Mitigating Catastrophic Forgetting Using Improved Clustering-Based Episodic Memory

Online Continual Learning (Domain-Incremental, Task-Agnostic) — STRATA-I and STRATA-II

Authors

Owen Beabout, Abigail Dodd, Titus Murphy, Enyue Lu

Code

https://github.com/shininglegend/strata

Paper (PDF)

paper.pdf

Poster

poster.pdf

Keywords

continual learning, catastrophic forgetting, episodic memory, domain incremental learning, task-agnostic learning

Abstract

Online Continual Learning (OCL) is a subdomain of machine learning in which models must continuously learn from a perpetual data stream without access to past samples. Models using domain incremental learning can adapt to shifting sample distributions, known as new tasks, while retaining accuracy on previously-trained tasks, without needing to know the precise moment when the task switches. However, domain incremental learning models are often susceptible to a loss in accuracy on earlier tasks as they train on subsequent tasks. We propose two new domain-incremental balanced stochastic gradient models with improved clustering-based episodic memory, STochastic gRAdient with Task-Agnosticity (STRATA-I and STRATA-II), and demonstrate strong performance on several benchmark datasets and tasks compared to previous state-of-theart models, including reducing forgetting in at least two cases by over 75%.

Problem

In domain-incremental online continual learning, task identity and task boundaries are not available, while the data distribution shifts over time. Many models forget earlier tasks as training continues.

Previous Work

Prior work has explored task-agnostic clustering-based episodic memory, which attempts to group samples by underlying task structure without access to explicit task labels [1]. Among non-task-agnostic approaches using episodic memory, MEGA-I and MEGA-II are particularly strong methods: they combine gradients from memory and incoming samples by either rotating or balancing them based on relative loss values [2].

[1] Lamers, Christiaan, René Vidal, Nabil Belbachir, Niki van Stein, Thomas Bäeck, and Paris Giampouras. "Clustering-based domain-incremental learning." In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3384-3392. 2023.
[2] Guo, Yunhui, Mingrui Liu, Tianbao Yang, and Tajana Rosing. "Improved schemes for episodic memory-based lifelong learning." Advances in Neural Information Processing Systems 33 (2020): 1023-1035.

Our Upgrades

When a cluster is full, remove the sample farthest from the cluster mean (instead of FIFO).
Integrate this memory into MEGA-style gradient balancing to form STRATA-I and STRATA-II.

Key Results

STRATA-I and STRATA-II generally reduce forgetting compared to baselines across most datasets and task types, with especially strong reductions in class-split settings. Overall accuracy is also competitive and often best.

Forgetting (Permutation Tasks): STRATA models show reduced forgetting on permutation-based tasks compared to baselines.

Legend:

Forgetting (Class-Split Tasks): Especially strong forgetting reductions in class-split settings, with over 75% improvement in some cases.

Legend:

First-Task Accuracy (Rotation Tasks): STRATA models maintain stronger first-task performance throughout training.

Legend:

Overall Accuracy (Rotation Tasks): STRATA models frequently achieve the best or statistically tied-best overall accuracy.

Legend:

1 / 4

Method summary

Episodic memory update

Per class label, maintain a pool of clusters with a maximum number of clusters and samples per cluster.
Assign new samples to the nearest cluster mean (or create a new cluster if capacity allows).
If a cluster exceeds capacity, remove the sample farthest from the cluster mean (l2 distance), then recompute the mean.

STRATA-I vs STRATA-II

STRATA-I: loss-weighted mixing of current gradient and reference gradient from episodic memory.
STRATA-II: rotates the current gradient toward the reference gradient using a loss-balanced angle.

Experimental setup

Datasets: MNIST, Fashion MNIST, CIFAR-10
Task types: permutation, rotation, class-split
Task presentation: sequential tasks (no overlap) and continual/overlapping transitions
Baselines: TA-A-GEM, Bayesian Gradient Descent (BGD), random add/remove episodic memory