Continual Learning Research in 2023: A Review With Opinions

Continual learning is the problem of updating a model on new data without forgetting what it learned before. It is also the problem every production ML team faces and almost no research paper solves in conditions that resemble production.

In 2023, the field has a taxonomy of approaches, a graveyard of benchmarks that reward cheating, and a growing gap between “prevents catastrophic forgetting on Split-MNIST” and “personalizes an LLM for a user without destroying base capabilities.” This review covers the landscape with opinions attached, because neutral surveys of a field this messy are useless.

Continual Learning Taxonomy

mindmap
  root((Continual Learning))
    Regularization
      EWC
      SI
      LwF
    Replay
      Experience replay
      Generative replay
      Coreset selection
    Architecture
      Progressive nets
      PackNet
      Dynamic expansion
    Optimization
      GEM
      A-GEM
      OGD
    Meta-learning
      MAML variants
      Online meta-learning
    Hybrid
      ER + EWC
      Distillation + regularization
      Replay + distillation

Every branch has papers claiming SOTA. Most results do not replicate under fair evaluation protocols.

The Core Problem: Catastrophic Forgetting

Train a neural network on task A. Fine-tune on task B. Performance on task A collapses. This is catastrophic forgetting, and it is not a corner case. It is the default behavior of gradient-based optimization on non-stationary data distributions.

Production teams encounter this constantly:

The research field formalizes this as continual learning. The solutions proposed in papers rarely survive contact with real deployment constraints.

Elastic Weight Consolidation (EWC)

Kirkpatrick et al.’s EWC (2017) remains the most cited regularization approach. After training on task A, compute the Fisher information matrix diagonal to estimate parameter importance. When training on task B, add a penalty that prevents important parameters from moving far from their task-A values.

What works: EWC reduces forgetting on small-scale benchmarks (Permuted MNIST, Split MNIST) with clear task boundaries.

What does not work: Fisher diagonal approximations are noisy for large models. The quadratic penalty fights the new task gradient when tasks conflict. EWC does not scale cleanly to models with billions of parameters. Nobody has shown EWC working on a production LLM personalization pipeline at scale.

My opinion: EWC is a pedagogical tool, not a production solution. It teaches you why regularization-based CL is appealing and why it fails when task boundaries blur.

Replay Methods

Experience replay stores a buffer of examples from previous tasks and mixes them into training on new tasks. Simple, effective, and honest about its memory cost.

Experience replay: Keep N examples per task. Sample uniformly during new task training. Works surprisingly well if the buffer is large enough. The question is always: how large, and who pays for storage?

Generative replay: Train a generative model on task A data, synthesize pseudo-examples when learning task B. Clever. Generator quality limits replay quality. GAN-generated replay for CL mostly works on MNIST-scale data.

Coreset selection: Instead of random replay, select a diverse subset that maximally preserves task performance. Herding, k-center, gradient-based selection. Better sample efficiency but expensive selection algorithms.

My opinion: Replay is the most honest approach because it admits that preventing forgetting requires retaining information about old data. The field’s discomfort with replay (it “cheats” by storing data) is ideological, not practical. In production, you have logs. Use them.

Architecture-Based Methods

Progressive neural networks add a new column of parameters for each task with lateral connections to previous columns. No forgetting by construction. Memory grows linearly with tasks.

PackNet prunes and reassigns parameters per task in a shared network. Dynamic expansion methods add neurons or modules for new tasks.

What works: Clean task boundaries with moderate task count.

What fails: Task count scaling (100 tasks means 100 columns or a fragmented network), inference complexity (which column/module for which input?), and the assumption that task identity is known at inference time.

My opinion: Architecture methods solve forgetting by throwing parameters at the problem. For LLMs where parameters are already expensive, this is a non-starter unless task count is tiny (2-5 distinct domains, not millions of users).

Why Benchmarks Are Broken

Split-MNIST: train on digits 0-4, then 5-9. Permuted MNIST: same digits, different pixel permutations. CORe50: incremental object recognition on a robot camera.

These benchmarks share fatal flaws:

Clear task boundaries at train and test time. Production data does not arrive in labeled task blocks. It arrives as a stream with shifting distributions and no task ID.

Small model scale. MLPs and small CNNs on MNIST do not predict behavior on 7B parameter transformers.

No measurement of forward transfer. Most benchmarks only measure forgetting (backward transfer). A method that prevents forgetting by learning nothing on new tasks scores well. That is not continual learning. That is freezing.

No compute budget constraints. Methods that store full replay buffers and retrain on all previous data every task are not scalable. Papers rarely report training cost.

No evaluation of base capability preservation for LLMs. Fine-tune a LLM on medical QA, measure medical QA improvement and general MMLU retention. Almost no CL benchmarks do this.

What Would Actually Matter for LLM Personalization

The application I care about most: updating an LLM’s behavior for a specific user or domain without degrading general capabilities. This is continual learning in the wild.

Requirements that 2023 research mostly ignores:

  1. No task labels. The system does not know when “task B” started. Data arrives continuously.
  2. Compute budget. Cannot replay full pretraining corpus on every update.
  3. Latency. Updates should not require full retraining. Minutes, not days.
  4. Safety. Personalization must not introduce harmful behavior or leak other users’ data.
  5. Evaluation. Measure both new task performance and general capability retention on standard benchmarks.

Current approaches that come closest:

Strong Opinions Section

Opinion 1: The field over-indexes on preventing forgetting and under-indexes on forward transfer and compute efficiency. A method that forgets 5% but learns new tasks 3x faster is more useful than one that forgets 0% at 10x compute cost.

Opinion 2: For LLMs, the right default is frozen base model plus adapters or retrieval, not continual weight updates. The research community should stop pretending Split-MNIST results inform LLM deployment strategy.

Opinion 3: Replay is not cheating. Data retention is a design choice. Privacy-preserving replay (differential privacy, federated buffers) is an engineering problem, not a reason to abandon the most effective CL strategy.

Opinion 4: Most “continual learning” papers would be better classified as “regularized fine-tuning with extra steps.” The label “continual learning” attracts citations without delivering production value.

What I Am Watching

Closing

Continual learning research in 2023 has strong theoretical foundations and weak production relevance for large-scale models. EWC and architecture methods teach important concepts but do not solve real deployment problems. Replay works but the field stigmatizes it. Benchmarks reward methods that exploit unrealistic assumptions.

If you are a founder or engineer facing forgetting in production: start with replay (mix old and new data), consider LoRA adapters instead of full fine-tuning, and measure both new task and base capability metrics. Ignore Split-MNIST SOTA claims.

The field will mature when benchmarks match production constraints. Until then, read papers for ideas, not for deployment recipes.

--claps