Experience Replay and Zero-Shot Clustering for Continual Learning in

Diabetic Retinopathy Detection

Gusseppe Bravo-Rocca

1 a

, Peini Liu

1 b

, Jordi Guitart

1,2 c

, Ajay Dholakia

3 d

David Ellison

3 e

and Rodrigo M. Carrillo-Larco

4 f

Barcelona Supercomputing Center, Barcelona, Spain

Universitat Politècnica de Catalunya, Barcelona, Spain

Lenovo Infrastructure Solutions Group, Morrisville, NC, U.S.A.

Emory University, GA, U.S.A.

Keywords:

Zero-Shot Clustering, Experience Replay, Diabetic Retinopathy Detection, Privacy-Preserving Learning,

Medical Imaging.

Abstract:

We present an approach to mitigate catastrophic forgetting in Continual Learning (CL), focusing on domain

incremental scenarios in medical imaging. Our method leverages Large Language Models (LLMs) to generate

task-agnostic descriptions from multimodal inputs, enabling zero-shot clustering of tasks without supervision.

This clustering underpins an enhanced Experience Replay (ER) strategy, strategically sampling data points to

refresh the model’s memory while preserving privacy. By incrementally updating a multi-head classiﬁer using

only data embeddings, our approach maintains both efﬁciency and data conﬁdentiality. Evaluated on a chal-

lenging diabetic retinopathy dataset, our method demonstrates signiﬁcant improvements over traditional CL

techniques, including Elastic Weight Consolidation (EWC), Gradient Episodic Memory (GEM), and Learn-

ing Without Forgetting (LWF). Extensive experiments across Multi-Layer Perceptron (MLP), Residual, and

Attention architectures show consistent performance gains (up to 3.1% in Average Mean Class Accuracy) and

reduced forgetting, with only 6% computational overhead. These results highlight our approach’s potential

for privacy-preserving, efﬁcient CL in sensitive domains like healthcare, offering a promising direction for

developing adaptive AI systems that can learn continuously while respecting data privacy constraints.

1 INTRODUCTION

Continual Learning (CL) aims to develop AI systems

capable of acquiring and reﬁning knowledge over

time, mirroring human-like adaptive learning (Parisi

et al., 2019). Unlike conventional Machine Learn-

ing approaches that separate training and inference

phases, CL models must adapt to evolving real-world

data and tasks (Wang et al., 2024). This adaptation is

crucial in scenarios with blurred task boundaries (Koh

et al., 2022) and in environments requiring continuous

learning without catastrophic forgetting (De Lange

et al., 2022; Kirkpatrick et al., 2017; Robins, 1993).

https://orcid.org/0000-0001-6824-1124

https://orcid.org/0000-0003-0058-8732

https://orcid.org/0000-0003-0751-3100

https://orcid.org/0009-0007-8973-6063

https://orcid.org/0000-0002-0752-5569

https://orcid.org/0000-0002-2090-1856

A signiﬁcant challenge in CL is maintaining con-

sistent performance across changing multimodal data

distributions, particularly in domains like medical

imaging where privacy concerns and data mutability

are paramount. Domain Incremental Learning (DIL)

faces acute challenges with domain shifts, such as

variations in lighting, population characteristics, or

noise in medical image classiﬁers. Traditional re-

training approaches are often infeasible due to pri-

vacy constraints in healthcare (Kumar and Srivas-

tava, 2018; Kumari and Singh, 2024; Lenga et al.,

2020), leading to performance degradation on previ-

ously learned tasks (Khan et al., 2024; Kuang et al.,

2018).

To address these challenges, we present a

novel unsupervised learning framework that leverages

Large Language Models (LLMs) for CL in privacy-

sensitive domains. As shown in Figure 1, our ap-

proach uses LLMs to generate textual descriptions

from multimodal inputs (images, labels), enabling

Bravo-Rocca, G., Liu, P., Guitart, J., Dholakia, A., Ellison, D. and Carrillo-Larco, R. M.

Experience Replay and Zero-Shot Clustering for Continual Learning in Diabetic Retinopathy Detection.

DOI: 10.5220/0013128600003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 2: VISAPP, pages

81-92

ISBN: 978-989-758-728-3; ISSN: 2184-4321

zero-shot clustering without predeﬁned task bound-

aries. That is, once we get the LLM-generated de-

scriptions, we map them to embeddings. On the same

space, we compare these embeddings with the im-

ages’ embeddings to perform the clustering. This

method is particularly well-suited for medical imag-

ing scenarios, such as Diabetic Retinopathy (DR) de-

tection from fundus images, where data privacy and

distribution shifts are critical concerns.

Our approach extends the concept of ER (Riemer

et al., 2019) by integrating a strategic sampling

methodology derived from zero-shot clusters. This

new ER strategy refreshes the neural network’s mem-

ory, mitigating knowledge degradation across tasks.

The model architecture employs a multi-head classi-

ﬁer that expands incrementally with new tasks, each

head containing a simple linear layer for adaptation.

Key features of our approach include:

• Privacy Preservation. By operating on embed-

dings rather than raw images, our method ad-

dresses critical privacy concerns in sensitive do-

mains like healthcare.

• Resource Efﬁciency. Designed to function on

CPUs using embeddings, our approach is compu-

tationally efﬁcient and suitable for deployment in

resource-constrained environments.

• Adaptability. The ER strategy, free from ﬁxed

task deﬁnitions, improves adaptability to evolv-

ing data distributions in medical imaging scenar-

ios (Zhang et al., 2024; Serra et al., 2018).

• Foundation Model Integration. We leverage

CLIP (Radford et al., 2021) to produce robust im-

age embeddings, enhancing the zero-shot cluster-

ing process.

Our method seamlessly integrates with and en-

hances established CL strategies, including Elas-

tic Weight Consolidation (EWC) (Kirkpatrick et al.,

2017), Gradient Episodic Memory (GEM) (Lopez-

Paz and Ranzato, 2017), and Learning Without For-

getting (LWF) (Li and Hoiem, 2017). We demon-

strate its effectiveness on a challenging DR dataset

(Karthik, 2019), showcasing improved robustness

against forgetting and signiﬁcant performance boosts

over existing techniques.

The main contributions of our paper are:

1. A novel zero-shot clustering framework using

LLM-generated descriptions for unsupervised im-

age clustering, enhancing ER in CL.

2. A privacy-preserving, CPU-based ER strategy

leveraging zero-shot clusters for efﬁcient incre-

mental learning in sensitive domains.

3. Comprehensive experiments demonstrating our

method’s efﬁcacy in preventing catastrophic for-

getting and enhancing existing CL performance

across multiple model architectures (MLP, Resid-

ual, and Attention).

4. A generalizable approach to CL that addresses

key challenges in medical imaging while show-

ing potential applicability to other domains with

similar privacy and distribution shift concerns.

2 RELATED WORK

LLMs for Zero-Shot Learning. LLMs have dra-

matically transformed machine capabilities for under-

standing and generating human-like text, notably en-

hancing zero-shot learning (Brown and et al., 2020).

Our research leverages these capabilities, using LLMs

to create descriptive embeddings for images. These

embeddings, when integrated with the visual embed-

dings from CLIP, facilitate effective zero-shot clus-

tering. This method represents a departure from the

usual applications of LLMs, which typically direct

task execution. Instead, we use their generative power

to enhance data organization for CL.

Experience Replay. ER is rooted in the aspiration to

emulate aspects of human memory processes, where

past experiences are occasionally revisited to solidify

learning. The canonical form of ER (Riemer et al.,

2019) involves interleaved training of new tasks with

memory samples, seeking to approximate the joint

distribution of tasks. Variants like Dark ER (Buzzega

et al., 2020) have added layers of complexity, employ-

ing distillation loss to enforce output consistency. Re-

cent trends in ER have seen the incorporation of dual-

memory architectures, such as approaches mirroring

the interplay between fast and slow learning processes

by maintaining two semantic memories (Arani et al.,

2022). While such architectures provide novel mech-

anisms to handle forgetting, the optimal way to struc-

ture and utilize these memories remains an open chal-

lenge. In our work, we use this idea to incorporate

past data points to inform the replay, based on the data

properties.

Privacy-Preserving Exemplars. ER is essential for

mitigating catastrophic forgetting, typically involving

raw data samples from previous tasks. Our method

enhances privacy by storing only the embeddings of

exemplars, not the raw images. This modiﬁcation

ensures data privacy while maintaining ER effective-

ness. By ﬁne-tuning zero-shot clustering on train-

ing datasets, we reﬁne exemplar selection, ensuring

the memory buffer contains the most representative

embeddings (Rebufﬁ et al., 2017; Shin et al., 2017).

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

Inference (for Task 1, ..., Task N)

Custom Experience Replay

Training (for Task 0)

Multimodal

input

: Image

: Label

Multi-head

classifier

Base model

Head 0

Head 1

Head

Train a new

head

for Task

using

Zero-shot clustering

for any Task using

Large

Language

Model

Best

descriptions

Find best for

given

Language-image embedding ( )

Generate

pairs of descriptions

Sample

each

cluster

Positive: "[Optical image

shows] {an iris with retinal

disease}"

Negative: "[Optical image

shows] {normal retina}"

Closest

candidate

Memory Buffer

Figure 1: Our method uses a Large Language Model (LLM) to generate descriptions d

for each image x, using its label y for

initial domain learning in Task 0. These descriptions underpin unsupervised zero-shot clustering, forming clusters x

. Key

points from these clusters are buffered for replay. A multi-head classiﬁer leverages this buffer in an Experience Replay (ER)

strategy, learning the pertinent head i for predictions y, thus preserving knowledge across successive tasks.

This approach addresses privacy concerns in medical

image analysis by storing embeddings instead of ac-

tual images, complying with privacy regulations and

addressing security concerns (Shokri and Shmatikov,

2015). Our strategy meets the growing demand for

privacy-preserving ML techniques.

CLIP Embeddings in CL. Our methodology re-

purposes CLIP as an embedder within a CL frame-

work, eschewing the common practice of ﬁne-tuning

CLIP on downstream tasks. This strategy retains the

model’s zero-shot learning capabilities while avoid-

ing the pitfalls of catastrophic forgetting inherent

in direct ﬁne-tuning scenarios (Garg et al., 2023).

By comparing LLM-generated descriptions, our zero-

shot clustering ﬁne-tuning process identiﬁes optimal

exemplars for memory storage, facilitating more ef-

fective learning across sequential tasks.

Continual Learning in Medical Imaging. CL in

medical imaging presents unique challenges due to

privacy concerns and data distribution shifts. Recent

work has explored continuous domain adaptation for

healthcare applications (Venkataramani et al., 2018),

addressing the evolving nature of medical data. Addi-

tionally, domain adaptation techniques have been ap-

plied to medical image segmentation tasks (Valindria

et al., 2018), demonstrating the potential of transfer

learning in this ﬁeld. Our work builds upon these

foundations, introducing a novel approach that com-

bines zero-shot learning with ER, speciﬁcally tailored

to handle the privacy and distribution shift issues in

medical imaging scenarios.

3 PROBLEM STATEMENT

Challenges and Requirements. The primary chal-

lenges in DIL include:

• Catastrophic Forgetting. New knowledge acqui-

sition leads to the erosion of previously learned

information.

• Dynamic Data Distributions. The data distribu-

tion D

changes over time, necessitating continual

model adaptation.

• Privacy Preservation. Direct access to raw data

is often restricted, especially in sensitive applica-

tions like healthcare.

• Task Boundary Ambiguity. In real-world sce-

narios, clear task boundaries may not exist, re-

quiring models to adapt without explicit task de-

lineation.

Formal Deﬁnition. DIL involves training a model H

on a sequence of tasks {T

, T

, . . . , T

}, where the data

distribution for each task may change over time. In

our context, a task T

represents a speciﬁc domain or

data distribution, such as fundus images with particu-

lar lighting conditions or noise levels. The model H

consists of a base network b(x;θ

) shared across tasks

and a set of task-speciﬁc heads {g

(z;θ

)}, where

z = b(x; θ

) is the shared representation. The objec-

tive is to minimize the cumulative loss:

min

,{θ

}

∑

i=1

L(H (D

;θ

, θ

), Y

), (1)

where θ

are the parameters of the base network,

are the parameters of the head for task T

, D

and

Experience Replay and Zero-Shot Clustering for Continual Learning in Diabetic Retinopathy Detection

are the data and labels for task T

, respectively, and

L denotes the loss function.

During training on a new task T

, a new head

(z;θ

) is added to the model:

H (x; θ

, {θ

}

k=1

) = g

(b(x;θ

);θ

). (2)

The goal is to optimize the parameters θ

and {θ

}

such that the performance on all previously learned

tasks {T

, T

, . . . , T

n−1

} is maintained while learning

the new task T

Task Iteration. In our approach, tasks are presented

sequentially, with each task representing a distinct do-

main or data distribution. The model is trained on

these tasks in order, without revisiting previous tasks

except through the ER mechanism. This setup sim-

ulates real-world scenarios where data distributions

evolve over time and previous data may not be fully

accessible.

Signiﬁcance and Impact. Addressing these chal-

lenges will enable the development of more robust

and adaptive AI systems. In medical imaging, for

instance, this means more accurate and timely diag-

noses despite evolving data distributions, all while

preserving patient privacy. Successfully solving this

problem will also advance the broader ﬁeld of CL,

providing insights and techniques applicable to var-

ious dynamic environments where data distributions

shift over time and privacy is a concern.

4 APPROACH

We propose a framework that synergizes LLMs,

speciﬁcally GPT-4 (OpenAI, 2023), with vision-

language models like CLIP to enhance CL through

zero-shot clustering and ER. Our approach addresses

key challenges in CL, particularly in privacy-sensitive

domains like medical imaging, by leveraging GPT-

4-generated descriptions and CLIP embeddings for

zero-shot clustering. This method enables the iden-

tiﬁcation of exemplars for ER without storing raw

images, ensuring privacy preservation and efﬁcient

learning across sequential tasks.

4.1 Zero-Shot Clustering with LLM

and CLIP

Our zero-shot clustering method harnesses the com-

bined strengths of GPT-4 and CLIP to cluster im-

ages into predeﬁned classes without explicit training.

This approach is particularly valuable in CL scenarios

where task boundaries are ambiguous and data distri-

butions evolve over time.

Given a set of images {I

, I

, . . . , I

} and their as-

sociated labels (’Retinopathy’ and ’No Retinopathy’),

we employ GPT-4 to generate a set of textual de-

scriptions {D

, D

, . . . , D

}. These descriptions cap-

ture various aspects of the images, including poten-

tial class information (e.g., presence or absence of

retinopathy) and other relevant features. We then uti-

lize CLIP (ViT-L/14@336px conﬁguration) to obtain

embeddings for both images and textual descriptions,

leveraging its ability to create a shared embedding

space for multimodal data.

4.1.1 Embedding Generation

The embedding process consists of two key steps:

1. Textual Description Generation and Tokenization:

GPT-4 generates descriptions based on image labels,

which are then tokenized for CLIP input.

2. CLIP Encoding: Both tokenized descriptions and

images are processed by CLIP to obtain normalized

embeddings, denoted as X

for images and T

for text

descriptions.

Formally, we represent this process as:

CLIP

image

)

|CLIP

image

, T

CLIP

text

)

|CLIP

text

(3)

where CLIP

image

(·) and CLIP

text

(·) are CLIP’s im-

age and text encoding functions, respectively. Nor-

malization ensures all embeddings lie on the unit

sphere, facilitating similarity computations.

4.1.2 Similarity Computation and Clustering

We perform zero-shot clustering by computing cosine

similarities between image and text embeddings. For

each image embedding X

, we calculate its similarity

to all text embeddings T

i j

= cos(X

, T

) =

· T

||T

(4)

Label assignment for each image is determined

by the text description yielding the highest similar-

ity score: L

= argmax

i j

. This process effectively

assigns each image to the class best represented by

its most similar text description, achieving zero-shot

classiﬁcation without task-speciﬁc training.

4.1.3 Optimizing Description Selection

To maximize the effectiveness of zero-shot cluster-

ing, we optimize the selection of textual descriptions.

We experiment with various templates and descrip-

tion sets (see Table 1), evaluating their clustering per-

formance using F1-score (due to imbalanced data).

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

Algorithm 1 outlines this optimization process, and

Figure 2 presents the results.

Data: Image set I , class labels L = {0, 1}, set of

description sets D, set of templates T

Result: Optimal template T

∗

, optimal description

set D

∗

foreach template t ∈ T do

foreach description set d ∈ D do

Generate text prompts using template t

and description set d;

Obtain text embeddings T

using the

CLIP model;

foreach image I

∈ I do

Compute similarity scores S

i j

between image I

and all

embeddings T

;

Assign label L

to image I

based on

the highest similarity score S

i j

;

end

Evaluate the F1-score for the current

combination of t and d;

if current scores are higher than the best

previous scores then

Update T

∗

and D

∗

with the current

template t and description set d;

end

Algorithm 1: Optimizing Description Selection for Zero-

shot Clustering.

Figure 2: Optimal template against all the sets. We can

see that ’Set 3’ gets the highest score for Task 0 and also

for Task 1 and Task2, indicating a useful description and

template for zero-shot clustering at inference time.

We task GPT-4 to generate 11 description sets for

the classes No Retinopathy and Retinopathy, paired

with 10 templates for constructing prompts. These

sets vary in detail, from technical terms (e.g., "Mild

to Severe non-proliferative diabetic retinopathy") to

simpler descriptions (e.g., "Signs of diabetic retinopa-

thy"). Templates such as "An iris with {}", "A human

eye with {}", and "An ocular image with {}" were

used to form the ﬁnal prompts.

Table 1: 10 Templates and 11 Description Sets optimized

for Zero-shot Clustering.

# Templates Binary Description Sets

1 An iris with Healthy / Diabetic damage

2 A human eye

with

No damage / Diabetic signs

3 no template Normal / Retinal disease

4 An ocular im-

age with

No issues / Retinopathy

5 A retinal photo

with

Clear fundus / Fundus

changes

6 A fundus image

displaying

No retinopathy / Mild-

severe incl. laser

7 Visible symp-

toms suggest

Normal fundus / Retinopa-

thy incl. laser

8 Retinal scan re-

veals

No abnormalities / Non-

proliferative

9 Optical image

shows

No pathology / Mild-severe

incl. laser

10 The condition

of the retina is

Healthy / Mild-severe incl.

laser

11 No disease / Various stages

incl. laser

This comprehensive generation enables our

method to adapt to the nuances of DR detection.

4.2 Stratiﬁed Sampling for Experience

Replay

After optimizing the textual descriptions and tem-

plates for zero-shot clustering, we employ stratiﬁed

sampling to ensure balanced representation of each

class within the ER buffer. This approach is crucial

for constructing a diverse and representative collec-

tion of multimodal inputs, each containing an embed-

ding, label, ensuring effective and privacy-preserving

ER while promoting generalization across tasks and

minimizing catastrophic forgetting.

4.2.1 Sampling Procedure

Given a collection of multimodal inputs M , where

each input m

∈ M is characterized by its embedding

, zero-shot label z

, we deﬁne a stratiﬁed sampling

strategy to select a subset S ⊆ M with proportional

representation across the zero-shot classiﬁed labels.

For each distinct zero-shot label l ∈ Z derived

from the clustering process, we deﬁne a subset M

⊂

M containing inputs with label l. We then sample

neighbors

inputs from each subset M

(

sample(M

, n

neighbors

), if |M

| > n

neighbors

, otherwise

(5)

Experience Replay and Zero-Shot Clustering for Continual Learning in Diabetic Retinopathy Detection

The ﬁnal sample set S is the union of all samples

across the labels:

S =

[

l∈Z

(6)

This stratiﬁed sampling strategy ensures a bal-

anced replay buffer, critical for maintaining diversity

during ER and reducing the risk of catastrophic for-

getting while reinforcing learning across tasks.

4.2.2 Clustering+Sampling Performance

We evaluate our zero-shot clustering and stratiﬁed

sampling approach across three tasks of increasing

complexity (described in Section 5.2.1). Some key

insights can be derived from our results, which are

shown in Table 2.

Table 2: Zero-shot clustering and sampling results across

tasks 0, 1, and 2.

Task Class 0 Class 1 Samples F1-score

0 387 653 20 (1.92%) 0.892

1 1732 890 20 (0.76%) 0.890

2 2120 1542 20 (0.55%) 0.891

• Class Distribution. There is signiﬁcant class im-

balance across tasks, mirroring real-world med-

ical imaging scenarios where pathological cases

are less frequent.

• Sampling Efﬁciency. Consistent selection of 20

samples per task (1.92% to 0.55% of total data),

demonstrating compact yet representative mem-

ory buffer maintenance as the dataset grows.

• Performance Stability. F1-scores remain consis-

tent (0.89) across tasks despite increasing com-

plexity and imbalance, highlighting the robust-

ness of our approach in evolving data distribu-

tions.

These quantitative results are further illustrated

by the qualitative analysis shown in Figure 3, which

presents UMAP projections of clusters and selected

samples for each task.

4.2.3 Template and Description Set Performance

We evaluate the performance of various templates and

description sets for zero-shot clustering across tasks.

Figure 4 presents the top-performing combinations

for each task.

The key ﬁndings from our template and descrip-

tion set analysis are as follows:

• Across all tasks, the template "Optical image

shows" combined with description Set 3 ("Normal

(a) Task 0: Clusters with uniform image quality. The model

successfully differentiates between Class 0 and Class 1,

with well-separated cluster shapes.

(b) Task 1: Clusters impacted by lighting variation. Light-

ing variations cause more overlap between clusters, making

classiﬁcation harder, yet most sample points are still cor-

rectly clustered.

increases cluster overlap signiﬁcantly. Despite the noise,

some samples remain distinguishable, demonstrating mod-

erate model robustness.

Figure 3: UMAP projections of clusters for embeddings

across the three tasks. Each projection includes clusters and

10 samples from the memory buffer for both Class 0 and

Class 1. These projections illustrate how the model adapts

to progressively increasing task complexity.

/ Retinal disease") consistently achieves the high-

est F1-scores (0.892, 0.890, and 0.891 for Tasks

0, 1, and 2, respectively).

• Set 3 demonstrates robust performance across

all tasks, indicating its effectiveness in zero-shot

clustering for DR detection.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

(a) Task 0: Top 3 templates and description sets. The best-

performing pair is {Template 9, Set 3}, achieving an F1-

score of 0.892.

(b) Task 1: Top 3 templates and description sets. The best-

performing pair is {Template 9, Set 3}, achieving an F1-

score of 0.890.

performing pair is {Template 9, Set 3}, achieving an F1-

score of 0.891.

Figure 4: Top 3 templates and description sets for each task,

highlighting the highest F1-scores for zero-shot clustering

across Tasks 0, 1, and 2. Set 3 consistently shows top per-

formance in combination with different templates across all

tasks.

• Sets 1 ("Healthy / Diabetic damage") and 7 ("Nor-

mal fundus / Retinopathy incl. laser") also show

promising performance, highlighting the impor-

tance of carefully crafted descriptions in zero-shot

learning scenarios.

These results highlight the robustness of our

method, which consistently adapts to increasing task

complexity (e.g., uniform quality, lighting variation,

Gaussian noise) while maintaining high performance

in zero-shot clustering and sample selection for ER.

This demonstrates its effectiveness in real-world sce-

narios where image quality varies signiﬁcantly.

4.3 Experience Replay Algorithm

We propose an enhanced ER algorithm that leverages

zero-shot clustering and stratiﬁed sampling to address

catastrophic forgetting in CL. Our method comprises

two key components: a zero-shot exemplars buffer

and an ER strategy.

4.3.1 Zero-shot Exemplars Buffer

The zero-shot exemplars buffer maintains a balanced

set of exemplars based on zero-shot clustering out-

comes. It is updated as shown in Algorithm 2.

Input: Max buffer size max_size, neighbors

n_neighbors, strategy S

Output: Updated replay buffer B

D ← {CreateMultimodalInput(d) | d ∈

S.experience.dataset};

C , Z ←

ZEROSHOTCLUSTERING(D,text_embs_best);

S ← STRATIFIEDSAMPLING(C , Z, n_neighbors);

B ← {(E

, y

) | s ∈ S, (E

, y

) =

ExtractFeatures(s)};

Update B in strategy S, respecting max_size;

Algorithm 2: Updating Replay Buffer with Zeroshot Exem-

plars.

Key features of our zero-shot exemplars buffer

include: multimodal input creation encapsulating

embeddings and labels, unsupervised clustering us-

ing pre-computed text embeddings, stratiﬁed sam-

pling for balanced cluster representation, privacy-

preserving updates using embeddings instead of raw

data and dynamic buffer group adjustment based on

clustering outcomes.

This approach ensures diverse sample representa-

tion while maintaining privacy and efﬁciency. The use

of zero-shot labels enhances applicability in scenarios

with scarce ground truth.

4.3.2 Experience Replay Strategy

Our ER strategy integrates the zero-shot exemplars

buffer into the training process as shown in Algorithm

3. This strategy addresses key CL challenges through

several mechanisms. It employs adaptive sampling

via zero-shot clustering, mitigating task boundary am-

biguity, while ensuring balanced class and task repre-

Experience Replay and Zero-Shot Clustering for Continual Learning in Diabetic Retinopathy Detection

sentation through stratiﬁed sampling. The approach

maintains privacy preservation by operating on em-

beddings, and achieves computational efﬁciency with

a compact, diverse replay buffer.

Input: Training strategy S, Storage policy P (with

zero-shot exemplars buffer B)

Output: Updated training strategy with ER

Attach P to S ;

while training do

if B ∈ P is not empty then

S.dataloader ←

Combine(S.adapted_dataset, B);

end

ExecuteTrainingExperience();

B ← P .update(S);

end

Algorithm 3: Experience Replay Strategy.

5 EXPERIMENTAL EVALUATION

We rigorously evaluate our proposed CL approach on

the challenging task of DR detection, assessing its

efﬁcacy under various conditions that simulate real-

world scenarios.

5.1 Experimental Setup

5.1.1 Testbed

Our experiments were conducted on a CPU-based

platform with comprehensive speciﬁcations. The

hardware conﬁguration consists of Dual Intel Xeon

Platinum 8360Y CPUs operating at 2.40GHz with

256 GB RAM, running on Ubuntu 22.04 LTS.

Our software stack includes the Intel AI Analyt-

ics Toolkit (Docker image: intel/oneapi-aikit:devel-

ubuntu22.04)

, Avalanche 0.3.1 for CL

, Intel Exten-

sion for PyTorch 1.12.100+cpu

, and Intel Extension

for Scikit-learn 2023.0.1

. This environment ensures

reproducibility and leverages optimized libraries for

enhanced performance on CPU architectures.

5.1.2 Dataset

We utilize the APTOS 2019 Blindness Detection

dataset (Karthik, 2019), comprising 3,662 high-

resolution retinal images. This dataset, developed

in collaboration with Aravind Eye Hospital in India,

captures real-world clinical complexities and image

https://hub.docker.com/r/intel/oneapi-aikit

https://avalanche.continualai.org/

https://github.com/intel/intel-extension-for-pytorch

https://github.com/intel/scikit-learn-intelex

quality variations, providing a robust testbed for our

CL approach. Figure 5 presents sample images from

this dataset, illustrating the diversity in image quality

and pathological conditions.

Figure 5: Representative fundus images from different

tasks, showcasing varying image quality and conditions.

Left: Task 0 - uniform image quality; Center: Task 1 - vari-

ation in lighting; Right: Task 2 - artiﬁcially added Gaussian

noise to simulate challenging imaging conditions.

5.2 Experimental Methodology

5.2.1 Task Design

We construct three distinct tasks to assess our model’s

performance under progressively challenging condi-

tions:

• Task 0 (Baseline). Uniform image quality, repre-

senting ideal clinical conditions.

• Task 1 (Lighting Variation). Introduces varia-

tions in lighting, simulating different imaging en-

vironments.

• Task 2 (Noise Addition). Incorporates Gaussian

noise, emulating low-quality or degraded images.

This task progression allows us to evaluate our

model’s robustness to common real-world variations

in medical imaging.

5.2.2 Model Architectures

To rigorously evaluate the generalizability and robust-

ness of our approach, we employ three distinct neural

network architectures, each chosen to address speciﬁc

aspects of CL in medical imaging:

• Multi-Layer Perceptron (MLP). A baseline ar-

chitecture with one hidden layer, selected for its

simplicity and efﬁciency. This model serves as

a litmus test for our method’s ability to enhance

even basic architectures in CL scenarios.

• Residual Network. Incorporating skip connec-

tions, this architecture mitigates the vanishing

gradient problem, crucial for maintaining perfor-

mance across multiple tasks in CL. Its ability to

learn residual functions is particularly relevant for

detecting subtle changes in medical images across

different domains.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

• Attention-based Network. Leveraging self-

attention mechanisms, this model excels at cap-

turing complex, long-range dependencies in data.

In the context of medical imaging, it can focus

on the most relevant features for diagnosis, poten-

tially enhancing the model’s adaptability to new

tasks.

All architectures are designed to process 768-

dimensional CLIP embeddings as input, outputting

binary classiﬁcations for DR. This uniﬁed input-

output structure allows for fair comparison across ar-

chitectures while leveraging the rich semantic infor-

mation captured by CLIP embeddings.

5.2.3 Continual Learning Strategies

We benchmark our approach against a diverse set of

state-of-the-art CL strategies, each addressing differ-

ent aspects of the catastrophic forgetting problem:

• Naive (ﬁne-tuning). Serves as a baseline, high-

lighting the severity of catastrophic forgetting in

the absence of specialized CL techniques.

• Elastic Weight Consolidation (EWC). A

regularization-based approach that selectively

slows down learning on important parameters,

crucial for preserving knowledge in medical

imaging tasks where certain features may be

universally important.

• Learning without Forgetting (LwF). Employs

knowledge distillation to retain previous task

information, potentially beneﬁcial in scenarios

where task boundaries in medical imaging are not

clearly deﬁned.

• Gradient Episodic Memory (GEM). Constrains

gradient updates to maintain performance on pre-

vious tasks, offering insights into the trade-offs

between stability and plasticity in medical AI

models.

Each strategy is evaluated in its original form and

enhanced with our proposed zero-shot clustering and

stratiﬁed sampling approach. This comprehensive

evaluation not only benchmarks our method against

established techniques but also demonstrates its po-

tential as a complementary enhancement to existing

CL strategies in the challenging domain of medical

image analysis.

5.2.4 Evaluation Metric

We employ the Average Mean Class Accuracy

(AMCA) as our primary evaluation metric:

AMCA =

∑

t=1

∑

c=1

c,t

(7)

where T = 3 (number of tasks) and C = 2 (num-

ber of classes: with and without DR). This metric en-

sures robustness against class imbalance and distribu-

tion shifts across tasks.

5.3 Experimental Protocol

To ensure robust and generalizable results, we im-

plemented a rigorous experimental protocol. Our

approach included extensive hyperparameter explo-

ration, varying the number of neighbors (15, 20, 25,

30, 50) in our stratiﬁed sampling approach to assess

sensitivity. For stochastic control, we utilized differ-

ent random seeds for each run, ensuring statistical

validity. We conducted a comprehensive evaluation

across multiple architectures, including MLP, Resid-

ual, and Attention-based networks. These were tested

with various CL strategies: Naive, EWC, LwF, and

GEM, both in their original form and enhanced with

our approach. For each conﬁguration, we recorded

key performance metrics, focusing on AMCA and

Forgetting scores. This comprehensive evaluation

framework enables us to draw statistically signiﬁcant

conclusions about our method’s efﬁcacy across di-

verse scenarios in medical imaging CL.

5.4 Results and Discussion

We evaluate our approach across multiple architec-

tures (MLP, Residual, Attention) and continual learn-

ing strategies (Naive, GEM, LwF, EWC), compar-

ing performance and computational efﬁciency against

baseline methods.

5.4.1 Performance Analysis

Table 3 presents Average Mean Class Accuracy

(AMCA) scores and Forgetting metrics across differ-

ent conﬁgurations. Our method consistently outper-

forms baselines across all architectures and strategies.

• Architectural Robustness. Performance im-

provements range from 0.8% to 3.1% in AMCA

across all architectures, with the most signiﬁ-

cant gains observed in complex models (Residual:

+2.8% for Naive, Attention: +3.1% for LwF).

• Strategy Enhancement. Our approach ampliﬁes

the strengths of existing CL strategies. For in-

stance, it reduces Forgetting in Naive learning (8.5

to 5.2 in Residual models) and enhances knowl-

edge distillation in LwF (3.1% AMCA increase

in Attention models).

Experience Replay and Zero-Shot Clustering for Continual Learning in Diabetic Retinopathy Detection

• Forgetting Mitigation. We observe consistent

reductions in Forgetting metrics, particularly no-

table in the Naive strategy across all architectures,

indicating improved knowledge retention.

These results suggest that our zero-shot cluster-

ing and stratiﬁed sampling approach provides a more

diverse and representative set of samples, enhancing

both learning and retention in CL scenarios.

5.4.2 Hyperparameter Sensitivity

The number of neighbors in KNN has a signiﬁcant

impact on both performance and forgetting. The

Mean AMCA shows peak performance at around 25

neighbors, followed by a slight decline and stabiliza-

tion between 30 and 50 neighbors. Similarly, forget-

ting is reduced most signiﬁcantly at around 25 neigh-

bors, after which it gradually decreases as the number

of neighbors increases.

Figure 6: Both metrics reach optimal values around 25-

30 number of neighbors (NN), illustrating the trade-off be-

tween performance and retention.

This highlights a sweet spot between 25 and 30

neighbors, where both performance (Mean AMCA)

and retention (reduced forgetting) are optimized.

Tuning within this range balances sample diversity

and computational efﬁciency, ensuring high perfor-

mance with minimal forgetting, as shown in Figure 6.

5.4.3 Computational Efﬁciency

Figure 7 compares the execution times between our

method and the original strategy. Our approach in-

curs a minimal 6% increase in average execution time

(5.81 s vs. 5.48 s). This negligible overhead is con-

sistent across all architectures, with Attention models

showing the highest variability due to their complex-

ity.

The marginal increase in computational cost, cou-

pled with signiﬁcant performance gains, positions our

method as an efﬁcient in-place replacement for exist-

ing strategies. This balance is particularly valuable

Figure 7: Average execution time comparison across archi-

tectures. Box plots show distribution of execution times,

with mean values in the legend.

in time-sensitive applications like medical imaging,

where improved accuracy without substantial compu-

tational overhead is crucial.

6 CONCLUSION

This study introduced a framework integrating zero-

shot learning with Experience Replay for CL in med-

ical imaging, with a focus on DR detection. Our ap-

proach, leveraging LLMs for DIL, demonstrated sev-

eral key achievements. We observed consistent per-

formance improvements across diverse model archi-

tectures and CL strategies, with AMCA increases up

to 3.1%. The system showed effective mitigation of

catastrophic forgetting, evidenced by reduced Forget-

ting metrics, particularly in naive learning scenarios.

Additionally, we achieved this with negligible com-

putational overhead (6% increase in execution time),

enabling seamless integration into existing systems.

These results highlight the potential of our method

to enhance the adaptability, efﬁciency, and privacy-

preservation of AI systems in healthcare. The frame-

work’s ability to maintain performance across varying

data distributions while operating on embeddings ad-

dresses critical challenges in medical AI deployment.

Future research directions encompass several key

areas. We aim to scale to more complex, multi-

modal medical datasets and develop adaptive cluster-

ing algorithms for dynamic medical imaging scenar-

ios. Additionally, we plan to explore applicability in

other domains with similar privacy and distribution

shift concerns. A crucial component of future work

involves conducting rigorous ethical analyses, partic-

ularly regarding data privacy and algorithmic bias in

diverse patient populations.

While our work represents a step towards more ro-

bust and adaptable AI in healthcare, realizing its full

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

Table 3: Comparison of Mean AMCA scores and forgetting (parentheses) for different base models and strategies. Boldface

indicates superior performance between Ours and Original strategies based on Mean AMCA. Averages are computed across

different NN values for each base model.

Model NN Naive GEM LwF EWC

Ours Original Ours Original Ours Original Ours Original

Attention

15 0.936(3.7) 0.929(3.4) 0.967(3.5) 0.965(3.4) 0.823(4.8) 0.793(5.3) 0.930(4.1) 0.928(3.1)

20 0.938(4.2) 0.926(4.3) 0.968(4.2) 0.965(3.2) 0.816(5.8) 0.793(5.3) 0.934(3.7) 0.919(4.4)

25 0.939(4.7) 0.922(4.9) 0.967(3.3) 0.965(3.2) 0.826(5.5) 0.789(5.4) 0.933(3.6) 0.915(4.2)

30 0.937(4.5) 0.930(2.5) 0.967(4.2) 0.961(3.4) 0.825(5.3) 0.792(5.5) 0.929(5.9) 0.918(4.1)

50 0.938(4.3) 0.932(2.6) 0.965(4.1) 0.960(3.8) 0.824(5.7) 0.791(5.0) 0.932(4.8) 0.911(5.7)

Avg 0.938(4.3) 0.928(3.5) 0.967(3.9) 0.963(3.4) 0.823(5.4) 0.792(5.3) 0.932(4.4) 0.918(4.3)

Residual

15 0.913(5.1) 0.880(9.7) 0.941(2.6) 0.934(2.8) 0.940(4.6) 0.928(4.5) 0.939(3.8) 0.933(3.7)

20 0.912(5.0) 0.879(9.7) 0.942(2.8) 0.935(2.9) 0.934(4.5) 0.930(4.5) 0.938(4.3) 0.932(4.1)

25 0.913(5.3) 0.879(9.7) 0.942(1.7) 0.938(2.8) 0.933(4.6) 0.928(4.4) 0.938(3.9) 0.926(4.5)

30 0.912(5.7) 0.879(9.6) 0.940(3.9) 0.937(2.7) 0.938(4.4) 0.928(4.5) 0.938(4.9) 0.914(6.7)

50 0.912(5.0) 0.903(3.8) 0.941(3.1) 0.935(2.6) 0.936(4.8) 0.929(4.5) 0.932(6.7) 0.901(10.0)

Avg 0.912(5.2) 0.884(8.5) 0.941(2.8) 0.936(2.8) 0.936(4.6) 0.929(4.5) 0.937(4.7) 0.921(5.8)

MLP

15 0.915(5.1) 0.899(3.2) 0.931(5.1) 0.923(3.8) 0.942(4.8) 0.940(4.8) 0.930(6.9) 0.917(5.8)

20 0.914(5.3) 0.897(4.3) 0.931(5.1) 0.923(3.7) 0.941(5.0) 0.940(4.5) 0.892(5.2) 0.874(5.4)

25 0.914(5.2) 0.900(3.8) 0.931(5.1) 0.922(4.0) 0.944(5.0) 0.937(5.0) 0.891(5.6) 0.875(4.1)

30 0.914(5.6) 0.896(4.2) 0.930(5.2) 0.923(3.8) 0.944(4.8) 0.937(5.3) 0.892(5.5) 0.879(3.2)

50 0.911(5.5) 0.893(5.1) 0.927(5.3) 0.921(3.9) 0.948(4.9) 0.936(5.3) 0.890(5.9) 0.872(4.7)

Avg 0.914(5.4) 0.897(4.1) 0.930(5.1) 0.922(3.9) 0.944(4.9) 0.938(5.0) 0.899(5.8) 0.883(4.6)

potential requires extensive clinical validation. As

we progress towards real-world applications, address-

ing scalability, generalizability, and ethical consider-

ations will be paramount.

ACKNOWLEDGEMENTS

We thank Lenovo for providing the technical in-

frastructure to run the experiments in this paper.

This work was partially supported by Lenovo and

Intel as part of the Lenovo AI Innovators Uni-

versity Research program, by the Spanish Min-

istry of Science (MICINN), the Research State

Agency (AEI) and European Regional Develop-

ment Funds (ERDF/FEDER) under grant agreements

PID2019-107255GB-C22 and PID2021-126248OB-

I00, MCIN/AEI/10.13039/ 501100011033/FEDER,

UE, and by the Generalitat de Catalunya under con-

tract 2021-SGR-00478.

REFERENCES

Arani, E., Sarfraz, F. B., and Zonooz, B. (2022). Learn-

ing fast, learning slow: A general continual learn-

ing method based on complementary learning system.

arXiv preprint abs/2201.12604.

Brown, T. B. and et al., M. (2020). Language models are

few-shot learners. In Proceedings of the 34th Interna-

tional Conference on Neural Information Processing

Systems, NIPS’20. Curran Associates Inc.

Buzzega, P., Boschini, M., Porrello, A., Abati, D., and

Calderara, S. (2020). Dark experience for general

continual learning: a strong, simple baseline. In Pro-

ceedings of the 34th International Conference on Neu-

ral Information Processing Systems, NIPS’20. Curran

Associates Inc.

De Lange, M., Aljundi, R., Masana, M., Parisot, S., Jia,

X., Leonardis, A., Slabaugh, G., and Tuytelaars, T.

(2022). A continual learning survey: Defying forget-

ting in classiﬁcation tasks. IEEE Transactions on Pat-

tern Analysis and Machine Intelligence, 44(7):3366–

3385.

Garg, S., Farajtabar, M., Pouransari, H., Vemulapalli, R.,

Mehta, S., Tuzel, O., Shankar, V., and Faghri, F.

(2023). TiC-CLIP: Continual Training of CLIP Mod-

els. arXiv preprint abs/2310.16226.

Karthik, Maggie, S. D. (2019). APTOS 2019 Blindness De-

tection. https://kaggle.com/competitions/aptos2019-

blindness-detection.

Khan, V., Cygert, S., Deja, K., Trzcinski, T., and Twar-

dowski, B. (2024). Looking through the past: Better

knowledge retention for generative replay in continual

learning. IEEE Access, 12:45309–45317.

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J.,

Desjardins, G., Rusu, A. A., Milan, K., Quan, J.,

Ramalho, T., Grabska-Barwinska, A., et al. (2017).

Overcoming catastrophic forgetting in neural net-

works. Proceedings of the national academy of sci-

ences, 114(13):3521–3526.

Experience Replay and Zero-Shot Clustering for Continual Learning in Diabetic Retinopathy Detection

Koh, H., Kim, D., Ha, J.-W., and Choi, J. (2022). Online

continual learning on class incremental blurry task

conﬁguration with anytime inference. arXiv preprint

abs/2110.10031.

Kuang, K., Cui, P., Athey, S., Xiong, R., and Li, B. (2018).

Stable prediction across unknown environments. In

Proceedings of the 24th ACM SIGKDD International

Conference on Knowledge Discovery & Data Mining,

KDD’18, pages 1617–1626, New York, NY, USA. As-

sociation for Computing Machinery.

Kumar, P. and Srivastava, M. M. (2018). Example min-

ing for incremental learning in medical imaging. In

2018 IEEE Symposium Series on Computational In-

telligence (SSCI), pages 48–51.

Kumari, S. and Singh, P. (2024). Deep learning for unsuper-

vised domain adaptation in medical imaging: Recent

advancements and future perspectives. Computers in

Biology and Medicine, 170:107912.

Lenga, M., Schulz, H., and Saalbach, A. (2020). Con-

tinual learning for domain adaptation in chest x-ray

classiﬁcation. In Proceedings of the Third Confer-

ence on Medical Imaging with Deep Learning, PMLR

121:413-423, 2020.

Li, Z. and Hoiem, D. (2017). Learning without forgetting.

IEEE Transactions on Pattern Analysis and Machine

Intelligence, 40(12):2935–2947.

Lopez-Paz, D. and Ranzato, M. (2017). Gradient episodic

memory for continual learning. In Proceedings of the

31st International Conference on Neural Information

Processing Systems, NIPS’17, pages 6470–6479. Cur-

ran Associates Inc.

OpenAI (2023). GPT-4 Technical Report.

https://openai.com/research/gpt-4.

Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., and

Wermter, S. (2019). Continual Lifelong Learning

with Neural Networks: A Review. Neural Networks,

113:54–71.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh,

G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P.,

Clark, J., Krueger, G., and Sutskever, I. (2021). Learn-

ing transferable visual models from natural language

supervision. In Meila, M. and Zhang, T., editors, Pro-

ceedings of the 38th International Conference on Ma-

chine Learning, ICML 2021, 18-24 July 2021, Virtual

Event, volume 139 of Proceedings of Machine Learn-

ing Research, pages 8748–8763. PMLR.

Rebufﬁ, S.-A., Kolesnikov, A., Sperl, G., and Lampert,

C. H. (2017). iCaRL: Incremental Classiﬁer and Rep-

resentation Learning. In 2017 IEEE Conference on

Computer Vision and Pattern Recognition (CVPR),

pages 5533–5542.

Riemer, M., Cases, I., Ajemian, R., Liu, M., Rish, I., Tu,

Y., and Tesauro, G. (2019). Learning to learn with-

out forgetting by maximizing transfer and minimizing

interference. arXiv preprint abs/1810.11910.

Robins, A. (1993). Catastrophic forgetting, rehearsal and

pseudorehearsal. Connection Science, 5(2):123–146.

Serra, J., Suris, D., Miron, M., and Karatzoglou, A. (2018).

Overcoming catastrophic forgetting with hard atten-

tion to the task. In International Conference on Ma-

chine Learning, pages 4548–4557. PMLR.

Shin, H., Lee, J. K., Kim, J., and Kim, J. (2017). Con-

tinual learning with deep generative replay. In Pro-

ceedings of the 31st International Conference on Neu-

ral Information Processing Systems, NIPS’17, pages

2994–3003. Curran Associates Inc.

Shokri, R. and Shmatikov, V. (2015). Privacy-preserving

deep learning. In 2015 53rd Annual Allerton Con-

ference on Communication, Control, and Computing

(Allerton), pages 909–910.

Valindria, V. V., Lavdas, I., Bai, W., Kamnitsas, K.,

Aboagye, E. O., Rockall, A. G., Rueckert, D., and

Glocker, B. (2018). Domain Adaptation for MRI Or-

gan Segmentation using Reverse Classiﬁcation Accu-

racy. arXiv preprint abs/1806.00363.

Venkataramani, R., Ravishankar, H., and Anamandra, S.

(2018). Towards continuous domain adaptation for

healthcare. arXiv preprint abs/1812.01281.

Wang, L., Zhang, X., Su, H., and Zhu, J. (2024). A Com-

prehensive Survey of Continual Learning: Theory,

Method and Application. IEEE Transactions on Pat-

tern Analysis and Machine Intelligence, 46(8):5362–

5383.

Zhang, J., Fu, Y., Peng, Z., Yao, D., and He, K. (2024).

CORE: Mitigating Catastrophic Forgetting in Contin-

ual Learning through Cognitive Replay. arXiv preprint

abs/2402.01348.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications