Deep Image Clustering with Model-Agnostic Meta-Learning

Kim Bjerge

1 a

, Paul Bodesheim

2 b

and Henrik Karstoft

1 c

Department of Electrical and Computer Engineering, Aarhus University, Finlandsgade 22, 8200 Aarhus N, Denmark

Computer Vision Group, Friedrich Schiller University, Ernst-Abbe-Platz 2, 07743 Jena, Germany

{kbe, hka}@ece.au.dk, paul.bodesheim@uni-jena.de

Keywords:

Deep Clustering, Episodic Training, Few-Shot Learning, Multivariate Loss, Semi-Supervised Learning.

Abstract:

Deep clustering has proven successful in analyzing complex, high-dimensional real-world data. Typically,

features are extracted from a deep neural network and then clustered. However, training the network to extract

features that can be clustered efﬁciently in a semantically meaningful way is particularly challenging when

data is sparse. In this paper, we present a semi-supervised method to ﬁne-tune a deep learning network using

Model-Agnostic Meta-Learning, commonly employed in Few-Shot Learning. We apply episodic training

with a novel multivariate scatter loss, designed to enhance inter-class feature separation while minimizing

intra-class variance, thereby improving overall clustering performance. Our approach works with state-of-the-

art deep learning models, spanning convolutional neural networks and vision transformers, as well as different

clustering algorithms like K-means and Spectral clustering. The effectiveness of our method is tested on

several commonly used Few-Shot Learning datasets, where episodic ﬁne-tuning with our multivariate scatter

loss and a ConvNeXt backbone outperforms other models, achieving adjusted rand index scores of 89.7% on

the EU moths dataset and 86.9% on the Caltech birds dataset, respectively. Hence, our proposed method can

be applied across various practical domains, such as clustering images of animal species in biology.

1 INTRODUCTION

In real-world applications such as biology and

ecology, training supervised networks to classify

rare species from images presents signiﬁcant chal-

lenges (Mora et al., 2011; Binta Islam et al., 2023). In

many cases, images of these species are either scarce

or non-existing, making unsupervised methods such

as deep image clustering a valuable alternative.

Clustering is a core problem in unsuper-

vised learning, with traditional methods like K-

means (Macqueen, 1967) and Spectral cluster-

ing (Von Luxburg, 2007) used to group data into clus-

ters where similar data points are close together and

dissimilar points are far apart. However, these clas-

sical methods often operate in the feature space of

hand-crafted features, which may not capture the un-

derlying structure of the data, limiting their effec-

tiveness. To overcome this, deep clustering tech-

niques have emerged (Bo et al., 2020; Sun et al.,

2022; Huang et al., 2024; Lu et al., 2024), leverag-

ing deep neural networks to learn an embedding that

https://orcid.org/0000-0001-6742-9504

https://orcid.org/0000-0002-3564-6528

https://orcid.org/0000-0003-3739-8983

better reﬂects the intrinsic data structure before ap-

plying clustering. For instance, deep embedded clus-

tering uses auto-encoders and Kullback-Leibler (KL)

divergence minimization to improve clustering accu-

racy (Xie et al., 2016), while other methods use Con-

volutional Neural Networks (CNN) to extract latent

representations (Yang et al., 2016). Since deep clus-

tering is an unsupervised method it still may be misled

by noisy or complex data in the absence of labels.

A natural solution to this limitation is to incorpo-

rate some supervised information, leading to semi-

supervised deep clustering (Ren et al., 2019; Qin

et al., 2019; Cai et al., 2023). In real-world appli-

cations, individual labels are often hard to obtain, but

pairwise relations between data points are more ac-

cessible. For example, in face recognition, while the

identity (label) of a person may be unknown, it is of-

ten easy to determine whether two images represent

the same individual (Chopra et al., 2005). The learn-

ing process minimizes a contrastive loss function that

drives the similarity metric to be small for pairs of

faces from the same person, and large for pairs from

different persons. Semi-supervised methods use pair-

wise constraints — whether two data points belong to

the same class — to guide clustering. For example,

286

Bjerge, K., Bodesheim, P. and Karstoft, H.

Deep Image Clustering with Model-Agnostic Meta-Learning.

DOI: 10.5220/0013114600003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 2: VISAPP, pages

286-297

ISBN: 978-989-758-728-3; ISSN: 2184-4321

Vilhagra et al. proposed a Siamese network (Vilhagra

et al., 2020), which has been applied with Spectral

clustering for single-cell RNA sequencing data (Jiang

et al., 2022). While this approach can enhance per-

formance, its success largely depends on the quality

of the pairwise constraints, and selecting them appro-

priately is a difﬁcult challenge.

In this paper, we propose using Model-Agnostic

Meta-Learning (MAML) (Finn et al., 2017), a

technique commonly applied in Few-Shot Learning

(FSL) (Wang et al., 2020; Song et al., 2023), to ad-

dress the challenge of clustering with limited labeled

data using a set of data points. FSL, which involves

training models using only a few labeled examples,

employs a support and query set for episodic train-

ing (Li et al., 2019), and has gained substantial at-

tention for its ability to enable models to learn from

sparse data (Wang et al., 2020). FSL bridges the

gap between traditional deep learning, which requires

large labeled datasets, and the growing need for sys-

tems that can quickly adapt to new tasks or domains.

In our approach, we utilize a small amount of

labeled data, such as similar images from a limited

set of categories, to improve clustering performance.

This method is particularly relevant in domains like

biology, where clustering rare or unseen species is

crucial (Bjerge et al., 2023). By merging FSL with

clustering, we aim to develop models that can gener-

alize beyond their training data, addressing the critical

need for adaptability in real-world applications.

In domains where labeled data is available for

some classes but absent for others, our proposed

method offers a viable solution by clustering data, un-

like supervised classiﬁers, which struggle to handle

unknown data samples.

Contribution. Our paper introduces a novel

methodology speciﬁcally tailored for clustering

samples within the framework of few-shot learning,

utilizing episodic training to achieve domain gen-

eralization (Li et al., 2019; Ren et al., 2018). The

approach is designed to cluster samples from novel

classes not included in the FSL support set.

Our proposal uses transductive inference with

the Prototypical Network during meta-learning with

episodic training of models to tackle the challenges

posed by limited labeled data.

To further improve clustering performance, we

propose a novel multivariate scatter loss func-

tion, extending the univariate scatter loss introduced

by (Bjerge et al., 2024). This function is strategi-

cally applied during episodic training to disentangle

and separate the distribution of classes within the sim-

ilarity space. The paper evaluates four state-of-the-art

deep learning networks ﬁne-tuned with episodic train-

ing and the multivariate scatter loss evaluated with the

K-means and Spectral clustering algorithms.

2 RELATED WORK

2.1 Deep Image Clustering

Deep clustering has emerged as a powerful technique

for analyzing complex, high-dimensional data. A

variety of methods have been proposed to enhance

clustering performance, especially in cases where la-

beled data is scarce or non-existent. Deep Embed-

ded Clustering (DEC), introduced by Xie et al. (Xie

et al., 2016), is one of the state-of-the-art approaches

in deep clustering. However, DEC does not leverage

prior knowledge to guide the learning process. To

address this limitation, Ren et al. (Ren et al., 2019)

propose Semi-Supervised Deep Embedded Clustering

(SDEC), which incorporates labeled data to improve

clustering outcomes. This highlights the importance

of integrating supervision in deep clustering for im-

proved performance. Bo et al. (Bo et al., 2020) further

advanced the ﬁeld by introducing the Structural Deep

Clustering Network (SDCN), which integrates struc-

tural information into deep clustering. They point out

that most deep clustering methods rely on the neural

network ability to learn effective feature representa-

tions, often through auto-encoders.

While deep clustering methods have shown great

promise, Sun et al. (Sun et al., 2022) note that the ab-

sence of labels often results in unreliable clustering.

To mitigate this, they propose Deep Active Cluster-

ing, which actively selects key data points for human

labeling. This novel approach improves the clustering

process by intelligently choosing the most informa-

tive samples for annotation, addressing one of the lim-

itations of conventional semi-supervised deep cluster-

ing methods that rely on ﬁxed, pre-labeled data.

The evolution of deep clustering methods has also

been tracked by Lu et al. (Lu et al., 2024), who of-

fer a comprehensive review of prior knowledge used

in deep clustering. Huang et al. (Huang et al., 2024)

provide a general framework for Deep Image Cluster-

ing Networks, outlining the key stages: image pre-

processing, embedding, feature processing, cluster-

ing, and result processing. The challenges posed by

big data clustering are also highlighted by Fahad et

al. (Fahad et al., 2014). They offer a survey of ex-

isting clustering algorithms and a comparison of their

theoretical and empirical performance.

Rodriguez et al. (Rodriguez et al., 2019) per-

formed a systematic comparison of nine well-known

Deep Image Clustering with Model-Agnostic Meta-Learning

287

clustering methods, offering insights into their rel-

ative performance across different datasets. Addi-

tionally, various classic clustering algorithms such as

Density-Based Clustering (Ester et al., 1996; Ankerst

et al., 1999), EM Algorithms (Dempster et al.,

1977), Hierarchical Density-Based Clustering (HDB-

SCAN) (McInnes et al., 2017) and Spectral Clus-

tering (Von Luxburg, 2007) have been explored in

the context of clustering of large-scale data. Jain et

al. (Jain, 2010) summarize these methods, discussing

key challenges like feature selection, semi-supervised

clustering, and clustering at scale.

2.2 Few-Shot Learning

Few-Shot Learning comprising diverse approaches

tailored to address the challenges inherent in learning

from limited labeled data. Few-shot learning can be

categorized into two distinct branches: inductive FSL

and transductive FSL. The former involves the pre-

diction of test samples individually, while the latter

addresses the prediction of test samples collectively.

For a comprehensive overview of the evolving

FSL landscape, we refer to surveys such as those pre-

sented in works by Wang et al. (Wang et al., 2020)

and Song et al. (Song et al., 2023). A notable cat-

egory of methods relevant to our proposed approach

leverage euclidean distance and cosine similarity as

the fundamental measure. This includes Prototyp-

ical Networks (Snell et al., 2017), Finetune (Chen

et al., 2019), Transductive Information Maximiza-

tion (Boudiaf et al., 2020), and Prototypical Rectiﬁ-

cation (Liu et al., 2020).

Meta-learning is a key paradigm in the FSL land-

scape, and a comprehensive survey on the subject

is presented by Hospedales et al. (Hospedales et al.,

2022). Additionally, the integration of episodic train-

ing for domain generalization, as discussed by Li

et al. (Li et al., 2019) and Model-Agnostic Meta-

Learning (Finn et al., 2017), emerges as a crucial as-

pect in enhancing the adaptability and robustness of

FSL models.

In few-shot classiﬁcation, we are given a small

support set of (N · K) labeled examples S =

{(X

(1)

, y

), ..., (X

(K)

, y

)} where each X

(k)

∈ R

the D-dimensional embedded feature vector of an

example and y

∈ {1, ..., K} are the correspond-

ing labels (Bjerge et al., 2024). The set S

{(X

(k)

, y

), ..., (X

(k)

, y

)} denotes the subset of exam-

ples labeled with class k and the number of class la-

bels in the support set is denoted K-way where we

have N-shots of examples in each S

. A query set Q

contains sample images q

that belong to classes in

the support set where the goal is to match the query

samples to the correct class label. When training with

a dataset that comprises more classes than those in-

cluded in the support set, a random subset of K classes

is selected for each few-shot task, which encompasses

both the support and query sets. We propose to use

the Prototypical Network (Snell et al., 2017), which

employs the Euclidean distance to compare the center

point of the support set with the query samples.

In conclusion, FSL and deep clustering has made

signiﬁcant strides, leveraging powerful feature ex-

traction methods from deep learning and incorporat-

ing structural and active learning techniques to en-

hance clustering performance. Despite these advance-

ments, challenges remain, particularly in large-scale

data clustering and the integration of prior knowledge,

which continue to drive innovation in the ﬁeld.

3 METHOD

3.1 Deep Clustering with Few-Shot

Learning

We propose clustering images by utilizing feature em-

beddings derived from the output vector of a deep

learning (DL) model. The DL model is ﬁne-tuned

through episodic training, a technique commonly

used in FSL (Li et al., 2019). We introduce a multi-

variate loss function to enhance feature clustering dur-

ing episodic training. The pipeline, illustrated in Fig-

ure 1, consists of three steps.

Step 1. involves training and ﬁne-tuning a pre-

trained DL model to extract features that effectively

represent the images, ensuring that the embedding

space is spread across diverse clusters, as visualized

in Figure 2. To enhance the quality of these feature

embeddings, we leverage MAML (Finn et al., 2017;

Hospedales et al., 2022), a technique commonly used

in FSL. MAML uses a training and validation dataset

with classes of images that are not present in the test

dataset, allowing the model to generalize well to new,

unseen classes.

In our study, we selected four DL models that span

both convolutional neural networks (CNNs) and vi-

sion transformers (ViTs). We chose ResNet50v2 (He

et al., 2016) and EfﬁcientNetB3 (Tan and Le, 2019)

as they are widely used CNNs, achieving top-1

accuracies of 76.0% and 81.6% on the ImageNet

dataset (Russakovsky et al., 2015), respectively. Fur-

thermore, we included ConvNext-B (Todi et al.,

2023), a recently published model that achieves an

ImageNet top-1 accuracy of 85.3%, and the vision

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

288

Figure 1: Illustrates the pipeline of our proposed deep clustering method for effectively grouping images in few-shot scenarios.

A training and validation dataset with different classes is used to ﬁne-tune a deep learning model for feature extraction. In

step 1. the model is ﬁne-tuned with few-shot learning and episodic training where the best model is chosen based on K-means

clustering of the validation dataset. In Step 2 and 3, the extracted features are fed into a clustering algorithm, which groups a

set of new test images into N clusters.

(a) Pretrained DL model

(b) Fine-tuned DL model

Figure 2: T-SNE projection of feature embeddings for clus-

ters with a pretrained DL model and a ﬁne-tuned DL model

with episodic training and multivariate scatter loss. The

scatter plot illustrates the clustering of the ﬁrst and second

components of 11 classes in the embedding space.

transformer ViT-B/16 (Dosovitskiy et al., 2021) with

an accuracy of 88.6%.

Step 2. concerns extracting features from the ﬁne-

tuned DL model. The feature vector is extracted from

the backbone of the trained DL model. The backbone

architecture includes multiple convolutional layers or

vision transformer layers, with the ﬁnal head of fully

connected layers removed. The test dataset contains

image categories that are not present in the training or

validation datasets. Additionally, the categories in the

training and validation datasets are distinct from each

other as well. The size of the feature vector depends

on the used DL model.

Step 3. involves clustering of feature embeddings

extracted with the DL model. Clustering analysis in-

volves grouping a set of objects such that those within

the same cluster are more similar to each other than

to objects in other clusters. In our study, the objects

being clustered are unlabeled images, which could in-

clude images of various unlabeled categories. In eco-

logical studies, this could involve species such as an-

imals, birds, or insects for which labels are sparse or

entirely absent. The primary challenge is to learn ef-

fective feature embeddings of training images that can

serve as input to a clustering algorithm.

For this purpose, we utilize K-means (Macqueen,

1967) and Spectral clustering (Von Luxburg, 2007),

assuming the number of clusters is known. These two

clustering methods were selected based on a compar-

ative analysis of nine common clustering methods by

Rodriguez et al. (Rodriguez et al., 2019). K-means is

a well-known and widely used method, while Spec-

tral clustering demonstrated superior performance,

achieving an adjusted rand index of 68.16% as the

best out of nine different selected clustering algo-

rithms.

Unlike K-means, which works well for convex

clusters, spectral clustering can capture more com-

plex, non-linear relationships in the data. Spectral

clustering is a graph-based approach and leverages the

spectral (eigenvalue) properties of a similarity graph

constructed from the data to identify clusters, rather

than relying on traditional distance-based metrics like

K-means. It effectively detects clusters of arbitrary

Deep Image Clustering with Model-Agnostic Meta-Learning

289

shapes and sizes, particularly in cases where data

points are not well-separated by traditional distance

metrics.

While K-means and Spectral clustering are cen-

tral to our demonstration, other clustering meth-

ods could also be employed. These include Fast

Density-Based Clustering (DBSCAN) (Ester et al.,

1996), Expectation-Maximization (EM) (Dempster

et al., 1977), or hierarchical methods such as ag-

glomerative and divisive clustering, as explored by

Jain et al. (Jain, 2010). For unsupervised cluster-

ing where the number of clusters is unknown, meth-

ods like Ordering Points To Identify the Clustering

Structure (OPTICS) (Ankerst et al., 1999) and HDB-

SCAN (McInnes et al., 2017) could be utilized.

A solution to clustering of images not included

in training would beneﬁt many real-life applications

where the number of samples for each class is sparse

or labelled data was non-existent during training.

3.2 Episodic Training with Multivariate

Scatter Loss

In few-shot learning, episodic training with domain

generalization (DG) (Li et al., 2019), also known

as meta-learning (Hospedales et al., 2022), involves

training with a set of tasks during each epoch. Each

task comprises several episodes, with each episode

containing a labeled support set and a query set drawn

from the training dataset. After each epoch, the model

accuracy is evaluated using the validation dataset,

which contains tasks with class categories different

from those in the training dataset, thereby assessing

the model ability to generalize across domains.

A Prototypical Network uses Euclidean distance

as a similarity function to predict the relationship

between query samples and class labels in the sup-

port set. Experiments have demonstrated that training

with Euclidean distance, rather than cosine similar-

ity, yields superior results (Li et al., 2019). In our

approach, we use the cross entropy loss together a

new a multivariate scatter loss function during train-

ing. This new loss function is designed to minimize

within-class variance while simultaneously maximiz-

ing the mean separation between classes, enhancing

the model’s discriminative ability. The method is in-

spired by the work of Bodesheim et al. (Bodesheim

et al., 2013) and Bjerge et al. (Bjerge et al., 2024).

For an illustration of class distribution after training

with multivariate scatter loss, see Figure 2. The mul-

tivariant scatter loss is deﬁned as

Lm(θ) =

∑

k=1

∑

n=1

− X

)

− X

)

∑

i=1

∑

j=i+1

− X

)

− X

)

(1)

with X

being the embedding vector for each sample

in the support set S

of class k. N

is the number of

samples in each class of the support set and θ is the

parameters of the DL model. With X

, X

, and X

denote the mean center point of samples in the classes

k, i, and j of the support set. The numerator of Eq. (1)

encourages the norm of the data points, centred with

respect to their cluster, to be small, and the denomi-

nator encourages the cluster centres to be far apart.

The cross-entropy loss function in Eq. (2) ensures

that a query sample q is classiﬁed correctly according

to the support set during training:

(θ) = − ˆy

log(

exp(−d

q,k

)

∑

i=1

exp(−d

q,k

)

) (2)

Here, d

q,k

is the euclidean distance between the sup-

port center point k

and the query sample q and ˆy

the one-hot encoded vector for the correct label of the

query sample. K is the number of classes (K-way)

in the support set. Finally a combined loss function

is deﬁned in Eq. (3) to prioritize between the cross-

entropy and scatter loss:

L(θ) = αLm(θ) + (1 − α)Lc(θ) . (3)

The goal is to increase the distance between class dis-

tributions in the support set while increasing the num-

ber of correctly classiﬁed query samples related to the

support classes. The loss function L(θ) will prioritize

between minimizing the multivariate scatter loss (Lm)

and the cross-entropy loss for the batch of query sam-

ples (Lc) by adjusting α ∈ [0, 1]

3.3 Fine-Tuning with Episodic Training

A DL model was used as a backbone to extract fea-

ture embeddings. The outputs from the backbone

of the DL model were ﬂattened and used as embed-

dings, leading to D-dimensional feature vectors. In

our work, we have trained larger models than used

in FSL. The models were ﬁne-tuned using four DL

architectures, generating feature vectors with the fol-

lowing dimensions: 2048 (ResNet50v2), 1536 (Ef-

ﬁcientNetB3), 1024 (ConvNeXt-B), and 768 (ViT-

B/16). For ﬁne-tuning, we used classical pre-trained

weights from the ImageNet dataset (Russakovsky

et al., 2015). These pre-trained DL models were then

ﬁne-tuned with episodic training on a new domain-

speciﬁc dataset. During all episodic training sessions,

a 5-shot K-way support set was utilized. The value

of K ranged between 15 and 30 constrained by the

DL model and a maximum of 50GB of GPU mem-

ory. Training on the datasets was performed using

data augmentation, including image scaling, horizon-

tal ﬂip and adding color jitter for brightness, contrast

and saturation.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

290

The stochastic gradient descent optimizer (SGD)

was used during episodic training and ﬁne-tuning.

The SGD was conﬁgured with the momentum of 0.9

and a weight decay of 5.0 · 10

−4

using a multi-step

scheduler to lower the learning rate at two speciﬁed

milestones speciﬁed by epochs. The ﬁrst milestone

was set to 3 epochs and the second to 6 epochs with

a total of 9 epochs. The limited number of epochs

is a result of episodic training, with each epoch com-

prising 300 few-shot tasks, each consisting of 6 query

samples per class from the support set. SGD was

tested with the initial learning rate of 1.0 · 10

−3

for

pre-trained models during ﬁne-tuning.

K-means clustering was performed on all images

in the validation dataset after each epoch to select the

best performing model. The clustering was applied to

the validation datasets described in section 4.1 which

included 20, 40, 50, or 97 classes. After each epoch,

cluster accuracy (CA) was used to select the optimal

model for clustering the validation dataset with K-

means.

3.4 Performance Metrics for Clustering

In this section, we brieﬂy outline the criteria used

for performance evaluation based on commonly used

clustering validation indices (Wu et al., 2019; Huang

et al., 2024; Lu et al., 2024). These criteria are anal-

ogous to accuracy or recognition rate in supervised

learning. Four key evaluation metrics are used in-

clude Cluster Accuracy (CA), Adjusted Rand Index

(ARI) (Hubert and Arabie, 1985), Normalized Mutual

Information (NMI) (Strehl and Ghosh, 2003) and Ad-

justed Mutual Information (AMI) (Vinh et al., 2009).

When the true class labels of a dataset are known,

these metrics allow us to assess how accurately a clus-

tering technique partitions the data relative to the cor-

rect class labels.

CA. measures the percentage of correctly classiﬁed

images in the clustering solution compared to pre-

deﬁned image class labels (Ω). We use CA as deﬁned

by (Fahad et al., 2014):

CA =

∑

i=1

max(C

)

|Ω|

(4)

where C

is the set of instances in the ith cluster. L

is the class labels for all images in the ith cluster, and

max(C

) is the number of instances with the major-

ity label in the ith cluster (e.g. if label l appears in

the ith cluster more often than any other label, then

max(C

) is the number of instances in C

with the

label l).

ARI. is a measure used to evaluate the similarity

between two clustering results, taking into account

the possibility of chance. It is an improvement over

the basic Rand Index, which measures the proportion

of agreement between two clusterings (i.e., the pro-

portion of points that are either clustered together or

clustered apart in both clusterings). ARI takes into

account the number of instances that exist in the same

cluster and different clusters.

NMI. is a normalized version of Mutual Informa-

tion (MI) using the upper bound to score results be-

tween 0 (no mutual information) and 1 (perfect corre-

lation).

AMI. is an adjustment of the MI score to account

for chance. It accounts for the fact that the Mutual In-

formation is generally higher for two clusterings with

a larger number of clusters, regardless of whether

there is actually more information shared.

4 EXPERIMENTAL SETUP

The proposed method undergoes training and evalu-

ation using four distinct datasets for FSL that are de-

scribed in Sec. 4.1. Classic training and ﬁne-tuning of

models has also been conducted to compare perfor-

mances with episodic training. The best model was

selected based on cluster accuracy using K-means,

evaluated after each epoch for both training methods.

The metrics outlined in the tables of Sec. 5 include

CA, NMI, AMI, and ARI for clustering of features

with classic and episodic training. Metrics in all ta-

bles are computed across 5 random runs of clustering

feature vectors from ﬁne-tuned DL models, with av-

erage (AVG) and standard deviations (SD).

4.1 Datasets

We have selected four datasets with images typically

used for FSL, here we use the validation and test-

ing dataset for clustering. These datasets were cho-

sen over commonly used clustering datasets due to

their relatively high resolution and the clear separa-

tion into distinct classes for training, validation, and

testing, as demonstrated in previous studies (Ravi

and Larochelle, 2017; Wang et al., 2019; Bjerge

et al., 2024). Unlike FSL datasets, standard clustering

datasets typically do not provide such distinct class

separations.

Mini-ImageNet is a benchmark dataset and is a

subset of the larger ILSVRC-12 dataset (Russakovsky

Deep Image Clustering with Model-Agnostic Meta-Learning

291

Table 1: Shows the metrics (CA, NMI, AMI, ARI) for K-means and spectral clustering of features from ﬁne-tuned models

with classic and episodic training with ConvNeXt-B on the four datasets, tested with clustering on test datasets. A best α

value above zero indicates that the multivariate scatter loss improved episodic training in ﬁnding the best model. The best

metric results for each clustering method and dataset is highlighted with bold.

Dataset Cluster Training Best CA NMI AMI ARI

method α AVG (SD) AVG (SD) AVG (SD) AVG (SD)

EU Moths K-means Classic - 0.773 (0.022) 0.872 (0.009) 0.785 (0.014) 0.653 (0.026)

EU Moths K-means Episodic 0.1 0.837 (0.029) 0.912 (0.005) 0.851 (0.008) 0.748 (0.009)

EU Moths Spectral Classic - 0.874 (0.009) 0.922 (0.003) 0.868 (0.006) 0.787 (0.009)

EU Moths Spectral Episodic 0.1 0.940 (0.011) 0.962 (0.004) 0.935 (0.007) 0.897 (0.012)

CUB K-means Classic - 0.844 (0.019) 0.885 (0.010) 0.864 (0.011) 0.759 (0.026)

CUB K-means Episodic 0.1 0.834 (0.025) 0.885 (0.005) 0.864 (0.006) 0.745 (0.038)

CUB Spectral Classic - 0.936 (0.000) 0.932 (0.000) 0.919 (0.000) 0.881 (0.000)

CUB Spectral Episodic 0.0 0.928 (0.002) 0.930 (0.002) 0.917 (0.002) 0.869 (0.004)

mini-ImageNet K-means Classic - 0.895 (0.005) 0.942 (0.001) 0.941 (0.001) 0.853 (0.005)

mini-ImageNet

K-means Episodic 0.5 0.900 (0.010) 0.942 (0.001) 0.941 (0.001) 0.856 (0.009)

mini-ImageNet Spectral Classic - 0.903 (0.003) 0.941 (0.001) 0.941 (0.001) 0.832 (0.006)

mini-ImageNet Spectral Episodic - 0.893 (0.002) 0.937 (0.000) 0.936 (0.000) 0.813 (0.001)

tired-ImageNet K-means Classic - 0.975 (0.024) 0.984 (0.007) 0.984 (0.007) 0.965 (0.028)

tired-ImageNet K-means Episodic 0.5 0.984 (0.021) 0.986 (0.007) 0.986 (0.007) 0.977 (0.023)

tired-ImageNet Spectral Classic - 0.978 (0.018) 0.982 (0.007) 0.982 (0.007) 0.966 (0.023)

tired-ImageNet Spectral Episodic 0.2 0.982 (0.019) 0.984 (0.007) 0.984 (0.007) 0.971 (0.024)

et al., 2015). It has a total of 60,000 color images from

100 classes, where each class has 600 images of size

224x224 (Vinyals et al., 2016). In alignment with es-

tablished practices (Ravi and Larochelle, 2017; Wang

et al., 2019), we adopt a partitioning scheme consist-

ing of 60 training classes, complemented by 20 vali-

dation classes and 20 test classes.

Tiered-ImageNet (Ren et al., 2018) is a larger sub-

set of the larger ILSVRC-12 dataset (Russakovsky

et al., 2015). It contains 608 classes with 779,165

images partitioned into disjoint sets for training (351

classes), validation and testing. Images for evaluation

are split into 97 classes for validation and 160 classes

for testing.

Caltech Birds-200-2011 (CUB) (Wah et al., 2011)

is a ﬁne-grained image classiﬁcation dataset. We

adopt Chen et al. (Chen et al., 2019) for few-shot clas-

siﬁcation on CUB, which splits into 120 training, 40

validation, and 40 test classes with a total of 11,788

images for the experiments. To maintain consistency

and comparability with mini-ImageNet, the images

from CUB are uniformly resized to 224x224 pixels.

The EU Moths dataset, as introduced by B

ohlke

et al. (B

ohlke et al., 2021), encapsulates a collec-

tion of 200 moth species prevalent in Central Europe.

Notably, each species is delineated by few samples,

comprising a mere 11 images, resulting in a total of

2,205 images within the dataset. This dataset was

chosen for our evaluation of the multivariate scatter

loss because it presents the most challenging for clus-

tering, with only a few samples per class and species

with highly similar appearances. The images have

high resolution, even surpassing 1000x1000 pixels,

and have been resized to 224x224 pixels. The dataset

is split into 100 training, 50 validation, and 50 test

classes for experiments.

5 RESULTS AND DISCUSSION

Several experiments were performed on the datasets

to evaluate clustering and episodic training with the

multivariate scatter loss function. Here, results were

obtained with pre-trained models on ImageNet and

ﬁne-tuning with classic and episodic training on

the four datasets. Since mini-ImageNet and tiered-

ImageNet are a subset of ImageNet we only expect to

see minor improvements when ﬁne-tuning the models

with classic or episodic training.

Detailed results as shown in Table 1 for

training models with ResNet50v2, EfﬁcientNetB3,

ConvNeXt-B and ViT-B/16 can be found on Github

with the source code for the experimental results.

5.1 Evaluation of the Multivariate

Scatter Loss

Spectral clustering of features from ﬁne-tuned mod-

els, using both classic and episodic training on the EU

moth dataset, is shown in Figure 3. The scores repre-

sent the average of ﬁve runs. It is important to em-

phasize that the optimal model, along with the corre-

https://github.com/kimbjerge/few-shot-clustering

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

292

(a) ResNet50V2 feature clustering. (b) EfﬁcientNetB3 feature clustering.

Figure 3: Shows the ARI and AMI metrics with Spectral clustering of features from ﬁne-tuned models on the EU moth dataset

with different α values. Each green (AMI) and blue (ARI) dot represents a clustering result out of ﬁve random runs for each

metric. The red (AMI) and yellow (ARI) circles at Alpha=0 are the metrics for the classic trained models.

sponding α values, was determined through episodic

ﬁne-tuning using the validation sets.

All four models achieved higher AMI and ARI

scores with episodic training compared to ﬁne-

tuning with classic training. The highest relative in-

crease in performance with episodic training is ob-

served for ResNet50v2 and EfﬁcientNetB3. However,

ConvNeXt-B and ViT-B/16 achieved the highest AMI

and ARI scores overall. As the value of α increases,

both AMI and ARI scores tend to decrease. Nonethe-

less, the highest ARI scores were observed with non-

zero α values at: α = 0.5 (ResNet50v2), α = 0.1 (Ef-

ﬁcientNetB3), α = 0.1 (ConvNeXt-B) and α = 0.2

(ViT-B/16). This indicates that the multivariate scat-

ter loss positively impacts episodic training, leading

to improved Spectral clustering performance. How-

ever, there is only very little improvement compared

to α = 0 for all networks which limits the impact of

the scatter loss.

In Table 1 and on Github

, we present the

CA, NMI, AMI, and ARI metrics for both clas-

sic and episodic ﬁne-tuning of models, with cluster-

ing performed using K-means and Spectral cluster-

ing. For K-means clustering with α = 0.1, the best

results were achieved with episodic ﬁne-tuning of

the ResNet50v2, ConvNeXt-B, and ViT-B/16 models.

However, for EfﬁcientNetB3, the multivariate scatter

loss did not result in performance improvement for

K-means clustering.

Fine-tuning models on the mini-ImageNet dataset

demonstrates that the multivariate scatter loss indi-

cates minor improved model performance, with the

episodic ﬁne-tuned models achieving the best aver-

age ARI of 0.977 and AMI of 0.984 with α = 0.5

and ConvNeXt-B. In two cases, using ResNet50v2

and ViT-B/16, the highest scores were obtained with

Spectral clustering, solely by applying multivariate

scatter loss (α = 1.0).

5.2 Clustering with the EU Moths and

CUB Datasets

Figure 4 summarizes the ARI scores for Spectral clus-

tering of features from models pre-trained on Ima-

geNet and ﬁne-tuned using either classic or episodic

training on the EU moths and CUB datasets.

As expected, the ﬁne-tuned models outperform

the pre-trained models, as the ﬁne-tuning process

adapts the models to datasets within the same domain,

in our case different species of moths or birds.

Notably, models ﬁne-tuned with episodic train-

ing consistently outperform those trained with clas-

sic ﬁne-tuning. This suggests that episodic training

enhances the models’ ability to generalize and adapt

to new domains, such as switching between differ-

ent species in the moth and bird datasets. For the

Deep Image Clustering with Model-Agnostic Meta-Learning

293

(a) EU moths with Spectral clustering. (b) CUB dataset with Spectral clustering.

Figure 4: Shows the ARI score with Spectral clustering of features from pre-trained and ﬁne-tuned models on the EU moth

and CUB dataset. The blue bars presents clustering of features from classic pre-trained models on ImageNet. The green bars

presents clustering of features from classic ﬁne-tuned models on the respective dataset. The red bars presents clustering of

features from the best episodic ﬁne-tuned models.

(a) Mini-ImageNet - Spectral clustering. (b) TieredImageNet - K-means clustering.

Figure 5: Shows the ARI score with Spectral and K-means clustering of features from pre-trained and ﬁne-tuned models

on the mini-ImageNet and tiered-ImagneNet dataset. The blue bars presents clustering of features from classic pre-trained

models on ImageNet. The green bars presents clustering of features from classic ﬁne-tuned models on the respective dataset.

The red bars presents clustering of features from the best episodic ﬁne-tuned models.

CUB dataset, the performance of ConvNeXt-B is the

same for both classic and episodic training. However,

this is not the case for the EU moths dataset, which

has fewer samples per class. The stronger perfor-

mance of episodic training on the EU moths dataset

highlights its effectiveness, particularly when work-

ing with datasets that have limited samples per class.

Table 1 presents the CA, NMI, AMI, and ARI

metrics for K-means and Spectral clustering of fea-

tures extracted from ﬁne-tuned ConvNeXt-B models

using both classic and episodic training of the EU

moths, CUB, mini-ImageNet and tiered-ImageNet

datasets. Metrics in all tables are computed across

5 random runs of clustering feature vectors from ﬁne-

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

294

tuned DL models, with average (AVG) and standard

deviations (SD). The best results were obtained us-

ing Spectral clustering on features from a ConvNeXt-

B model trained with episodic learning. For both

datasets, Spectral clustering outperforms K-means

clustering with an increase of 8% - 14% on all clus-

tering metrics.

On the EU moths dataset, the best model achieved

NMI=0.962, AMI=0.935, and ARI=0.897 with mul-

tivariate scatter loss (α = 0.1). For the CUB

dataset, ConvNeXt-B model reached NMI=0.930,

AMI=0.917, and ARI=0.869 with α = 0.0. How-

ever, classic ﬁne-tuning yielded slightly better results,

with an ARI of 0.881. All scores are reported as the

average of 5 runs. The CUB dataset contains more

samples compared to the EU moths dataset, suggest-

ing that the multivariate scatter loss has a greater im-

pact with episodic training on datasets with fewer data

samples per class.

5.3 Clustering with the mini-ImageNet

and tiered-ImageNet Datasets

Figure 5 summarizes the ARI scores for Spectral and

K-means clustering of features extracted from models

pre-trained on ImageNet and ﬁne-tuned using either

classic or episodic training on the mini-ImageNet and

tiered-ImageNet datasets. As expected, both classic

and episodic ﬁne-tuning outperform the pre-trained

models. While episodic ﬁne-tuning generally yields

slightly better results than classic ﬁne-tuning, Efﬁ-

cientNetB3 is an exception, where classic ﬁne-tuning

is the best.

The multivariate scatter loss enhances episodic

training even without the use of cross-entropy loss

(α = 1.0). Given that the models were pre-trained

on ImageNet, the multivariate scatter loss appears to

improve the distribution of features into well-deﬁned

clusters, particularly when the model has already been

pre-trained on the same classes.

The test dataset for tiered-ImageNet contains 160

classes, and Spectral clustering requires several hours

(10-20) to process. As a result, only the performance

of ConvNeXt-B is included in the detailed results on

Github

. Interestingly, K-means clustering outper-

forms Spectral clustering, achieving an ARI of 0.856

compared to 0.813 with Spectral clustering. This

suggests that while Spectral clustering excels with

test datasets containing 20-50 classes, K-means per-

forms better when the number of classes increases

above 150, it might be the preferred method for larger

datasets.

6 CONCLUSION

In this study, we present a novel method for clus-

tering images using Model-Agnostic Meta-Learning

within the context of few-shot learning and episodic

training with multivariate scatter loss. The proposed

method was evaluated on four commonly used few-

shot learning datasets, employing four state-of-the-art

DL models for feature extraction. ConvNeXt-B out-

performed the other networks, achieving ARI scores

of 0.897 and 0.869 on the EU moths and Caltech

Birds datasets, respectively. On the mini-ImageNet

and tiered-ImageNet datasets, episodic ﬁne-tuned

models with multivariate scatter loss further improved

clustering performance, with ARI scores of 0.977

and 0.856 (α = 0.5). The multivariate scatter loss

consistently enhanced clustering performance during

episodic ﬁne-tuning across most experiments, par-

ticularly on the EU moths dataset, where its effec-

tiveness was demonstrated with only 11 samples per

class. This highlights the method’s potential in han-

dling datasets with limited samples.

We explored two commonly used clustering algo-

rithms: K-means and Spectral clustering. On the EU

moths and Caltech Birds datasets, Spectral clustering

outperformed K-means, with an 8% to 14% improve-

ment across all clustering metrics. However, on the

tiered-ImageNet dataset, which contains a large num-

ber of classes (160), K-means clustering delivered the

best results. Future experiments could explore addi-

tional clustering algorithms, especially when applied

to datasets with few samples in each class.

Our solution represents an advancement in tack-

ling the challenge of image clustering, especially in

real-world scenarios where class samples are sparse.

The demonstrated accuracy and effectiveness of our

proposed method highlight its potential as a valuable

tool in settings with limited labeled data, offering

promising applications across a wide range of prac-

tical domains such as clustering of images of animal

species in biology.

ACKNOWLEDGEMENT

During the preparation of this work, the ﬁrst author

utilized ChatGPT (OpenAI, 2023) to enhance the clar-

ity and formulation for parts of the written text. How-

ever, the authors takes full responsibility for the con-

tent of the publication.

Deep Image Clustering with Model-Agnostic Meta-Learning

295

REFERENCES

Ankerst, M., Breunig, M. M., Kriegel, H. P., and Sander,

J. (1999). OPTICS: Ordering Points to Identify the

Clustering Structure. SIGMOD Record (ACM Special

Interest Group on Management of Data), 28(2).

Binta Islam, S., Valles, D., Hibbitts, T. J., Ryberg, W. A.,

Walkup, D. K., and Forstner, M. R. (2023). Animal

Species Recognition with Deep Convolutional Neural

Networks from Ecological Camera Trap Images. Ani-

mals, 13(9).

Bjerge, K., Bodesheim, P., and Karstoft, H. (2024). Few-

shot learning with novelty detection. In Fred, A.,

Hadjali, A., Gusikhin, O., and Sansone, C., editors,

Deep Learning Theory and Applications, pages 340–

363, Cham. Springer Nature Switzerland.

Bjerge, K., Geissmann, Q., Alison, J., Mann, H. M., Høye,

T. T., Dyrmann, M., and Karstoft, H. (2023). Hier-

archical classiﬁcation of insects with multitask learn-

ing and anomaly detection. Ecological Informatics,

77:102278.

Bo, D., Wang, X., Shi, C., Zhu, M., Lu, E., and Cui, P.

(2020). Structural Deep Clustering Network. In The

Web Conference 2020 - Proceedings of the World Wide

Web Conference, WWW 2020.

Bodesheim, P., Freytag, A., Rodner, E., Kemmler, M., and

Denzler, J. (2013). Kernel null space methods for nov-

elty detection. In Proceedings of the IEEE Computer

Society Conference on Computer Vision and Pattern

Recognition.

ohlke, J., Korsch, D., Bodesheim, P., and Denzler, J.

(2021). Exploiting Web Images for Moth Species

Classiﬁcation. In Lecture Notes in Informatics (LNI),

Proceedings - Series of the Gesellschaft fur Informatik

(GI), volume P-314.

Boudiaf, M., Masud, Z. I., Rony, J., Dolz, J., Piantanida,

P., and Ayed, I. B. (2020). Transductive informa-

tion maximization for few-shot learning. In Advances

in Neural Information Processing Systems, volume

2020-December.

Cai, J., Hao, J., Yang, H., Zhao, X., and Yang, Y. (2023).

A review on semi-supervised clustering. Information

Sciences, 632.

Chen, W. Y., Wang, Y. C. F., Liu, Y. C., Kira, Z., and Huang,

J. B. (2019). A closer look at few-shot classiﬁcation.

In 7th International Conference on Learning Repre-

sentations, ICLR 2019.

Chopra, S., Hadsell, R., and LeCun, Y. (2005). Learn-

ing a similarity metric discriminatively, with appli-

cation to face veriﬁcation. In Proceedings - 2005

IEEE Computer Society Conference on Computer Vi-

sion and Pattern Recognition, CVPR 2005, volume I.

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977).

Maximum Likelihood from Incomplete Data Via the

EM Algorithm . Journal of the Royal Statistical Soci-

ety Series B: Statistical Methodology, 39(1).

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,

D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,

M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby,

N. (2021). An image is worth 16X16 words: Trans-

formers for image recognition at scale. In ICLR 2021

- 9th International Conference on Learning Represen-

tations.

Ester, M., Kriegel, H. P., Sander, J., and Xu, X. (1996). A

Density-Based Algorithm for Discovering Clusters A

Density-Based Algorithm for Discovering Clusters in

Large Spatial Databases with Noise. In Proceedings -

2nd International Conference on Knowledge Discov-

ery and Data Mining, KDD 1996.

Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Khalil, I.,

Zomaya, A. Y., Foufou, S., and Bouras, A. (2014).

A survey of clustering algorithms for big data: Tax-

onomy and empirical analysis. IEEE Transactions on

Emerging Topics in Computing, 2(3).

Finn, C., Abbeel, P., and Levine, S. (2017). Model-agnostic

meta-learning for fast adaptation of deep networks. In

34th International Conference on Machine Learning,

ICML 2017, volume 3.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE Computer Society Conference on Computer

Vision and Pattern Recognition, volume 2016-Decem.

Hospedales, T., Antoniou, A., Micaelli, P., and Storkey, A.

(2022). Meta-Learning in Neural Networks: A Sur-

vey. IEEE Transactions on Pattern Analysis and Ma-

chine Intelligence, 44(9).

Huang, H., Wang, C., Wei, X., and Zhou, Y. (2024).

Deep image clustering: A survey. Neurocomputing,

599:128101.

Hubert, L. and Arabie, P. (1985). Comparing partitions.

Journal of Classiﬁcation, 2(1).

Jain, A. K. (2010). Data clustering: 50 years beyond K-

means. Pattern Recognition Letters, 31(8).

Jiang, H., Huang, Y., and Li, Q. (2022). Spectral clus-

tering of single cells using Siamese nerual network

combined with improved afﬁnity matrix. Brieﬁngs in

Bioinformatics, 23(3).

Li, D., Zhang, J., Yang, Y., Liu, C., Song, Y. Z., and

Hospedales, T. (2019). Episodic training for domain

generalization. In Proceedings of the IEEE Interna-

tional Conference on Computer Vision, volume 2019-

October.

Liu, J., Song, L., and Qin, Y. (2020). Prototype Rectiﬁca-

tion for Few-Shot Learning. Lecture Notes in Com-

puter Science (including subseries Lecture Notes in

Artiﬁcial Intelligence and Lecture Notes in Bioinfor-

matics), 12346 LNCS:741–756.

Lu, Y., Li, H., Li, Y., Lin, Y., and Peng, X. (2024). A survey

on deep clustering: from the prior perspective. Vici-

nagearth, 1(1):1–17.

Macqueen, J. (1967). Some methods for classiﬁcation and

analysis of multivarite observation. Preceeding of the

5th Berkeley symposium on mathematical statistics

and probability, Berkeley. University of california

press, 281.

McInnes, L., Healy, J., and Astels, S. (2017). hdbscan: Hi-

erarchical density based clustering. The Journal of

Open Source Software, 2(11).

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

296

Mora, C., Tittensor, D. P., Adl, S., Simpson, A. G., and

Worm, B. (2011). How many species are there on

earth and in the ocean? PLoS Biology, 9(8).

OpenAI (2023). ChatGPT 3.5.

Qin, Y., Ding, S., Wang, L., and Wang, Y. (2019). Research

Progress on Semi-Supervised Clustering.

Ravi, S. and Larochelle, H. (2017). Optimization as a model

for few-shot learning. In 5th International Confer-

ence on Learning Representations, ICLR 2017 - Con-

ference Track Proceedings.

Ren, M., Triantaﬁllou, E., Ravi, S., Snell, J., Swersky, K.,

Tenenbaum, J. B., Larochelle, H., and Zemel, R. S.

(2018). Meta-learning for semi-supervised few-shot

classiﬁcation. In 6th International Conference on

Learning Representations, ICLR 2018 - Conference

Track Proceedings.

Ren, Y., Hu, K., Dai, X., Pan, L., Hoi, S. C., and Xu, Z.

(2019). Semi-supervised deep embedded clustering.

Neurocomputing, 325.

Rodriguez, M. Z., Comin, C. H., Casanova, D., Bruno,

O. M., Amancio, D. R., Costa, L. d. F., and Rodrigues,

F. A. (2019). Clustering algorithms: A comparative

approach. PLoS ONE, 14(1).

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,

Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-

stein, M., Berg, A. C., and Fei-Fei, L. (2015). Ima-

geNet Large Scale Visual Recognition Challenge. In-

ternational Journal of Computer Vision, 115(3).

Snell, J., Swersky, K., and Zemel, R. (2017). Prototypical

networks for few-shot learning. In Advances in Neu-

ral Information Processing Systems, volume 2017-

December.

Song, Y., Wang, T., Cai, P., Mondal, S. K., and Sahoo, J. P.

(2023). A Comprehensive Survey of Few-shot Learn-

ing: Evolution, Applications, Challenges, and Oppor-

tunities. ACM Computing Surveys, 55(13s).

Strehl, A. and Ghosh, J. (2003). Cluster ensembles - A

knowledge reuse framework for combining multiple

partitions. In Journal of Machine Learning Research,

volume 3.

Sun, B., Zhou, P., Du, L., and Li, X. (2022). Active deep

image clustering. Knowledge-Based Systems, 252.

Tan, M. and Le, Q. V. (2019). EfﬁcientNet: Rethinking

model scaling for convolutional neural networks. In

36th International Conference on Machine Learning,

ICML 2019, volume 97, pages 6105–6114.

Todi, A., Narula, N., Sharma, M., and Gupta, U. (2023).

ConvNext: A Contemporary Architecture for Convo-

lutional Neural Networks for Image Classiﬁcation. In

Proceedings - 2023 3rd International Conference on

Innovative Sustainable Computational Technologies,

CISCT 2023.

Vilhagra, L. A., Fernandes, E. R., and Nogueira, B. M.

(2020). TextCSN: A semi-supervised approach for

text clustering using pairwise constraints and convo-

lutional siamese network. In Proceedings of the ACM

Symposium on Applied Computing.

Vinh, N. X., Epps, J., and Bailey, J. (2009). Information the-

oretic measures for clusterings comparison: Is a cor-

rection for chance necessary? In Proceedings of the

26th International Conference On Machine Learning,

ICML 2009.

Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K.,

and Wierstra, D. (2016). Matching networks for one

shot learning. In Advances in Neural Information Pro-

cessing Systems.

Von Luxburg, U. (2007). A tutorial on spectral clustering.

Statistics and Computing, 17(4).

Wah, C., Branson, S., Welinder, P., Perona, P., and Be-

longie, S. (2011). Caltech-UCSD Birds-200-2011

(CUB-200-2011). California Institute of Technology,

CNS-TR-2011-001.

Wang, Y., Chao, W.-L., Weinberger, K. Q., and van der

Maaten, L. (2019). SimpleShot: Revisiting

Nearest-Neighbor Classiﬁcation for Few-Shot Learn-

ing. arXiv.

Wang, Y., Yao, Q., Kwok, J. T., and Ni, L. M. (2020). Gen-

eralizing from a Few Examples: A Survey on Few-

shot Learning. ACM Computing Surveys, 53(3).

Wu, J., Long, K., Wang, F., Qian, C., Li, C., Lin, Z., and

Zha, H. (2019). Deep comprehensive correlation min-

ing for image clustering. In Proceedings of the IEEE

International Conference on Computer Vision, vol-

ume 2019-October.

Xie, J., Girshick, R., and Farhadi, A. (2016). Unsupervised

deep embedding for clustering analysis. In 33rd In-

ternational Conference on Machine Learning, ICML

2016, volume 1.

Yang, J., Parikh, D., and Batra, D. (2016). Joint unsuper-

vised learning of deep representations and image clus-

ters. In Proceedings of the IEEE Computer Society

Conference on Computer Vision and Pattern Recogni-

tion, volume 2016-December.

Deep Image Clustering with Model-Agnostic Meta-Learning

297