CLIP-MDGAN: Multi-Discriminator GAN Using CLIP Task Allocation

Shonosuke Gonda, Fumihiko Sakaue and Jun Sato

Nagoya Institute of Technology, Nagoya 466-8555, Japan

gonda@cv.nitech.ac.jp, {sakaue, junsato}@nitech.ac.jp

Keywords:

Image Synthesis, Image Distribution, GAN, Multi-Discriminator, Clip, Foundation Model, Multimodal.

Abstract:

In a Generative Adversarial Network (GAN), in which the generator and discriminator learn adversarially, the

performance of the generator can be improved by improving the discriminator’s discriminatory ability. Thus,

in this paper, we propose a method to improve the generator’s generative ability by adversarially training a

single generator with multiple discriminators, each with different expertise. By each discriminator having

different expertise, the overall discriminatory ability of the discriminator is improved, which improves the

generator’s performance. However, it is not easy to give multiple discriminators independent expertise. To

address this, we propose CLIP-MDGAN, which leverages CLIP, a large-scale learning model that has recently

attracted a lot of attention, to classify a dataset into multiple classes with different visual features. Based on

CLIP-based classiﬁcation, each discriminator is assigned a speciﬁc subset of images to promote the devel-

opment of independent expertise. Furthermore, we introduce a method to gradually increase the number of

discriminators in adversarial training to reduce instability in training multiple discriminators and reduce train-

ing costs.

1 INTRODUCTION

In recent years, advances in research in VAE (Vari-

ational Auto Encoder) (Kingma and Welling, 2014),

GAN (Generative Adversarial Network) (Goodfel-

low et al., 2014), and diffusion models (Ramesh

et al., 2021; Rombach et al., 2022) have made it

possible to reproduce the image distribution within

datasets accurately and to generate high-quality im-

ages. Among these approaches, GANs have demon-

strated applications not only in high-ﬁdelity image

generation (Brock et al., 2019) (Karras et al., 2020)

but also in diverse tasks such as domain transla-

tion (Karras et al., 2020), super-resolution (Ledig

et al., 2017), text-to-image generation (Zhang et al.,

2017), and anomaly detection (Zhang et al., 2017).

The mechanism of GAN, which improves capabili-

ties by competing among multiple networks, is very

important in advancing learning methodologies.

In GANs, training typically progresses by pairing

one Generator with one Discriminator, allowing them

to learn in competition. As the Discriminator’s abil-

ity to distinguish improves, so does the Generator’s

ability to produce realistic images. If we think of the

Generator as a student and the Discriminator as a su-

pervisor, students will be able to acquire a broader

and higher level of ability if they are taught by mul-

Figure 1: A method for automatically selecting the Discrim-

inator corresponding to each training image based on image

feature distribution.

tiple supervisors, each with different specialized ex-

pertise, rather than by a single supervisor. Thus, we

in this paper propose a method to enhance the Genera-

tor’s capabilities by adversarially training it with mul-

tiple Discriminators, each with different expertise, as

shown in Figure 1.

Several methods have previously been proposed to

enhance Generator performance using multiple Dis-

criminators (Dinh Nguyen et al., 2017; Durugkar

et al., 2017; Choi and Han, 2022). However, task

allocation to multiple Discriminators is not a trivial

problem and it often suffers from instability. Thus, in

this paper, we propose a method for allocating tasks to

multiple Discriminators by using CLIP (Contrastive

464

Gonda, S., Sakaue, F. and Sato, J.

CLIP-MDGAN: Multi-Discriminator GAN Using CLIP Task Allocation.

DOI: 10.5220/0013231900003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 2: VISAPP, pages

464-470

ISBN: 978-989-758-728-3; ISSN: 2184-4321

Language-Image Pre-training) (Radford et al., 2021),

a large-scale model pretrained on the relationship be-

tween images and texts using large amount of image

and text data.

In recent years, the advent of large-scale mod-

els pretrained on vast datasets has enabled zero-shot

performance across various downstream tasks by uti-

lizing well-trained Encoders within these large-scale

models. In light of this, we propose an efﬁcient and

stable method for adversarial training between a Gen-

erator and multiple highly specialized Discriminators

by leveraging the Encoder of CLIP to streamline the

specialization of multiple Discriminators. We call it

CLIP-MDGAN.

As illustrated in Figure 1, our CLIP-MDGAN se-

lects the appropriate Discriminator for each image

based on its image features. By strategically employ-

ing CLIP’s Text Encoder and Image Encoder, we can

cluster image groups effectively for each Discrimina-

tor. Since CLIP Encoders are trained to capture cor-

relations between images and text, the feature vec-

tors from its Image Encoder serve as valuable sym-

bolic descriptors, facilitating meaningful clustering of

large datasets. Hence, by clustering images into mul-

tiple classes using these feature vectors, we assign

each Discriminator a speciﬁc classiﬁcation task. This

assignment allows each Discriminator to focus on

identifying images with distinct characteristics, fos-

tering high specialization among multiple Discrim-

inators and enhancing the Generator’s performance

through adversarial training. Our approach avoids the

instability seen in the existing methods during Dis-

criminator selection, while enabling each Discrimina-

tor to develop high specialization efﬁciently.

This paper presents both a semi-automatic and a

fully automatic approach for assigning tasks to mul-

tiple Discriminators using CLIP. Additionally, to en-

sure fast and stable task allocation to multiple Dis-

criminators, we introduce a method that gradually in-

creases the number of Discriminators in stages.

While this study focuses on enhancing a sin-

gle Generator with multiple Discriminators, this ap-

proach can be extended to frameworks involving mul-

tiple Generators (Ghosh et al., 2018; Hoang et al.,

2018), potentially enabling multi-Generator systems

to beneﬁt from adversarial training with specialized

Discriminators.

2 RELATED WORK

Several methods have previously been proposed

to enhance Generator performance through the

use of multiple Discriminators. For example,

Figure 2: Overview of CLIP-MDGAN.

D2GAN (Dinh Nguyen et al., 2017) demonstrated

that employing two Discriminators with opposing

loss functions could help avoid mode collapse in

GANs. Albuquerque (Albuquerque et al., 2019) intro-

duced multi-objective discriminators, each focused on

distinct tasks, thereby enhancing the Generator’s per-

formance. GMAN (Durugkar et al., 2017) proposed

a balanced approach by combining a hard Discrimi-

nator (strict teacher) and a soft Discriminator (lenient

teacher) to regulate the Generator’s learning.

There have also been attempts to instill specialized

skills within multiple Discriminators during train-

ing. MCL-GAN (Choi and Han, 2022) utilizes Multi-

ple Choice Learning (MCL) (Guzman-Rivera et al.,

2012) to select a limited number of Discriminators

from a larger pool based on each dataset image, facil-

itating adversarial learning with the Generator. How-

ever, this method requires multiple Discriminators for

the classiﬁcation of a single data instance, resulting in

overlapping roles among these Discriminators. Addi-

tionally, during the early stages of training, task allo-

cation to each Discriminator can be unstable, making

it challenging to maintain operational stability.

Thus, we in this paper propose a method for al-

locating tasks to multiple Discriminators by using

CLIP (Radford et al., 2021), a large-scale model pre-

trained on the relationship between images and texts.

By using the well-trained text and image encoders of

CLIP, our method can avoid instability during Dis-

criminator selection, while enabling each Discrimina-

tor to develop specialization efﬁciently.

3 TRAINING GENERATORS

WITH MULTIPLE

DISCRIMINATORS

The overall view of our CLIP-MDGAN is as shown

in Figure 2.

CLIP-MDGAN: Multi-Discriminator GAN Using CLIP Task Allocation

465

Figure 3: Hand-made class assignment for multiple dis-

criminators.

First, we divide Dataset X into N classes X

(i =

1, ··· , N) by using CLIP Encoders. Next, the CLIP-

based Discriminator Selector identiﬁes which of these

N classes the image G(z) generated by the Generator

from noise z belongs to. G(z) is then passed to the

Discriminator D

selected by the Discriminator Se-

lector, and D

distinguishes between the dataset X

Class i and G(z).

In this way, the Discriminator is responsible for

identifying speciﬁc classes, which improves the over-

all discriminatory ability of the Discriminators, and

the Generator’s image generation ability is further im-

proved by adversarial training with these Discrimina-

tors.

4 CLASS ASSIGNMENT FOR

MULTIPLE DISCRIMINATORS

When using multiple specialized Discriminators to

determine the authenticity of images generated by a

Generator, it is essential to assign each Discrimina-

tor a speciﬁc image class (or domain) to identify. In

this research, we propose two distinct methods for as-

signing image classes to each Discriminator. The fol-

lowing sections will provide a detailed explanation of

these two methods.

4.1 Class Assignment Based on

Hand-Deﬁned Classes

First, we describe a method for assigning classes

to multiple Discriminators based on hand-deﬁned

classes. In image generation tasks, it is sometimes

possible to roughly categorize the images we aim to

generate based on human senses. In such cases, the

method described in this section is useful.

Consider, for example, the task of generating face

images. In this scenario, we can divide the face im-

ages into some classes based on human senses. For

instance, we might set up four hand-deﬁned classes:

male children, female children, adult men, and adult

women. Then, for assigning each class image to an

individual Discriminator, we need to divide the image

dataset into these multiple classes. In this research,

we use CLIP (Radford et al., 2021), a model trained

on a large dataset of text and images, to obtain the

relationship between each class label and its corre-

sponding set of images in a zero-shot manner.

Speciﬁcally, as shown in the upper part of Fig-

ure 3, we use CLIP Text Encoder to compute the text

features T

for the hand-deﬁned N classes, Class i

(1, ··· , N). Next, we select an image x

( j =

1, · ·· , M) from the image dataset X that comprises the

images we aim to generate and calculate its image fea-

ture I

using CLIP Image Encoder. Then, by selecting

the text feature T

that best matches I

based on cosine

similarity, we assign image x

to Class i. By repeating

this process for all images in dataset X, we partition

X into N class-speciﬁc subsets X

(i = 1, ·· · , N), as

shown in the lower part of Figure 3.

In our research, the N Discriminators are each re-

sponsible for determining the authenticity of images

belonging to one of these N classes. This approach

leverages hand-deﬁned categories combined with au-

tomatic class assignment using CLIP, enabling each

Discriminator to specialize in judging a speciﬁc sub-

set of images, thus enhancing the Generator’s perfor-

mance across diverse classes.

4.2 Class Assignment Based on Data

Distribution

We next describe a method for assigning classes to

each Discriminator based on the distribution of image

features within the training data. This method auto-

matically determines the optimal class distinction and

allows each discriminator to specialize on a subset of

images with similar characteristics. By using data-

driven clustering, each Discriminator can learn to dis-

tinguish its assigned classes more effectively, which

improves the overall performance of the Generator.

In this method, the image dataset X is automati-

cally divided into N datasets X

(i = 1, ··· , N) based

on its distribution. First, the M images x

(i =

1, · ·· , M) in the training dataset are input into CLIP

Image Encoder, resulting in M image feature vectors

(i = 1, · ·· , M). Then, as an initial state for cluster-

ing, the M image feature vectors I

are treated as M

clusters C

(i = 1, ··· , M) in the image feature space,

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

466

Figure 4: CLIP-based Discriminator Selector.

thereby forming the cluster set C as follows:

C = {C

, C

, . . . , C

} (1)

The distance d(C

, C

) between clusters C

and C

is deﬁned based on Ward’s method (Ward, 1963) as

follows:

d(C

, C

) =

||C

| + |C

∥µ

− µ

∥

(2)

where, |C

| represents the number of data points in

cluster C

, µ

is the centroid (mean vector) of clus-

ter C

, and ∥µ

− µ

∥

denotes the squared Euclidean

distance between the centroids of clusters C

and C

respectively.

Then, the two clusters C

and C

with the smallest

distance d are merged to form a new cluster C

i j

= C

∪C

(3)

As a result, the new cluster set C is generated as fol-

lows:

C ← (C \ {C

, C

}) ∪ {C

i j

} (4)

By repeating this operation until the number of

clusters reaches N, the image dataset X can be di-

vided into N clusters X

(i = 1, · ·· , N) based on the

distribution of image features obtained from CLIP.

5 DISCRIMINATOR SELECTION

Next, we describe a method for selecting which of

the N Discriminators should handle a generated im-

age G(z) from the Generator. In this research, we in-

troduce a new component, the Discriminator Selector,

responsible for making this selection. This Selector

identiﬁes the appropriate Discriminator (and thus the

appropriate class) for each generated image based on

its image features. We implement the Discriminator

Selector using CLIP.

As shown in Figure 4, for each of the N classes

deﬁned in Section 4, we convert the images x

i j

∈ X

(i = 1, ·· · , N) into image feature vectors I

i j

using

CLIP Image Encoder. We then compute a mean fea-

ture vector I

mean

for each class as follows:

mean

∑

j=1

i j

(5)

where, M

denotes the number of data in Class i.

Then, as shown in Figure 4, the class c to which

the generated image I

gen

belongs is determined by

selecting the class with the minimum Euclidean dis-

tance between the feature vector of the generated im-

age I

gen

and the mean feature vector of each class.

This is formulated as follows:

c = argmin

i∈{1,...,N}

∥I

gen

− I

mean

∥ (6)

Each time the Generator produces an image, the

Discriminator Selector uses this criterion to assign the

generated image to a speciﬁc class. Consequently,

the corresponding Discriminator, specialized in that

class, is selected to evaluate the generated image.

In this method, an appropriate Discriminator is au-

tomatically selected based on the characteristics of

each generated image, resulting in improved discrim-

inative capability compared to using a single Discrim-

inator. Consequently, this enhanced discrimination

leads to improved generative capabilities of the Gen-

erator, as it receives more targeted and specialized

feedback from the selected Discriminator.

6 GRADUAL INCREASE OF

DISCRIMINATORS

In this section, we describe an approach for more ef-

ﬁcient and stable training when using multiple Dis-

criminators. When assigning distinct specializations

to each Discriminator for adversarial learning, the ini-

tial stages of training can be challenging. At this

early stage, the Generator’s image quality is often

low, making it difﬁcult to assign generated images to

the most suitable Discriminator (i.e., to determine the

correct class). This poses a stability issue in the early

training phases. Additionally, optimizing multiple

Discriminators sequentially can be time-consuming.

To address these challenges, our approach gradu-

ally increases the number of Discriminators over the

course of training. In the initial stage, we start with

one Discriminator adversarially trained against a sin-

gle Generator. Once the training has progressed to a

CLIP-MDGAN: Multi-Discriminator GAN Using CLIP Task Allocation

467

certain extent, we incrementally increase the number

of Discriminators. Each time we increase the number

of Discriminators, we do so by splitting each existing

Discriminator into two. Thus, the number of Discrim-

inators increases as 1, 2, 4, 8, and so forth. For each

split, the ﬁnal parameters of the original Discrimina-

tor are used as the initial parameters for the two new

Discriminators. This inheritance of parameters en-

sures that the characteristics of the original Discrimi-

nator are retained in the split Discriminators, allowing

for stable handling of the expansion process.

There are several possible methods for determin-

ing the timing of Discriminator increases. In this re-

search, for simplicity, we chose to increase the num-

ber of Discriminators after a ﬁxed number of training

iterations. Developing a method to automatically de-

termine this timing is an area for future investigation.

With this approach, we can avoid instability in

Discriminator selection during the early stages of

training. Additionally, as adversarial learning pro-

gresses more efﬁciently, this method enables perfor-

mance improvements in the Generator with fewer

training iterations.

7 TRAINING

Next, we discuss the training method for the net-

work. Here, we consider the case of adversarially

training a single Generator and N Discriminators D

(i = 1, ··· , N) using a training dataset divided into N

classes.

Suppose that an image G(z) generated by the Gen-

erator from noise z is assigned to a Discriminator D

by the Discriminator Selector. In this scenario, the

Generator G and the selected Discriminator D

engage

in adversarial training according to the following ob-

jective functions:

∗

= argmin

(7)

∗

= argmax

x∼p

data

(x)

[logD

(x)]

+ E

z∼p

(z)

[log(1 − D

(G(z)))] (8)

In this way, an appropriate Discriminator is se-

lected by the Discriminator Selector based on the

characteristics of the image generated by the Gen-

erator, enabling targeted adversarial learning to take

place.

8 EXPERIMENTS

Next, we present the experimental results using the

proposed method. In the following experiments, we

Figure 5: CelebA dataset(64×64 resized).

(a) Training images classiﬁed by hand-deﬁned classes

(b) Training images classiﬁed by image distribution

Figure 6: Comparison of training images classiﬁed into 4

classes C

(i = 1, · · · , 4) by hand-deﬁned classes and image

distribution.

target the task of generating face images using the

CelebA (Liu et al., 2015) dataset shown in Figure 5

as training data. In each experiment, the dataset was

divided using the proposed method, and a discrimi-

nator was selected using a Discriminator Selector to

train the GAN. For quantitative evaluation, we used

echet Inception Distance (FID) (Heusel et al., 2017)

to measure the similarity in the feature space between

the generated images and the real images.

8.1 Image Generation Accuracy of

Generator

In this experiment, we compared the image genera-

tion accuracy of the generator by setting the num-

ber of discriminators to 1, 2, and 4. Of these three

cases, the case with 1 discriminator is the conven-

tional method.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

468

(a) 1G1D (b) 1G2D (c) 1G4D

Figure 7: Images generated from 1G1D, 1G2D, and 1G4D.

(a) hand-deﬁned (b) data distribution

Figure 8: Images generated from 1G4D trained by using the

hand-deﬁned classes and the data distribution based classes.

For dataset division based on hand-deﬁned

classes, we set the following four classes and divided

the dataset using CLIP Text Encoder and Image En-

coder.

• ”Man with a smile”

• ”Man with a Neutral-face”

• ”Woman with a smile”

• ”Woman with a Neutral-face”

Example images in the four classes after division

are shown in Figure 6 (a). Each row is a group of im-

ages corresponding to ”Man with a smile”, ”Man with

a Neutral-face”, ”Woman with a smile”, and ”Woman

with a Neutral-face” respectively. From these images,

we ﬁnd that by using CLIP Encoder, the images can

be roughly classiﬁed correctly into the four classes.

Next, we show the results of automatic data seg-

mentation based on the distribution of the image fea-

tures in Figure 6 (b). Each of the four rows repre-

sents one of four classes of images after segmenta-

tion. From the top row, these images appear to be

roughly divided into Western women, Western men,

Asians, and Athletes. These results are different from

the hand-deﬁned classes in Figure 6 (a), but we ﬁnd

that the images have been segmented into groups with

similar image features.

Based on these data classiﬁcations, we compared

the image generation ability of the generator when ad-

versarial learning was performed between one gener-

ator and four discriminators (1G4D) with adversarial

learning with one discriminator (1G1D) and adversar-

ial learning with two discriminators (1G2D). In the

case of 1G2D, we transitioned from 1G1D to 1G2D

in 50 epochs. In the case of 1G4D, we transitioned

from 1G1D to 1G2D in 50 epochs, and from 1G2D to

1G4D in 100 epochs.

Table 1: FID score of generated images by the Generator

trained with each conﬁguration.

hand-deﬁned data distribution

1G1D 95.71 94.74

1G2D 90.33 89.85

1G4D 90.03 87.37

Table 2: FID score of generated images by the Generator

trained with each conﬁguration.

FID

1G4D (4) failed

1G4D (1,4) 96.42

1G4D (1,2,4) 87.37

Figure 7 shows example images generated from

1G1D, 1G2D, and 1G4D respectively. Figure 8 also

shows a comparison of images generated from 1G4D

using hand-deﬁned classes and using data distribution

based classes.

To quantitatively evaluate the quality of these gen-

erated images, we computed the FID values of the

generated images. The results are shown in Table 1.

Table 1 shows that 1G2D and 1G4D can generate

generators with higher performance than the conven-

tional method 1G1D, and that the more discriminators

are added, the higher the performance of the genera-

tor. Also, comparing the second and third columns

of this table, we ﬁnd that automatic data classiﬁca-

tion based on the distribution of the data can generate

a generator with higher performance than the hand-

deﬁned classes.

8.2 Gradual Increase of Discriminators

Next, we investigated the difference in image genera-

tion results between learning while gradually increas-

ing the number of discriminators and learning with

an increased number of discriminators from the be-

ginning of learning. We compared three cases: (1)

1G4D from the beginning (1G4D (4)), (2) starting

with 1G1D and changing to 1G4D at the 50th epoch

(1G4D (1,4)), and (3) starting with 1G1D, changing

to 1G2D at the 50th epoch and changing to 1G4D at

the 100th epoch (1G4D (1,2,4)). The FID score of

the generator trained in each conﬁguration is shown

in Table 2.

In the case of IG4D (4), the image generation by

the Generator in the early stages of learning was low

accuracy, so the discriminator selection could not be

performed appropriately and training failed. On the

other hand, comparing the results of 1G4D (1,4) and

1G4D (1,2,4), we ﬁnd that 1G4D (1,2,4) has a higher

generation accuracy of the Generator, and that gradu-

ally increasing the number of discriminators can gen-

CLIP-MDGAN: Multi-Discriminator GAN Using CLIP Task Allocation

469

erate a better Generator.

These results show that when using multiple dis-

criminators, gradually increasing their expertise can

result in better adversarial learning.

9 CONCLUSION

In this paper, we proposed a method to improve the

image generation ability of a generator by adversari-

ally training a generator using multiple discriminators

with different expertise. In particular, we proposed a

method to give multiple discriminators independent

expertise by dividing a dataset so that they have inde-

pendent image features and selecting images for each

discriminator by using CLIP. In addition, we showed

a method to gradually increase the number of discrim-

inators in order to eliminate the instability of training

that occurs when using multiple discriminators.

Experimental results showed that when dividing

the classes that each discriminator is responsible for,

more appropriate expertise is given to the discrimina-

tors when dividing based on the distribution of image

features obtained by CLIP rather than based on fea-

tures thought by humans. It was also revealed that

it is more efﬁcient to give expertise while gradually

increasing the number of discriminators.

REFERENCES

Albuquerque, I., Monteiro, T., Doan, T., Considine, B.,

Falk, T., and Mitliagkas, I. (2019). Multi-objective

training of generative adversarial networks with multi-

ple discriminators. In Proc. International Conference

on Machine Learning.

Brock, A., Donahue, J., and Simonyan, K. (2019). Large

scale gan training for high ﬁdelity natural image syn-

thesis. In International Conference on Learning Rep-

resentations (ICLR).

Choi, J. and Han, B. (2022). Mcl-gan:generative adversarial

networks with multiple specialized discriminators. In

Proc. Conference on Neural Information Processing

Systems (NeurIPS).

Dinh Nguyen, T., Le, T., Vu, H., and Phung, D. (2017).

Dual discriminator generative adversarial nets. In

Proc. Conference on Neural Information Processing

Systems (NIPS).

Durugkar, I., Gemp, I., and Mahadevan, S. (2017). Genera-

tive multi-adversarial networks. In Proc. International

Conference on Learning Representations.

Ghosh, A., Kulharia, V., Namboodiri, V., Torr, P., and Doka-

nia, P. (2018). Multi-agent diverse generative adver-

sarial networks. In Proc. IEEE Conference on Com-

puter Vision and Pattern Recognition (CVPR).

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A., and Ben-

gio, Y. (2014). Generative adversarial nets. Advances

in neural information processing systems, 27:2672–

2680.

Guzman-Rivera, A., Batra, D., and Kohli, P. (2012). Mul-

tiple choice learning: Learning to produce multiple

structured outputs. In Proc. Conference on Neural In-

formation Processing Systems (NIPS).

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and

Hochreiter, S. (2017). Gans trained by a two time-

scale update rule converge to a local nash equilibrium.

Advances in Neural Information Processing Systems,

30.

Hoang, Q., Nguyen, T., Le, T., and Phung, D. (2018).

Mgan: Training generative adversarial nets with mul-

tiple generators. In Proc. International Conference on

Learning Representations.

Karras, T., Laine, S., and Aila, T. (2020). Analyzing and

improving the image quality of StyleGAN. In Proc.

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition (CVPR), pages 8110–8119.

Kingma, D. P. and Welling, M. (2014). Auto-encoding vari-

ational bayes. arXiv preprint arXiv:1312.6114.

Ledig, C., Theis, L., Husz

ar, F., Caballero, J., Cunning-

ham, A., Acosta, A., Aitken, A., Tejani, A., Totz, J.,

Wang, Z., and Shi, W. (2017). Photo-realistic single

image super-resolution using a generative adversarial

network. Proc. IEEE Conference on Computer Vision

and Pattern Recognition (CVPR), pages 4681–4690.

Liu, Z., Luo, P., Wang, X., and Tang, X. (2015). Deep learn-

ing face attributes in the wild. In Proc. International

Conference on Computer Vision (ICCV).

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh,

G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P.,

Clark, J., Krueger, G., and Sutskever, I. (2021). Learn-

ing transferable visual models from natural language

supervision. In International Conference on Machine

Learning (ICML). PMLR.

Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C.,

Radford, A., Chen, M., and Sutskever, I. (2021).

Zero-shot text-to-image generation. arXiv preprint

arXiv:2102.12092.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and

Ommer, B. (2022). High-resolution image synthesis

with latent diffusion models. In Proc. IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 10684–10695.

Ward, J. H. (1963). Hierarchical grouping to optimize an

objective function. Journal of the American Statistical

Association, 58(301):236–244.

Zhang, H., Xu, T., Li, H., Zhang, S., Huang, X., Wang,

X., and Metaxas, D. N. (2017). Stackgan: Text to

photo-realistic image synthesis with stacked genera-

tive adversarial networks. In Proc. IEEE International

Conference on Computer Vision (ICCV), pages 5907–

5915.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

470