DragGAN-Based Emotion Image Generation and Analysis for

Animated Faces

Daqi Hu

The School of Computer and Artificial Intelligence, Nanjing University of Science and Technology Zijin College,

Nanjing, China

Keywords: Generative Artificial Intelligence, DragGAN, StyleGAN, Motion Supervision.

Abstract: In recent years, generative artificial intelligence (AI) and its applications have become a hot topic among art

designers and content creators. There is a need for a simpler and more direct method to slightly edit images.

In this paper, author introduces Drag Your Generative Adversarial Network (DragGAN) and improve its

discriminators and features to adapt to anime styles. This work consists of two main parts: algorithm design

based on Style-Based GAN (StyleGAN) model and application to anime style images. Specifically, an

analytical model is first constructed using DragGAN. The process is called motion supervision. The input

image should match the trained model. Secondly, it uses point tracking to continuously iterate the generation

process and gives the result of each iteration. Third, the analysis compares the predictive performance of

different models and provides an interactive GUI application as a demo project. With this research, anyone

can edit their anime style portraits with a few clicks and drags. This can help people to reduce the time spent

on editing anime style images and increase their productivity and creativity.

1 INTRODUCTION

Anime style refers to a unique and vibrant art form that

originated in Japan. Today it has become a global

cultural phenomenon. Anime is characterized by

colorful visuals, fantastical themes and rich

expressions. It has captured the hearts of millions of

anime fans worldwide. This unique style of animation

has found its way into a variety of media, including

television series, movies, manga (Japanese comics),

video games and more.

Deep learning-based image-to-image translation

has produced excellent results in the past few years

(Schmidhuber 2015 & Chen et al 2020). Generative

Adversarial Network (GAN) have been proved to

have the strong ability in style conversion and have

been the greatest solution for slightly image correction

(Chen et al 2020, Goodfellow et al 2014 & Pan et al

2023). Instead of training raw pixel data, GAN pay

more attention to the changes. In the context of style

conversion, the origin and the result are “pairs”, GAN

trains a discriminator to recognize the pair to convert

origin image to the result. So, image generation with

GAN method is faster and more precise than

traditional reinforcement learning. Users can edit the

content of any GAN-generated images with a single

drag in Drag Your GAN (DragGAN) (Pan et al 2023).

It provides a kind of way for users to get the demanded

image in a user-friendly web GUI. DragGAN

separated the approach into two parts to achieve the

goal: a feature-driven motion supervision that directs

the handle point toward the target position and an

advanced point tracking method that continuously

localizes the handle point's position using the

discriminative generating features. It mainly focuses

on image generation of in-real-life photos. However,

in Anime-style image generation, the feature of image

is quite different from that. Firstly, Anime-style

images have sharp stroke edges. Those sketching

represents ambient occlusion and subsurface

scattering in real lighting environments. Second, the

textures and diffuse of cloth and skin are simplified to

single diffuse color. Since faces have to be more

beautiful than real-world people, the discriminators

and models of DragGAN cannot be used directly.

Most researchers have accomplished this by

generating a 2D image using a 3D model or by

generating the entire picture using a diffusion model

(Deng et al 2020, Jascha et al 2015, Song et al 2020,

Song 2011 & Teed and Deng 2020). Both of these

methods do not fulfill the requirements of accuracy

and flexibility. To solve this problem, this paper

introduces DragGAN and improves its discriminators

234

Hu, D.

DragGAN-Based Emotion Image Generation and Analysis for Animated Faces.

DOI: 10.5220/0012799500003885

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 1st International Conference on Data Analysis and Machine Learning (DAML 2023), pages 234-239

ISBN: 978-989-758-705-4

and features to fit Anime-style. Specifically, first,

DragGAN is used to construct the analysis model. The

process is called Motion Supervision. The input image

should match the trained model. Second, it uses Point

Tracking to keep iterating the generating process and

give the result of each iteration. Then users can stop at

any step if the image fit their needs before the final

iteration. Third, the predictive performance of the

different models is analyzed and compared. As

mentioned above, the official DragGAN requires and

fits well on real image and its associated pretrained

model. What this project does is expand the algorithm

to stylized images, and use Anime style as an example.

This research can help users get an easier way to alter

image, especially in rotation and relocation

requirements, which gets better results than traditional

image editing algorithms.

2 METHODOLOGY

2.1 Dataset Description and

Preprocessing

The dataset in this project is “Ganyu | Genshin Impact

Anime Faces GAN Training” (Dataset 2023), which is

downloaded from Kaggle. The pictures in this dataset

represents a specific anime character, and each image

is 512x512 pixels in size. All of the images should be

preprocessed to align its portrait to the center and

resize it to 512x512. There are different angles of the

character face and they allow users to be exposed to

different environments and lighting. To gain better

results, additional styles of character are also

recommended.

2.2 Proposed Approach

This project focuses on providing a method of anime

image transition, which is based on Style-Based GAN

(StyleGAN) network and some application work from

DragGAN. StyleGAN3 has been a state-of-the-art

method in GAN field since it releases its first version

StyleGAN in 2019, and it released a better version

recently which fixes some issues and add support to

PyTorch and Ada Architecture graphics cards.

DragGAN supports StyleGAN2 and StyleGAN3 at

the same time. Either repository is available to train

model or use the pretrained models from them. One of

the most well-known StyleGAN3 model is a real-

human image generation model. The whole process of

this implementation comes from datasets, which

usually is the pretrained models. In DragGAN, the

Generator of exponential moving average (Gema) in

StyleGAN model will be used to manipulate the

source and target points from user interface. Then the

model will predict the movement of next frame’s

generated image. The data is from Gema, which will

be discussed later in this passage. Finally, it tracks the

manipulated point and update every frame. The

process is shown in the Fig.1.

Figure 1: This project’s architecture (Picture credit:

Original).

2.2.1 StyleGAN3

The StyleGAN is a new generator and discriminator

architecture for GANs. GAN is a generative model

for deep learning that is characterized by generating

realistic samples by means of adversarial

training.GAN consists of two parts: a generator and a

discriminator. The generator is responsible for

generating false samples, while the discriminator is

responsible for classifying the real and generated

samples. The Generator is responsible for converting

random noise inputs into spurious samples. The

discriminator is responsible for determining whether

the input sample is a true sample or a false sample

generated by the generator. The generator receives a

random noise vector as input and generates a spurious

sample. The discriminator receives the true sample

and the false sample generated by the generator and

tries to distinguish them accurately. The goal of the

generator is to generate false samples that can deceive

the discriminator, while the goal of the discriminator

is to accurately classify true and false samples. By

repeatedly and iteratively training the generator and

discriminator, the performance of the GAN gradually

improves, and the generated false samples become

more and more realistic.StyleGAN3 is an improved

version of the GAN, which breaks through in terms

of the quality and diversity of the generated images.

Compared with the traditional GAN, StyleGAN3 has

3 features higher quality of generated images.

StyleGAN3 generates more realistic, clear and

detailed images by introducing new architectures and

DragGAN-Based Emotion Image Generation and Analysis for Animated Faces

235

training techniques. Better control of

generation.StyleGAN3 allows fine control over

different attributes of the generated samples, such as

facial expressions, hairstyles, etc., providing more

personalization options. Super-resolution generation:

StyleGAN3 can generate high-resolution images,

including image generation for detail enhancement

and super-resolution reconstruction tasks.

styleGAN3's thought process is similar to that of

traditional GANs, but it introduces a number of

improvements and innovations in the model

architecture and training process. These

improvements include architectural optimization of

generators and discriminators, feature alignment

mechanisms, regularization methods, etc., aiming to

improve the quality and diversity of generated results.

It produces an automatically taught, unsupervised

separation of high-level features and random

variation in the resulting images and enables

straightforward, scale-specific synthesis

management. StyleGAN3 optimized from

Progressive GAN’s method, it leverages modern

GPU architecture and is implemented on modern

machine learning framework. Every StyleGAN

model provides a Generator, a Discriminator and a

Generator which focuses on exponential moving

average, denoted as Gema. So, these three parts in

pickle file is called G, D, and Gema. The point that

why DragGAN is implemented on StyleGAN3 is

because the third version of StyleGAN have obvious

improvements on video content generation, and they

share the both backend and training utilities. In origin

StyleGAN’s implementation, it use a style-based

generator to force the generated image stick to the

given sample. This is the reason why StyleGAN3 is

the state-of-the-art and is different from those

previous GAN. Traditionally, the generator receives

the latent code through an input layer. However,

StyleGAN includes it with the first constant image.

That is:

𝐴𝑑𝑎𝐼𝑁



𝑥



, 𝑦



= 𝑦

,

























+ 𝑦

,

(1)

where each feature map 𝑥



is normalized separately,

and 𝑦 =



𝑦



, 𝑦





takes specialized 𝑤 and act as a style

to control adaptive instance normalization (AdaIN).

2.2.2 Latent Code

In the context of GANs, the Generator plays a crucial

role in producing the generated images, while the

Discriminator is responsible for training the

Generator by providing feedback and ensuring that

the generated results resemble the desired outcomes.

However, GANs have expanded beyond simple

image generation and can now be used to alter

existing images from a given source to a target image.

The connection between the Generator and

Discriminator is achieved through latent code, which

serves as an intermediary during the training process

and remains invisible until all updates have been

completed. As a result, the generated or altered

images are derived from the datasets used for training,

meaning that the outcomes are limited to what was

defined within those datasets. Consequently, if users

aim to alter an image that is not part of the training

dataset, the original DragGAN method is unable to

provide a viable solution. However, the Pivotal

Tuning for Latent-based editing (PIL) method offers

a solution to this limitation (Roich 2021). PIL enables

the alteration of a pretrained model without the need

for retraining. All that is required is the input image

that the user wishes to edit. Figure 1 showcases the

effectiveness of the Inverse Latent approach, which

allows for customizable image editing without the

need to retrain the model. This method seamlessly

integrates into the entire process, providing a user-

friendly and efficient solution for image

manipulation. By leveraging PIL, users now have the

ability to modify images that were not part of the

original training dataset, expanding the scope of

possible image alterations and offering greater

flexibility in creative expression. This advancement

introduces a significant shift in latent-based editing

techniques, granting users the power to customize and

transform images with ease and innovation.

How PIL affects pretrained models is in three

steps, inversion, tuning and regularization. It is used to

approach the origin StyleGAN model and adapt it to

the given image. Inversion thus serves to improve the

tuning phase's starting point. The most editable latent

space is StyleGAN's native latent space 𝑊. Consider

that the implementation should reconstruct 𝑠, which is

the input image, with optimizing the latent code 𝑊,

the pivot code 𝑝 and the noise vector 𝑣. The following

objective defines the optimization:

𝑊



, 𝑣 =argmin𝐿



𝑠, 𝐺



𝑤, 𝑣; 𝜃



 +

𝜆



𝐿



(𝑣) (2)

in which 𝐺(𝑤, 𝑣; 𝜃) defines that generator 𝐺 and

weights 𝜃 create the generated image. Instead of

traditional StyleGAN-based methods using its

mapping network, The PIL uses three individual

networks. 𝐿



, the perceptual loss,

𝐿



, the noise

regularization term and 𝜆



the hyperparameter.

DAML 2023 - International Conference on Data Analysis and Machine Learning

236

Tuning and regularization phase are mostly same as

origin StyleGAN.

2.2.3 DragGAN

DragGAN is an image editing method based on GAN,

which consists of two main components: a Generator

and a Discriminator. The Generator is responsible for

generating realistic images, while the Discriminator

guides the training of the Generator by evaluating the

generated images to produce more realistic

images.DragGAN extends the functionality of the

GAN by allowing the user to perform image editing

between a given source image and a target image.

Specifically, DragGAN can generate editing results

similar to the target image by learning the differences

between the source and target images based on the

source image provided by the user. This allows the

user to edit the image in terms of morphology, color,

and texture by using DragGAN for applications such

as image style conversion, image enhancement, and

image restructuring.

In DragGAN, the generator generates an image by

sampling from the latent space (latent space). Latent

space is a low-dimensional vector space in which

each vector corresponds to a unique image style. By

adjusting the values of the latent vectors, the user can

control the different features and styles of the

generated image.

The advantage of DragGAN over traditional rule-

based editing methods is that it learns the distribution

of the input image and edits according to the

distribution of the target image. As a result, the

editing results are more natural and realistic, and can

be adapted to a wider range of image styles and

contents.

The application areas of DragGAN include

computer vision, image processing, and artistic

creation. It provides users with an intuitive, flexible

and efficient image editing tool that empowers them

to create unique and innovative visual effects.

An image 𝐼∈𝑅

××

generated by latent code

L, are manipulated by input data points. The source

points are defined as 𝑠



= 𝑥

,

, 𝑦

,

𝑖 = 1,2,…𝑛

, and the target ones are

𝑡



= 𝑥

,

, 𝑦

,

𝑖 = 1,2,…𝑛. The goal is to move

the points from source locations to target locations. In

one optimization step, the model gets a iterated code

𝐿



and a result image 𝐼′. Theoretically this iteration is

in every frame the GUI rendered, but it heavily

depends on computer performance. In each iteration,

the output of 6th block of StyleGAN2 which results

in the feature maps 𝐹 is forced to be aligned with the

original image. So this is called motion supervision,

and the loss function is defined as follows:

𝐿 =

∑∑



𝐹

(

𝑢



)

−𝐹

(

𝑢



+ 𝑁



)









∈



(



,



)





+𝜆

(

𝐹−𝐹



)





(3)

where 𝐹(𝑢) means the feature values, which is in 𝐹 at

pixel 𝑢. And

𝑁

























is a normalized vector.

Due to the actual processing phase is a supervision

learning problem, the program can be easily running

on modern desktop computers, seems like a nearly

real-time image editing resolution. Finally, the reason

why a point tracking phase is required in a loop is that

motion supervision cannot ensure the source and

target point are just correct with the position user

defined in the beginning. Thus another optimization

function is required:

𝑝



= 𝑎𝑟𝑔𝑚𝑖𝑛

𝐹



(

𝑢



)

−𝑓







(4)

which 𝑝



referred to tracking point in each iteration.

The tracked point is calculated from the nearest

neighbor of 𝑓



2.3 Implementation Details

This project is based on PyTorch 2.0. This project

needs to install requirements of StyleGAN3 first, this

is the base environment of all of the projects, including

StyleGAN3, DragGAN and PTI. After ensuring all

requirements are satisfied, please review the system

environment PATH and CUDA toolkit version. The

CUDA toolkit must be 11.8. Although it is not

optimized for the latest Ada GPUs (e.g. RTX 4090,

H100), but the source code of the SDK changed in

12.x and the PyTorch plugins in this project cannot be

compiled successfully. Then, check whether current

PyTorch version supports GPU, which should be

torch==2.0+cu118. The training process of StyleGAN

is known to be long and annoying, so high-end GPUs

are recommended. The pillow package should be

downgraded to version 9.5.0. If the source model (.pkl

file) comes from other pretrained datasets, please

follow the naming format as “<model_type>-

<custom_name>-<512x512>.pkl”. <model_type> can

be “stylegan2”, “stylegan-human” or “stylegan3”.

Other names cannot be recognized and will receive an

exception during the runtime. Although those projects

support both conda environment and pypi

environment, native python usually do not include C

debug symbols (this can be checked during the first

installation, just updating it cannot recover). And it

may pop up “cannot open file python311.lib” issue

DragGAN-Based Emotion Image Generation and Analysis for Animated Faces

237

when attempting to build the PyTorch plugin. So

conda environment is recommended.

3 RESULTS AND DISCUSSION

As depicted in Figure 2, it observes the utilization of a

red point as the source point and a blue point as the

target point. The Drag process initiates with the

objective of bringing the red point and its neighboring

points closer to the blue point. This process relies on

the Generator component within the GAN,

specifically known as Gema. It can be likened to

watching a generated video that can be halted once the

red point reaches the blue point, or manually

interrupted at an intermediate stage.

In comparison to the original DragGAN method,

this project ensures that stylized images can be both

edited and generated using the StyleGAN3 model.

Additionally, they can be manipulated using the

DragGAN framework and its accompanying program.

This advancement makes significant contributions to

the editing of anime characters, negating the

requirement for precise 3D models or professional

sketching skills. It facilitates a quick preview of how

different emotions or physical features manifest on the

same character.

Compared to diffusion models, GAN-based

models offer more precise control over details, which

is a notable advantage. This enables the model to

finely manipulate and generate stylized images,

further enhancing the capabilities of image editing

techniques. The marriage of the GAN-based approach

with the DragGAN method opens up new possibilities

in the realm of anime character customization and

facilitates the exploration of diverse artistic

expressions.

Figure 2: Result of drag (Picture credit: Original).

4 CONCLUSION

This project mainly focuses on reattain the program

and adapt it to stylized images or unreal contents. This

research uses anime style as a reference and example,

to show how easy and direct it will be to alter the

image through a simple drag, compared to traditional

image editing methods which requires a fulfil

background of user aesthetic experience and art study

background. Meanwhile, this will not take sketchers’

place, because GAN models actually cannot create an

image. In fact, although it can generate images, but

this generate process is mostly named after its

programming appearance, which is compared to

traditional machine learning methods. So compared to

diffusion models, which can really create image from

scratch, this project can protect the design and right of

original creator and art designer, while providing

necessary and functional altering methods, which in

fact take the previous CPU-algorithm-based editing

methods. In the future, author will connect

StyleGAN3, DragGAN and PTI together, and make a

combination that user just need to upload their image

they want to alter, and get the result in the same

program.

REFERENCES

J. Schmidhuber, “Deep learning in neural networks: an

overview,” Neural Netw, vol 61, 2015, pp. 85–117

J. Chen, G. Liu, X. Chen, “AnimeGAN: a novel lightweight

GAN for photo animation,” International symposium

on intelligence computation and applications, vol.1205,

2020. pp. 242-256

I. J. Goodfellow, et al, “Generative adversarial nets,” In:

Proceedings 28th Annual Conference on Neural

Information Processing Systems 2014, NIPS 2014,

Montreal, QC, Canada, pp. 2672–2680

D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance

normalization: The missing ingredient for fast

stylization,” arXiv 2016, unpublished.

X. Pan, A. Tewari, T. Leimkühler, “Drag your gan:

Interactive point-based manipulation on the generative

image manifold,” ACM SIGGRAPH 2023 Conference

Proceedings, vol. 2023, pp. 1-11

Y. Deng, J. Yang, D. Chen, “Disentangled and controllable

face image generation via 3d imitative-contrastive

learning,” Proceedings of the IEEE/CVF conference on

computer vision and pattern recognition, 2020, pp.

5154-5163

S. Jascha, W. Eric, M. Niru, G. Surya, “Deep unsupervised

learning using nonequilibrium thermodynamics,” In

International Conference on Machine Learning. PMLR,

vol. 2015, pp. 2256–2265

DAML 2023 - International Conference on Data Analysis and Machine Learning

238

J. Song, C. Meng, S. Ermon, “Denoising diffusion implicit

models,” arXiv, 2020, unpublished.

Y. Song, J. Sohl-Dickstein, D. Kingma, “Score-based

generative modeling through stochastic differential

equations,” arXiv, 2011, unpublished.

Z. Teed, J. Deng, “Raft: Recurrent all-pairs field transforms

for optical flow,” Computer Vision–ECCV 2020: 16th

European Conference, Proceedings. Springer

International Publishing, 2020, pp. 402-419

Dataset, https://www.kaggle.com/datasets/andy8744/gany

u-genshin-impact-anime-faces-gan-training, last

accessed 2023/08/25

D. Roich, R. Mokady, A. H. Bermano, and D. Cohen-Or,

“Pivotal tuning for latent-based editing of real images,”

arXiv, 2021, unpublished.

DragGAN-Based Emotion Image Generation and Analysis for Animated Faces

239