PhotoHandler: Manipulation of Portrait Images with StyleGANs Using

Text

Geral Castillo-Arredondo, Dante Moreno-Carhuacusma and Willy Ugarte

Universidad Peruana de Ciencias Aplicadas, Lima, Peru

Keywords:

StyleGAN, StyleCLIP, Portrait Images, Image Manipulation, Mobile Application, NLP.

Abstract:

There are many methods and approaches used for manipulation of images. However, according to the state

of the art, the ones related to StyleGANs has shown better overall results when it comes to this task. In

this work, we proposed to use StyleGANs in order to allow the manipulation of portrait images using text.

We used a method called StyleCLIP which combine StyleGANs with CLIP, a neural network that connects

text and images. This method has two numeric parameters, which are the manipulation strength and the

disentanglement threshold. The contribution consists of adjusting these parameters based on the entered text

in order to improve the result of the edited image. With this technology, we built an app that will help people

with poor digital skills to edit their photos easily by using words. The opinion of the users of the application

has been validated through a survey. The results obtained allowed us to demonstrate that our method make a

satisfactory image edition for most users.

1 INTRODUCTION

There is a lack of digital skills for most of the people

for photo edition and handling. Even if there exists

many edition tools, most of them, require some level

of expertise. Thus, these technological skills have be-

come more important in recent years.

In 2018, 92% of households in the US had at

least one Information and Communication Technol-

ogy (ICT), most of these are mobile devices

. Many

people tend to edit their photos before uploading them

to social networks, either to improve them or to make

them more appealing.

In August 2020, the mobile case company Case24

conducted a survey in which it was revealed that

only 29% of the sample used would post a photo on

social networks without editing it (V

azquez, 2021).

Likewise, 71% of the people surveyed conﬁrmed that

Facetune was their favourite retouching application.

According to the City University of London, 90%

of young women surveyed in a study reported that

they use a ﬁlter or edit their photos before posting

https://orcid.org/0000-0002-7510-618X

US Census Bureau, 2021. “Computer and Internet Use

in the United States: 2018.” - https://www.census.gov/new

sroom/press-releases/2021/computer-internet-use.html

City University London, 2021 - “90% of young women

report using a ﬁlter or editing their photos before posting.”

Furthermore, in 2014 a survey was carried out by the

Pew Research Center in which 77% of older adults

reported that they need assistance if they want to use

a tablet or smartphone

Indeed, many people, including the elderly, will

have trouble enhancing their photos through the use of

editing apps like Facetune. Thus, having apps that can

easily improve the editing experience without needing

high digital skills could have a great impact.

This problem is difﬁcult because you want to

be able to perform multiple edits on portrait images

based on a large domain of words. Many methods fail

as they have difﬁculty abstracting features correctly,

resulting in an edit that does not maintain the identity

of the person whose photo has been edited (Hou et al.,

2022).

Another difﬁculty regarding this problem is based

on ﬁnding a method that is able to connect a large

domain of words with different image manipulation

directions. In this sense, it’s important to have an ade-

quate manipulation direction for each word of the do-

main. For example, if the word is “old” or “smile”,

- https://www.sciencedaily.com/releases/2021/03/210308

111852.htm

No Isolation, 2021 - “Why do many seniors have trou-

ble using technology?” - https://www.noisolation.com/rese

arch/why-do-many-seniors-have-trouble-using-technolog

Castillo-Arredondo, G., Moreno-Carhuacusma, D. and Ugarte, W.

PhotoHandler: Manipulation of Portrait Images with StyleGANs Using Text.

DOI: 10.5220/0012123200003552

In Proceedings of the 20th International Conference on Smart Business Technologies (ICSBT 2023), pages 73-82

ISBN: 978-989-758-667-5; ISSN: 2184-772X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

the goal is to manipulate the image so that the per-

son’s face looks older or smiley.

There are multiple solutions that have proposed

different methods to modify or generate images based

on a description given by one or more words. There

are methods that allow to use semantic control over

portrait images in order to generate multiple types of

editing. Among these methods, it is feasible to men-

tion PIE (Tewari et al., 2020), InterFaceGAN (Shen

et al., 2020) and GuidedStyle (Hou et al., 2022).

However, such proposals do not support a long do-

main of words and they are not ﬂexible to different

directions of manipulation. On the other hand, there

are methods that use natural language to generate ed-

its on images. Among the models that use natural

language, there are TEA-cGAN (Zhu et al., 2019),

LatteGAN (Matsumori et al., 2021) and Image-Text

Shared Space (Xu et al., 2022). However, these pro-

posals focus on editing images in general and are not

optimized for editing people’s faces.

StyleCLIP (Patashnik et al., 2021) is presented

as a solution that allows you to edit images using

text. By combining two methods, which are Style-

GAN (Karras et al., 2021) and CLIP (Radford et al.,

2021). StyleCLIP has two numeric parameters, which

are the manipulation strength and the disentanglement

threshold. These parameters are manipulated manu-

ally and have an important role in the result of the

image edition.

Therefore, it is proposed to adjust these parame-

ters based on the entered text using natural language

processing and classiﬁers. In this sense, our proposal

consists of using the StyleCLIP method and manipu-

lating its numerical parameters automatically in order

to solve the problem. The results obtained through

surveys demonstrate the satisfaction of users with re-

spect to the solution provided.

Our main contributions are as follows:

• We have proposed the use of a natural language

processing model and classiﬁers to adjust the nu-

meric parameters of StyleCLIP based on the input

text.

• We have developed a web application to facilitate

the user experience.

• We present an analysis of our method and a com-

parison with state-of-the-art approaches

This paper is organized as follows. In Section 2, sim-

ilar solutions currently implemented in the literature

will be explained and compared. In Section 3, we will

explain the deﬁnitions of the technologies related to

StyleCLIP, StyleGAN, CLIP, NLP and some terms re-

lated to portrait image manipulation and we will detail

the contribution of our proposal. Finally, Section 4

will detail the experiments and their results. To con-

clude with Section 5

2 RELATED WORKS

The manipulation of images guided by text has been

approached in different works. In this segment we

discuss the existing solutions, how they apply GANs

and how our proposed approach differs from them.

LatteGAN (Matsumori et al., 2021) uses a Visu-

ally Guided Language Attention (Latte) module to

extract text representations for the generator and a

U-Net discriminator to generate images while tack-

ling Multi-Turn Image Manipulation (MTIM). This

model can perform iterative manipulations to the im-

ages while exploring the capabilities of GANs in im-

age manipulations. In order to achieve this, Lat-

teGAN focuses on distinguishing objects in an im-

age and then performs manipulations guided by text

prompts instructing to add or remove certain objects

form the picture. Our approach differs in this aspect,

since our goal is to perform manipulations on human

portraits and its many different features, not adding or

removing unrelated objects. However, the application

of MTIM could prove valuable in the construction of

a model able to give more control to the user while

only requiring images and text prompts as inputs.

In order to manipulate human portraits, the

PIE (Tewari et al., 2020) model uses StyleGAN and

StyleRig. With the latter, the model gains control over

certain portrait features such as the head pose, expres-

sion and lightning. The model highlights the potential

of StyleGAN, which we adopt through the application

of StyleCLIP. Though PIE proved to be a useful and

valuable solution to the manipulation of portraits, we

did not use apply it due to the time and computation

constraints on our project. Though we aim to perform

different and broader manipulations in features such

as age, gender, hair color, among others, the appli-

cation of StyleRig applies a method that could yield

better results in the manipulation of expressions and

lightning in a portrait.

The authors of GuidedStyle (Hou et al., 2022) ap-

proach the manipulation of images as a style trans-

fer problem, similar to StyleCLIP. Their model uses

a residual attention network (RA-MLP) to transform

the original latent code. With the application of RA-

MLPs, the model is able to gradually edit certain at-

tributes of the face in the image. To achieve a dis-

entangled manipulation, the authors added limitations

such as attending different layers of the style, and only

retaining the strongest manipulations on each layer.

This allows the model to perform precise manipula-

ICSBT 2023 - 20th International Conference on Smart Business Technologies

tions. However, it is limited to editing attributes that

were previously used to train the model. Although our

approach and contribution enhance the precision of

StyleCLIP, the method used in GuidedStyle is deeper

in its implementation, and could prove useful for fu-

ture projects.

Finally, the closest work to our approach is Style-

CLIP (Patashnik et al., 2021), which showcases the

power of StyleGAN applied with CLIP to perform a

wide range of manipulations on human portraits. Un-

like the GuidedStyle model (Hou et al., 2022), it is

not restricted to the attributes available in the RA-

MLP. However, it requires numeric parameters that

control the disentanglement and strength of the ma-

nipulations. The best values of these parameters vary

from case to case, depending on the feature of the por-

trait that is being edited, as well as how much impact

should the manipulation have. Our approach uses a

simple classiﬁcation model to adjust these parameters

using the text prompt and changing the values of the

numeric parameters according to the possible scenar-

ios mapped in our contribution. Compared to a simple

implementation of StyleCLIP with preset values for

every case, our approach enhances the manipulations.

3 IMAGE MANIPULATION WITH

StyleGANs

3.1 Preliminary Concepts

The manipulation of images through machine learn-

ing methods involves particular challenges. For ex-

ample, the resulting image should still have the same

content as the original, but with a different style, ac-

cording to the desired changes. If the interaction be-

tween user and the solution is to be kept at minimum,

then the model should receive the information regard-

ing the desired manipulation and pair it with visual

concepts accurately.

3.1.1 Style-Based Generation of Images

Generative adversarial networks, also called GAN,

use two neural networks to create artiﬁcial samples

similar to the training ones(Goodfellow et al., 2014).

The ﬁrst network generates samples, and the second

network, called discriminator, determines if a sample

comes from the original training set or not.

Both networks are trained together, until the gen-

erator outputs samples the discriminator classify as

part of the training set. This architecture can be seen

in the Figure 1. Given a training sample of human

faces, a GAN could generate an artiﬁcial human face

Figure 1: Typical Generative Adversarial Networks (GAN)

architecture (Vint et al., 2021).

that should look realistic. However, this doesn’t allow

for much manipulation of the generated image.

The content of an image and its style can be sepa-

rated, so it’s possible to change the style of an image

while preserving its content using adaptive instance

normalization (AdaIN) (Huang and Belongie, 2017).

To do this, instance normalization (IN) performs style

normalization by normalizing feature statistics, which

carry the style information of an image. This IN is

extended to receive content input and style input, and

then adjust the mean and variance of the content input

to match those of the style input.

Both style-transfer and the GAN architecture

serve as the base for StyleGAN(Karras et al., 2021).

This model achieves the separation of high-level at-

tributes and stochastic variations in the generated im-

ages, like freckles, hair, among others, gaining better

results than a regular GAN. The generator network in

StyleGAN starts with a learned constant input and ad-

justs the style of the image at each convolution layer.

StyleGAN can control the image synthesis via scale-

speciﬁc modiﬁcations to the styles.

3.1.2 Contrastive Language-Image Pre-Training

Computer vision systems are typically trained to

predict a ﬁxed set of predetermined object cate-

gories. Because of this, labeled data is needed if the

model should identify another visual concept. Con-

trastive Language-Image Pre-training, or CLIP (Rad-

ford et al., 2021), attempts to remove this restriction.

The neural network in the CLIP model is trained

on 400 million image-text pairs harvested from the

Web (Patashnik et al., 2021). Instead of trying to

determine the exact word associated with the image,

CLIP uses contrastive representation to predict which

possible image and text pairings occurred in a batch

of image and text pairings.

To do this, it uses a text encoder and an image

encoder. Figure 2 shows how the text and image en-

coder generate the text and image pairs used in the

contrastive training; and how these encoders also help

with the zero-shot prediction. Natural language is

used to reference learned visual concepts and describe

PhotoHandler: Manipulation of Portrait Images with StyleGANs Using Text

the new ones.

3.1.3 Classiﬁcation Models and Natural

Language Processing

Natural language processing (NLP) is a broad ﬁeld

that combines linguistics and artiﬁcial intelligence in

order to understand, analyse and even generate lan-

guages that humans use naturally (Abdar et al., 2021).

The use of NLP allows a better interaction between

the human and the computer, which is useful when

trying to develop a solution for people without tech-

nological expertise.

The classiﬁcation of natural language text prompts

can be achieved through the use of decision trees in a

fast and accurate manner (Boros et al., 2017). Deci-

sion trees have the advantage of having a clear struc-

ture that makes it easy to conceptually understand and

implement. By selectively pruning and ﬁne tuning,

decision tress can also be used in the feature selection

process of machine learning algorithms. Deep learn-

ing techniques have been used on more complex NLP

related tasks (Abdar et al., 2021).

3.1.4 Text Driven Manipulation of Images

Thanks to the potential CLIP opened, models like

StyleCLIP try to guide the style-based manipulation

of images with text (Patashnik et al., 2021). To

achieve disentangled manipulation of images, a text

prompt describing an attribute is mapped into a single

global direction in StyleGAN’s latent space.

This desired manipulation, called ∆s, should yield

an image where that attribute is introduced or ampli-

ﬁed, without signiﬁcantly affecting other attributes.

The manipulation strength of ∆s is deﬁned as the pa-

rameter α. First, the CLIP text encoder is used to ob-

tain a vector ∆t in CLIP’s joint language-image em-

bedding.

The objective is to map this vector into a manip-

ulation direction ∆s in the latent space S from Style-

GAN. To compute stable directions, StyleCLIP uses

prompt engineering, a process that consists on feed-

ing sentences with the same meaning as the prompt

to the text encoder, and averages their embeddings.

In particular, StyleCLIP requires a text description of

the target attribute that will be edited, and neutral text.

For example, to manipulate faces, the target prompt

could be “old man’s face”, and the neutral class de-

scribed as “face”.

StyleCLIP then applies prompt engineering to av-

erage the emebeddings and produce ∆t. After ∆t is

determined from natural language, StyleCLIP deter-

mines channel relevance for each channel c of S using

mean projections.

Having estimated the relevance R

of each chan-

nel to the desired manipulation, StyleCLIP ignores

the channels with a R

bellows the disentanglement

threshold β. This parameter can control the degree

of disentanglement in the manipulation: using higher

threshold values results in more disentangled manip-

ulations, but at the same time the visual effect of the

manipulation is reduced.

For high-level attributes like age or gender, the

manipulation involves a combination of several lower

level attributes (for example, grey hair, wrinkles, and

skin color), making multiple channels relevant, so a

lower β is preferable in these cases.

3.2 Method

We noticed the presence of a large group of users who

are not familiar with technology, for whom the image

manipulation process is cumbersome. StyleCLIP is a

method that allows editing images from text in such a

way that the ﬁnal edition is related to the entered text.

This could prove useful for these users thanks to the

potential of this method.

However, considering the characteristics of the

users, the entire solution should be developed so that

it would not require expertise or many parameters be-

ing directly manipulated by the user.

Therefore, we proposed an addition to the Style-

CLIP implementation: a self-modulator model based

on natural language processing that adjusts the nu-

meric parameters of StyleCLIP based on the target

text.

The purpose is to ﬁnd the best values for the nu-

merical parameters in such a way that the generated

image is as close as possible to the target text descrip-

tion.

In its global manipulation approach, StyleCLIP

requires the following parameters:

• Target text prompt: It describes the desired im-

age with the changes to be made. For example,

given a portrait, the target text could be “a happy

face”.

• Neutral class, or neutral text prompt: De-

scribes what the input image contains. It should

be a concise and factual description.

• Manipulation strength (α): The numerical pa-

rameter that deﬁnes how much impact the global

manipulation direction generated has on the im-

age.

• Disentanglement threshold (β): The numerical

parameter that deﬁnes how many attributes on the

image are affected by the manipulation.

ICSBT 2023 - 20th International Conference on Smart Business Technologies

Figure 2: Summary of the approach used in CLIP (Radford et al., 2021).

The α and β numerical parameters have a consider-

able impact in the ﬁnal result of the edition. In that

sense, depending on the type of edition, there are op-

timal values for these numerical parameters that allow

the edition to be closer to what the target text deﬁnes.

The impact of α is different from the impact of β,

as each parameter affects a different part of the image

editing. The value of the manipulation strength or α

deﬁnes the level of intensity of the edition. The higher

the value of this parameter, the more the generated

image will be closer to the description given in the

target text.

Likewise, if the value of α takes negative values,

the generated image will move away from the descrip-

tion given in the target text to the point of being the

opposite of what is described.

It is important to mention that the minimum value

that α can take is -10, while the maximum value that

it can take is 10. The disentanglement threshold or

β is a parameter that allows limiting the amplitude of

the editions that are made. In other words, β limits the

amount of changes allowed in the generated image.

The higher the value of β, the more speciﬁc and

precise the image changes are since fewer changes are

allowed.

In the same way, the lower the value of this param-

eter, the wider the changes, allowing more aspects to

be changed in the generated image. In this case, the

minimum value that β can take is 0.08, while the max-

imum value that it can take is 0.3.

For example, changing the age of a person re-

quires the manipulation of several attributes of the im-

age, so it requires a low β.

Also, to make the aging effect noticeable, a high

α ampliﬁes the strength of the manipulation. On the

other hand, if you want to edit a portrait image in such

a way that the person in the photo smiles.

For this, a high β value is required, since it is de-

sired to limit the changes in the image, so that they

Figure 3: Model of our proposal.

only affect the person’s mouth. Also, the α value

would be low so that the smile looks natural.

Figure 3 shows the general concept behind our

proposal. The StyleCLIP numerical parameters are

managed based on the results of the parameter modu-

lator, which receives the target description. It can be

seen that the StyleCLIP model receives ﬁve parame-

ters.

In that sense, the parameters that are not numeric

such as the input image, the neutral description and

the target description are given by the user.

On the other hand, numerical parameters such as

the manipulation strength (α) and the disentangle-

ment threshold (β) are determined from the target de-

scription. In the parameter modulator, a natural lan-

guage processing (NLP) model is used as part of the

process.

The process followed by the parameter modulator

consists of the following steps:

• Step 1: Receiving the target description text and

use a NLP model to classify that text based on

three deﬁned categories.

• Step 2: Determining the value of the numerical

parameters based on the class assigned to the tar-

get description. In this case, there are numerical

values deﬁned for each category or class.

PhotoHandler: Manipulation of Portrait Images with StyleGANs Using Text

• Step 3: Constituting the determined numeric pa-

rameters as output parameters of the parameter

modulator and send them as input parameters to

the StyleCLIP model.

To implement this proposal, a model capable of clas-

sifying a given text was developed and trained. Also,

a dataset, which contains a wide variety of classiﬁed

texts, was used in the training of the proposed model.

The text in question to be classiﬁed is the target

description or target text, which corresponds to one

of the StyleCLIP parameters.

In this approach, we consider three possible cate-

gories for the classiﬁcation:

• Speciﬁc: The target text describes speciﬁc or little

manipulations such as changes in hair color, eye

color, among others. For example, “blue eyes”

or “pink hair” shouldn’t affect many attributes.

Therefore, we set a high disentanglement thresh-

old β and a low manipulation strength α, which

should produce ﬁne grading changes in the image.

• Entangled: The target text describes manipula-

tions that affect many attributes in the image such

as changing gender, age, among others. Target

texts such as “old man”, “female face”, among

other belong in this category, because the age and

gender have a deep impact on most attributes. T

herefore, we set a low β with a high α to make the

effect more noticeable.

• Medium: This category sets both α and β in

medium values, since the model didn’t predict a

speciﬁc or a entangled manipulation. An exam-

ple of words in this category would be “freck-

led face”, since it is not intended to change many

things in the image, but it is not a small change

either.

The categories mentioned have been deﬁned because

they are considered the most relevant based on the

types of manipulations that have been identiﬁed.

In this sense, three types of manipulation have

been identiﬁed taking into account the number of

changes in the generated image.

These manipulation types are broad manipula-

tions, speciﬁc manipulations and medium manipula-

tions.

In addition, it is considered that the identiﬁed cat-

egories cover the majority of manipulations that the

user can try, and therefore are sufﬁcient to generate

an improvement in the result.

The dataset used consists of a csv ﬁle, which con-

tains three columns of words. Each column represents

a category or class.

Therefore, the words that belong to a class should

be found in the column that represents that class or

Table 1: Dataset example.

Speciﬁc Entangled Medium

green eyes baby face hairy face

blue eyes old man freckled face

pink hair female face pale face

red mouth male face makeup face

small nose donald trump crying face

category.

In this sense, using supervised learning, the model

is intended to learn to classify words from the infor-

mation given in the dataset.

To build the dataset, we ﬁrst proceeded to obtain

a set of words. After that, these words were classiﬁed

manually.

For the manual classiﬁcation, it was necessary to

determine which was the most suitable class for each

word.

For example, if the words were “elderly face”,

then the corresponding class would be “Entangled”,

since we want to generate many changes in the im-

age.

In the same way, if the words were “pale face”, the

corresponding class would be “Medium”, since the

changes are not very speciﬁc or small, but neither is it

intended to greatly affect the features of the person in

the image.

Table 1 shows an example of how the dataset is

structured. However, this example does not represent

the total amount of data in the ﬁnal dataset, which has

more than 20 words per column.

After the proposed model is able to classify the

words, the α and β values are deﬁned, which depend

directly on the assigned class or category.

For each class or category, static values for α and

β are deﬁned.

The deﬁned values of α and β should optimize the

result in the generated image taking into account the

class assigned to the target description.

To deﬁne the most appropriate values of α and β

for the case of each category, a series of experimenta-

tion processes are followed, which will be described

in Section 4.

Also, the classiﬁer model has to be previously

trained with the mentioned dataset.

We limited the target texts used in the dataset to

the ones that could be applied to human faces. We

also set the neutral text as “a face”.

This is conﬁgured in this way since the implemen-

tation of StyleCLIP available only works with images

of faces.

ICSBT 2023 - 20th International Conference on Smart Business Technologies

Figure 4: Text Classiﬁer for Parameter Tuning.

Once the model predicts the category of the de-

sired manipulation based on the target text, we adjust

the α and β values accordingly, and sent all four Style-

CLIP parameters mentioned above to the model.

In Figure 4, the processes constituted in the pa-

rameter modulator are shown in detail. First, the tar-

get description or target prompt text is entered into the

NLP classiﬁer. This classiﬁer determines what kind

of editing does the target description or target prompt

text represents. To deﬁne the type of edition, a pro-

cess described as a decision tree is followed.

In this sense, we proceed to identify if the target

prompt text represents a detailed and speciﬁc edition

or not. If so, a low value is assigned to the manipula-

tion strength (α) and a high value to the disentangle-

ment threshold (β).

However, if it is not identiﬁed as a minor or spe-

ciﬁc edit, it proceeds to identify whether the target

description represents an edit that affects many at-

tributes in the image. If this is the case, a high value

is assigned to the manipulation strength (α) and a low

value to the disentanglement threshold (β).

On the other hand, if it is not identiﬁed as a broad

edition, medium values are assigned to both α and β.

Once the most appropriate values for α and β have

been identiﬁed, these are sent to the StyleCLIP model

in order to obtain the resulting manipulated image.

Apart from the numerical parameters mentioned,

to obtain the desired result, StyleCLIP must also re-

ceive other parameters such as the input image, the

neutral description or neutral class and the target de-

scription or target prompt text.

Likewise, it is important to mention that the neu-

tral description will always have the value of “a face”

or “human face”, since all the images to be tested cor-

respond to portraits of human faces.

4 EXPERIMENTS

4.1 Experimental Protocol

We use the StyleCLIP API available in Replicate (ht

tps://replicate.com/orpatashnik/styleclip) to perform

the experiments and develop the mobile app.

The classiﬁcation model, developed in Python

with the spaCy library, chosen due to its focus on nat-

ural language processing.

The model was later encapsulated and deployed in

Google Cloud Functions for ease of usability.

Given the motivation for the problem, we devel-

oped a mobile app ’Photo Handler’ to reach the audi-

ence who would be beneﬁted by the solution.

The source code for the classiﬁcation model and

mobile app are stored in the repositories available in

https://github.com/PRY20220107-PhotoHandler.

A deployment guide and the executable for the

mobile app, with the model included, are available in

https://drive.google.com/drive/folders/1irjtFwA8oS

0FLEzIlcIsaTk6XFU4fLZs?usp=sharing

4.2 Results

4.2.1 Parameter Values

We deﬁned three categories for the classiﬁcation

model: speciﬁc, entangled and medium.

First, we estimated a range of values for the ma-

nipulation strength α and disentanglement threshold

β parameters based on the scenarios of each manipu-

lation type.

Originally, the values for α range from −10 to

10, with the negative values indicating the model to

perform the opposite of what the input text suggests.

Meanwhile, the β parameter ranges from 0 to 0.30.

It is important to mention that negative values for

α will not be considered, since we do not want to per-

form the opposite edition to what is described in the

input text.

We divide those ranges in thirds to have low,

medium and high values. To determine which values

within these ranges are better suited for each manipu-

lation type, we performed a set of tests comparing the

results.

• Speciﬁc: This type of manipulation requires is lo-

cated on few attributes of the face such as changes

in hair or eye. It should not affect other features

in the portrait and therefore We tested with high

PhotoHandler: Manipulation of Portrait Images with StyleGANs Using Text

Table 2: Parameter test for different manipulations and target texts.

(a) Speciﬁc manipulations using “a face

with a bowl cut” as the target text.

α β

Outputs

3.30 .21

3.00 .24

2.60 .30

2.00 .21

.02 .27

1.50 .28

3.30 .30

.40 .21

(b) Medium manipulations using “a

happy face” as the target text.

α β

Outputs

6.60 .20

3.40 .11

4.10 .15

4.50 .15

4.10 .19

5.40 .11

3.70 .12

6.20 .11

face” as the target text.

α β

Outputs

10.00 .10

6.70 .08

6.80 .09

8.90 .09

7.40 .10

9.00 .10

7.90 .08

6.70 .11

values of β ranging from 0.21 to 0.30, and a low

manipulation strength α from 0 to 3.3.

• Medium: This category sets both parameters in

medium values. α ranges from 3.4 to 6.6 and β

from 0.11 to 0.2.

• Entangled: The manipulations affect many at-

tributes in the image, and require a noticeable

change, such as gender or age. We experimented

on low values of β, from 0.08 to 0.11, and high

values of α between 6.7 and 10.

In the ﬁrst set of tests to evaluate which values work

best for speciﬁc manipulations (see Table 2a), we

used “a face with a bowl cut” as the input text for the

speciﬁc manipulation on three different images. We

change the α and β values within their low and high

values.

The results are compared and a value combina-

tion was chosen based on the perception of the ﬁnal

images. The original images are shown at the top of

each ﬁgure.

Likewise, the green row represents the executed

test that allowed obtaining the most realistic edition.

In that sense, the combination of α and β values that

are found in said row are selected. We repeated this

process for the medium (see Table 2b) and entangled

manipulations (see Table 2c).

We tested the results of the same three images on

the corresponding α and β value ranges, depending

on the manipulation type. We then selected the values

that would yield a satisfactory result.

We used the a labelled data set with target texts to

train a classiﬁcation model from Spacy. It estimates

the probability of it belonging to each manipulation

category and returns the most likely category the text

belongs to. With that information internally the mo-

bile app sends the corresponding α and β values, the

neutral text (“a face”) and the target text to the Style-

CLIP API.

4.2.2 User Satisfaction

To validate our results, and due to the visual nature of

the image manipulations, we performed experiments

focused on the perception of the resulting images. A

group of 58 people were surveyed on the results ob-

tained by the ﬁnished solution.

This group of people were between 18 and 50

years old. Also, these were people who have the de-

sire to edit their photos but do not have the digital

skills to do so. Each person was shown the pictures in

Figures 5a, 5b and 5c. Then, they were asked if the re-

sults would be expected and if they would be satisﬁed

with them.

First of all, it is important to mention that each

ﬁgure shown to the people surveyed represents a type

of manipulation. In Figure 5a, a speciﬁc type of ma-

nipulation is shown. In Figure 5b, a medium type of

edition is shown, and in Figure 5c, a entangled type

of edition is shown.

Therefore, through the survey, it is possible to val-

idate the rate of user satisfaction with respect to each

ICSBT 2023 - 20th International Conference on Smart Business Technologies

(a) Speciﬁc result with the target text “red

haired face”.

(b) Medium result with the target text

“smiley face”.

“baby face”.

Figure 5: Various results for different manipulations.

Table 3: Perception of the resulting images.

Manipulation

Type

Users satisﬁed

by the result

Users not satisﬁed

by the result

Speciﬁc 87.9% 12.1%

Medium 94.8% 5.2%

Entangled 81.0% 19.0%

type of manipulation. Table 3 shows the results of the

survey. In this table, you can see the percentage of

people who were satisﬁed with the edition for each

type of manipulation

4.3 Discussion

As the results from the survey highlight in Table 3,

most users are satisﬁed with the manipulations. Our

model presents difﬁculties performing entangled ma-

nipulations, with the lowest user satisfaction rate

(81.0%). The Figure 2c highlights how the low disen-

tanglement threshold greatly affects the ﬁnal result.

On the other hand, the editions shown in Fig-

ures 5a and 5b present a higher rate of satisfaction

for speciﬁc manipulations (87.9%) and medium ma-

nipulations (94.8%), respectively. In this case, the

medium manipulations present the highest rate of sat-

isfaction, so it is possible to conclude that in general

the model works better if the numerical parameters

are kept at medium values.

Future work could expand the possible scenarios

mapped for the different types of manipulations. En-

tangled manipulations in particular require attention

in order to avoid noticeable changes in other face fea-

tures that are undesired by the user. Due to time con-

straints, the classiﬁcation model does not estimate a

numeric value for the parameters.

Instead, the model assigns a predeﬁned set of val-

ues based on the prediction of the kind of manipula-

tion the user intends. A new model based on regres-

sion could potentially have better results in most use

cases of StyleCLIP with a wider range of alpha and

beta values.

As it was pointed out, the response time of the

model deployed in Google Cloud Functions and the

StyleCLIP API are noticeable in the ﬁnal implemen-

tation. Future works could make modiﬁcations to the

base StyleCLIP model in order to enhance its perfor-

mance and processing time, as well as incorporating

the adjustments to manipulation strength and disen-

tanglement threshold inside the model.

5 CONCLUSION

We conclude that the model implemented to adjust the

numeric parameters of StyleCLIP allows the edition

to be more adequate with respect to the target text

entered, and therefore better results are obtained in

the edited image. This model classiﬁes the target text

based on three categories, and the numeric parame-

ters of StyleCLIP are adjusted based on the category

assigned to the description or target text.

Based on the surveys carried out, we conclude that

the results obtained with the presented proposal sat-

isfy the majority of people who want to edit their pho-

tos but don’t have the digital skills to do so. Also, it is

ﬁne that the implementation of StyleCLIP used only

works with portrait images, since those are the types

of images that we want to edit to solve the described

problem.

In future works, it is desired to improve the model

in relation to the adjustment of the numerical param-

eters. These parameters would be predicted based on

the target text automatically. In the current proposal,

the numerical parameters are set manually based on

the category assigned to the target description (Leon-

Urbano and Ugarte, 2020; Ysique-Neciosup et al.,

2022). Additionally, the proposed model could also

be extended to other types of images or 3D mod-

els (Guillermo et al., 2022).

PhotoHandler: Manipulation of Portrait Images with StyleGANs Using Text

REFERENCES

Abdar, M., Pourpanah, F., Hussain, S., Rezazadegan, D.,

Liu, L., Ghavamzadeh, M., Fieguth, P. W., Cao, X.,

Khosravi, A., Acharya, U. R., Makarenkov, V., and

Nahavandi, S. (2021). A review of uncertainty quan-

tiﬁcation in deep learning: Techniques, applications

and challenges. Inf. Fusion, 76.

Boros, T., Dumitrescu, S. D., and Pipa, S. (2017). Fast and

accurate decision trees for natural language process-

ing tasks. In RANLP. INCOMA Ltd.

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A. C., and Ben-

gio, Y. (2014). Generative adversarial nets. In NIPS.

Guillermo, L., Rojas, J.-M., and Ugarte, W. (2022). Emo-

tional 3d speech visualization from 2d audio visual

data. International Journal of Modeling, Simulation,

and Scientiﬁc Computing, 0(0):2450002.

Hou, X., Zhang, X., Liang, H., Shen, L., Lai, Z., and Wan,

J. (2022). Guidedstyle: Attribute knowledge guided

style manipulation for semantic face editing. Neural

Networks, 145.

Huang, X. and Belongie, S. J. (2017). Arbitrary style trans-

fer in real-time with adaptive instance normalization.

In ICCV. IEEE.

Karras, T., Laine, S., and Aila, T. (2021). A style-based

generator architecture for generative adversarial net-

works. IEEE Trans. Pattern Anal. Mach. Intell.,

43(12).

Leon-Urbano, C. and Ugarte, W. (2020). End-to-end elec-

troencephalogram (EEG) motor imagery classiﬁcation

with long short-term. In SSCI, pages 2814–2820.

IEEE.

Matsumori, S., Abe, Y., Shingyouchi, K., Sugiura, K., and

Imai, M. (2021). Lattegan: Visually guided language

attention for multi-turn text-conditioned image manip-

ulation. IEEE Access, 9.

Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., and

Lischinski, D. (2021). Styleclip: Text-driven manipu-

lation of stylegan imagery. In ICCV. IEEE.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,

Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,

J., Krueger, G., and Sutskever, I. (2021). Learning

transferable visual models from natural language su-

pervision. In ICML, volume 139. PMLR.

Shen, Y., Gu, J., Tang, X., and Zhou, B. (2020). Interpreting

the latent space of gans for semantic face editing. In

CVPR. IEEE.

Tewari, A., Elgharib, M., R., M. B., Bernard, F., Seidel, H.,

erez, P., Zollh

ofer, M., and Theobalt, C. (2020). PIE:

portrait image embedding for semantic control. ACM

Trans. Graph., 39(6).

Vint, D., Anderson, M., Yang, Y., Ilioudis, C. V., Cate-

rina, G. D., and Clemente, C. (2021). Automatic

target recognition for low resolution foliage penetrat-

ing SAR images using cnns and gans. Remote. Sens.,

13(4).

azquez, B. C. (2021). El papel de los inﬂuencers en la

creaci

on y reproducci

on del estereotipo de belleza fe-

menina en instagram. Master’s thesis, Universidad de

Salamanca.

Xu, X., Chen, Y., Tao, X., and Jia, J. (2022). Text-

guided human image manipulation via image-text

shared space. IEEE Trans. Pattern Anal. Mach. Intell.,

44(10).

Ysique-Neciosup, J., Chavez, N. M., and Ugarte, W. (2022).

Deephistory: A convolutional neural network for au-

tomatic animation of museum paintings. Comput. An-

imat. Virtual Worlds, 33(5).

Zhu, D., Mogadala, A., and Klakow, D. (2019). Im-

age manipulation with natural language using two-

sidedattentive conditional generative adversarial net-

work. CoRR, abs/1912.07478.

ICSBT 2023 - 20th International Conference on Smart Business Technologies