The Analysis of Image Inpainting Based on Pix2Pix Model and Mix

Loss

Xu Yan

School of Statistic and Data Science, Nankai University, Tianjin, China

Keywords: Image Translation, Pix2Pix, Loss Function.

Abstract: This paper investigates the application of image translation as an important use case for generative adversarial

networks (GAN), which has received widespread attention from scholars in recent years. This study builds an

Image-to-image translation (Pix2Pix) model with a U-net as the generator and PatchGAN as the discriminator

and observes the performance by adjusting the loss function of the generator. First experiment is modifying

the scale factors for GAN Loss and Mean Absolute Error loss (L1 Loss), and the second is exchanging L1

Loss with square loss function (L2 Loss). After comparing the image authenticity and detail processing of

different results, it is noticed that an overall better translation is achieved when the scale factor is set to 1:100.

If finer detail handling is required, lowering the scale factor to 1:10 can be beneficial. However, it's also found

that including L2 Loss in the generator loss function do not yield favorable results. It provides guidance for

future choices of hyperparameters for the pix2pix model and lays the foundation for further research into loss

functions.

1 INTRODUCTION

Image processing is a technique that can repair

damaged portions of target images, reconstructing

them to generate high-quality, deeply semantically

approximated original images. In recent years, with

the advancement of computer computational

capabilities and rapid development of machine

learning, achievements in computer vision have

greatly enhanced scientific technology and human

quality of life. Deep learning-based image processing

techniques play a crucial role in many practical

applications, such as object removal in image editing,

restoration of old photos, repair of occluded portions

of specific objects, facial restoration, and more (Lecun

1998). Currently, it is one of the main focal points of

research in the field of computer vision.

The inception of Convolutional Neural Networks

(CNN) enabled to extracting image semantics and

features, making it one of the earliest neural network

models employed in image processing. Moreover, due

to its capability to extract image features, CNNs can

also be utilized in tasks like texture synthesis and

image style transfer (Gatys et al 2015 & GatyS et al

2016). The introduction and widespread application of

Generative Adversarial Networks (GAN) have further

enhanced the visual outcomes of image inpainting

(Goodfellow et al 2014). In the realm of image

processing using GANs. Yu et al. introduced the

concept of gated convolutions, which elevated the

effectiveness of image restoration (Yu et al 2019).

However, due to the larger model size and a high

number of parameters, training costs are significantly

increased. Also, because of the uncertainty in filling

the missing regions of damaged images using regular

GAN methods, it is challenging to determine the

inpainting area, which can even lead to severe

restoration errors. A two-stage visual consistency

network was proposed, consisting of a mask

prediction module and a robust restoration module

(Wang et al 2020). This approach significantly

improved the model's generalization ability. By

incorporating image semantic understanding by

introducing an attention mechanism, the precision of

the restored images can be further enhanced (Yu et al

2018). While the former often leads to image

discontinuity issues, Liu et al. introduced a coherence

semantic attention mechanism, as mentioned in

reference, which focuses on the interrelation of deep

features in the area to be restored (Liu et al 2019). This

effectively resolves color discontinuities and

boundary distortions. Image-to-image translation with

Conditional Adversarial Networks (Pix2Pix) is an

438

Yan, X.

The Analysis of Image Inpainting Based on Pix2Pix Model and Mix Loss.

DOI: 10.5220/0012804900003885

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 1st International Conference on Data Analysis and Machine Learning (DAML 2023), pages 438-443

ISBN: 978-989-758-705-4

image translation technique based on GAN. It is used

to transform images from one type to another type.

The same model architecture exhibits varying

gradients under different loss functions, thereby

affecting training efficiency and outcomes. Indeed, by

modifying the loss function expression, you can

enhance the clarity and coherence of generated images.

The main objective of this study is to modify the

loss function formula of the generator to observe the

effects of different loss functions on image translation

results. The pix2pix model employs Mix Loss, a

combination of GAN loss and Mean Absolute Error

loss (L1 loss), to define the generator's loss function.

This article aims to explore the advantages and

limitations of this defined approach. First, the

differences are observed in training effects between

adding L1 loss to the generator's loss function and not

having L1 loss. Second, it conducting multiple

experiments by adjusting the ratio of L1 loss to GAN

loss in the generator of the pix2pix model. The

prediction results are obtained for the same input data

from different models trained for the same number of

iterations. Third, the training results of the model at

the same epoch under different coefficient ratios are

observed and compared. In addition, this paper

employs No-Reference Blind Video Quality

Assessment (NR-BVQA) in combination with human

subjective perception to assess the coherence and

authenticity of generated images. The experimental

results demonstrate that using only the GAN loss can

lead to gradient explosion, and if the L1 loss ratio is

too small, it can result in severe image distortion,

while if the L1 loss ratio is too large, it can cause the

image to become overly blurry. In particular, if L1 loss

is replaced with square loss function (L2 loss),

although it can accelerate convergence, the final

results may become excessively smooth and lose

image details.

2 METHODOLOGY

2.1 Dataset Description and

Preprocessing

The dataset used in this study called “facade” is

sourced from Kaggle's pix2pix dataset (Dataset). This

dataset consists of paired images, with each pair

containing a photograph of a building and its

corresponding sketch at the same resolution. In this

dataset, the network is trained to translate hand-

drawn sketches of buildings. The entire dataset

comprises 400 pairs of training data and 106 pairs of

test data. Figure 1 shows an example of the training

data.

Figure 1: Images from the facade dataset (Original).

2.2 Proposed Approach

Under the same model, different choices of loss

functions can impact the convergence of the loss

function, the model's learning rate, and the accuracy

of predictions, significantly influencing the model's

performance. This paper places a strong emphasis on

analyzing the performance of the pix2pix model under

different generator loss functions. This analysis

includes modifying the hyperparameters of the loss

function and changing the composition structure of the

generator loss. By observing the prediction results of

the same sketch under different loss functions, it seeks

to explore the strengths and weaknesses of the loss

function choices as discussed in the original paper

(Isola et al 2017). In the end, the paper aims to provide

insights into selecting the most suitable loss function

for different tasks. The experimental workflow is

illustrated in Figure 2.

Figure 2: The pipeline of the study (Original).

2.2.1 L1 Loss

The L1 Loss is calculated by adding the absolute

differences between each predicted value and its

corresponding target value, and then taking the

average. The mathematical expression for L1 Loss is:

 











 (1)

where the 



represent the prediction of y and 

represent the true value of y. One characteristic of L1

Loss is that it's insensitive to outliers because it

employs absolute differences. This property makes it

perform well on datasets with a considerable amount

of noise. L1 Loss can be used to measure the absolute

difference between generated images and target

images. This helps improve the stability of the

The Analysis of Image Inpainting Based on Pix2Pix Model and Mix Loss

439

generator, making the generated images closer to the

target images. During training, L1 Loss serves as an

important feedback signal, assisting the generator in

gradually producing more realistic images.

2.2.2 L2 Loss

L2 Loss, also known as Mean Squared Error, is

indeed a loss function commonly used for regression

problems. The mathematical expression for L2 Loss

is as follows:

 













 (2)

The advantage of L2 loss is that it is a smooth,

continuously differentiable function, which makes it

easy to handle in optimization algorithms like

gradient descent. Additionally, it is typically a convex

function, implying it has a global minimum.

However, L2 loss calculates errors using squared

terms, which means it is more sensitive to large errors

because squaring amplifies the impact of these errors.

2.2.3 GAN Loss

Adversarial Loss as one of the primary

implementations of Generator Loss, is typically

represented in the specific form of Binary Cross-

Entropy Loss. It pushes the model's training by

having the generator and discriminator engage in

mutual competition. The generator aims to create

more realistic data, while the discriminator strives to

differentiate between real and generated data. This

adversarial training approach leads to continuous

improvement in the generator's ability to produce

more authentic data. However, GAN loss training is

often less stable compared to traditional supervised

learning because the competition between the

generator and discriminator can result in oscillations

during the training process. Therefore, it's crucial to

carefully select hyperparameters and employ various

techniques to stabilize the training. Additionally,

during the experiments, we also attempted to replace

L1 Loss with L2 Loss and observed the training

results.

2.2.4 Unet

The generator in the pix2pix model uses a U-net

network architecture. It is a deep learning

convolutional neural network architecture consisting

of an encoder and a decoder. The first half is used for

feature extraction, while the second half is used for

upsampling. Its specific network architecture is

shown in Figure 3. In this architecture, the down-

sampling path consists of convolutional layers and

pooling layers, which are used to reduce image

resolution, decrease the spatial size of the image, and

simultaneously extract image features. In contrast, the

up-sampling path serves the opposite purpose and has

a complementary architecture compared to the down-

sampling path. U-Net also employs skip connections

to connect feature maps of different depths between

the encoder and decoder. This helps in transmitting

both low-level and high-level features, addressing the

common issue of information loss. The U-net

network's five pooling layers enable it to achieve

multi-scale feature recognition in images, making it

highly effective for semantic image segmentation.

Figure 3: Unet architecture (Original).

2.3 Implementation Details

In this experiment, the choice of using the Adam

optimizer is justified. Adam has relatively low

demands on storage and computational resources,

making it advantageous for deep neural networks

dealing with large-scale data and parameters.

Additionally, Adam's ability to adaptively adjust the

learning rate can accelerate the model's training

process. Simultaneously, we trained each model for a

total of 20 epochs to ensure that we could observe

variations in training results among different models

and also assess differences in training efficiency. In

the selection of hyperparameters, the experiments

conduct four different ratios of GAN Loss to L1 Loss,

namely 1:100, 1:200, 1:10, and 1:1, and observe the

impact of GAN Loss and L1 Loss on the training

results under these various ratio combinations.

3 RESULTS AND DISCUSSION

In the results section, the paper showcases and

discussed the image translation outcomes under

various loss functions. Keeping the training dataset

consistent with the pix2pix dataset and maintaining all

DAML 2023 - International Conference on Data Analysis and Machine Learning

440

training parameters except the loss functions

unchanged, different models will attempt to translate

the same test image at the same number of training

iterations and then compare the translated images to

evaluate the differences between different models.

This chapter will consist of two parts: Scale Factor

Selection and Choice of L1 and L2.

3.1 The Performance of Scale Factor

Selection

As can be seen from Figure 4-7, under 20 epochs, each

model has provided its respective training results.

From the loss function curves, it is evident that when

the L1 loss proportion is relatively higher, there is a

noticeable downward trend in the loss function. This

leads to good prediction performance even before

complete convergence is achieved. Conversely, when

the L1 loss proportion is too small, the loss function

exhibits strong oscillations, and there is no sign of

convergence even with increased training cycles. This

behavior is attributed to the nature of the GAN loss

itself, which struggles to converge without the

presence of pre-training data. From the translation of

the hand-draw images, it can be observed that the

translation results are better when the scale factor is

set at 1:10 and 1:100. In comparison, when the L1 loss

proportion is too small, although the image resolution

is higher, it leads to the generation of more

inconsistent regions. On the other hand, in cases where

the L1 loss proportion is higher, since the L1 loss

measures the absolute difference between the original

and predicted images, it results in both reduced image

resolution and enhanced image smoothness. Therefore,

adjusting the scale factor to balance resolution and

smoothness is a key aspect of the experiment.

Figure 4: Loss Function Curves and Results When

     (Original).

Figure 5: Loss Function Curves and Results When

     (Original).

Figure 6: Loss Function Curves and Results When

     (Original).

Figure 7: Loss Function Curves and Results When

     (Original).

3.2 The Performance of Choice of L1

and L2

As can be seen from Figure 8-11, L1 loss tends to

generate relatively smooth results but may result in the

loss of some details, whereas L2 loss tends to produce

more precise translated images but may also make the

generated images more susceptible to noise. As can be

seen from figure 8-11, when the scale factor is set to

1:100, it can be observed that L1 loss converges

significantly while L2 loss does not. This is because

L2 loss is more likely to get stuck in local minima and

may struggle to achieve a smaller loss, whereas L1

The Analysis of Image Inpainting Based on Pix2Pix Model and Mix Loss

441

loss, being a convex function, does not face this issue.

At the same time, it can also be observed from the

results that when using L2 loss, the images are

noticeably blurrier, and artifacts are introduced. This

occurrence is because the square operation in L2 loss

penalizes large errors but is relatively insensitive to

small errors. As a result, it may not perform well in

terms of fine detail.

Figure 8: Loss Function Curves and Results When

     (Original).

Figure 9: Loss Function Curves and Results When

     (Original).

Figure 10: Loss Function Curves and Results When

     (Original).

Figure 11: Loss Function Curves and Results When

     (Original).

In summary, properly setting the ratio between

GAN Loss and L1 Loss can indeed contribute to the

training and translation performance of an image

translation network. Additionally, under specific

requirements, adjusting the values of these scale

factors can be used to control the model's translation

effect effectively.

4 CONCLUSION

This study presents the generation results of the

Pix2Pix model under different generator loss

functions. First, it constructs a Pix2Pix network with a

U-net as the generator and PatchGAN as the

discriminator. Second, it trains the model for image

translation ability using the Kaggle Pix2Pix dataset as

a training set. Then, it adjusts the scale factors for

GAN loss and L1 loss to observe overall better

translation results. The findings indicate that a higher

ratio of L1 loss results in lower resolution, while a

higher ratio of GAN loss leads to inconsistent regions

in the generated images. The study reveals that setting

the scale factor to 1:100 results in images that combine

realism and good resolution. On the other hand, a scale

factor of 1:10 can be employed to improve resolution,

particularly for finer details. Finally, the research

identifies that L2 loss does not promote model

convergence and is less suitable for handling details

compared to L1 loss. Therefore, it is not recommended

for use in this model. In the future, considering the

incorporation of other loss functions, such as SSIM

loss, into the generator loss function is a promising

avenue for enhancing image translation results.

DAML 2023 - International Conference on Data Analysis and Machine Learning

442

REFERENCES

Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, “Gradient-

based Learning Applied to Document Recognition,”

Proceedings of the IEEE, vol. 86, 1998, pp. 2278-2324.

L. Gatys, A. S. Ecker, M. Bethge, “Texture synthesis using

convolutional neural networks, Advances in neural

information processing systems, 2015, p. 28

L. A. GatyS, A. S. Ecker, M. Bethge, “Image style transfer

using convolutional neural networks, Proceedings of

the IEEE conference on computer vision and pattern

recognition, 2016, pp. 2414-2423.

I. Goodfellow, J. Pouget-Abadie, M. Mirza, et al.

“Generative adversarial nets,” Advances in neural

information processing systems, 2014, p. 27.

J. Yu, Z. Lin, J. Yang, et al. “Free-form image inpainting

with gated convolution,” Proceedings of the IEEE/CVF

international conference on computer vision, 2019, pp.

4471-4480.

Y. Wang, Y. C. Chen, X. Tao, et al. “Vcnet: A robust

approach to blind image inpainting,” Computer Vision–

ECCV 2020: 16th European Conference, Glasgow,

2020, pp. 752-768.

J. Yu, Z. Lin, J. Yang, et al. “Generative image inpainting

with contextual attention,” Proceedings of the IEEE

conference on computer vision and pattern recognition,

2018, pp. 5505-5514.

H. Liu, B. Jiang, Y. Xiao, et al. “Coherent semantic

attention for image inpainting,” Proceedings of the

IEEE/CVF International Conference on Computer

Vision, 2019, pp. 4170-4179.

Dataset

https://www.kaggle.com/datasets/vikramtiwari/pix2pi

x-dataset

P. Isola, J. Y. Zhu, T. Zhou, et al. “Image-to-image

translation with conditional adversarial networks,”

Proceedings of the IEEE conference on computer vision

and pattern recognition, 2017, pp. 1125-1134.

The Analysis of Image Inpainting Based on Pix2Pix Model and Mix Loss

443