Super-Resolution Analysis of Animal Images Based on ESRGAN

Model

Shaoxu Li

College of Computer Science and Technology, Beijing Jiao Tong University, Weihai, China

Keywords: Super-Resolution, Animal, Images, Training Dataset.

Abstract: Image super-resolution (SR) plays a crucial role in enhancing the quality of images for society. This study

introduces an Enhanced Super Resolution Generative Adversarial Network (ESRGAN) model designed

specifically for improving the resolution of animal images. The objective of this paper is to explore the effect

of different training datasets on the SR effect of specific animal target datasets by studying the models

generated from different types of animal training datasets and similar animal training datasets. In addition,

the effects of different types of animal image datasets on the performance of ESRGAN models are analyzed.

Training datasets of multiple animal species are used to train different models which are trained under the

same loss function. In addition, the target dataset is subjected to SR processing of species-specific animal

images in this experiment to verify the effectiveness of this model in real-world applications. Finally, this

study emphasizes the key role of dataset selection in the performance enhancement of ESRGAN models. This

method provides an effective tool in the field of animal image processing that can be applied to a variety of

real-world scenarios, thus contributing to the development of animal conservation, medical imaging, and

scientific research.

1 INTRODUCTION

Image super-resolution (SR) has become more and

more important with the development of image

technology.SR technology can improve low-

resolution (LR) images to high-resolution (HR)

images, providing clearer and more detailed image

information for various application scenarios

(Sharma and Shrivastava 20222). With the wide

application of SR technology in medicine, photo

beautification, celestial body research, and so on

(Frid-Adar et al 2018, Yang et al 2019 & Schawinski

et al 2017). SR is becoming more important for its

practical application. This study wants to explore the

application of Enhanced Super Resolution Generative

Adversarial Network (ESRGAN) technology in

animal image SR. Animal images are of great

significance in the field of ecology, animal behavior

research and medicine. Due to the limitations of field

environment and equipment, the acquired animal

images are often limited by the resolution, resulting

in information loss and analysis difficulties. By

improving the resolution of animal images, the study

can more accurately study and understand the

ecological habits, behavior patterns and health status

of animals, which provides a tool for the protection

and research of animals.

In the field of image SR, researchers have

proposed many methods, including traditional

interpolation methods, deep learning-based methods

and Generative Adversarial networks (GAN) (Keys

1981, Sharma and Shrivastava 2022 & Favorskaya

and Pakhirka 2023). ESRGAN improves SRGAN by

introducing composed of residual-in-residual Dense

blocks (RRDB) architecture (Wang et al 2018). In

recent years, ESRGAN technology, as a variant of

GANs, has made significant breakthroughs and is

widely used in image SR tasks. At the same time,

many researches have studied the processing methods

of animal images (Yang et al 2008). By integrating

previous work and technological developments, the

study can better understand the trends and limitations

of the current field.

The main objective of this study is to introduce

GAN technique to solve the animal image SR

problem. The training effect of different animal

image datasets is explored by introducing ESRGAN

technique. Specifically, this experiment trains two

models using species-specific animal image training

set and mixed-species animal image training set

Li, S.

Super-Resolution Analysis of Animal Images Based on ESRGAN Model.

DOI: 10.5220/0012804200003885

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 1st International Conference on Data Analysis and Machine Learning (DAML 2023), pages 433-437

ISBN: 978-989-758-705-4

433

respectively, and then compares their hyper-

segmentation effects on species-specific animal

images. Finally, this study explores the practical

implications of this research through quantitative and

qualitative analysis of the experimental results. The

results show that the proposed model can effectively

perform animal image SR. This study is expected to

provide a tool for ecologists, animal behaviorists and

medical researchers to help them study and analyze

animal images more accurately.

2 METHODOLOGY

2.1 Dataset Description and

Preprocessing

The purpose of this study is to explore the hyper-

segmentation effect of models trained on different

animal datasets (Dataset). The training dataset for the

first model in this study is a total of 100 images

including various animals. The training dataset for the

second model is 100 images of a single animal. To

ensure generalization of the training models, the

images are randomly cropped to specified sizes in this

study before processing the original images. The

images are additionally randomly horizontally

inverted and rotated by 90 degrees with a 50%

probability. In order to better compare the

experimental results, the images provide with the

dataset are processed in low resolution and high

resolution in this experiment. In both cases, the image

data is resized to the specified size to ensure that the

generated images have the desired dimensions.

2.2 Proposed Approach

The main goal of this research is to explore how to

maximize the SR effect of animal images using

datasets of different species of animals. The research

method of this study is to first define the model based

on ESRGAN, then train the model separately with two

different datasets. The first dataset contains various

species of animals, and the second dataset contains

only a single species of animal. In the training process,

the study plot the change of loss value, and finally

import the test set for testing and compare the results

of various indicators. The research process is shown

in the following Figure 1.

2.2.1 ESRGAN

The architecture used in this experiment is based on

the ESRGAN network model. The model architecture

of ESRGAN includes a generator and a discriminator

(Jiang 2022). The generator accepts a LR input image

and gradually improves the image resolution through

multi-layer convolution and residual blocks. Its goal

is to generate HR images with more details. The

discriminator distinguishes between the real HR

images and images generated by the generator. The

effect of fake images generated by the generator is

continuously improved according to the results of the

discriminator during the training process, so as to

achieve good sharpness ability.

2.2.2 Network Architecture

In this experiment, the architecture of the generator

can be shown in the Figure 2.

Figure 1: The research process (Original).

DAML 2023 - International Conference on Data Analysis and Machine Learning

434

Figure 2: The architecture of generator (Original).

The input is convolutional and fed into the

sequence of residual blocks. In an ESRGAN model the

Residual sequence Block is a sequence containing

multiple RRDB. RRDB is a kind of deep neural

network block used for image processing. RRDB

contains multiple Dense Blocks inside, usually five.

Each dense block contains a series of convolutional

layers for local feature extraction and enhancement.

Each dense block contains the same structure, as

shown in the Figure 3.

Figure 3: The architecture of RRDB (Original).

These residual blocks are stacked on top of each

other. Each residual block performs deep feature

extraction and augmentation. The intermediate

convolutional layer is responsible for processing the

output of the residual block. The upsampling block

increases the resolution of the feature map. The final

processing layer contains two convolutional layers.

The architecture of the discriminator is shown in

Figure 4.

Figure 4: The architecture of Discriminator (Original).

The discriminator employs the visual geometry

group (VGG) network to assess the likelihood that a

real image is more authentic than a fake image. Here is

the procedure for calculating the discriminator's output:





       (1)





       (2)

where 









and 









are the output of discriminator.

When the discriminator determines that the real image

is real, the output is 









. Otherwise, the output is











. And  and  represent the output

score of the discriminator for the real and fake HR

image,  represents the expected value of the

output of the discriminator for the fake HR image

generated by the generator, 











represents the

expected value of the output for the real HR image , σ

represents the standard deviation.

So the loss function of the discriminator is defined

as follows:





 









  





  



 (3)

On the contrary, the generator's adversarial loss

function is formulated as follows:





 









  





  



 (4)

where 



means the loss function of the generator, 



means the loss function of the discriminator, log means

in logarithmic form, 



 and 



 is defined above,







is the expectation of the real image, 





is the

expectation of the fake image.

2.2.3 Loss Function

The perceived loss is calculated by utilizing pre-

trained convolutional neural networks to assess the

similarity in features between the generated image and

the genuine image.In ESRGAN the perceptual loss

compares the features with the real image using the

activation function. After the above definition, for the

generator G, its loss function is:





 



 



 



(5)

where 



is perceptual loss, in the study it is L1

Loss, 



is pixel-wise Loss, that is:





 









 



(6)

In the study,     



,    . For the

discriminator, the loss function is defined above:





 



 























  



 (7)

2.3 Implementation Details

The python version used in this experiment is 3.11.5,

the number of iterations is 2000, the learning rate is 1e-

4, the total training cycle is 5, the batch size is 8, the

gradient penalty weight is 10, the number of worker

threads is 4, and the device used is GPU.

Super-Resolution Analysis of Animal Images Based on ESRGAN Model

435

3 RESULTS AND DISCUSSION

This paper presents the experimental results of this

experiment and discusses the significance of the

experimental results in this section. In this experiment,

the model is trained with a mixed data set of pictures

of 100 animals. Then, then the resulting model is used

to process the target dataset composed of different

kinds of dogs, and the average Peak Signal-to-Noise

Ratio(PSNR) value is obtained. Then, a second model

is trained on a dataset consisting of images of different

types of dogs (different from the target dataset), and

the second model is used to process the same target

dataset and obtain the average PSNR value.

3.1 Training Dataset of Mixed Animals

For the first model, the training dataset of mixed

animals is used for training. Figure 5 and Figure 6

illustrate the fluctuations in the loss values of both the

generator and discriminator.

Figure 5: G-Loss (Generator Loss) and D-Loss

(Discriminator Loss) (Original).

Figure 6: G-Loss (Picture credit: Original).

The results show that the discriminator loss

fluctuates between -100 and 100 when some abnormal

peaks are discarded, and after the number of training

times reaches 900-1000 times, the fluctuation of the

discriminator loss starts to become significantly

smaller, and the value tends to 0 to -50. With some of

the abnormal peaks discarded, the generator loss is

between 1 and 6, and the fluctuation starts to become

smaller when the number of training times reaches

1500 to 2000. Under this model, the average PSNR

value of the target dataset image is

64.27290923861564.

3.2 Training Dataset of a Single

Animals

For the second model, the training dataset consisted of

images of dogs and is utilized for training. Fig. 7 and

Fig. 8 depict the variations in the loss values for the

generator and discriminator, respectively.

Figure 7: G-Loss and D-Loss (Original).

Figure 8: G-Loss (Original).

The results indicate that the discriminator loss falls

within the range of -40 to 0, excluding certain

abnormal peaks. After 1000-1250 training iterations,

the fluctuation in the discriminator loss noticeably

decreases, with values ranging from -20 to 0. In the

case of discarding some abnormal peaks, the generator

loss is between 0 and 6. Under this model, the average

PSNR value is 65.84025702654814.

3.3 Comparison of Results

Throughout the training process, the absolute value of

the discriminator loss for the first model is notably

DAML 2023 - International Conference on Data Analysis and Machine Learning

436

higher than that of the second model, and it exhibits

greater fluctuations. Similarly, the generator loss for

the first model shows more significant fluctuations and

has a substantially higher absolute value compared to

the second model. This means that the training dataset

for the second model can be more conducive to training

the model. Using PSNR index to measure the SR

results of the two models, the first model is slightly

lower than the second model. This demonstrates that

the ratio between the SR result and the original image

is greater, indicating an improved quality of the

generated image. This means that training on a single

animal dataset and using the generated model to SR

process images of the corresponding species will be

slightly better. In summary, the results suggest that it

is better to use single-species datasets to train models

and process images of that species.

4 CONCLUSION

The purpose of this study is to compare the SR effects

of models trained on different training datasets on the

same target dataset. Specifically, it is investigated

whether the models trained on the training dataset of

the same kind of animals have better SR effect on the

target dataset of the same kind of animals. In this

experiment, the SR model based on ESRGAN is

established and a dataset consisting of pictures of

various animals and a dataset consisting only of

pictures of dogs are trained separately. By comparing

the change of the loss value of the discriminator and

the generator during the training process, this

experiment can judge the training effect of the model

and the training time required. Finally, SR processing

is performed on the same target dataset composed of

dog pictures and the SR effects are compared. The

results show that the SR effect of the models trained

with different training sets is not the same when the SR

processing of a certain kind of animal image is needed.

The model trained with the same type of animal images

as the target dataset can have a better SR effect on the

target animal images. At the same time, images

consisting of a single species lead to faster training of

models with less loss. In the future, this experiment

will also consider removing the background of the

dataset to obtain better experimental results, use more

kinds of training dataset to compare the experimental

effects and consider using more metrics to measure the

SR effect of the images.

REFERENCES

A. Sharma, B. P. Shrivastava, “Different Techniques of

Image SR Using Deep Learning: A Review,” IEEE

Sensors Journal, vol. 23, 2022, pp. 1724-1733.

M. Frid-Adar, I. Diamant, E. Klang, et al. “GAN-based

synthetic medical image augmentation for increased

CNN performance in liver lesion classification,”

Neurocomputing, vol. 321, 2018, pp. 321-331.

Q. Yang, Y. Ma, F. Chen, et al. “Recent advances in photo-

activated sulfate radical-advanced oxidation process

(SR-AOP) for refractory organic pollutants removal in

water,” Chemical Engineering Journal, vol. 378, 2019,

pp. 122149.

K. Schawinski, C. Zhang, H. Zhang, et al. “Generative

adversarial networks recover features in astrophysical

images of galaxies beyond the deconvolution limit,”

Monthly Notices of the Royal Astronomical Society:

Letters, vol. 467, 2017, pp. 110-114.

R. Keys, “Cubic convolution interpolation for digital image

processing,” IEEE transactions on acoustics, speech,

and signal processing, vol. 29, 1981, pp. 1153-1160.

A. Sharma, B.P. Shrivastava, “Different Techniques of

Image SR Using Deep Learning: A Review,” IEEE

Sensors Journal, vol. 23, 2022, pp. 1724-1733.

M.N Favorskaya, A.I. Pakhirka, “SF-SRGAN: Progessive

GAN-based Face Hallucination,” The International

Archives of the Photogrammetry, Remote Sensing and

Spatial Information Sciences, vol. 48, 2023, pp. 47-52.

X. Wang, K. Yu, S. Wu, et al. “Esrgan: Enhanced super-

resolution generative adversarial networks,”

Proceedings of the European conference on computer

vision (ECCV) workshops, 2018.

J. Yang, J. Wright, T. Huang, et al. “Image super-resolution

as sparse representation of raw image patches,” 2008

IEEE conference on computer vision and pattern

recognition. IEEE, 2008, pp. 1-8.

Dataset

https://www.kaggle.com/datasets/iamsouravbanerjee/a

nimal-image-dataset-90-different-animals

J. Jiang, L. Zhao, Y. Jiao, “Research on Image Super-

resolution Reconstruction Based on Deep Learning,”

International Journal of Advanced Network, Monitoring

and Controls, vol. 7, 2022, pp. 1-21.

Super-Resolution Analysis of Animal Images Based on ESRGAN Model

437