Exploring Histopathological Image Augmentation Through

StyleGAN2ADA: A Quantitative Analysis

Glenda P. Train

1 a

, Johanna E. Rogalsky

2 b

, Sergio O. Ioshii

3 c

Paulo M. Azevedo-Marques

2 d

and Lucas F. Oliveira

1 e

Department of Informatics, Federal University of Paran

a (UFPR), Brazil

Interunit Graduate Program in Bioengineering, University of S

ao Paulo (USP), Brazil

Pathological Anatomy Laboratory, Erasto Gaertner Hospital, Brazil

{glendaproenca, johanna.elisabeth8}@gmail.com, sergio.ioshii@pucpr.br, pmarques@fmrp.usp.br, lferrari@inf.ufpr.br

Keywords:

Breast Cancer, Estrogen Receptor, Progesterone Receptor, Digital Pathology, Data Augmentation,

Classiﬁcation, Artiﬁcial Intelligence, Deep Learning, Medical Image Processing.

Abstract:

Due to the rapid development of technology in the last decade, pathology has entered its digital era with the

diffusion of WSIs. With this improvement, providing reliable automated diagnoses has become highly de-

sirable to reduce the time and effort of experts in time-consuming and exhaustive tasks. However, with the

scarcity of publicly labeled medical data and the imbalance between data classes, it is necessary to use various

data augmentation techniques to mitigate these problems. This paper presents experiments that investigate the

impact of adding synthetic IHC images on the classiﬁcation of staining intensity levels of cancer cells with

estrogen and progesterone biomarkers. We tested models SVM, CNN, DenseNet, and ViT, trained with and

without images generated by StyleGAN2ADA and AutoAugment. The experiments covered class balancing

and adding synthetic images to the training process, improving the classiﬁcation F1-Score by up to 14 per-

centage points. In almost all experiments using StyleGAN2ADA images, the F1-Score was enhanced.

1 INTRODUCTION

Cancer is a term that deﬁnes a large group of dis-

eases characterized by the rapid creation of abnor-

mal cells that grow beyond their usual limits and

can spread to other body regions. The widespread

spread of these abnormal cells is the main cause of

death from cancer. According to the World Health

Organization (WHO), the global cancer incidence ex-

ceeded 19 million cases, reaching almost 10 million

deaths in 2020 (WHO, 2022). In 2022, Breast Can-

cer (BC) occupied the second position for incidence

and ranked fourth place for mortality, as reported

by the International Agency for Research on Cancer

(IARC) (IARC, 2023).

To reach a diagnosis, the immunohistochemistry

(IHC) process analyses the biopsy samples concern-

ing the Estrogen Receptor (ER) and the Proges-

https://orcid.org/0009-0003-8417-7179

https://orcid.org/0009-0003-2282-3606

https://orcid.org/0000-0002-7871-4463

https://orcid.org/0000-0002-7271-2774

https://orcid.org/0000-0002-8198-0877

terone Receptor (PR) biomarkers. These receptors

are proteins inside or on cells that can bind to certain

substances in the blood (American Cancer Society,

2021), leading cancerous cells to overexpress them

and thereby promote uncontrolled cell growth (Yip

and Rhodes, 2014).

The Allred score assesses the hormone receptor

expression by summing up the Proportion Score (PS),

which indicates the relative proportion of cancer cells

in the tissue (Mouelhi et al., 2018), and the Intensity

Score (IS), which evaluates the intensity of cell stain-

ing (Kim et al., 2016). The PS score ranges from 0 to

5, and the IS score has values 0 (negative), 1+ (weakly

positive), 2+ (moderately positive), and 3+ (strongly

positive) (Rogalsky et al., 2021). Since the score re-

lies on the pathologist’s or histopathologist’s experi-

ence and professional training, this process is suscep-

tible to human error and fatigue, which may lead to

misdiagnoses (Han et al., 2017).

The creation of Whole-Slide Images (WSIs)

marked the beginning of the pathology digital era and

prompted researchers to automate diagnosis and assist

in IHC image reporting (Laurinavicius et al., 2016).

In this context, automatic classiﬁcation to categorize

Train, G. P., Rogalsky, J. E., Ioshii, S. O., Azevedo-Marques, P. M. and Oliveira, L. F.

Exploring Histopathological Image Augmentation Through StyleGAN2ADA: A Quantitative Analysis.

DOI: 10.5220/0013382300003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 3: VISAPP, pages

839-846

ISBN: 978-989-758-728-3; ISSN: 2184-4321

839

the characteristics of the disease more speciﬁcally and

segmentation to determine the location of cancer cells

in IHC images have emerged (Mouelhi et al., 2018;

Cordeiro et al., 2018; Tang et al., 2019; Rogalsky

et al., 2021; Rmili et al., 2022; Mridha et al., 2022;

Choi et al., 2023; Krinski et al., 2023).

Recent interest from IHC medical imaging re-

searchers has led to signiﬁcant progress. Al-

though, challenges remain for medical image sets

as they often lack variability and have imbalanced

classes (Mukherkjee et al., 2022). However, data aug-

mentation techniques with image-processing methods

and the generation of synthetic images provided by

Generative Adversarial Networks (GANs) are helping

to address these issues by generating new synthetic

images from existing datasets (Krinski et al., 2023;

Osuala et al., 2023).

Therefore, to enhance the automation of

breast cancer diagnosis, we investigated the Style-

GAN2ADA network to generate images for each

class of the IS score to produce high-quality medical

data. Then, we compare them with the AutoAugment

model generated images. In addition, we deﬁned and

applied four classiﬁcation methods to categorize the

Estrogen Receptor (ER) and Progesterone Receptor

(PR) biomarkers patches. To this end, we performed

a quantitative analysis of the results, combining data

augmentation techniques and classiﬁcation methods

and evaluating them through f1-score.

2 MATERIALS AND METHODS

2.1 Datasets

In this paper, we used the dataset from the Rogal-

sky study (Rogalsky, 2021), which includes IHC-

DAB WSIs from 78 patients evaluated for ER and

PR biomarkers. The author provided 1801 (ER) and

1625 (PR) patches with a dimension of 400x300 pix-

els, selected from a ROI with 40x increase (discard-

ing non-pathogenic regions). Each patch received the

cancer intensity score (IS) according to the opinion

of experts. Fig. 1 presents image samples from both

datasets (ﬁrst row), and Table 1 shows the IS class

distributions.

Table 1: Distribution of IS classes from Estrogen Receptor

(ER) and Progesterone Receptor (PR) images that compose

the HistoBC-HR dataset of Rogalsky (2021).

Exam Type 0 1+ 2+ 3+ Total

ER 414 149 293 945 1801

PR 515 171 226 713 1625

2.2 Data Augmentation

Medical image datasets often have low variability and

high imbalance between classes (Mukherkjee et al.,

2022). For instance, our dataset presented a limitation

in the number of samples as class 1+ has only 149 ex-

amples, and, at the same time, class 3+ contains 945

images, demonstrating the imbalance between classes

1+ and 3+. Data Augmentation (DA) techniques have

recently addressed these challenges, including image-

processing methods and synthetic image generation

provided by GANs as well (Osuala et al., 2023). In

this paper, we investigated two techniques: AutoAug-

ment and StyleGAN2ADA (step 1 in Fig. 2).

2.2.1 AutoAugment

In the context of our research, we used AutoAugment

provided by the Pytorch library, comprised of pre-

trained weights on the CIFAR-10 dataset. The idea of

AutoAugment is to automate the search for data aug-

mentation policies, optimizing the selection of trans-

formations, the probability of applying them, and the

magnitude of the operation. We considered trans-

formations such as rotations, translations, brightness

adjustments, color changes, and equalizations with

different probabilities of use and magnitudes. We

chose the pre-trained AutoAugment to compose these

image-processing forms of data augmentation, using

operations without requiring manual adjustments to

determine which ones to apply and at what magni-

tudes. For more details about the transformations,

consult Cubuk et al. (2019).

2.2.2 StyleGAN2ADA

Generative Adversarial Networks (GANs) are recent

techniques for generating synthetic images, consist-

ing of two networks: a generator that creates new im-

ages and a discriminator that distinguishes real from

fake (Osuala et al., 2023). Amidst current archi-

tectures, the Style Generative Adversarial Network

with Adaptive Discriminator Augmentation (Style-

GAN2ADA) stand out for generating high-quality im-

ages and addressing the overﬁtting issues of its pre-

decessors. This network tackles the challenges aris-

ing from limited datasets by incorporating image-

processing data augmentation techniques during the

training process, thereby diversifying and increasing

the number of dataset samples (Karras et al., 2020).

To evaluate the StyleGAN2ADA, we separated

the same-class images from the dataset, forming four

training sets (0, 1+, 2+, and 3+) for each type of

exam (ER and PR). Then, we trained a speciﬁc Style-

GAN2ADA for each class and generated new syn-

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

840

Figure 1: Examples of real images, synthetic images obtained by AutoAugment, and images generated by StyleGAN2ADA

on the Estrogen Receptor (ER) and Progesterone Receptor (PR) dataset.

Figure 2: Overview of the proposed work. Step 1 uses the ER and PR image datasets, applying AutoAugment (AA) with

pre-trained weights for data augmentation, and also trains a StyleGAN2ADA (SG) to generate synthetic data for each IS class.

Step 2 organizes the data into ﬁve experiments: E1, consisting solely of the original dataset; E2, the original dataset with the

addition of 100 images produced by AA to the training set; E3, the original dataset with 100 synthetic images generated by

SG; E4, with class balancing in the training set using AA-generated images; and E5, also balancing the classes but using

SG-generated images. Based on these experiments, cross-validation deﬁnes the training, validation, and testing sets, with

the validation and testing sets consisting of real data. Finally, IS score classiﬁcation uses the Rogalsky Methodology, CNN,

DenseNet, and ViT models (’C’ stands for Convolution, ’P’ for Pooling, and ’D’ for Dense Layer). The evaluation of the

experiments was performed using the f1-score.

Exploring Histopathological Image Augmentation Through StyleGAN2ADA: A Quantitative Analysis

841

thetic images of both biomarkers and all intensity

scores. In this step, we used the Pytorch library

to implement StyleGAN2ADA, with 1500 training

epochs and 0.5 as the data augmentation hyperpa-

rameter (truncation) to create new artiﬁcial images.

We deﬁned these values after running smaller experi-

ments on a validation set.

2.3 Classiﬁcation

2.3.1 Rogalsky Methodology (RM)

As a ﬁrst step to achieve the IS classiﬁcation, we

adapted the methods proposed by Rogalsky (Rogal-

sky, 2021). On each patch we applied the Contrast

Limited Adaptive Histogram Equalization (CLAHE)

method followed by a thresholding, starting with a

Gaussian blur ﬁlter to ﬂatten gradients and avoid

noise ampliﬁcations. Then, we converted the image

to grayscale and passed it through the Otsu technique

(see Fig. 2). With this initial segmentation of the cells,

we transformed the images from the original color

space to the HSV color space, splitting the H, S, and V

channels. After this, we extracted the positive cells (in

brown) and negative cells (in blue) with color decon-

volution (mask values available in Rogalsky (2021)).

From the deconvolution images, we calculated in-

tensity histograms, which underwent MinMax nor-

malization to keep values between 0 and 100. The

intensity histograms served as features for the train-

ing, validation, and test sets. The training stage con-

sisted of passing the training and validation sets to a

Support Vector Machine (SVM) model. During this

phase, the validation set optimized the model’s hy-

perparameters. Finally, we delivered the test set to

the trained SVM and calculated general and per-class

f1-scores. We repeated this process ﬁve times, using

the 5-fold cross-validation method.

2.3.2 Proposed CNN

With the rise of IHC WSIs, Deep Learning mod-

els, particularly CNNs, gained prominence for their

ability to automatically extract features, often outper-

forming manually adjusted methods for feature ex-

traction (Cordeiro, 2019; Mridha et al., 2022). Given

this scenario, we proposed a CNN based on the ar-

chitecture from Tang et al. (2019), which uses a

lightweight CNN capable of accurately classifying

IHC images. The patches were normalized and re-

sized to 256x256 pixels, and the architecture included

6 Convolutional and Max Pooling layers, along with

2 Dense layers (Fig. 2). The convolutional layers

had 64, 64, 128, 256, 256, and 512 neurons, with a

dropout rate of 0.2, a learning rate of 0.00008, the

Adam optimizer, early stopping, and the Multi-Label

Soft Margin Loss function. The network outputted the

probability of the image belonging to each IS class,

and its performance was evaluated using the same

method and metric from the RM.

2.3.3 DenseNet Approach

To address vanishing-gradient issues, strengthen fea-

ture propagation, and reduce CNN parameters, the

DenseNet (Densely Connected Convolutional Net-

work) architecture was proposed (Huang et al., 2017).

A key innovation was the use of dense connections

between layers (Dense Blocks), where all layers are

connected, allowing each to receive inputs from all

previous layers and pass features to the next, thereby

enabling more effective learning (Huang et al., 2017).

For our experiments, we implemented DenseNet121,

a 121-layer variation (Fig. 2), using Pytorch and pre-

trained weights from the ImageNet1K dataset, and

performed ﬁne-tuning with early stopping. Hyperpa-

rameters were set according to default values in the

library, with a Cross-Entropy loss function, 224x224

image size, and the Adam optimizer. Regarding the

division of data and metrics, DenseNet followed the

same steps described in the proposed CNN.

2.3.4 ViT Approach

To provide an alternative to CNNs with lower com-

putational costs, the Vision Transformer (ViT) was

proposed in Dosovitskiy et al. (2021). ViT divides

the input image into ﬁxed-sized patches and employs

self-attention mechanisms to capture features at var-

ious levels. This enables the model to understand

both global image context and relationships between

patches (Fig. 2). The classiﬁcation step is performed

by an MLP with a hidden layer during pre-training

and a single linear layer during ﬁne-tuning (Dosovit-

skiy et al., 2021). We implemented the ViT using

Pytorch and performed ﬁne-tuning with pre-trained

weights from the ImageNet1K dataset. The model

considered normalized 224x224 images, learning rate

of 0.001, Cross-Entropy loss, and Adam optimizer,

along with early stopping to prevent overﬁtting. Data

division and evaluation metrics followed the approach

described in previous classiﬁers.

3 EXPERIMENTS

3.1 Cross-Validation Approach

To structure the training process, we opted for the 5-

fold cross-validation, chosen for its ability to provide

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

842

reliable performance estimates with small datasets,

avoiding misinterpretations common with methods

like Holdout (Maleki et al., 2020). This approach

averages results across ﬁve combinations of train-

ing, validation, and test sets. Unlike standard cross-

validation, we designated one fold exclusively for

testing, while one of the remaining four training folds

served as validation. This adaptation enhanced the

variance of the test data, offering a more realistic per-

formance estimate while balancing result reliability

with training and execution time.

3.2 Hyperparameter Optimization

After deﬁning the data augmentation and classiﬁca-

tion methods, we optimized key hyperparameters. For

StyleGAN2ADA, we set the truncation factor to 0.5,

balancing increased variance in synthetic images with

the preservation of original features (Karras et al.,

2020), based on experiments with truncation values

between 0.3 and 0.7. For classiﬁers, we adjusted

the learning rate and implemented early stopping to

mitigate overﬁtting (Bai et al., 2021). Training halts

if validation loss worsens by more than 10% across

three instances, and the learning rate is halved each

time, enabling ﬁner adjustments. These values were

determined through smaller experiments and learning

curve analysis.

3.3 IS Classiﬁcation

Regarding the automatic scoring of cancer levels from

Estrogen Receptor (ER) and Progesterone Receptor

(PR) images, we deﬁned ﬁve experiments to study the

impact of adding synthetic images on the training pro-

cess of the classiﬁcation models. For this, we consid-

ered AutoAugment and StyleGAN2ADA as data aug-

mentation methods and selected Rogalsky Methodol-

ogy, CNN, DenseNet, and ViT models to categorize

images into the four IS score values. In the evalua-

tion step, the f1-scores of the test sets from the 5-folds

were aggregated by the mean. We present the descrip-

tions of these experiments below:

• E1 – DS: Training, validation, and testing of the

proposed classiﬁcation methods with the original

dataset (DS).

• E2 – AA+100: Addition of 100 synthetic images

for each IS class to the original training set, gen-

erated by the AutoAugment model.

• E3 – SG+100: Addition of 100 synthetic images

for each IS class to the original training set, gen-

erated by the StyleGAN2ADA model.

• E4 – AA+B: Balancing of the training set classes

in relation to the majority class using the Au-

toAugment model. After balancing, 100 more

synthetic images were added for each IS class.

• E5 – SG+B: Balancing of the training set classes

in relation to the majority class using the Style-

GAN2ADA model. After balancing, 100 more

synthetic images were added for each IS class.

Table 2 details the number of samples per class

in each experiment. For experiments E2 and E3, we

added 100 synthetic images using AutoAugment and

StyleGAN2ADA, respectively, to evaluate the impact

of a small amount of data augmentation without ex-

ceeding the number of real images in each class. The

minority class in the ER dataset had 123 samples, and

in the PR dataset, 102, prompting the selection of 100

synthetic images. Experiments E4 and E5 aimed to

balance the classes by adding synthetic images un-

til the minority classes matched the majority class,

followed by an additional 100 generated images per

class. All experiments were conducted with four clas-

siﬁers, and the results were compared to E1, which

used the original dataset without synthetic images.

Table 2: Number of training samples of each class for all

experiments with the ER and PR datasets.

ER PR

0 1+ 2+ 3+ Total 0 1+ 2+ 3+ Total

DS (E1) 330 123 235 753 1441 319 102 137 417 975

AA+100 (E2) 430 223 335 853 1841 419 202 237 517 1375

SG+100 (E3) 430 223 335 853 1841 419 202 237 517 1375

AA+B (E4) 853 853 853 853 3412 517 517 517 517 2068

SG+B (E5) 853 853 853 853 3412 517 517 517 517 2068

4 RESULTS

The average 5-fold results with the Rogalsky Method-

ology (RM), CNN, DenseNet (DN), and ViT models

for each experiment are presented in Table 3, covering

the f1-score metric from both datasets (ER and PR).

4.1 ER Dataset

Starting with Rogalsky Methodology in the ER

dataset scenario, AutoAugment (E2 and E4) wors-

ened the average f1-scores by 5 and 9 percentage

points, while E3 reduced them minimally (0.31),

and E5 improved them by 0.38 percentage points.

AutoAugment transformations may have introduced

noise or altered the main characteristics of the origi-

nal images, such as coloration and contrast (see sec-

ond row of Fig. 1). This may have transformed the

intensity of the cells, confusing the classiﬁer. Also,

Exploring Histopathological Image Augmentation Through StyleGAN2ADA: A Quantitative Analysis

843

Table 3: Classiﬁcation methods f1-scores in experiments E1, E2, E3, E4, and E5, considering all IS classes from ER and PR

biomarkers. The blue values show improvements, and the red values indicate worsening compared to the results of experiment

DS. The results refer to the 5-fold averages.

Exam Type Model DS (E1) AA+100 (E2) SG+100 (E3) AA+B (E4) SG+B (E5)

RM 82.86 ± 02.34 77.94 ± 05.12 82.55 ± 02.92 74.33 ± 05.91 83.25 ± 03.06

CNN 71.77 ± 06.88 77.37 ± 04.63 81.20 ± 03.03 65.76 ± 36.92 81.16 ± 04.04

DN 76.88 ± 05.47 77.32 ± 04.95 87.46 ± 03.98 76.93 ± 03.88 80.60 ± 04.08

ViT 82.86 ± 03.43 82.97 ± 02.89 83.68 ± 02.65 82.44 ± 03.03 84.04 ± 03.71

RM 76.36 ± 04.91 69.56 ± 06.81 78.63 ± 04.13 68.23 ± 06.83 77.86 ± 03.01

CNN 57.16 ± 02.80 58.84 ± 03.05 64.69 ± 05.57 69.36 ± 06.49 71.31 ± 03.95

DN 80.47 ± 04.48 80.42 ± 02.77 82.00 ± 02.82 79.34 ± 02.98 81.25 ± 03.37

ViT

91.75 ± 06.35 89.76 ± 05.09 91.76 ± 06.26 89.47 ± 05.59 92.22 ± 06.53

since the SVM does not present problems with imbal-

anced data (Cortes and Vapnik, 1995), adding images

with StyleGAN2ADA did not contribute to or worsen

the model’s performance. Thus, the network synthetic

images outperform AutoAugment and do not harm

the classiﬁcation process.

With the proposed CNN, synthetic images im-

proved E2, E3, and E5 experiments, with an in-

crease of up to 9 percentage points in E3 using Style-

GAN2ADA. These improvements can be attributed

to the greater variability introduced by synthetic im-

ages (see third row of Fig. 1), which helped general-

ize the model. With AutoAugment, there was a gain

of 5.6 percentage points in E2, but a drop of 6 per-

centage points in E4. Because it is a model with no

pre-trained weights, the CNN was able to stabilize its

training with the addition of images in experiment E2

(ao Huang et al., 2022). However, in the case of E4,

the model began to rely mostly on these synthetic im-

ages, resulting in misclassiﬁcation between classes.

DenseNet presented the best results among the

evaluated models, consistently outperforming E1,

with an f1-score of 87.46% in E3. The greater depth

of the network layers, the various innovations brought

by the architecture, and the use of pre-trained weights

may have contributed to better performance. How-

ever, E5 had a lower average f1-score compared to

the CNN and the RM. Thus, we emphasize the im-

portance of carefully optimizing the number of syn-

thetic images included in the training set, as a more

complex model does not necessarily guarantee better

performance.

With ViT, f1-scores were higher than 82% in all

experiments, making it the most consistent model. E5

showed the best performance, with a 1.18 percent-

age point improvement. While the improvement was

modest, the synthetic images did not degrade the re-

sults, and given that ViT typically requires millions

of data points to achieve optimal performance, this

highlights the potential of StyleGAN2ADA to support

the learning process effectively. The larger dataset,

class balance, and high-quality synthetic images from

the network could enhance learning by enriching the

training set and stabilizing ViT’s feature extraction

and MLP training (Dosovitskiy et al., 2021).

4.2 PR Dataset

In the analysis of the RM model on the PR dataset,

we noticed that the f1-scores improved in experiments

E3 and E5 (StyleGAN2ADA) compared to E1 (DS).

In contrast, the experiments with AutoAugment (E2

and E4) showed a decline, with a drop of 8 percent-

age points. This suggests that image-processing data

augmentation transformations can be detrimental if

not properly optimized for the speciﬁc problem (see

second row of Fig. 1). The experiment using 100

synthetic images from StyleGAN2ADA increased by

2.27 percentage points, demonstrating that these im-

ages can enhance the learning process, even in models

that do not depend on large datasets or class balancing

techniques (Cortes and Vapnik, 1995).

Using the CNN, all experiments with synthetic

image insertion (E2, E3, E4, and E5) improved the

f1-scores compared to the original data experiment.

For class 1+, the CNN initially failed to classify any

test samples in E1 and E2, achieving f1-scores of 0%.

From E3 onward, the model began to succeed, reach-

ing 60% in E5. This trend was reﬂected in overall

f1-scores, with gains of 14.15 percentage points in

E5 and 12.12 in E4. These results highlight the im-

portance of balancing training sets with synthetic im-

ages, particularly to stabilize learning in models with-

out pre-trained weights (ao Huang et al., 2022).

Regarding DenseNet, we observed that the f1-

scores did not show notable variations in the overall

results. The highest gain was 1.53 percentage points,

reaching an 82% f1-score, making DenseNet the ﬁrst

classiﬁer to surpass 80% on the PR data. Thus, we

conclude that even though the improvements from

one experiment to another are minor, they are crucial

for more accurate diagnoses in the medical ﬁeld and

indicate potential advancements in the area with the

use of StyleGAN2ADA.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

844

Finally, ViT achieved the best results in the re-

search, surpassing a 90% overall f1-score. We ob-

served that experiment E3 maintained the results of

E1, and E5 reached 92.22%, improving the metric by

0.47 percentage points. E2 and E4 reduced the met-

rics, presenting drops of 1.99 percentage points in E2

and 2.28 in E4. The fact that StyleGAN2ADA im-

ages did not harm performance highlights their po-

tential. ViT models require millions of data points

to achieve optimal performance, a requirement unmet

by small datasets. This suggests that synthetic data

could offer a cost-efﬁcient alternative, eliminating the

need for large, annotated real datasets by providing

millions of high-quality synthetic samples.

5 LIMITATIONS

We do not apply the proposal to other datasets with

distinct types of pathologies. Using a network capa-

ble of synthesizing high-quality medical images can

contribute to several areas due to its ability to learn

the concept of images and respect the main charac-

teristics of the data. Also, we do not consider the

use of pre-trained weights in StyleGAN2ADA, al-

though ﬁne-tuning can facilitate the training process

because it starts with information that may be rele-

vant. Finally, we did not carry out interpretability

studies of the feature maps of the generating network

and the CNN-based classiﬁers. This study would al-

low a greater understanding of which characteristics

the models consider essential to deﬁne each class.

6 DISCUSSION

In this research, we conducted an impact study to

evaluate the effect of adding synthetic medical im-

ages into the classiﬁcation methods training process.

The objective was to classify the cell staining inten-

sity score of patches from ER and PR biomarkers re-

sponsible for breast cancer detection and categoriza-

tion. To achieve this, we generated images with the

StyleGAN2ADA and AutoAugment models and in-

corporated synthetic images into the training process

of four classiﬁcation models.

In the ER dataset, we achieved the best classiﬁca-

tion results with DenseNet, obtaining an f1-score of

87.46% and improving the metric by 10.58 percent-

age points compared to the experiment with original

data. In the PR dataset, we achieved an f1-score of

92.22% with ViT, along with an increase of 14 per-

centage points with the CNN. Experiments using syn-

thetic images from AutoAugment produced worse re-

sults, with drops of up to 8 percentage points, indi-

cating that simple data augmentation techniques can

interfere with critical features of medical images. On

the other hand, images generated by StyleGAN2ADA

improved the results in most experiments by increas-

ing the variability of the training set and promoting

better generalization of the classiﬁers.

To the best of our knowledge, our research is the

ﬁrst in the ﬁeld to use StyleGAN2ADA in the context

of IHC images. In future work, we aim to evaluate the

proposed methods on other datasets, examine the im-

pact of pre-trained weights on StyleGAN2ADA train-

ing, and apply interpretability studies to facilitate the

adoption of this proposal in clinical environments.

7 CONCLUSIONS

In summary, in the context of breast cancer images

associated with the ER and PR biomarkers, the use

of the state-of-the-art network StyleGAN2ADA im-

proved performance in IS classiﬁcation by increasing

data variability. In contrast, data augmentation tech-

niques based solely on image-processing, such as the

pre-trained AutoAugment, proved inadequate for this

problem. Furthermore, we emphasize the importance

of carefully selecting the evaluation method in class-

imbalanced scenarios and testing different amounts of

synthetic images added to the training set to ensure re-

liable and robust results.

ACKNOWLEDGEMENTS

This research received ﬁnancial support from the

Coordenac¸

ao de Aperfeic¸oamento de Pessoal de

ıvel Superior - Brasil (CAPES) - Finance Code

001 and the Brazilian National Research Coun-

cil (CNPq; grant 441782/2018-3). The work

was submitted to the Ethics Committee on Re-

search conforms with the ethical aspects deﬁned by

the National Health Council’s Resolution 466/2012

(CAAE - 84415418.5.0000.0098 and approval num-

ber 2.568.281).

REFERENCES

American Cancer Society (2021). Breast Can-

cer Hormone Receptor Status. https:

//www.cancer.org/cancer/breast-cancer/

understanding-a-breast-cancer-diagnosis/

breast-cancer-hormone-receptor-status.html.

Exploring Histopathological Image Augmentation Through StyleGAN2ADA: A Quantitative Analysis

845

ao Huang, Z., Sang, Y., Sun, Y., and Lv, J. (2022). A neu-

ral network learning algorithm for highly imbalanced

data classiﬁcation. Information Sciences, 612:496–

513. https://doi.org/10.1016/j.ins.2022.08.074.

Bai, Y., Yang, E., Han, B., Yang, Y., Li, J., Mao, Y., Niu,

G., and Liu, T. (2021). Understanding and improving

early stopping for learning with noisy labels.

Choi, S., Cho, S. I., Jung, W., Lee, T., Choi, S. J., and

et al. (2023). Deep learning model improves tumor-

inﬁltrating lymphocyte evaluation and therapeutic re-

sponse prediction in breast cancer. npj Breast Cancer.

https://doi.org/10.1038/s41523-023-00577-4.

Cordeiro, C. Q. (2019). An Automatic Patch-Based Ap-

proach for HER-2 Scoring in Immunohistochemical

Breast Cancer Images. https://acervodigital.ufpr.br/

handle/1884/66131.

Cordeiro, C. Q., Ioshii, S. O., Alves, J. H., and de Oliveira,

L. F. (2018). An Automatic Patch-based Approach

for HER-2 Scoring in Immunohistochemical Breast

Cancer Images Using Color Features. XVIII Simp

osio

Brasileiro de Computac¸

ao Aplicada

a Sa

ude. https:

//doi.org/10.5753/sbcas.2018.3685.

Cortes, C. and Vapnik, V. (1995). Support-vector net-

works. Machine Learning, 20. https://doi.org/10.

1007/BF00994018. Acessado em: 27/03/2023.

Cubuk, E. D., Zoph, B., Man

e, D., Vasudevan, V., and Le,

Q. V. (2019). Autoaugment: Learning augmentation

strategies from data. In 2019 IEEE/CVF Conference

on Computer Vision and Pattern Recognition (CVPR).

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,

D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,

M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby,

N. (2021). An image is worth 16x16 words: Trans-

formers for image recognition at scale.

Han, Z., Wei, B., Zheng, Y., Yin, Y., Li, K., and

Li, S. (2017). Breast Cancer Multi-classiﬁcation

from Histopathological Images with Structured Deep

Learning Model. Scientiﬁc Reports. https://doi.org/

10.1038/s41598-017-04075-z.

Huang, G., Liu, Z., Maaten, L. V. D., and Weinberger, K. Q.

(2017). Densely connected convolutional networks.

In 2017 IEEE Conference on Computer Vision and

Pattern Recognition (CVPR), pages 2261–2269, Los

Alamitos, CA, USA. IEEE Computer Society.

IARC (2023). Cancer Today. https://gco.iarc.fr/today/

online-analysis-multi-bars?

Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J.,

and Aila, T. (2020). Training Generative Adversar-

ial Networks with Limited Data. https://doi.org/10.

48550/arXiv.2006.06676. Acessado em: 24/04/2023.

Kim, S.-W., Roh, J., and Park, C.-S. (2016). Immuno-

histochemistry for Pathologists: Protocols, Pitfalls,

and Tips. Journal of Pathology and Translational

Medicine 2016; 50: 411-418. https://doi.org/10.4132/

jptm.2016.08.08.

Krinski, B. A., Ruiz, D. V., Laroca, R., and Todt, E. (2023).

DACov: a deeper analysis of data augmentation on the

computed tomography segmentation problem. Com-

puter Methods in Biomechanics and Biomedical En-

gineering: Imaging & Visualization, 0(0):1–18. https:

//doi.org/10.1080/21681163.2023.2183807.

Laurinavicius, A., Plancoulaine, B., Herlin, P., and Lauri-

naviciene, A. (2016). Comprehensive Immunohisto-

chemistry: Digital, Analytical and Integrated. Patho-

biology 2016;83:156-163. https://doi.org/10.1159/

000442389.

Maleki, F., Muthukrishnan, N., Ovens, K., Md, C., and

Forghani, R. (2020). Machine Learning Algorithm

Validation. Neuroimaging Clinics of North America,

30:433–445. http://dx.doi.org/10.1016/j.nic.2020.08.

004. Acessado em: 25/04/2023.

Mouelhi, A., Rmili, H., Ali, J. B., Sayadi, M., Doghri,

R., and Mrad, K. (2018). Fast unsupervised nuclear

segmentation and classiﬁcation scheme for automatic

allred cancer scoring in immunohistochemical breast

tissue images. Computer Methods and Programs in

Biomedicine. https://doi.org/10.1016/j.cmpb.2018.08.

005.

Mridha, M. F., Morol, M. K., Ali, M. A., and Shovon,

M. S. H. (2022). convoHER2: A Deep Neural Net-

work for Multi-Stage Classiﬁcation of HER2 Breast

Cancer. AIUB Journal of Science and Engineering

(AJSE). https://doi.org/10.53799/ajse.v22i1.477.

Mukherkjee, D., Saha, P., Kaplun, D., Sinitca, A., and

Sarkar, R. (2022). Brain tumor image generation us-

ing an aggregation of GAN models with style trans-

fer. Sci Rep. 2022; 12: 9141. https://doi.org/10.1038\

%2Fs41598-022-12646-y.

Osuala, R., Kushibar, K., Garrucho, L., Linardos, A.,

Szafranowska, Z., Klein, S., Glocker, B., Diaz, O.,

and Lekadir, K. (2023). Data synthesis and adver-

sarial networks: A review and meta-analysis in can-

cer imaging. Medical Image Analysis, 84:102704.

https://doi.org/10.1016/j.media.2022.102704.

Rmili, H., Mouelhi, A., Solaiman, B., Doghri, R., and

Labidi, S. (2022). A novel pre-processing approach

based on colour space assessment for digestive neu-

roendocrine tumour grading in immunohistochemi-

cal tissue images. Pol J Pathol. 2022;73(2):134-158.

https://doi.org/10.5114/pjp.2022.119841.

Rogalsky, J. E. (2021). Semi-automatic ER and PR scoring

in immunohistochemistry H-BAD breast cancer im-

ages. https://acervodigital.ufpr.br/handle/1884/73470.

Rogalsky, J. E., Ioshii, S. O., and de Oliveira, L. F. (2021).

Automatic ER and PR scoring in Immunohistochem-

istry H-DAB Breast Cancer images. XXI Simp

osio

Brasileiro de Computac¸

ao Aplicada

a Sa

ude. https:

//doi.org/10.5753/sbcas.2021.16075.

Tang, Z., Chuang, K. V., DeCarli, C., Jin, L.-W., Beckett,

L., Keiser, M. J., and Dugger, B. N. (2019). Inter-

pretable classiﬁcation of Alzheimer’s disease patholo-

gies with a convolutional neural network pipeline.

Nature Communications 10. https://doi.org/10.1038/

s41467-019-10212-1.

WHO (2022). Cancer. https://www.who.int/news-room/

fact-sheets/detail/cancer.

Yip, C.-H. and Rhodes, A. (2014). Estrogen and pro-

gesterone receptors in breast cancer. Future Oncol-

ogy, 10(14), 2293-2301. https://doi.org/10.2217/fon.

14.110.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

846