Breast Cancer Image Classiﬁcation Using Deep Learning and Test-Time

Augmentation

ao Fernando Mari

, Larissa Ferreira Rodrigues Moreira

, Leandro Henrique Furtado Pinto Silva

Mauricio C. Escarpinati

and Andr

e R. Backes

Institute of Exact and Technological Sciences, Federal University of Vic¸osa - UFV, Rio Parana

ıba-MG, Brazil

School of Computer Science, Federal University of Uberl

andia, Uberl

andia-MG, Brazil

Department of Computing, Federal University of S

ao Carlos, S

ao Carlos-SP, Brazil

{joaof.mari, larissa.f.rodrigues, leandro.furtado}@ufv.br, mauricio@ufu.br, arbackes@yahoo.com.br

Keywords:

Breast Cancer, Deep Learning, Test-Time Augmentation, Image Classiﬁcation.

Abstract:

Deep learning-based computer vision methods can improve diagnostic accuracy, efﬁciency, and productivity.

While traditional approaches primarily apply Data Augmentation (DA) during the training phase, Test-Time

Augmentation (TTA) offers a complementary strategy to improve the predictive capabilities of trained models

without increasing training time. In this study, we propose a simple and effective TTA strategy to enhance the

classiﬁcation of histopathological images of breast cancer. After optimizing hyperparameters, we evaluated the

TTA strategy across all magniﬁcations of the BreakHis dataset using three deep learning architectures, trained

with and without DA. We compared ﬁve sets of transformations and multiple prediction rounds. The proposed

strategy signiﬁcantly improved the mean accuracy across all magniﬁcations, demonstrating its effectiveness in

improving model performance.

1 INTRODUCTION

Histopathological image classiﬁcation plays a crucial

role in diagnosing breast cancer. Pathologists an-

alyze microscopic slides of breast tissue at various

magniﬁcations to identify tumor characteristics, such

as determining whether a tumor is benign or malig-

nant. However, manual analysis is often subjective,

time-consuming, and prone to variability among ex-

perts. To address these limitations, computer vision

and deep learning techniques have been increasingly

adopted, offering improved diagnostic accuracy and

efﬁciency (Gautam, 2023).

Traditional deep learning methods for image

classiﬁcation commonly employ data augmentation

(DA) during training to enhance model generaliza-

tion (da Silva et al., 2020; Gautam, 2023; Barbosa

et al., 2024). Although DA is effective, it does not

leverage the potential of augmentations during the

testing phase. Test-Time Augmentation (TTA) ex-

tends the application of data transformations to in-

ference, enhancing the model’s predictions without

incurring additional training costs (Calvo-Zaragoza

et al., 2020; Shanmugam et al., 2021; Valero-Mas

et al., 2024). Previous research has shown the beneﬁts

of TTA in various medical imaging tasks, including

skin cancer classiﬁcation and bone fracture detection,

demonstrating its ability to improve predictive accu-

racy in different datasets and architectures (Shorten

and Khoshgoftaar, 2019; Garcea et al., 2023).

Despite its proven effectiveness, TTA has only

been minimally explored in the context of histopatho-

logical image classiﬁcation of breast cancer. Most

studies focus on speciﬁc magniﬁcation levels or mod-

els, neglecting a broader evaluation across magniﬁca-

tions and architectures (Gupta et al., 2021; Oza et al.,

2024). This leaves a signiﬁcant gap in understand-

ing TTA’s potential in multi-resolution medical imag-

ing scenarios, such as the BreakHis dataset (Spanhol

et al., 2016).

In this paper, we propose a simple and effec-

tive TTA strategy to improve the classiﬁcation of

histopathological images of breast cancer. Using

the BreakHis dataset, we evaluate TTA across all

magniﬁcation levels and analyze its impact on three

deep learning architectures: ResNet-50, Vision Trans-

former (ViT), and Swin Transformer V2. Our study

also compares the effects of training-time DA on the

performance of TTA.

The main contributions of this work are: (i) a

Mari, J. F., Moreira, L. F. R., Silva, L. H. F. P., Escarpinati, M. C. and Backes, A. R.

Breast Cancer Image Classiﬁcation Using Deep Learning and Test-Time Augmentation.

DOI: 10.5220/0013359200003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 3: VISAPP, pages

761-768

ISBN: 978-989-758-728-3; ISSN: 2184-4321

761

comprehensive evaluation of TTA across four mag-

niﬁcation levels (40×, 100×, 200×, and 400×) in

the BreakHis dataset, (ii) analysis of TTA’s impact

on three deep learning architectures trained both with

and without DA, (iii) identify the best transforma-

tion sets for TTA in breast cancer image classiﬁca-

tion, and (iv) demonstrates TTA’s ability to improve

model generalization and accuracy under diverse test-

ing conditions.

The remainder of this paper is organized as fol-

lows. Section 2 reviews related work. We detail the

materials and methods used in Section 3. Section

4 presents the experimental design while Section 5

presents and discusses the results obtained. Finally,

Section 6 concludes with future research directions.

2 RELATED WORK

Nguyen et al. (2019) developed an approach to breast

cancer classiﬁcation using weighted feature selection

with a Convolutional Neural Network (CNN) clas-

siﬁer, incorporating Test-Time Augmentation (TTA)

strategy involved transforming test images with hori-

zontal and vertical ﬂips and 90

◦

rotations to improve

classiﬁcation accuracy. Gupta et al. (2021) proposed

a modiﬁed ResNet to classify images as benign or

malignant on the BreakHis dataset by incorporating

TTA. However, their Residual network was trained

exclusively using 40× magniﬁcation images.

Kandel and Castelli (2021) investigated the im-

pact of TTA on X-ray images for bone fracture detec-

tion using ﬁve CNN architectures and the combina-

tion of different Data Augmentation (DA) techniques

based on rotation, ﬂips, and zoom. In Nanni et al.

(2021) authors evaluated different datasets, including

virus textures, and explored the effectiveness of vari-

ous DA techniques such as kernel ﬁlters, color space

transformations, geometric transformations, random

erasing/cutting, and image mixing.

Jiahao et al. (2021) proposed a method based on

the EfﬁcientNet architecture to identify skin cancer

in dermoscopy images by leveraging DA strategies

during training and Test-Time Augmentation during

inference to improve classiﬁcation accuracy. In the

same application context, Goceri (2023) applied TTA

to identify skin cancer and evaluated three CNN ar-

chitectures: DenseNet, ResNet, and VGG.

uller et al. (2022) evaluated the use of TTA

on different medical imaging datasets, including

histopathological images of colorectal cancer. More-

over, they tested different CNN architectures and ob-

served that TTA is a promising technique that does

not require additional training time.

Oza et al. (2024) investigated the use of deep

learning to diagnose breast cancer from mammo-

grams with abnormal lesions, using a transfer learning

strategy and TTA to enhance the classiﬁcation perfor-

mance. They evaluated four pre-trained CNN models

and exploited DA based on rotation with various an-

gles: horizontal ﬂip, zoom, and shearing.

Unlike the aforementioned studies, to the best of

our knowledge, this is the ﬁrst study to explore TTA

across all magniﬁcations of the BreakHis dataset. We

used modern deep learning architectures, including

ResNet-50, ViT-16, and Swin Transformer V2, en-

hanced by hyperparameter optimization. Moreover,

TTA enhances the capability of the model to adapt

to various testing conditions, ensuring a more re-

liable assessment of breast cancer classiﬁcation in

histopathological images across different image mag-

niﬁcations.

3 MATERIAL AND METHODS

3.1 Dataset

We used the BreakHis dataset

(Spanhol et al., 2016,

2017), a well-established benchmark for breast cancer

histopathological image classiﬁcation. The dataset

comprises 7,909 microscopic images of breast tumor

tissue collected from 82 patients. These images are

captured at four magniﬁcation levels: 40×, 100×,

200×, and 400×. Each image is labeled benign or

malignant, with 2,480 images in the benign class and

5,429 in the malignant class. This diversity in magni-

ﬁcation levels allows us to evaluate the classiﬁcation

models under varying resolutions, a common scenario

in histopathological analysis. Dataset includes ﬁve

predeﬁned random folds to ensure robust evaluation.

Each fold features a patient-wise split, where all im-

ages from a single patient are allocated exclusively to

the training or test sets. This approach avoids over-

lap and ensures the models are tested on unseen data,

replicating real-world scenarios where generalization

to new patients is critical.

3.2 Architectures

We selected three deep learning architectures to eval-

uate our approaches: ResNet-50 (He et al., 2016), ViT

b16 (Dosovitskiy et al., 2021), and Swin Transformer

V2 (Liu et al., 2022). These architectures are widely

https://web.inf.ufpr.br/vri/databases/breast-cancer-

histopathological-database-breakhis/

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

762

recognized for their high performance in tasks related

to image classiﬁcation and object detection.

ResNet-50 introduced residual learning, enabling

the training of deeper networks than earlier CNNs.

This architecture mitigates the vanishing gradient

problem using residual blocks comprising convolu-

tional layers and skip connections. Skip connections

bypass one or more residual blocks, allowing the net-

work to learn identity mappings and enhancing gradi-

ent ﬂow during backpropagation (He et al., 2016).

ViT b16 (Vision Transformer) leverages a

Transformer-based architecture originally designed

for natural language processing and adapts it to com-

puter vision tasks. Proposed by Dosovitskiy et al.

(2021), ViT divides input images into ﬁxed-size

patches, treating each patch as a token. These tokens

are processed through a series of self-attention mech-

anisms to learn embeddings, enabling the model to

capture global relationships across the image.

Swin Transformer V2, introduced by Liu et al.

(2022), enhances the Swin Transformer with a hier-

archical architecture that uses windowing for local

attention and improves the computational efﬁciency

over ViT’s global attention. By processing smaller

image windows and shifting them across layers, it ef-

fectively captures both the local and global contexts,

making it highly suitable for image classiﬁcation.

4 EXPERIMENT DESIGN

In this study, we designed experiments to evaluate

the effectiveness of TTA strategies for histopatho-

logical image classiﬁcation of breast cancer using

the BreakHis dataset (Section 3.1). The dataset is

provided with ﬁve randomized folds, already split

patient-wise into training and test sets. For each fold,

we further split 20% of the training set, on an image-

wise basis, to construct a validation set.

The validation set served two purposes: hyper-

parameter optimization and early stopping. Hyper-

parameter optimization focused on selecting the best

values for batch size (BS) and learning rate (LR). A

grid search approach was used, exploring BS values

of 16, 32, 64, 128 and LR values of 0.01, 0.001,

0.0001, 0.00001. To reduce computational demands,

hyperparameter optimization was conducted only on

the ﬁrst fold using images with a magniﬁcation of

40× for each architecture. The best-performing hy-

perparameters were then applied to all folds and mag-

niﬁcations.

The Adam optimizer and cross-entropy loss func-

tion were used to ﬁne-tune models pre-trained on the

ImageNet dataset (Deng et al., 2009). All layers of the

models were unfrozen during training to enable ﬁne-

tuning. A learning rate scheduler was implemented to

reduce the learning rate if the validation loss did not

improve after ten consecutive epochs (reduce LR on

plateau). An early stopping strategy was also applied,

halting training if the validation loss failed to improve

after 21 epochs.

Figure 1 illustrates the experimental design, which

includes the steps of data splitting, hyperparameter

optimization, model training with and without data

augmentation, and prediction using the TTA strategy.

This systematic approach ensures the reliable evalua-

tion of TTA’s impact on classiﬁcation performance.

ResNet-50

ViT b16

Swin T V2 base

40x 100x 200x 400x

40x 100x 200x 400x 40x 100x 200x 400x

SPLIT

optimization

Model

training

Test DA

Original training dataset (5 folds)

Training

Training dataset Validation dataset

Original test dataset (5 folds)

Repeat for N rounds

N = {1, 5, 9, 15, 25}

Training

with DA

Training

without DA

Prediction

(Most frequent class among the N rounds)

40x 100x 200x 400x

Used for early

stopping

Only for

fold 1,

and 40x

Mean over the 5 folds

TEST-TIME

AUGMENTATION

MODELS

(a)

(b)

(c)

(d)

Figure 1: The experiment design illustration. (a) We split

the training set to create a validation set. (b) Hyperparame-

ter optimization. (c) Training the models with and without

DA. (d) Prediction with the proposed TTA strategy.

4.1 Training-Time Data Augmentation

We trained one model for each architecture, magni-

ﬁcation, and fold, following the hyperparameter op-

timization and training strategies outlined in Section

4. This training was conducted with and without a

DA strategy. Three different transformation pipelines

were used for the datasets: T-1 was applied to the vali-

dation and test sets, T-2 was applied to the training set

for models trained without DA, and T-3 was applied

to the training set for models trained with DA.

In all cases, the images were normalized using the

mean and standard deviation of the ImageNet dataset

to ensure compatibility with pre-trained models. The

transformations are detailed as follows: T-1: The im-

ages were resized to 256×256 pixels, followed by a

center crop to 224×224 pixels. T-2: Random re-

sized cropping with patches covering between 80%-

100% of the original image size. T-3: A sequence

of augmentations including: Random horizontal ﬂip-

ping. Random rotation between -15

◦

and 15

◦

. Ran-

dom resized cropping with patches covering 80% to

100% of the original image size. Color jittering, with

brightness, contrast, and saturation adjusted by a fac-

tor randomly chosen between 0.8 and 1.2. Random

erasing, with patches covering 2% to 20% of the orig-

inal image size. These transformations were carefully

Breast Cancer Image Classiﬁcation Using Deep Learning and Test-Time Augmentation

763

selected to enhance the models’ generalization capa-

bilities, particularly when using DA, and to provide a

consistent testing baseline for fair comparison.

4.2 Test-Time Augmentation (TTA)

In this work, we propose a simple but effective TTA

strategy to enhance the accuracy of classiﬁcation

models with minimal computational overhead. The

approach involves making multiple predictions for

each test image by applying a DA strategy to generate

augmented versions of the image. The ﬁnal prediction

is determined using a majority voting mechanism, as

deﬁned by Equation 1.

ˆy =

(

1, if

∑

i=1

f (T(x)) >

0, otherwise

(1)

where ˆy is the ﬁnal prediction label, N denotes the

number of TTA prediction rounds, and T (x) refers to

the transformation set applied to the test image x. The

function f represents the deep learning model used

to predict the output for the transformed image T (x).

The term

∑

i=1

f (T(x)) is the sum of the binary pre-

dictions (0 or 1) for each transformed image, and

serves as the decision threshold. If the sum is greater

than

, more than half of the predictions are 1, and the

ﬁnal prediction ˆy is set to 1; otherwise, ˆy is set to 0.

This majority voting strategy aggregates predictions

from multiple augmented versions of the test image

to improve robustness.

We evaluated ﬁve transformation sets for TTA

during the prediction phase: T-A: Random resized

crop with patches between 80% and 100% of the

original image size. T-B: Random resized crop with

patches between 50% and 100% of the original image

size. T-C: Random horizontal ﬂip, followed by ran-

dom rotation (-15

◦

to 15

◦

), and a random resized crop

with patches between 80% and 100% of the original

size. T-D: Random horizontal ﬂip, followed by ran-

dom rotation (-15

◦

to 15

◦

), and a random resized crop

with patches between 50% and 100% of the original

size. T-E: Identical to the transformation set T-3 de-

scribed in Section 4.1, combining random horizontal

ﬂip, random rotation, random resized crop, color jit-

tering, and random erasing. This TTA strategy lever-

ages multiple augmentations to reduce prediction un-

certainties and improve the robustness of the model

by considering diverse perspectives of the same input

image.

5 RESULTS AND DISCUSSION

For the experiments we used a PC running Linux

Ubuntu 22.04 LTS, equipped with a Core I5-12400

with 6 cores, up to 4.40 GHz CPU, 32 GB of RAM,

and a GPU NVIDIA RTX 4090 with 24 GB of mem-

ory. The experiments were developed using Python

3.10, PyTorch 2.2.2, torchvision 0.17.2 with CUDA

Toolkit 10.1, and Scikit-learn 1.4.2. The pre-trained

models were obtained from the torchvision library.

To evaluate the performance of our method, we

used the accuracy, which measures the proportion of

correctly classiﬁed samples out of the total number of

samples, providing a straightforward metric to evalu-

ate the overall effectiveness of the models in classify-

ing histopathological images.

Table 1 presents the results of the hyperparame-

ter optimization described in Section 4. Each row

lists the BS and LR that achieved the highest valida-

tion accuracy for the respective architecture. We also

included the number of epochs required before early

stopping was triggered. These optimized BS and LR

values were consistently applied to train all models in

this study.

Table 1: Optimized hyperparameter values for each archi-

tecture.

Architecture BS LR Acc. Val. Epochs

ResNet-50 32 0.0001 0.9820 14

ViT b16 64 0.0001 0.9910 28

Swin T. V2 base 16 0.0001 0.9930 9

Table 2 presents the test accuracy for models

trained with and without DA. Since the BreakHis

dataset consists of ﬁve folds, the reported values rep-

resent the mean accuracy across these folds. The re-

sults demonstrate that applying DA during training

improves the test accuracy using the standard predic-

tion strategy, i.e., without employing TTA. Bold val-

ues indicate the highest accuracy achieved for each

magniﬁcation level and architecture. The values in

this table provide the baseline for evaluating the TTA

strategies.

Table 2: Test accuracy obtained with the standard test strat-

egy with models trained with and without DA.

ResNet-50 ViT b16 Swin T. V2 base

Mag. No DA DA No DA DA No DA DA

40× 0.8483 0.8798 0.8394 0.8582 0.8883 0.8963

100× 0.8578 0.8750 0.8298 0.8525 0.8929 0.8785

200× 0.8692 0.8866 0.8724 0.8825 0.8913 0.8979

400× 0.8170 0.8679 0.8167 0.8426 0.8558 0.8608

Mean: 0.8481 0.8773 0.8396 0.8590 0.8821 0.8834

Tables 3 and 4 summarize the mean accuracy

across all magniﬁcations (40×, 100×, 200×, and

400×) achieved using the TTA strategy for each trans-

formation set (T-A, T-B, T-C, T-D, and T-E) with 1, 5,

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

764

Table 3: The test accuracy obtained through the TTA strategy with 1, 5, 6, 15, and 25 rounds of predictions considering ﬁve

different transformation sets (TS) with the models trained without DA (No DA).

Accuracy (Improvement over No DA) [t-value]

Arch. TS No DA 1 5 9 15 25

ResNet-50

T-A 0.8481 0.8548 (+0.0067) [3.57] 0.8551 (+0.0070) [3.07] 0.8549 (+0.0068) [3.06] 0.8551 (+0.0070) [3.16] 0.8551 (+0.0070) [3.16]

T-B 0.8481 0.8523 (+0.0042) [3.33] 0.8554 (+0.0073) [7.46] 0.8561 (+0.0080) [10.01] 0.8567 (+0.0086) [7.38] 0.8554 (+0.0073) [8.33]

T-C 0.8481 0.8507 (+0.0026) [0.57] 0.8519 (+0.0038) [0.91] 0.8515 (+0.0034) [0.80] 0.8534 (+0.0053) [1.14] 0.8542 (+0.0061) [1.50]

T-D 0.8481 0.8563 (+0.0082) [2.85] 0.8607 (+0.0126) [6.56] 0.8601 (+0.0120) [9.56] 0.8616 (+0.0136) [8.52] 0.8610 (+0.0129) [9.38]

T-E 0.8481 0.7969 (-0.0512) [-5.92] 0.8235 (-0.0246) [-4.30] 0.8283 (-0.0198) [-3.63] 0.8342 (-0.0139) [-3.70] 0.8360 (-0.0121) [-2.43]

ViT b16

T-A 0.8396 0.8414 (+0.0019) [0.67] 0.8415 (+0.0019) [0.63] 0.8409 (+0.0014) [0.47] 0.8408 (+0.0012) [0.41] 0.8406 (+0.0010) [0.36]

T-B 0.8396 0.8406 (+0.0011) [0.45] 0.8480 (+0.0085) [6.18] 0.8494 (+0.0098) [5.92] 0.8490 (+0.0095) [6.39] 0.8489 (+0.0093) [8.23]

T-C 0.8396 0.8451 (+0.0056) [1.27] 0.8494 (+0.0098) [2.82] 0.8484 (+0.0088) [2.65] 0.8503 (+0.0108) [3.29] 0.8503 (+0.0108) [2.81]

T-D 0.8396 0.8445 (+0.0049) [1.24] 0.8532 (+0.0136) [3.35] 0.8549 (+0.0153) [3.48] 0.8541 (+0.0146) [3.05] 0.8556 (+0.0160) [3.38]

T-E 0.8396 0.8288 (-0.0108) [-1.58] 0.8442 (+0.0046) [1.16] 0.8481 (+0.0085) [2.69] 0.8507 (+0.0112) [4.72] 0.8504 (+0.0109) [3.59]

Swin T. V2

base

T-A 0.8821 0.8805 (-0.0016) [-0.43] 0.8810 (-0.0010) [-0.29] 0.8806 (-0.0014) [-0.38] 0.8806 (-0.0014) [-0.39] 0.8806 (-0.0014) [-0.39]

T-B 0.8821 0.8813 (-0.0008) [-0.49] 0.8845 (+0.0025) [1.40] 0.8855 (+0.0034) [2.29] 0.8850 (+0.0029) [2.34] 0.8852 (+0.0031) [1.87]

T-C 0.8821 0.8822 (+0.0001) [0.04] 0.8860 (+0.0039) [1.27] 0.8870 (+0.0050) [1.35] 0.8863 (+0.0043) [1.52] 0.8873 (+0.0052) [1.95]

T-D 0.8821 0.8851 (+0.0031) [1.75] 0.8883 (+0.0063) [5.21] 0.8886 (+0.0066) [2.86] 0.8885 (+0.0064) [2.69] 0.8893 (+0.0072) [3.51]

T-E 0.8821 0.8719 (-0.0101) [-2.11] 0.8822 (+0.0002) [0.04] 0.8827 (+0.0006) [0.15] 0.8842 (+0.0021) [0.53] 0.8852 (+0.0032) [0.97]

Table 4: The test accuracy obtained through the TTA strategy with 1, 5, 6, 15, and 25 rounds of predictions considering ﬁve

different transformation sets (TS) with the models trained with DA.

Mean accuracy (Improvement over DA) [t-value]

Arch. TS DA 1 5 9 15 25

ResNet-50

T-A 0.8773 0.8792 (+0.0019) [0.45] 0.8787 (+0.0013) [0.34] 0.8786 (+0.0013) [0.32] 0.8785 (+0.0012) [0.29] 0.8785 (+0.0012) [0.29]

T-B 0.8773 0.8793 (+0.0020) [0.99] 0.8813 (+0.0039) [1.29] 0.8818 (+0.0045) [1.34] 0.8832 (+0.0058) [2.25] 0.8827 (+0.0054) [1.95]

T-C 0.8773 0.8778 (+0.0004) [0.10] 0.8791 (+0.0017) [0.46] 0.8798 (+0.0024) [0.64] 0.8809 (+0.0036) [0.84] 0.8805 (+0.0031) [0.73]

T-D 0.8773 0.8811 (+0.0038) [1.60] 0.8828 (+0.0055) [1.79] 0.8844 (+0.0070) [2.25] 0.8851 (+0.0077) [2.54] 0.8847 (+0.0074) [2.39]

T-E 0.8773 0.8694 (-0.0079) [-2.11] 0.8799 (+0.0026) [0.59] 0.8828 (+0.0054) [1.25] 0.8827 (+0.0054) [1.29] 0.8834 (+0.0061) [1.39]

ViT b16

T-A 0.8590 0.8581 (-0.0008) [-0.29] 0.8583 (-0.0006) [-0.24] 0.8583 (-0.0007) [-0.25] 0.8581 (-0.0009) [-0.33] 0.8581 (-0.0009) [-0.34]

T-B 0.8590 0.8606 (+0.0017) [0.86] 0.8646 (+0.0056) [2.48] 0.8644 (+0.0054) [2.41] 0.8656 (+0.0066) [2.26] 0.8650 (+0.0060) [1.94]

T-C 0.8590 0.8602 (+0.0012) [0.60] 0.8629 (+0.0040) [2.73] 0.8623 (+0.0033) [2.17] 0.8625 (+0.0035) [2.02] 0.8622 (+0.0032) [1.87]

T-D 0.8590 0.8646 (+0.0057) [1.93] 0.8650 (+0.0061) [3.04] 0.8663 (+0.0073) [3.49] 0.8660 (+0.0070) [3.37] 0.8659 (+0.0070) [4.36]

T-E 0.8590 0.8535 (-0.0055) [-1.81] 0.8565 (-0.0025) [-0.77] 0.8595 (+0.0005) [0.21] 0.8592 (+0.0002) [0.11] 0.8611 (+0.0022) [0.84]

Swin T. V2

base

T-A 0.8834 0.8846 (+0.0013) [0.40] 0.8845 (+0.0011) [0.36] 0.8844 (+0.0010) [0.31] 0.8844 (+0.0010) [0.31] 0.8844 (+0.0010) [0.31]

T-B 0.8834 0.8844 (+0.0011) [0.95] 0.8860 (+0.0027) [1.41] 0.8868 (+0.0034) [2.41] 0.8871 (+0.0037) [2.66] 0.8869 (+0.0036) [2.58]

T-C 0.8834 0.8831 (-0.0003) [-0.08] 0.8857 (+0.0023) [0.50] 0.8864 (+0.0030) [0.76] 0.8859 (+0.0025) [0.67] 0.8860 (+0.0026) [0.74]

T-D 0.8834 0.8829 (-0.0005) [-0.16] 0.8858 (+0.0024) [1.26] 0.8858 (+0.0024) [1.38] 0.8864 (+0.0031) [2.14] 0.8868 (+0.0035) [2.53]

T-E 0.8834 0.8812 (-0.0022) [-0.83] 0.8841 (+0.0007) [0.16] 0.8852 (+0.0018) [0.51] 0.8866 (+0.0032) [1.02] 0.8868 (+0.0035) [1.07]

Table 5: The test accuracy obtained through the DA in the prediction with 1, 5, 6, 15, and 25 rounds of predictions considering

the transformation set T-D with the models trained without DA.

Accuracy when using transformation set T-D (Improvement over No DA)

Arch. Mag. No DA 1 5 9 15 25

ResNet-50

(T-D)

40× 0.8483 0.8627 (+0.0144) 0.8671 (+0.0189) 0.8641 (+0.0158) 0.8664 (+0.0182) 0.8648 (+0.0165)

100× 0.8578 0.8704 (+0.0126) 0.8699 (+0.0120) 0.8700 (+0.0122) 0.8726 (+0.0148) 0.8722 (+0.0143)

200× 0.8692 0.8690 (-0.0002) 0.8776 (+0.0084) 0.8802 (+0.0110) 0.8794 (+0.0101) 0.8785 (+0.0092)

400× 0.8170 0.8231 (+0.0061) 0.8282 (+0.0112) 0.8259 (+0.0089) 0.8281 (+0.0111) 0.8285 (+0.0115)

Mean: 0.8481 0.8563 (+0.0082) 0.8607 (+0.0126) 0.8601 (+0.0120) 0.8616 (+0.0136) 0.8610 (+0.0129)

ViT b16

(T-D)

40× 0.8394 0.8341 (-0.0053) 0.8422 (+0.0028) 0.8417 (+0.0023) 0.8407 (+0.0013) 0.8412 (+0.0018)

100× 0.8298 0.8467 (+0.0170) 0.8551 (+0.0253) 0.8568 (+0.0270) 0.8575 (+0.0278) 0.8581 (+0.0283)

200× 0.8724 0.8754 (+0.0030) 0.8835 (+0.0111) 0.8873 (+0.0149) 0.8842 (+0.0118) 0.8884 (+0.0160)

400× 0.8167 0.8216 (+0.0050) 0.8319 (+0.0153) 0.8338 (+0.0171) 0.8341 (+0.0174) 0.8346 (+0.0179)

Mean: 0.8396 0.8445 (+0.0049) 0.8532 (+0.0136) 0.8549 (+0.0153) 0.8541 (+0.0146) 0.8556 (+0.0160)

Swin T. V2

base (T-D)

40× 0.8883 0.8969 (+0.0086) 0.8987 (+0.0104) 0.9021 (+0.0137) 0.9028 (+0.0145) 0.9023 (+0.0139)

100× 0.8929 0.8964 (+0.0035) 0.8980 (+0.0052) 0.9002 (+0.0074) 0.8980 (+0.0051) 0.8996 (+0.0068)

200× 0.8913 0.8920 (+0.0007) 0.8966 (+0.0053) 0.8937 (+0.0024) 0.8941 (+0.0029) 0.8943 (+0.0031)

400× 0.8558 0.8552 (-0.0006) 0.8600 (+0.0042) 0.8585 (+0.0027) 0.8589 (+0.0031) 0.8608 (+0.0050)

Mean: 0.8821 0.8851 (+0.0031) 0.8883 (+0.0063) 0.8886 (+0.0066) 0.8885 (+0.0064) 0.8893 (+0.0072)

Figure 2: Line charts illustrating the mean test accuracy achieved using the TTA strategy with 1, 5, 9, and 25 prediction

rounds, evaluated across the ﬁve transformation strategies applied to models trained without DA (solid lines) and with DA

(dashed lines).

Breast Cancer Image Classiﬁcation Using Deep Learning and Test-Time Augmentation

765

Table 6: The test accuracy obtained through the Test-Time Augmentation with 1, 5, 6, 15, and 25 rounds of predictions

considering the best transformation set with the models trained with DA.

Accuracy when using the best transformation set (Improvement over DA)

Arch. Mag. DA 1 5 9 15 25

ResNet-50

(T-D)

40× 0.8798 0.8857 (+0.0059) 0.8875 (+0.0077) 0.8899 (+0.0101) 0.8908 (+0.0110) 0.8899 (+0.0101)

100× 0.8750 0.8821 (+0.0071) 0.8831 (+0.0081) 0.8819 (+0.0069) 0.8829 (+0.0079) 0.8836 (+0.0086)

200× 0.8866 0.8931 (+0.0065) 0.8977 (+0.0111) 0.9006 (+0.0140) 0.9007 (+0.0141) 0.9004 (+0.0138)

400× 0.8679 0.8636 (-0.0043) 0.8630 (-0.0049) 0.8650 (-0.0029) 0.8658 (-0.0021) 0.8651 (-0.0028)

Mean: 0.8773 0.8811 (+0.0038) 0.8828 (+0.0055) 0.8844 (+0.0070) 0.8851 (+0.0077) 0.8847 (+0.0074)

ViT b16

(T-D)

40× 0.8582 0.8694 (+0.0112) 0.8686 (+0.0104) 0.8711 (+0.0129) 0.8713 (+0.0131) 0.8701 (+0.0119)

100× 0.8525 0.8611 (+0.0086) 0.8619 (+0.0094) 0.8622 (+0.0097) 0.8605 (+0.0080) 0.8601 (+0.0076)

200× 0.8825 0.8783 (-0.0042) 0.8863 (+0.0038) 0.8850 (+0.0025) 0.8841 (+0.0016) 0.8860 (+0.0035)

400× 0.8426 0.8497 (+0.0071) 0.8433 (+0.0007) 0.8467 (+0.0041) 0.8481 (+0.0055) 0.8475 (+0.0049)

Mean: 0.8590 0.8646 (+0.0057) 0.8650 (+0.0061) 0.8663 (+0.0073) 0.8660 (+0.0070) 0.8659 (+0.0070)

Swin T. V2

base (T-B)

40× 0.8963 0.8969 (+0.0006) 0.9052 (+0.0089) 0.9039 (+0.0076) 0.9039 (+0.0076) 0.9045 (+0.0082)

100× 0.8785 0.8824 (+0.0039) 0.8810 (+0.0025) 0.8831 (+0.0046) 0.8833 (+0.0048) 0.8806 (+0.0021)

200× 0.8979 0.8957 (-0.0022) 0.8982 (+0.0003) 0.8989 (+0.0010) 0.8981 (+0.0002) 0.8990 (+0.0011)

400× 0.8608 0.8628 (+0.0020) 0.8598 (-0.0010) 0.8614 (+0.0006) 0.8630 (+0.0022) 0.8636 (+0.0028)

Mean: 0.8834 0.8844 (-0.0011) 0.8860 (+0.0027) 0.8868 (+0.0034) 0.8871 (+0.0037) 0.8869 (+0.0036)

Figure 3: Line charts illustrating the mean test accuracy achieved using the TTA strategy with 1, 5, 9, and 25 prediction

rounds, evaluated across the ﬁve magniﬁcations (40×, 100×, 200×, 400×) focusing exclusively on the transformation set

T-D. The results are presented for models trained without DA (solid lines) and with DA (dashed lines).

9, 15, and 25 prediction rounds. Table 3 presents the

results for models trained without DA, while Table 4

shows results for models trained with DA. The val-

ues in parentheses indicate the improvement in accu-

racy compared to the standard prediction strategy. For

each row, the best accuracy, corresponding to the op-

timal number of prediction rounds, is italicized. Simi-

larly, the best accuracy for each column, representing

the most effective transformation set for each archi-

tecture, is highlighted in bold.

To assess the statistical signiﬁcance of the im-

provements achieved by the TTA strategies compared

to the standard prediction, we conducted a T-test. The

resulting t-values are shown in brackets, with those

exceeding the critical value of 2.3534 highlighted in

bold, indicating that the corresponding TTA strategies

produced a meaningful improvement over the stan-

dard prediction method.

To provide a clearer understanding of the data pre-

sented in Tables 3 and 4, Figure 2 illustrates line

charts for each architecture, highlighting the accuracy

values. The solid lines correspond to models trained

without DA, while the dashed lines represent models

trained with DA. Each line in the charts corresponds

to a speciﬁc transformation set, and the black horizon-

tal line indicates the baseline accuracy achieved with-

out applying TTA. This visualization emphasizes the

relative performance of each transformation set com-

pared to the baseline across all architectures.

These results indicate that the transformation set

T-D consistently outperforms others across most ar-

chitectures, regardless of the number of prediction

rounds. An exception is observed for the Swin Trans-

former V2 trained with DA, where the best results are

achieved using the transformation set T-B. It is im-

portant to point out that TTA with only one predic-

tion round represents a special case, as it lacks the

redundancy provided by multiple predictions, which

can enhance accuracy. Transformation sets T-C and

T-D are more aggressive than T-A and T-B due to the

inclusion of additional horizontal ﬂips and random ro-

tation operations. Transformation set T-E, which in-

corporates even more aggressive augmentations, such

as color jittering and random erasing, does not per-

form well in the TTA strategy, likely because these

transformations signiﬁcantly alter the image charac-

teristics, reducing the model’s ability to generalize.

Transformation sets T-A and T-B are similar, differing

only in the size of the crop regions in the random re-

sized crop operation. The same distinction applies to

T-C and T-D. Speciﬁcally, T-A and T-C crop regions

between 80% and 100% of the original image size,

while T-B and T-D crop regions between 50% and

100%. These ﬁndings suggest that cropping larger re-

gions is more effective for TTA, as it preserves more

contextual information in the input images, enhancing

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

766

the model’s ability to make accurate predictions.

Tables 5 and 6 present the results obtained when

applying the best transformation sets during the pre-

diction step, as identiﬁed in Tables 5 and 6. The T-D

transformation set yielded the highest accuracy across

magniﬁcations for most models. The exception was

the Swin Transformer V2 trained with DA, where T-

B achieved the best performance. These tables pro-

vide detailed classiﬁcation accuracy for each magniﬁ-

cation (40×, 100×, 200×, and 400×) and the overall

mean accuracy across all magniﬁcations.

To provide a clearer visualization of the data pre-

sented in Tables 5 and 6, Figure 3 displays line charts

summarizing these values for each architecture, fo-

cusing on the best transformation set. The solid lines

correspond to models trained without DA, while the

dashed lines represent those trained with DA. Each

line in the charts indicates the performance for a

speciﬁc magniﬁcation level (40×, 100×, 200×, and

400×), while the horizontal lines denote the base-

line accuracy achieved without applying TTA for each

magniﬁcation. This visualization highlights the per-

formance improvements across magniﬁcations and

the relative impact of TTA strategies compared to the

baseline.

These results indicate that the TTA strategy pro-

vides a more signiﬁcant improvement when applied to

models trained without DA compared to those trained

with DA. The improvements for models trained with-

out DA were up to 0.0189, 0.0283, and 0.0145 for

ResNet-50, ViT b16, and Swin Transformer V2, re-

spectively. In contrast, for models trained with DA,

the gains were more modest, reaching up to 0.0141,

0.0131, and 0.0089 for ResNet-50, ViT b16, and Swin

Transformer V2, respectively. This can be attributed

to the fact that training-time DA already enhances the

generalization capability of the models, leaving less

room for additional improvement through TTA. Nev-

ertheless, these results demonstrate that TTA can still

provide meaningful enhancements to model perfor-

mance, even when DA is applied during training.

Considering each magniﬁcation, the best results

for 40×, 100×, and 200× were achieved using TTA

strategies, and only for 400×, the best result was

achieved by a model trained with standard prediction,

with an accuracy of 0.8679 for ResNet-50 trained

with DA. For the 40×, Swin Transformer V2 (DA)

achieved an accuracy of 0.9045 with 5 rounds of the

T-B transformation set. For the 100×, Swin Trans-

former V2 (No DA) achieved an accuracy of 0.9002

with 9 rounds of the T-D transformation set, and for

the 200×, ResNet-50 (DA) achieved an accuracy of

0.9007 with 15 rounds of the transformation set T-D.

Still considering the results in Tables 5 and 6, Fig-

ure 3, the best accuracies for the 40×, 100×, and

200× magniﬁcations were achieved using TTA strate-

gies. While for the 400× magniﬁcation, the highest

accuracy was obtained using a model with standard

prediction.

For the 40× magniﬁcation, the Swin Transformer

V2 trained with DA achieved the highest accuracy

of 0.9045 with 5 rounds of the T-B transformation

set. At 100×, the Swin Transformer V2 trained with-

out DA achieved the best accuracy of 0.9002 with

9 rounds of the T-D transformation set. For 200×,

the ResNet-50 model trained with DA achieved the

highest accuracy of 0.9007 with 15 rounds of the T-D

transformation set. Notably, for the 400× magniﬁca-

tion, the highest accuracy (0.8679) was obtained by

the ResNet-50 model trained with DA without apply-

ing TTA.

Although the best accuracy across all models was

not achieved using a TTA strategy, TTA signiﬁcantly

improved results for all models trained without DA

and also enhanced the performance of ViT b16 trained

with DA. These ﬁndings underscore the effective-

ness of TTA strategies in improving model perfor-

mance, particularly at lower magniﬁcations, and high-

light their potential for boosting accuracy even when

DA has already been applied during training.

6 CONCLUSION

In this study, we proposed a straightforward TTA ap-

proach to enhance the classiﬁcation of breast can-

cer histopathological images using three deep learn-

ing models trained with and without DA. The TTA

strategy was evaluated using ﬁve transformation sets

across all magniﬁcations in the BreakHis dataset.

This strategy led to peak accuracies for the 40×,

100×, and 200× magniﬁcations, achieving 0.9045,

0.9002, and 0.9007, respectively. Given that TTA in-

troduces no additional cost during training and only

enhances inference, it serves as a valuable tool to im-

prove model accuracy and offers a practical and efﬁ-

cient way to enhance the performance of deep learn-

ing models in breast cancer image classiﬁcation tasks.

ACKNOWLEDGEMENTS

We would like to thank FAPEMIG, Brazil (Grant

number CEX - APQ-02964-17) for ﬁnancial support.

Andr

e R. Backes gratefully acknowledges the ﬁnan-

cial support of CNPq (National Council for Scien-

tiﬁc and Technological Development, Brazil) (Grant

#307100/2021-9). This study was ﬁnanced in part by

Breast Cancer Image Classiﬁcation Using Deep Learning and Test-Time Augmentation

767

the Coordenac¸

ao de Aperfeic¸oamento de Pessoal de

ıvel Superior - Brazil (CAPES) - Finance Code 001.

REFERENCES

Barbosa, G., Moreira, L., de Sousa, P. M., Moreira, R.,

and Backes, A. (2024). Optimization and Learning

Rate Inﬂuence on Breast Cancer Image Classiﬁcation.

In Proceedings of the 19th International Joint Con-

ference on Computer Vision, Imaging and Computer

Graphics Theory and Applications - Volume 3: VIS-

APP, pages 792–799. INSTICC, SciTePress.

Calvo-Zaragoza, J., Rico-Juan, J. R., and Gallego, A.-J.

(2020). Ensemble classiﬁcation from deep predic-

tions with test data augmentation. Soft Computing,

24(2):1423–1433.

da Silva, M., Rodrigues, L., and Mari, J. F. (2020). Op-

timizing data augmentation policies for convolutional

neural networks based on classiﬁcation of sickle cells.

In Anais do XVI Workshop de Vis

ao Computacional,

pages 46–51, Porto Alegre, RS, Brasil. SBC.

Deng, J., Dong, W., Socher, R., L, L., Li, K., and Fei-Fei,

L. (2009). ImageNet: A large-scale hierarchical im-

age database. In 2009 IEEE Conference on Computer

Vision and Pattern Recognition, pages 248–255, Mi-

ami, FL, USA. IEEE.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,

D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,

M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby,

N. (2021). An image is worth 16x16 words: Trans-

formers for image recognition at scale. In Interna-

tional Conference on Learning Representations.

Garcea, F., Serra, A., Lamberti, F., and Morra, L. (2023).

Data augmentation for medical imaging: A systematic

literature review. Computers in Biology and Medicine,

152:106391.

Gautam, A. (2023). Recent advancements of deep learn-

ing in detecting breast cancer: a survey. Multimedia

Systems, 29(3):917–943.

Goceri, E. (2023). Comparison of the impacts of der-

moscopy image augmentation methods on skin cancer

classiﬁcation and a new augmentation method with

wavelet packets. International Journal of Imaging

Systems and Technology, 33(5):1727–1744.

Gupta, V., Vasudev, M., Doegar, A., and Sambyal, N.

(2021). Breast cancer detection from histopathol-

ogy images using modiﬁed residual neural net-

works. Biocybernetics and Biomedical Engineering,

41(4):1272–1287.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Jiahao, Z., Jiang, Y., Huang, R., and Shi, J. (2021).

Efﬁcientnet-based model with test time augmentation

for cancer detection. In 2021 IEEE 2nd International

Conference on Big Data, Artiﬁcial Intelligence and

Internet of Things Engineering (ICBAIE), pages 548–

551.

Kandel, I. and Castelli, M. (2021). Improving convolu-

tional neural networks performance for image clas-

siﬁcation using test time augmentation: a case study

using mura dataset. Health Information Science and

Systems, 9(1):33.

Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning,

J., Cao, Y., Zhang, Z., Dong, L., Wei, F., and Guo, B.

(2022). Swin transformer v2: Scaling up capacity and

resolution. In 2022 IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition (CVPR), pages

11999–12009.

uller, D., Soto-Rey, I., and Kramer, F. (2022). An analysis

on ensemble learning optimized medical image clas-

siﬁcation with deep convolutional neural networks.

IEEE Access, 10:66467–66480.

Nanni, L., Paci, M., Brahnam, S., and Lumini, A. (2021).

Comparison of different image data augmentation ap-

proaches. Journal of Imaging, 7(12).

Nguyen, C. P., Hoang Vo, A., and Nguyen, B. T. (2019).

Breast Cancer Histology Image Classiﬁcation using

Deep Learning. In 2019 19th International Sympo-

sium on Communications and Information Technolo-

gies (ISCIT), pages 366–370.

Oza, P., Sharma, P., and Patel, S. (2024). Breast lesion clas-

siﬁcation from mammograms using deep neural net-

work and test-time augmentation. Neural Computing

and Applications, 36(4):2101–2117.

Shanmugam, D., Blalock, D., Balakrishnan, G., and Guttag,

J. (2021). Better aggregation in test-time augmenta-

tion. In 2021 IEEE/CVF International Conference on

Computer Vision (ICCV), pages 1194–1203.

Shorten, C. and Khoshgoftaar, T. M. (2019). A survey on

image data augmentation for deep learning. Journal

of Big Data, 6(1):60.

Spanhol, F. A., Oliveira, L. S., Cavalin, P. R., Petitjean, C.,

and Heutte, L. (2017). Deep features for breast cancer

histopathological image classiﬁcation. In 2017 IEEE

International Conference on Systems, Man, and Cy-

bernetics (SMC), pages 1868–1873. IEEE.

Spanhol, F. A., Oliveira, L. S., Petitjean, C., and Heutte,

L. (2016). A Dataset for Breast Cancer Histopatho-

logical Image Classiﬁcation. IEEE Transactions on

Biomedical Engineering, 63(7):1455–1462.

Valero-Mas, J. J., Gallego, A. J., and Rico-Juan, J. R.

(2024). An overview of ensemble and feature learn-

ing in few-shot image classiﬁcation using siamese

networks. Multimedia Tools and Applications,

83(7):19929–19952.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

768