Beyond Data Augmentations: Generalization Abilities of Few-Shot

Segmentation Models

Muhammad Ahsan

1 a

, Guy Ben-Yosef

2 b

and Gemma Roig

1 c

Institute of Computer Science, Goethe University Frankfurt, Germany

GE Research, U.S.A.

Keywords:

Machine Vision, Deep Neural Networks, Meta-Learning, Few-Shot Learning, Semantic Segmentation.

Abstract:

Few-shot learning in semantic segmentation has gained signiﬁcant attention recently for its adaptability in

applications where only a few or no examples are available as support for training. Here we advocate for a

new testing paradigm, we coin it half-shot learning (HSL), which evaluates model’s ability to generalise to

new categories when support objects are partially viewed, signiﬁcantly cropped, occluded, noised, or aggres-

sively transformed. This new paradigm introduces challenges that will spark advances in the ﬁeld, allowing

us to benchmark existing models and analyze their acquired sense of objectness. Humans are remarkably

exceptional at recognizing objects even when partially obstructed. HSL seeks to bridge the gap between

human-like perception and machine learning models by forcing them to recognize objects from incomplete,

fragmented, or noisy views - just as humans do. We propose a highly augmented image set for HSL that is

built by intentionally manipulating PASCAL-5

and COCO-20

to ﬁt this paradigm. Our results reveal the

shortcomings of state-of-the-art few-shot learning models and suggest improvements through data augmenta-

tion or the incorporation of additional attention-based modules to enhance the generalization capabilities of

few-shot semantic segmentation (FSS). To improve the training method, we propose a channel and spatial

attention module (Woo et al., 2018), where an FSS model is retrained with attention module and tested against

the highly augmented support information. Our experiments demonstrate that an FSS model trained with the

proposed method achieves signiﬁcantly a higher accuracy (approximately 5%) when exposed to limited or

highly cropped support data.

1 INTRODUCTION

Deep convolutional neural networks (CNNs) have

driven signiﬁcant progress in various computer vision

tasks like image classiﬁcation, semantic segmenta-

tion, and object detection during the past several years

(Lu et al., 2021). Recent advancement in CNNs cover

progress in layer design (Srivastava et al., 2015), ac-

tivation and loss functions (Janocha and Czarnecki,

2017), regularization (Moradi et al., 2020), optimiza-

tion and computational speed (Cheng et al., 2018).

However, gathering enough labeled data is notori-

ously tedious particularly for dense prediction tasks

like instance segmentation and semantic segmenta-

tion (Gu et al., 2018). Few-shot learning was intro-

duced to mitigate this frequent lack of annotated data.

https://orcid.org/0009-0000-5982-1979

https://orcid.org/0000-0002-4368-0750

https://orcid.org/0000-0002-6439-8076

The goal in few-shot learning (FSL) is to learn a new

concept representation from only a few annotated ex-

amples. This is achieved by learning feature repre-

sentations via meta-learning, thus being able to gen-

eralize the new unseen classes (Hu et al., 2018). For

few-shot segmentation (FSS), the input to the model

includes a query image Q as well as k support im-

ages {S

} and k masks {M

} in which a given sin-

gle object class C is annotated. The model then re-

turns a segmentation mask of the class C in the query

image Q. Typically, the class C is not seen during

training, namely the set of dataset classes S is split to

two disjoint sets, S

train

(seen classes) and S

test

(unseen

classes), and during inference C ∈ S

test

. The goal of

our work is to explore the limitations of current FSS

models, and to gain insights for further developing

novel improved architectures. To achieve our goal,

we are deep-diving into a few of the recent approaches

suggested for FSS tasks, namely prototype learning to

segment an object (Liu et al., 2020), learning through

430

Ahsan, M., Ben-Yosef, G. and Roig, G.

Beyond Data Augmentations: Generalization Abilities of Few-Shot Segmentation Models.

DOI: 10.5220/0013179200003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 2: VISAPP, pages

430-438

ISBN: 978-989-758-728-3; ISSN: 2184-4321

ﬁxed background for different foreground objects and

vice versa (Lang et al., 2022), a model training tech-

nique to make new-class adaptation more manageable

with the class weight transformer (Lu et al., 2021),

image segmentation learning with task-speciﬁc edge

detection (Chen et al., 2016), and prototype align-

ment networks (Wang et al., 2019a). To evaluate gen-

eralization capabilities of FSS models, we introduce

HSL which tests the ability of a model to generalise

to unseen categories when the support information is

highly augmented or limited (see Fig.1 and Sec. 3.1).

Based on insights from HSL, we propose a novel

training method to improve the generalization ability

of FSS models when exposed to highly augmented

support information. We integrate FSS model with

a channel & spatial attention module CBAM (Woo

et al., 2018), which beneﬁts when highly augmented

or partial object information is used as a support for

training. Our primary contributions can be summa-

rized as fellows:

• We propose a challenging testing paradigm, called

HSL, for FSS models to evaluate their ability to

learn from partial object information.

• We propose a training method incorporating a

channel & spatial attention module (CBAM) to

improve the models’ performance in HSL scenar-

ios.

• We use Grad-CAM, a visualization technique that

leverages gradients to identify the importance of

spatial locations within convolutional layers.

2 RELATED WORK

The challenge of FSL has been an active area of re-

search for many years. In this work, we explore sev-

eral key components crucial to our study. First, we

discuss FSL, a paradigm that enable model to gener-

alize well from a limited number of training examples

where acquiring large labeled dataset is impractical

or expensive. Next, we delve into FSS, a specialized

application of FSL focused on accurately segmenting

different objects in images using only a few annotated

examples. Finally we examine the attention module,

a mechanism that improve the model performance by

allowing it to focus on most relevant input features.

2.1 Few-Shot Learning

FSL is the task of training models to generalize from

a small number of labeled examples to correctly clas-

sify or segment unseen samples. The main focus of

FSL is developing machine learning models suitable

for the real-world scenarios where obtaining a large

dataset is impractical or expensive. Most of the cur-

rent approaches in the FSL domain are based on a

meta-learning framework, where a base learner adapts

to new learning tasks derived from a base dataset to

simulate few-shot scenarios (Wang et al., 2019b).

In real-world applications, we are often confronted

with incomplete or imperfect data. While FSL aims

to address scenarios with limited examples, it still as-

sumes that the available data are reasonably complete

and high-quality. In contrast, HSL introduces the no-

tion of training and testing models with signiﬁcantly

imperfect or partial data.

2.2 Few-Shot Segmentation

FSS addresses the challenge of segmenting new

classes with limited annotated data, crucial in do-

mains like medicine and agriculture. (Catalano and

Matteucci, 2024). In FSS, a model learns to identify

pixels in a query image that belong to a speciﬁc object

class, guided by the segmentation masks from only

a small number of support images (Li et al., 2021).

Traditional semantic segmentation models typically

rely on a signiﬁcant amount of labeled data to achieve

good results and generally struggle to adapt to unseen

classes without additional ﬁne-tuning. In response,

several robust network architectures have been de-

veloped, incorporating key techniques like SegGPT

as a generalist segmentation model that uniﬁes var-

ious segmentation tasks into an in-context learning

framework (Wang et al., 2023a) dilated convolutions

(Yu and Koltun, 2015), encoder-decoder frameworks

(Ronneberger et al., 2015), multi-level feature ag-

gregation (Lin et al., 2017), and attention modules

(Huang et al., 2019). Previous studies typically ap-

proach FSS as a guided segmentation task. For in-

stance in (Hu et al., 2018), a base learner (support

branch) processes the support information to generate

parameters that guide the meta-learning framework in

predicting the mask for query images. (Zhang et al.,

2020) introduced masked average pooling to extract

support features, which became the foundational tech-

nique in FSS tasks. Due to the success of prototypical

networks, (Zhang et al., 2020) propose a dense proto-

type learning for segmentation and query mask pre-

diction. In our work we analyze that how training

with augmented support samples inﬂuences robust-

ness and generalization abilities. For example, in (Hu

et al., 2018) a late fusion is proposed where the sup-

port image branch predicts the weights of the top layer

of the query image branch. The Prototypical learning

approach is another method used for FSS which aims

to predict foreground and background classes by their

Beyond Data Augmentations: Generalization Abilities of Few-Shot Segmentation Models

431

similarity to learned prototypes (Wang et al., 2019a).

The PPNet model (Liu et al., 2020) performs proto-

type learning based on a decomposition of the holistic

object class into a set of part-aware prototypes. The

BAM model (Lang et al., 2022) introduces a new par-

allel branch base learner to the meta learner which is

to identify base classes and distinguishes the regions

of base classes from novel classes that do not need to

be segmented during inference.

2.3 Attention Module

CNN models are tried to be improved through mul-

tiple approaches, like developing a specialized op-

timizer (Rakelly et al., 2018), introducing adversar-

ial training methods (Wang et al., 2023b), or design-

ing specialized meta-architectures (Hu et al., 2018).

Another approach is to use attention blocks to en-

hance performance by re-calibrating channel-wise

feature responses through modeling channel inter-

dependencies. Another lightweight module, called

bottleneck attention module, is designed to enhance

performance by introducing the attention along both

channels and spatial axes. We proposed to incorpo-

rate and adapt the CBAM (Woo et al., 2018) in FSS

models, which adjusts weights based on the features

of the input data.

3 METHODS

For assessing HSL we perform a series of transfor-

mations on the support samples of two benchmark

PASCAL-5

, COCO-20

datasets. In this section we

presents the datasets, augmentations, training and

testing paradigms that we proposed to assess the at-

tention module and HSL task.

3.1 Half-Shot Learning

HSL introduces a more challenging scenario, where

the model is exposed to only a portion of the sup-

port objects, which are either partially visible, heav-

ily cropped or aggressively transformed. While state

of the art FSS models CWT (Lu et al., 2021), BAM

(Lang et al., 2022), PPNet (Liu et al., 2020), and

PANet (Wang et al., 2019a) have shown signiﬁcant

progress in standard benchmarks, they are still sen-

sitive to modiﬁcations in the support information. In

this work, we focus on exploring the reduction of sup-

port information, its effects, and the robustness of ex-

isting models. For investigating HSL the task of se-

mantic segmentation is a natural choice, since seg-

mentation aims to divide an image into segments or

regions, each of which represents a separate object

or part of the image. The HSL task allows us to ask

whether and how much the model really learns to rep-

resent transformable knowledge about general object

structure, or if it rather has a strong focus on align-

ment of the support mask to the image. We illustrate

the HSL semantic segmentation in Figure 1. HSL

goes beyond few-shot learning by simulating real-

world scenarios, where objects are rarely fully visi-

ble or perfectly captured. Models trained in HSL are

evaluated on their ability to:

• Ignore spurious correlations (e.g., background

noise) and focus on core object features.

• Generalize effectively when object data is incom-

plete or adversarial.

• Handle occlusions and environmental noise, mak-

ing them more robust and practical for deploy-

ment.

HSL provides a more challenging benchmark for

evaluating a model’s robustness and ability to learn

meaningful, core features under difﬁcult, imperfect

conditions.

3.2 Datasets and Augmentations

PASCAL-5

and COCO-20

are benchmarks used for

evaluation of FSS models. Both datasets are derived

from larger, well-known datasets (PASCAL VOC and

COCO), and they are restructured into subsets specif-

ically designed for FSL tasks.

3.2.1 Datasets

PASCAL-5

is an extension of PASCAL VOC and

also contains annotations from the Simultaneous De-

tection and Segmentation (SDS) dataset. The train set

and test set contains 5,953 and 1,449 images, respec-

tively. The 20 categories in the PASCAL-5

dataset

are divided into four folds (0, 1, 2, 3), and each

fold contains 5 disjoint classes. Data instances from

three folds are used for model training, and testing

performed using the fourth fold in a cross-validation

fashion. COCO-20

is larger and a more challeng-

ing dataset designed for different tasks like segmenta-

tion, key-point detection, and captioning dataset. The

dataset is divided into 4 splits (COCO-20i, where i =

(0, 1, 2, 3). The object categories are divided into four

folds, each containing 20 distinct classes. It provide

82,081 and 40,137 images for training and evaluation

respectively.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

432

Figure 1: Illustrate the HSL for semantic segmentation. In typical FSL scenarios, the support set consists of complete, clean,

and fully observable examples of each class. HSL introduces a more difﬁcult scenario, where the model is exposed to half of

the support objects that are either partially viewed, signiﬁcantly cropped, or noised. The goal is to test the model’s ability to

generalize under adversarial conditions. We propose to use a set of augmentations and perturbations in the support image and

mask, while keeping the query in the original form.

3.2.2 Augmentations

We apply the following augmentations as shown in

Figure 1 to test HSL on both datasets PASCAL-5

and

COCO-20

• Flip: Both Horizontal-Flip and Vertical-Flip used

with probabilities like p=0.5 & 1 that mirrored the

objects along the horizontal, vertical or both axis.

• Rotate: We randomly apply afﬁne transformations

to scale, translate and rotate the input images. By

rotating images in various angles, the dataset be-

comes more diverse, helping models generalize

better by learning rotational invariance. Speciﬁ-

cally, we apply four rotate limits with angles 10

◦

, 45

◦

and 90

◦

respectively.

• Crop: The center-crop operation focuses on the

center of the image, assuming that the most im-

portant or relevant information is likely to be in

the middle of the image. We apply the center-crop

to support images and labels with four different

variations, including 20%, 40%, 60%, and 80%.

• Noise: Noise reduces image clarity, making it

harder to distinguish details. Gaussian Noise actu-

ally sampled the complete noise with all channels

of the images. So, it is still imperfect information

for the model to learn about the object class.

• Superpixels: With Superpixels we transformed

the input images to their superpixel representation

partially or completely with p = 0.5 or 1.

• Irrelevant Support: We provide irrelevant support

images to the model like support samples have ir-

relevant category, different from the query image,

this also results in poor generalization to new data.

3.3 Models

FSS is a deep learning technique that makes a pre-

trained model capable of segmentation of new cate-

gories of data that are unseen to the model. We chose

the following four FSS models:

CWT: Classiﬁer Weight Transformer model devel-

oped to make new-class adaptation more manageable

through concentrate on classiﬁer-part rather than to

meta-learn the entire complex model (Lu et al., 2021).

BAM: A new perspective on FSS to identify the re-

gions that do not need to be segmented, they proposed

an additional branch namely base learner to speciﬁ-

cally predict the base class regions (Lang et al., 2022).

So, the irrelevant objects in the query images can be

concealed signiﬁcantly. Gram Matrix used to dif-

ferentiate the image scenes and extend the proposed

approach to a setting namely, i.e., generalized FSL,

which simultaneously identiﬁes the targets of base

and novel classes.

PPNet: Part-aware Prototype Network decompose the

holistic class representation into a set of part-aware

prototypes. The network consists of three parts, ﬁrst

is Embedding Network to compute the convolutional

feature maps of the images, second is Prototypes Gen-

eration Network that extracts a set of part-aware pro-

totypes and third is Part-aware Mask Generation Net-

work that generates the ﬁnal semantic segmentation

of the query images (Liu et al., 2020).

PANet: The PANet, prototype alignment network,

where they learn class-speciﬁc high quality proto-

types with non-parametric metric learning from a few

support samples (Wang et al., 2019a). They also

present a prototype alignment regularization among

support and query images.

Beyond Data Augmentations: Generalization Abilities of Few-Shot Segmentation Models

433

3.4 Training / Testing Paradigms

CWT, BAM, PPNet, and PANet as our baseline FSS

models to evaluate their performance on the HSL task.

Experiments performed with pre-trained and retrained

models with our customized settings, incorporating

data augmentations.

3.4.1 Train Normal and Test with Augmentation

Models are trained using default conﬁgurations and

reproduced the original results for all of the tested

standard FSS models. We subsequently applied vari-

ous augmentations to the support images to evaluate

the robustness and generalization capabilities of FSS

models.

3.4.2 Train and Test with Augmentations

Models are trained using highly augmented support

images, which provide limited information for the

models to learn from and also testing them against the

augmented dataset.

3.4.3 Train with Attention Module

To enhance the resilience and ability to learn from

limited support information, we suggest incorporating

the Convolutional Block Attention Module, CBAM

(Woo et al., 2018), which adjusts weights based on the

input features. This boost the representation capacity

by using attention modules, emphasizing key features

while reducing the focus on less relevant ones (Xu

et al., 2015; Gregor et al., 2015). CBAM has two se-

quential components, channel and spatial, which dy-

namically reﬁne the intermediate feature map at every

convolutional block. CBAM infers a 1D channel at-

tention map M

∈ R

C×1×1

and a 2D spatial attention

map M

∈ R

1×H×W

when given an intermediate fea-

ture map F ∈ R

C×H×W

as input as illustrated below:

′

= M

(F) ⊗ F

′′

= M

′

) ⊗ F

′

(1)

where F

′′

is the ﬁnal reﬁned output and ⊗ denotes

element-wise multiplication. Spatial information of a

feature map combined by using both average-pooling

and max-pooling: F

avg

and F

max

, and compute the out-

put feature vectors using element-wise summation.

The channel attention is computed as:

(F) = σ(MLP(AvgPool(F)) + MLP(MaxPool(F)))

= σ(W

avg

)) +W

max

)))

(2)

where W

, W

are the MLP weights, ReLU activation

function is followed by W

and σ denotes the sigmoid

function. Channel information of a feature map ag-

gregated by using two pooling operations performed

to generate 2D maps: F

avg

and F

max

, these maps are

concatenated and convolved by a standard convolu-

tion layer to prodcuce 2D spatial attention map. The

spatial attention is computed as:

(F) = σ( f

7×7

([AvgPool(F); MaxPool(F)]))

= σ( f

7×7

([F

avg

max

]))

(3)

where σ denotes the sigmoid function and f

7×7

rep-

resents a convolution operation.

4 EXPERIMENTAL RESULTS

Experimental results involved training and testing

paradigms to investigate augmentation inﬂuence us-

ing the benchmark datasets PASCAL-5

and COCO-

described in details.

4.1 Train Normal and Test with

Augmentation

Models are trained using standard conﬁgurations and

tested against HSL method and presented results in

Table 1.

All four models performed well in the ﬂip exper-

iments; however, PPNet and PANet experienced ac-

curacy losses of 6% & 9% respectively. In the Shift-

Scale-Rotate experiments, CWT and BAM performed

better, while PPNet and PANet decline in accuracy as

the rotation angles increased from 10

◦

to 90

◦

. All FSS

models experienced signiﬁcant accuracy losses with

cropped support images. The BAM model, which

outperformed the other FSS models, lost approxi-

mately 14% accuracy. The models unable to learn

sufﬁciently from the partial or incomplete support in-

formation. All FSS models experienced a drop in

performance when exposed to noisy data. Among

them, BAM lost 15% accuracy; however, it still out-

performed the other models. Compared to noise, all

models demonstrated better performance with the su-

perpixel representation of the support images. Models

tend to be more confused by the irrelevant support but

perform better when provided with partial views.

4.2 Train and Test with Augmentation

Training and testing performed with highly aug-

mented support information alongside standard query

images. Table 1. Models retrained with augmenta-

tions demonstrate only a slight increase in accuracy of

1-2% in some cases compared to those that were not

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

434

Table 1: Performance comparison of different Few-Shot Segmentation (FSS) models on highly augmented PASCAL-5i data.

The white column represents the mIoU performance when the models are tested on the augmented dataset, while the pink

column shows the performance when the models are both trained and tested on the augmented dataset.

# Augmentation CWT CWT BAM BAM PPNet PPNet PANet PANet

w/ Aug w/ Aug w/ Aug w/ Aug

0 Baseline 56.40 – – 67.81 – – 55.16 – – 48.10 – –

1 Hor-Flip(p=0.5) 56.34 56.32 67.76 67.80 51.62 52.60 46.46 46.57

2 Hor-Flip(p=1) 55.30 56.24 67.56 67.61 46.88 46.54 38.08 38.41

3 Ver-Flip(p=0.5) 56.18 56.20 67.76 67.72 51.42 52.50 46.34 46.43

4 Ver-Flip(p=1) 55.48 56.71 66.39 67.62 44.39 44.61 37.28 38.74

5 Sh-Rot(L=10) 55.16 56.85 67.60 67.70 46.26 46.11 36.73 36.17

6 Sh-Rot(L=20) 54.50 55.81 66.49 67.50 45.67 45.72 35.46 35.50

7 Sh-Rot(L=45) 54.84 55.65 66.22 67.11 44.21 44.83 34.52 35.82

8 Sh-Rot(L=90) 53.71 55.62 64.40 66.83 44.19 44.27 35.11 36.22

9 C-Crop(20%) 37.20 39.19 53.59 55.82 26.13 26.71 23.34 23.54

10 C-Crop(40%) 46.41 48.76 64.00 65.29 41.56 41.37 31.66 32.91

11 C-Crop(60%) 51.80 53.80 66.83 67.91 42.77 42.61 36.05 37.16

12 C-Crop(80%) 55.49 56.29 67.66 67.41 42.65 42.51 37.0 37.84

13 GaussNoise(p=.5) 28.96 29.65 52.26 52.02 19.54 19.17 21.90 22.82

14 Superpixels(p=.5) 40.53 41.37 63.72 65.18 35.72 36.27 31.39 32.63

15 Irrel-Support 31.73 31.16 47.86 48.61 23.27 23.48 24.62 24.42

retrained with augmentations. An interesting insight

is that the models do not effectively utilize partial sup-

port views even when trained to do so as shown in

Figure. 2 as well.

Figure 2: Segmentation results using the proposed method

under HSL settings. The method is applied with the BAM

model on the PASCAL-5

dataset. First row displays sam-

ples without any augmentation and remaining rows shown

results obtained with highly cropped dataset where air-

planes and boats indicates that predictions are less accurate

compared to the ground truth.

Table 2 presents the performance of CWT and

BAM models with another benchmark COCO-20

Experiments demonstrate that both models exhibit

varying degrees of robustness when subjected to dif-

ferent types of augmentations. In Flip and Rotate,

models demonstrated better performance, with only

2% decrease in accuracy as the rotation angle var-

ied from 10

◦

to 90

◦

. Exp#9-12 demonstrate that

Table 2: Training and sesting of FSS models on augmented

COCO-20

dataset.

# Augmentation CWT CWT BAM BAM

Aug

0 Baseline 32.90 – – 46.23 – –

1 Hor-Flip(p=0.5) 31.64 32.21 45.27 46.22

2 Hor-Flip(p=1) 31.42 31.55 45.19 46.13

3 Ver-Flip(p=0.5) 32.12 32.43 45.20 46.26

4 Ver-Flip(p=1) 28.74 29.48 44.06 46.58

5 Sh-Rot(L=10) 30.50 30.13 45.36 46.29

6 Sh-Rot(L=20) 29.18 30.27 45.23 46.38

7 Sh-Rot(L=45) 29.55 29.52 45.07 46.87

8 Sh-Rot(L=90) 28.38 29.31 44.51 45.31

9 C-Crop(20%) 19.63 19.86 34.72 36.38

10 C-Crop(40%) 25.18 26.63 36.25 38.56

11 C-Crop(60%) 28.96 29.19 42.27 44.17

12 C-Crop(80%) 31.14 31.73 44.81 47.96

13 GaussNoise(p=.5) 22.06 22.19 36.70 38.47

14 Superpixels(p=.5) 25.10 25.41 37.36 38.82

15 Irrel-Support 17.66 17.89 28.89 30.31

BAM consistently outperforms CWT when dealing

with highly cropped data. Models similarly strug-

gled in the Gaussian Noise experiment, CWT per-

forms poorly with an IoU of 22.19, while BAM shows

a stronger performance with an IoU of 38.47. With

superpixel representations, CWT achieves an IoU of

25.41, while BAM again outperforms with an IoU of

38.82. This shows that both models can handle su-

perpixel data better than noisy data, but BAM main-

tains a clear advantage. CWT with irrelevant support,

achieving an IoU of only 17.89, while BAM performs

better with an IoU of 30.31. This indicate that both

models are confused by irrelevant support, but BAM

is more robust in these challenging conditions.

Beyond Data Augmentations: Generalization Abilities of Few-Shot Segmentation Models

435

4.3 Train with Attention Module

Upon analysis, we observed that BAM consistently

demonstrates superior performance compared to the

other models, particularly when tested against highly

cropped images. This indicates BAM’s better gener-

alization abilities towards incomplete or partial sup-

port information. Therefore, for further experimenta-

tion, we selected BAM model to extend with the inte-

gration of an attention module, CBAM, which serves

as a simple yet effective weight adjustment mecha-

nism based on the features of the input data. At-

tention modules have proven to be effective in var-

ious visual tasks such as image classiﬁcation, ob-

ject detection, and semantic segmentation (Woo et al.,

2018). Models utilizing VGG/ResNet as a backbone,

such as BAM, can jointly train the combined CBAM-

enhanced networks, integrated CBAM with the Res-

Blocks in ResNet (He et al., 2016). Representation

Figure 3: BAM model with an attention module. The mod-

ule consists of two sequential sub-modules: channel and

spatial. The intermediate feature map is adaptively reﬁned

through the CBAM module at each convolutional block of

deep networks.

capacity enhanced through attention modules which

prioritizing important features while minimizing ir-

relevant ones (Xu et al., 2015; Gregor et al., 2015).

CBAM adaptively reﬁnes the intermediate feature

map at each convolutional block of deep networks

(see Fig 3). Table 3 clearly depicts generalization

capability of the BAM model enhanced against aug-

mented support information. Training of BAM with

attention module achieves approximately 2% - 5%

higher accuracy when tested with rotated and highly

cropped data (exp# 4,5 in Table 3).

We employ Grad-CAM (Selvaraju et al., 2020) as

a visualization technique that leverages gradients to

assess the importance of spatial locations within con-

volutional layers. Grad-CAM activation highlights

speciﬁc regions of the input image using an attention

heatmap, indicating the areas that are most crucial for

detecting a particular class of interest (Selvaraju et al.,

2020). For qualitative analysis, we compare the vi-

sualization results of baseline BAM and BAM with

Table 3: Performance comparison of BAM model and BAM

with Attention module on the highly augmented PASCAL-

5i dataset. The ﬁrst column shows the mIoU performance

of the BAM model when tested on the augmented dataset,

while the second pink column presents the improved perfor-

mance of the BAM+Attention model on the same test data.

# Augmentation BAM BAM

Attention

0 Baseline 67.81 – –

1 VerticalFlip(p=1) 66.39 67.81

2 Hor-Flip(p=1) 67.59 67.96

3 Sh-Rot(L=20) 66.53 67.29

4 Sh-Rot(L=90) 64.41 67.74

5 C-Crop(20%) 53.57 58.69

6 GaussNoise (p=1) 52.41 53.48

7 Superpixels (p=1) 63.39 65.71

the attention module. Figure 4 illustrates the results

of Grad-CAM, demonstrating BAM model integrated

with the attention module provides slightly better in-

sightful explanations.

Figure 4: Examples of feature importance visualizations.

5 CONCLUSION

Design and evaluation of the HSL task, where sup-

port images contain only partial object information.

In comparison to control case, where no support im-

age is provided namely Irrelevant-Support, all tested

models exhibited improved performance. This indi-

cates that all models were able to leverage support

information, even when it was signiﬁcantly limited.

Variations in model performance were observed due

to differences in model architectures and the vari-

ous backbones employed in the implementation. In

our case, the selected models have variations with the

backbones from VGG-16 to Resnet-101, which may

have different capability to generalize from partial

to full objects. In several experiments (e.g., Center-

Crop(20%), Noise, and Irrelevant-Support), the tested

models struggled to accurately identify the confus-

ing areas in the support images, which hindered the

improvement of the meta-learner’s predictions. The

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

436

part-based and prototype models tested here were

struggling to extract robust prototypes from the sup-

port set with less or incomplete information to learn

about the object class. A new training paradigm, re-

ferred to as BAM with attention, has been proposed.

In this approach, BAM model with an attention mod-

ule is re-trained in conjunction with attention module

and evaluated using highly augmented support im-

ages. Although, it still face challenges in extracting

robust features from the support set, it demonstrates

less confusion and greater capacity for generalization

compared to other models. We believe that our ﬁnd-

ings can illuminate future investigations into the is-

sues of bias or semantic ambiguity problems.

REFERENCES

Catalano, N. and Matteucci, M. (2024). Few shot seman-

tic segmentation: a review of methodologies, bench-

marks, and open challenges.

Chen, L.-C., Barron, J. T., Papandreou, G., Murphy, K., and

Yuille, A. L. (2016). Semantic image segmentation

with task-speciﬁc edge detection using cnns and a dis-

criminatively trained domain transform. In Proceed-

ings of the IEEE conference on computer vision and

pattern recognition, pages 4545–4554.

Cheng, J., Wang, P.-s., Li, G., Hu, Q.-h., and Lu, H.-

q. (2018). Recent advances in efﬁcient computa-

tion of deep convolutional neural networks. Frontiers

of Information Technology & Electronic Engineering,

19:64–77.

Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., and

Wierstra, D. (2015). Draw: A recurrent neural net-

work for image generation. cite arxiv:1502.04623.

Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B.,

Liu, T., Wang, X., Wang, G., Cai, J., et al. (2018). Re-

cent advances in convolutional neural networks. Pat-

tern recognition, 77:354–377.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep

Residual Learning for Image Recognition. In Pro-

ceedings of 2016 IEEE Conference on Computer Vi-

sion and Pattern Recognition, CVPR ’16, pages 770–

778. IEEE.

Hu, J., Shen, L., and Sun, G. (2018). Squeeze-and-

excitation networks. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 7132–7141.

Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y.,

and Liu, W. (2019). Ccnet: Criss-cross attention

for semantic segmentation. In Proceedings of the

IEEE/CVF international conference on computer vi-

sion, pages 603–612.

Janocha, K. and Czarnecki, W. M. (2017). On loss func-

tions for deep neural networks in classiﬁcation. arXiv

preprint arXiv:1702.05659.

Lang, C., Cheng, G., Tu, B., and Han, J. (2022). Learning

what not to segment: A new perspective on few-shot

segmentation. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 8057–8067.

Li, G., Jampani, V., Sevilla-Lara, L., Sun, D., Kim, J., and

Kim, J. (2021). Adaptive prototype learning and al-

location for few-shot segmentation. In Proceedings

of the IEEE/CVF conference on computer vision and

pattern recognition, pages 8334–8343.

Lin, G., Milan, A., Shen, C., and Reid, I. (2017). Reﬁnenet:

Multi-path reﬁnement networks for high-resolution

semantic segmentation. In Proceedings of the IEEE

conference on computer vision and pattern recogni-

tion, pages 1925–1934.

Liu, Y., Zhang, X., Zhang, S., and He, X. (2020). Part-

aware prototype network for few-shot semantic seg-

mentation. In European Conference on Computer Vi-

sion, pages 142–158. Springer.

Lu, Z., He, S., Zhu, X., Zhang, L., Song, Y.-Z., and Xiang,

T. (2021). Simpler is better: Few-shot semantic seg-

mentation with classiﬁer weight transformer. In ICCV.

Moradi, R., Berangi, R., and Minaei, B. (2020). A survey

of regularization strategies for deep models. Artiﬁcial

Intelligence Review, 53(6):3947–3986.

Rakelly, K., Shelhamer, E., Darrell, T., Efros, A. A.,

and Levine, S. (2018). Few-shot segmentation

propagation with guided networks. arXiv preprint

arXiv:1806.07373.

Ronneberger, O., Fischer, P., and Brox, T. (2015). U-

net: Convolutional networks for biomedical image

segmentation. In Medical image computing and

computer-assisted intervention–MICCAI 2015: 18th

international conference, Munich, Germany, October

5-9, 2015, proceedings, part III 18, pages 234–241.

Springer.

Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R.,

Parikh, D., and Batra, D. (2020). Grad-cam: visual

explanations from deep networks via gradient-based

localization. International journal of computer vision,

128:336–359.

Srivastava, R. K., Greff, K., and Schmidhuber, J. (2015).

Training very deep networks. Advances in neural in-

formation processing systems, 28.

Wang, K., Liew, J. H., Zou, Y., Zhou, D., and Feng, J.

(2019a). Panet: Few-shot image semantic segmen-

tation with prototype alignment. In The IEEE Inter-

national Conference on Computer Vision (ICCV).

Wang, X., Zhang, X., Cao, Y., Wang, W., Shen, C., and

Huang, T. (2023a). Seggpt: Segmenting everything in

context.

Wang, Y., Luo, N., and Zhang, T. (2023b). Focus on query:

Adversarial mining transformer for few-shot segmen-

tation. Advances in Neural Information Processing

Systems, 36:31524–31542.

Wang, Y.-X., Ramanan, D., and Hebert, M. (2019b). Meta-

learning to detect rare objects. In Proceedings of the

IEEE/CVF International Conference on Computer Vi-

sion, pages 9925–9934.

Woo, S., Park, J., Lee, J., and Kweon, I. S. (2018).

CBAM: convolutional block attention module. CoRR,

abs/1807.06521.

Beyond Data Augmentations: Generalization Abilities of Few-Shot Segmentation Models

437

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A. C.,

Salakhutdinov, R., Zemel, R. S., and Bengio, Y.

(2015). Show, attend and tell: Neural image

caption generation with visual attention. CoRR,

abs/1502.03044.

Yu, F. and Koltun, V. (2015). Multi-scale context ag-

gregation by dilated convolutions. arXiv preprint

arXiv:1511.07122.

Zhang, X., Wei, Y., Yang, Y., and Huang, T. S. (2020). Sg-

one: Similarity guidance network for one-shot seman-

tic segmentation. IEEE transactions on cybernetics,

50(9):3855–3865.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

438