Local Foreground Selection Aware Attentive Feature Reconstruction for

Few-Shot Fine-Grained Plant Species Classiﬁcation

Aisha Zulﬁqar

and Ebroul Izquierdo

School of Electronics Engineering and Computer Science, Queen Mary University of London, U.K.

ﬁ

Keywords:

Few-Shot Learning, Plant Species Classiﬁcation, Fine-Grained Classiﬁcation, Attention Mechanism.

Abstract:

Plant species exhibit subtle distinctions, requiring a reduction in intra-class variation and an increase in inter-

class differences to improve accuracy. This paper addresses plant species classiﬁcation using a limited number

of labelled samples and introduces a novel Local Foreground Selection(LFS) attention mechanism. Based on

the proposed attention Local Foreground Selection Module(LFSM) is a straightforward module designed to

generate discriminative support and query feature maps. It operates by integrating two types of attention: local

attention, which captures local spatial details to enhance feature discrimination and increase inter-class differ-

entiation, and foreground selection attention, which emphasizes the foreground plant object while mitigating

background interference. By focusing on the foreground, the query and support features selectively highlight

relevant feature sequences and disregard less signiﬁcant background sequences, thereby reducing intra-class

differences. Experimental results from three plant species datasets demonstrate the effectiveness of the pro-

posed LFS attention and its complementary advantages over previous feature reconstruction methods.

1 INTRODUCTION

Automatic plant species classiﬁcation is essential for

ecology and biodiversity conservation. It is a ﬁne-

grained image classiﬁcation problem, characterized

by subtle differences between species. Large intra-

class variations arise from differences in background,

illumination and pose, while inter-class variations

tend to be minimal due to similar morphology. Conse-

quently, effective training must focus on learning dis-

criminative features to enhance classiﬁcation perfor-

mance. Few-shot ﬁne-grained image classiﬁcation is

particularly challenging due to limited training data,

necessitating the optimization of both inter-class and

intra-class variations. Conventional few-shot learning

(Snell et al., 2017) (Sung et al., 2018) typically ad-

dresses only class-level differences, which limits its

effectiveness for ﬁne-grained classiﬁcation tasks. Re-

cent few-shot ﬁne-grained classiﬁcation methods use

feature reconstruction (Wertheimer et al., 2021) (Do-

ersch et al., 2020), where support features reconstruct

query features to enhance class separation to optimize

inter-class and intra-class variations.

In this study, we present a novel Local Foreground

Selection (LFS) attention mechanism, designed to lo-

https://orcid.org/0000-0001-6647-0134

calize discriminative regions through a dual-action at-

tention approach that generates distinct features. It

effectively reduces background effects while empha-

sizing the foreground, along with capturing essen-

tial local discriminative details. This is achieved by

combining local attention, which extracts local spa-

tial context to enhance inter-class variation, and fore-

ground selection attention, which minimizes intra-

class variation by highlighting foreground object and

reducing background effects. The novelty of the

LFS attention lies in its integration of local and fore-

ground selection attention, which together outper-

forms their individual results. This attention mech-

anism produces highly discriminative features, facili-

tating the reconstruction of support and query features

in existing few-shot ﬁne-grained classiﬁcation meth-

ods (Wertheimer et al., 2021) (Wu et al., 2023). Our

main contributions can be summarized as follows:

• In this study, we propose a novel Local Fore-

ground Selection attention designed to optimize

both inter-class and intra-class variations.

• The proposed attention is incorporated into a vi-

sion transformer encoder, creating the Local Fore-

ground Selection Module (LFSM), which pro-

duces discriminative feature maps.

• The LFSM module signiﬁcantly improves perfor-

532

Zulﬁqar, A. and Izquierdo, E.

Local Foreground Selection Aware Attentive Feature Reconstruction for Few-Shot Fine-Grained Plant Species Classiﬁcation.

DOI: 10.5220/0013184000003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theor y and Applications (VISIGRAPP 2025) - Volume 3: VISAPP, pages

532-539

ISBN: 978-989-758-728-3; ISSN: 2184-4321

mance for plant images with natural background,

when combined with state of the art feature recon-

struction approaches.

2 RELATED WORKS

Metric Based Few-Shot Learning. Metric learn-

ing is a widely used approach for few-shot learn-

ing. Prominent metric-based methods include Match-

ing Networks (Vinyals et al., 2016), Prototypical Net-

works (Snell et al., 2017), and Relation Networks

(Sung et al., 2018). These methods transform train-

ing data into feature vectors within a shared feature

space, where the similarity between two feature vec-

tors is measured by their distance. Metric learning ap-

proaches for ﬁne-grained image classiﬁcation lever-

age euclidean or cosine distances as metric. Recent

approaches include low-rank pairwise bilinear pool-

ing to learn effective distance metrics (Huang et al.,

2021b) and focus area location mechanisms for iden-

tifying similar regions among objects (Sun et al.,

2021). Techniques like multi-attention meta-learning

(MattML) (Zhu et al., 2020) and local descriptor-

based image-to-class measures (Li et al., 2019) en-

hance feature metrics for better classiﬁcation. Non-

linear data projection networks (NDPNet) (Zhang

et al., 2021) improve similarity measures through ad-

vanced projections, while bi-similarity network (Li

et al., 2020) employ dual similarity checks for more

discriminative features. Additionally, target-oriented

alignment networks (TOAN) (Huang et al., 2021a)

align support and query images to minimize intra-

class variance and maximize inter-class variance, con-

tributing to more robust classiﬁcation performance.

Feature Reconstruction Method. While metric-

based methods necessitate the creation of a single vec-

tor that preserves spatial locations, feature reconstruc-

tion approaches address this limitation. DeepEMD

(Zhang et al., 2020) does not perform matching at the

image level like traditional metric learning methods;

instead, it partitions the image into a set of local rep-

resentations. Optimal matching is conducted on these

representations from two images to assess similarity,

with Earth Mover’s distance. In the Feature Recon-

struction Network (FRN) (Wertheimer et al., 2021),

feature maps are reconstructed by pooling support set

feature maps into a matrix, where each column rep-

resents the concatenated feature maps of a channel.

For classiﬁcation, each location in the query image’s

feature map is reconstructed using a weighted sum

of the support features from the corresponding class.

The LCCRN (Li et al., 2023) network enhances local

information extraction by introducing a local content

extraction module, while a separate embedding mod-

ule preserves appearance details. Bi-FRN (Wu et al.,

2023) adopts a feature reconstruction strategy which

reconstructs query images from support images and

vice versa. This results in a feature pool derived from

both support and query images, with pooled features

mutually reconstructed from one another.

Attention Mechanism. The transformer self-

attention (Vaswani et al., 2017) has been incorporated

into several few-shot learning methods. The Few-shot

Embedding Adaptation with Transformer (FEAT)

(Ye et al., 2020) utilizes it to perform task-speciﬁc

embedding adaptation. This approach employs a

set-to-set function to derive more discriminative

instance representations and models interactions

among images within each set. The CTX model

(Doersch et al., 2020) proposes a self-supervised

learning framework combined with a Cross Trans-

former, representing images as spatial tensors and

generating query-aligned prototypes. CTX uses self-

attention to identify the spatial attention weights of

support and query images, facilitating the learning of

query-aligned class prototypes. The Few-shot Cosine

Transformer (FS-CT) (Nguyen et al., 2023) intro-

duces cosine attention, which produces an effective

correlation map between support and query images,

outperforming softmax attention. Additionally,

the Bi-directional Feature Reconstruction Network

(Bi-FRN) (Wu et al., 2023) utilizes self-attention to

generate discriminative query and support features.

3 METHOD

Classiﬁcation of plant species represents a ﬁne-

grained visual classiﬁcation (FGVC) task, charac-

terized by subtle differences among various species.

This challenge is compounded by large intra-class

variations and minimal inter-class differences that

must be addressed. Additionally, plant classiﬁcation

is conducted on images within their natural environ-

ments, necessitating the removal of background ef-

fects. We propose a novel attention-based module de-

signed to generate discriminative feature maps for use

with existing few-shot ﬁne-grained approaches using

feature reconstruction approaches. This novel mod-

ule can enhance the performance of prior works (Wu

et al., 2023) (Wertheimer et al., 2021). The over-

all framework of our method is illustrated in Figure

1. For an episode, the support and query samples

are passed through feature embedding where feature

maps are obtained. To obtain discriminative feature

Local Foreground Selection Aware Attentive Feature Reconstruction for Few-Shot Fine-Grained Plant Species Classiﬁcation

533

maps suitable for ﬁne grained classiﬁcation we reﬁne

them with our proposed attention mechanism and pass

them through the Local Foreground Selection mod-

ule (LFSM). Subsequently, the output of the LFSM

is fed into the feature reconstruction networks (Wu

et al., 2023) (Wertheimer et al., 2021), which recon-

struct query and support features. Finally, the similar-

ity metric computes the euclidean distance between

the original and reconstructed feature maps.

Figure 1: Local Foreground Selection aware Attentive Fea-

ture Reconstruction Network.

Problem Formulation. For a given dataset D, it is

divided into D

train

, D

val

and D

test

such that the cat-

egories in these sets are disjoint. Few-shot classiﬁ-

cation performs the task of C-way and K-shot clas-

siﬁcation on D

test

by learning the knowledge from

train

and D

val

. The task is performed such that C

classes from the test set are selected. From each of

these classes only K labelled samples are selected

which serve as the support samples(S) and M unla-

belled samples are selected which form part of the

query samples(Q). In this few-shot task the classiﬁ-

cation accuracy is determined on D

test

3.1 Feature Embedding

The ﬁrst step is to extract feature maps of support and

query samples. Following literature we use the Conv-

4 and ResNet-12 networks as backbones to obtain fea-

ture maps. The architectures of these networks are the

same as in (Wertheimer et al., 2021). For a C-way and

K-shot task C ×(K + M) images are input to the em-

bedding module where features are extracted.

3.2 Local Foreground Selection Module

The LFSM module is responsible for generating dis-

criminative features for support and query pools. For

a C-way, K-shot task, the extracted features from the

embedding module are represented as u

= f

) ∈

d×h×w

, where d denotes the number of channels,

h is the height of the feature maps, and w repre-

sents the width. The input features to the LFSM

are denoted as u

, while the output from this mod-

ule is y

∈ R

d×r

. Within the LFSM module, the in-

put features u

are transformed into r local features

, u

, . . . , u

] + E

pos

, which are then fed into a vi-

sion transformer encoder.

The LFSM constructs discriminative feature maps

utilizing the proposed Local Foreground Selection

(LFS) attention embedded within a vision transformer

encoder. Originally, vision transformer employs self-

attention but instead we introduce LFS attention. It

performs two tasks: the local attention task and fore-

ground selective attention task. The local attention

is to emphasize the local details of the plant object

which increases inter-class differences. The pres-

ence of natural background in plant images increases

intra-class variations, we aim to mitigate its impact

by using a foreground selection attention. The fore-

ground selection attention ﬁlters out sequences asso-

ciated with the background of the target plant object,

ensuring that only the important tokens relevant to the

foreground are retained. The two attentions—local

and foreground selection—are aggregated and gen-

erate distinct features for ﬁne grained classiﬁcation.

The architecture of the LFS module is presented in

the Figure 2 and a comparison of the heatmap result-

ing from the LFS attention with the output of the fea-

ture embedding is visualized in Figure 3.

Local Attention. The local attention mechanism

integrates convolutions into the vision transformer

block to effectively model local spatial context. In-

stead of position-wise linear projections for Multi-

Head Self-Attention we use convolutional projections

for the attention operation. Speciﬁcally, depth-wise

separable convolutions are utilized for these projec-

tions, which are applied to derive the queries (Q), keys

(K), and values (V). The feature maps, reshaped ear-

lier in the LFSM module, serve as input for the depth-

wise convolutional projections. This function returns

the attention scores to the vision transformer encoder.

Foreground Selection Attention. The purpose of

foreground selection attention is to enable a focus

on the plant object rather than the background, as it

does not contribute to ﬁne-grained classiﬁcation. In

transformer self-attention, all tokens are assigned rel-

evance weights that reduce the inﬂuence of less im-

portant tokens. Tokens not associated with the plant

object receive lower attention scores; however, these

background tokens still exert some inﬂuence. To fur-

ther diminish their impact, these tokens are discarded,

allowing the feature maps to concentrate on the plant

object, thereby aiding in the classiﬁcation of subtle

class differences. The foreground selection attention

generates relevance scores by selecting tokens based

on a deﬁned threshold, discarding those below it. The

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

534

Figure 2: Local Foreground Selection Module where features are fed into the vision transformer and outputs feature pool.

dot product of Q and K

creates an attention score

matrix with dimensions m ×m. The relevance of all

tokens is then evaluated relative to one another and

stored in the relevance matrix, where the relevance of

the i

token with respect to the j

token is denoted

by the matrix element relevance

i, j

. Relevance scores

are calculated for each row of the attention matrix.

Tokens representing the plant object yield high rele-

vance scores, while those associated with the back-

ground have low scores. Consequently, the relevance

scores higher than the threshold are preserved, while

lower relevance scores are set to zero. To implement

this, we arrange the rows in descending order. After

sorting the i

row of the relevance matrix, we identify

the (m ×FS-ratio)

index, where the FS-ratio deter-

mines the number of tokens selected and ranges from

0.1 to 1.0. The relevance score at this selected in-

dex is recorded as the threshold for comparison. A

comparison is then made between relevance

i, j

and the

threshold; if relevance

i, j

exceeds the threshold, it is

retained, otherwise set to zero. The relevance matrix

is then normalized by softmax function. The matrix

multiplication of the attention score and V is calcu-

lated to get the weighted sum over V as indicated by

the following equation. This process is mathemati-

cally illustrated in the following equations:

relevance =

√

(1)

index = argsort(−relevance

)[FS-ratio×m] (2)

relevance

(

relevance

, if > relevance

[index]

0, otherwise

(3)

FS-Attention = softmax (relevance

)V (4)

After getting the Foreground Selection attention(FS-

attention) we replace all the non-zero weights to one.

In this way the elements of the matrix are either zero

or one. The background related tokens are made zero

and plant object tokens are made one. This takes the

form of a binary matrix in which the relevance score

positions that contribute to the object are represented

by ones and the non-contributing scores that represent

the background are zero.

FS-Attention =

(

1 if FS-Attention > 0

0 otherwise

Local Foreground Selection Attention. The local

foreground selection attention is a bi-functional at-

tention mechanism designed to capture local details

of the plant object, as reﬂected in the local atten-

tion matrix. The foreground selection attention in-

dicates which locations within the attention matrix

have scores originating from the plant object ver-

sus the background. In the vision transformer en-

coder, the attention scores derived from local atten-

tion are element-wise multiplied with the foreground

selection attention matrix. This ensures that regions

with lower relevance score corresponding to the back-

ground are set to zero, effectively discarding them.

Consequently, the resulting attention matrix contains

values that speciﬁcally pertain to the plant object.

Utilizing attention scores from local attention, rather

than self-attention, enhances the representation of lo-

cal details of the plant object. This targeted attention

improves focus on the plant object while minimiz-

ing background interference, ultimately leading to in-

creased accuracy. After getting the attention weights

the generated features are obtained by iterative Layer

Normalization and Multi Layer Perceptron producing

feature pools as output.

ˆy

= Attention



, y



, ˆy

∈ R

r×d

(5)

Where W

and W

are learnable weight param-

eters having size d × d. The ˆy

are obtained using

the following equation from Multi Layer Perceptron

(MLP) and Layer Normalization (LN).

ˆy

= MLP (LN (y

+ ˆy

)) (6)

Local Foreground Selection Aware Attentive Feature Reconstruction for Few-Shot Fine-Grained Plant Species Classiﬁcation

535

Figure 3: Heat map comparing the original feature maps

to our proposed Local Foreground Selection attention on

Oxford Flower-102 dataset.

4 EXPERIMENTAL RESULTS

AND ANALYSIS

Datasets. The plant species classiﬁcation problem

is considered with natural background images from

three datasets. The Oxford Flower-102 dataset (Nils-

back and Zisserman, 2008) is a benchmark for few-

shot ﬁne-grained classiﬁcation, featuring 102 ﬂower

categories. The iNaturalist 2019 dataset (Van Horn

et al., 2018), known for its long-tailed distribution, fo-

cuses on a subset of 348 plant species from the Plantae

category, each containing 50–1000 images. Similarly,

the Plant Net 300-K dataset (Garcin et al., 2021), orig-

inally comprising 1081 plant species, uses a subset of

252 species having 50–1000 images for few-shot clas-

siﬁcation. All images across the datasets are resized

to 84×84 without any bounding box cropping.

Implementation Details. The experiments are con-

ducted using GeForce 3090 GPU, implemented with

Pytorch on Conv-4 and ResNet-12 backbones. All

methods including the baseline, state-of-the art and

ours are trained from scratch. The training of our

method is performed for 1200 epochs using SGD with

Nesterov momentum of 0.9. The initial learning rate

is taken as 0.1 and weight decay is 5e-4. Learning

rate decreases by a scaling factor of 10 every 400

epochs. For Conv-4 models, we train using 30-way

5-shot episodes, query images per class are 15. For

ResNet-12 models the training episodes are 15-way

5-shot. Testing for both backbones is performed for 1-

shot and 5-shot. Data augmentation methods like cen-

tre crop, random horizontal ﬂip and colour jitter are

used. The best-performing model are selected based

on the validation set every 20 epochs. For all exper-

iments, we report the mean accuracy of 10,000 ran-

domly generated tasks on D

est with 95% conﬁdence

intervals on the standard 1-shot and 5-shot settings.

4.1 Comparison with State-of-the-Art

The accuracy of our method is tested for plant species

classiﬁcation via few-shot learning. Experiments are

conducted on the above mentioned three plant species

datasets. We obtain results on the state-of-the art

methods ourselves using their ofﬁcial codes. The

state-of-the-art methods that we compare our method

to include ProtoNet (Snell et al., 2017), DN4 (Li et al.,

2019), CTX (Doersch et al., 2020), FRN (Wertheimer

et al., 2021), LCCRN (Li et al., 2023), BiFRN (Wu

et al., 2023), BSFA (Zha et al., 2023) and FSCT

(Nguyen et al., 2023). A comparison of the perfor-

mance of our method with the state-of-the-art meth-

ods is reﬂected in Table 1 for Conv-4 backbone and

Table 2 for ResNet-12 backbone. For the Conv-4

backbone our method outperforms all other state of

the art methods for all three plant species dataset. If

we carefully observe we can see that the performance

of our method with FRN (Wertheimer et al., 2021)

signiﬁcantly improves from 2-7 %. The 1-shot per-

formance improves signiﬁcantly for all three datasets.

The accuracy also increases appreciably for 5-shot as

well. When our method is used with BiFRN (Wu

et al., 2023) our local foreground selection attention

works well for both 1-shot and 5-shot as it removes

the effect of background and focus on local details.

For the ResNet-12 backbone also the performance of

our method is better for 5-way 5-shot and 5-way 1-

shot for all three datasets. The accuracy improves by

1-6 % when our attention mechanism is introduced

with FRN and BiFRN. The LFS attention used with

FRN outperforms all state of the arts as well as our

LFS with BiFRN. This show that our attention mech-

anism gives highly discriminative feature maps that

are suitable for feature reconstruction. Even though

FRN performs unidirectional feature reconstruction

the LFS attention make its performance comparable

to the bi-directional feature reconstruction in BiFRN.

4.2 Ablation Study

The Effectiveness of Local Foreground Selection

Attention. Our proposed LFS attention improves

the overall accuracy when used with both FRN and

BiFRN. So, we test its effectiveness of as compared

to self attention, local attention and foreground selec-

tion attention(FS-attention) . A comparison is pre-

sented with the baseline i.e ProtoNet (Snell et al.,

2017) and other aforementioned attentions. Our lo-

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

536

Table 1: Comparison with state-of the-art for 1-shot and 5-shot classiﬁcation of plant species on Conv-4 backbone.

Method

Oxford Flower-102 iNaturalist19 Plant Net 300-K

1-shot 5-shot 1-shot 5-shot 1-shot 5-shot

Proto(Snell et al., 2017) 59.19±.22 83.10±.14 54.40±.23 76.21±.16 54.83±.22 77.60±.15

DN4(Li et al., 2019) 51.86±.21 69.13±.21 38.88±.15 49.23±.13 40.45±.16 55.41±.23

CTX(Doersch et al., 2020) 70.52±.91 80.82±.71 46.96±.94 63.18±.79 49.69±.88 65.67±.76

FRN(Wertheimer et al., 2021) 69.40±.20 87.45±.12 62.08±.22 80.19±.15 62.01±.22 80.81±.14

BiFRN(Wu et al., 2023) 75.11±.17 89.55±.15 64.54±.20 81.89±.12 62.65±.23 81.56±.17

LCCRN(Li et al., 2023) 71.38±.21 85.70±.14 64.47±.23 80.36±.16 63.50±.22 80.55±.15

BSFA(Zha et al., 2023) 71.42±.47 83.85±.33 60.51±.53 75.85±.40 62.57±.50 77.16±.36

FSCT-Cosine(Nguyen et al., 2023) 66.30±.87 82.04±.68 51.83±.96 64.46±.83 52.07±.96 66.03±.75

FSCT-Softmax(Nguyen et al., 2023) 62.96±.91 76.22±.70 50.09±.95 62.04±.74 49.84±.95 62.12±.82

Ours: LFS + BiFRN 75.71±.10 90.30±.13 65.36±.15 82.59±.11 63.51±.20 82.19±.14

Ours: LFS + FRN 76.46±.20 89.94±.13 66.62±.22 82.76±.15 64.67±.20 82.31±.14

Table 2: Comparison with state-of the-art for 1-shot and 5-shot classiﬁcation of plant species on ResNet-12 backbone.

Method

Oxford Flower-102 iNaturalist19 Plant Net 300-K

1-shot 5-shot 1-shot 5-shot 1-shot 5-shot

Proto(Snell et al., 2017) 75.23±.20 88.43±.13 72.44±.29 87.37±.21 69.96±.21 86.34±.13

DN4(Li et al., 2019) 68.62±.21 74.19±.21 58.08±.15 59.37±.13 56.45±.16 65.41±.13

CTX(Doersch et al., 2020) 77.12±.91 83.62±.71 55.36±.94 60.58±.79 59.61±.88 70.47±.76

FRN(Wertheimer et al., 2021) 79.20±.20 90.10±.11 74.62±.23 87.52±.14 72.52±.21 86.33±.14

BiFRN(Wu et al., 2023) 77.75±.17 89.94±.15 72.24±.16 87.20±.12 70.84±.15 86.52±.12

LCCRN(Li et al., 2023) 78.95±.19 92.60±.10 74.96±0.21 87.65±.12 72.63±.22 86.50±.15

BSFA(Zha et al., 2023) 77.02±.44 89.06±.26 76.23±.48 87.31±.30 74.21±.49 86.63±.29

FSCT-Cosine(Nguyen et al., 2023) 63.10±.87 85.16±.68 58.85±.96 61.86±.83 62.14±.96 71.83±.75

FSCT-Softmax(Nguyen et al., 2023) 59.29±.91 79.52±.70 59.79±.95 58.84±.74 60.01±.95 67.12±.82

Ours: LFS + BiFRN 79.77±.12 90.87±.10 72.95±.15 87.67±.11 71.04±.15 86.70±.14

Ours: LFS + FRN 79.85±.20 93.50±.11 79.61±.11 93.33±.15 77.42±.15 88.71±.14

cal foreground selection attention(LFS-attention) per-

forms better with the combination of both local and

foreground selection attention. They do not give the

best results when used alone which is the evidence

that LFS-attention is more effective. The results are

shown in Table 3 and comparison is made for two

datasets. The results reﬂect that our proposed LFS-

attention is a superior approach for few-shot ﬁne-

grained classiﬁcation.

The Effect of FS-Ratio. FS-ratio deﬁned in the

foreground selection attention part deﬁnes the num-

ber of tokens that will be accepted. By incorporating

FS-ratio we can see the impact of number of tokens

that will be accepted to provide the best results. The

accuracy of classiﬁcation based on different FS-ratio

varies according to the backbone and the dataset. This

is reﬂected in the Table 4 and Table 5. This presents

evidence for the effect of number of background to-

kens accepted in the local foreground selection at-

tention. The results in the tables present the perfor-

mance on different FS-ratio ranging from 0.1 to 1.0.

The FS-ratio that gives the best performance is high-

lighted with bold letters, for each method and the re-

spective dataset. For Oxford Flower-102 the best re-

sults are given by FS-ratio 0.3 for FRN(Conv-4) and

0.5 for BiFRN(Conv-4). For ResNet-12 backbone

the FS-ratio 0.1 gives best results for both FRN and

BiFRN methods. For the iNaturalist 2019 the Conv-4

gives best results with FS-ratio 0.3 for FRN and 0.5

for BiFRN. While for ResNet-12, 0.3 FS-ratio works

Local Foreground Selection Aware Attentive Feature Reconstruction for Few-Shot Fine-Grained Plant Species Classiﬁcation

537

Table 3: Ablation study for effectiveness of LFS attention on Oxford Flowers102 and iNaturalist 2019 datasets.

Backbone Method

Oxford Flower-102 iNaturalist19

1-shot 5-shot 1-shot 5-shot

Conv-4 Baseline 59.19±.22 83.10±.14 54.40±.22 76.21±.17

FRN

(Conv-4)

Self-attention 74.20±.20 87.96±.13 64.21±.21 81.21±.14

Local attention 74.89±.18 88.20±.13 64.68±.25 81.51±.11

Selective attention 75.15±.17 88.65±.14 64.92±.20 81.70±.12

LFS-attention 76.46±.20 89.94±.13 66.62±.22 82.76±.15

BiFRN

(Conv-4)

Self-attention 75.11±.17 89.55±.15 64.54±.20 81.89±.12

Local attention 74.91±.17 89.60±.15 65.03±.20 82.28±.12

Selective attention 74.84±.17 89.79±.15 64.91±.20 82.36±.12

LFS-attention 75.71±.10 90.30±.13 65.36±.15 82.59±.11

ResNet-12 Baseline 70.99±.20 86.99±.13 64.59±.29 81.65±.21

FRN

(ResNet-12)

Self-attention 79.01±.20 91.50±.11 77.69±.11 90.52±.16

Local attention 78.88±.16 91.89±.12 78.12±0.15 91.26±.13

Selective attention 78.95±.21 92.26±.15 78.54±.12 91.98±.17

LFS-attention 79.85±.20 93.50±.11 79.61±.11 93.33±.15

BiFRN

(ResNet-12)

Self-attention 77.75±.17 89.94±.15 72.24±.16 87.20±.12

Local attention 78.31±.17 90.08±.15 71.97±.16 87.12±.12

Selective attention 77.75±.28 88.79±.15 72.44±.16 87.27±.12

LFS-attention 79.77±.12 90.87±.10 72.95±.15 87.67±.11

Table 4: Performance with different select ratios on Oxford

ﬂowers-102 dataset for 5-shot and 1-shot.

FRN FS-ratio/ Oxford ﬂower-102

0.1 0.3 0.5 0.7 0.9

Conv

5-shot 89.42 89.94 88.83 89.30 88.70

1-shot 76.03 76.46 75.26 75.16 74.20

ResNet

5-shot 93.50 93.37 93.04 92.88 93.12

1-shot 79.85 79.73 79.43 79.46 79.45

BiFRN FS-ratio/ Oxford ﬂower-102

Conv

5-shot 89.69 89.32 90.30 89.44 89.91

1-shot 74.96 74.64 75.71 74.51 74.91

ResNet

5-shot 90.87 90.40 89.90 89.81 90.24

1-shot 79.77 78.97 77.83 78.55 78.51

best for both FRN and BiFRN. In this way differ-

ent datasets have different suitable FS-ratio for each

method and the relevant backbone.

5 CONCLUSION

In this paper we propose a novel Local Foreground

Selection(LFS) attention based module called Lo-

cal Foreground Selection Module (LFSM), tailored

Table 5: Performance with different select ratios on iNatu-

ralist 19 dataset for 5-shot and 1-shot.

FRN FS-ratio/ iNaturalist 2019

0.1 0.3 0.5 0.7 0.9

Conv

5-shot 82.69 82.76 82.48 82.52 82.40

1-shot 66.15 66.62 66.14 65.83 66.01

ResNet

5-shot 93.21 93.33 92.82 93.06 92.82

1-shot 79.44 79.61 78.81 79.12 78.69

BiFRN FS-ratio/ iNaturalist 2019

Conv

5-shot 82.39 82.05 82.59 81.82 82.50

1-shot 65.00 64.43 65.36 64.63 65.18

ResNet

5-shot 86.68 87.67 86.59 87.37 85.92

1-shot 70.99 72.95 70.98 72.19 70.25

to generate discriminative query and support fea-

tures suitable for few-shot ﬁne grained classiﬁca-

tion. The proposed attention optimizes large intra-

class and small inter-class variations for the plant

species classiﬁcation task. The Foreground Selection

attention works by highlighting the foreground and

reducing the effect of background which contributes

to alleviate the high intra-class variations. Secondly,

we extract local spatial details to focus on the ﬁne-

grained details of the plant object with the local atten-

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

538

tion. Combining the local and foreground selection

attentions enhances accuracy. The proposed LFSM

module complements feature reconstruction methods

and improve performance evident on plant species

datasets and the ablation studies also validate the ef-

fectiveness of LFS attention.

ACKNOWLEDGMENTS

This research utilized Queen Mary’s Apocrita

HPC facility, supported by QMUL Research-IT.

http://doi.org/10.5281/zenodo.438045.

The research leading to this paper was partially

supported by the EU Horizon Europe research and

innovation Programme under Grant Agreement No.

101086461.

REFERENCES

Doersch, C., Gupta, A., and Zisserman, A. (2020).

Crosstransformers: spatially-aware few-shot transfer.

Advances in Neural Information Processing Systems,

33:21981–21993.

Garcin, C., Joly, A., Bonnet, P., Lombardo, J.-C., Af-

fouard, A., Chouet, M., Servajean, M., Lorieul, T.,

and Salmon, J. (2021). Pl@ ntnet-300k: a plant im-

age dataset with high label ambiguity and a long-tailed

distribution. In NeurIPS 2021-35th Conference on

Neural Information Processing Systems.

Huang, H., Zhang, J., Yu, L., Zhang, J., Wu, Q., and Xu,

C. (2021a). Toan: Target-oriented alignment network

for ﬁne-grained image categorization with few labeled

samples. IEEE Transactions on Circuits and Systems

for Video Technology, 32(2):853–866.

Huang, H., Zhang, J., Zhang, J., Xu, J., and Wu, Q.

(2021b). Low-rank pairwise alignment bilinear net-

work for few-shot ﬁne-grained image classiﬁcation.

IEEE Transactions on Multimedia, 23:1666–1680.

Li, W., Wang, L., Xu, J., Huo, J., Gao, Y., and Luo, J.

(2019). Revisiting local descriptor based image-to-

class measure for few-shot learning. In Proceedings

of the IEEE/CVF conference on computer vision and

pattern recognition, pages 7260–7268.

Li, X., Song, Q., Wu, J., Zhu, R., Ma, Z., and Xue,

J.-H. (2023). Locally-enriched cross-reconstruction

for few-shot ﬁne-grained image classiﬁcation. IEEE

Transactions on Circuits and Systems for Video Tech-

nology.

Li, X., Wu, J., Sun, Z., Ma, Z., Cao, J., and Xue, J.-H.

(2020). Bsnet: Bi-similarity network for few-shot

ﬁne-grained image classiﬁcation. IEEE Transactions

on Image Processing, 30:1318–1331.

Nguyen, H. Q., Nguyen, C. Q., Le, D. D., and Pham, H. H.

(2023). Enhancing few-shot image classiﬁcation with

cosine transformer. IEEE Access.

Nilsback, M.-E. and Zisserman, A. (2008). Automated

ﬂower classiﬁcation over a large number of classes.

In 2008 Sixth Indian conference on computer vision,

graphics & image processing, pages 722–729. IEEE.

Snell, J., Swersky, K., and Zemel, R. (2017). Proto-

typical networks for few-shot learning. In Guyon,

I., Luxburg, U. V., Bengio, S., Wallach, H., Fer-

gus, R., Vishwanathan, S., and Garnett, R., editors,

Advances in Neural Information Processing Systems,

volume 30. Curran Associates, Inc.

Sun, X., Xv, H., Dong, J., Zhou, H., Chen, C., and Li, Q.

(2021). Few-shot learning for domain-speciﬁc ﬁne-

grained image classiﬁcation. IEEE Transactions on

Industrial Electronics, 68(4):3588–3598.

Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H., and

Hospedales, T. M. (2018). Learning to compare: Re-

lation network for few-shot learning. In Proceedings

of the IEEE conference on computer vision and pat-

tern recognition, pages 1199–1208.

Van Horn, G., Mac Aodha, O., Song, Y., Cui, Y., Sun,

C., Shepard, A., Adam, H., Perona, P., and Belongie,

S. (2018). The inaturalist species classiﬁcation and

detection dataset. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 8769–8778.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.

(2017). Attention is all you need. Advances in neural

information processing systems, 30.

Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.

(2016). Matching networks for one shot learning. Ad-

vances in neural information processing systems, 29.

Wertheimer, D., Tang, L., and Hariharan, B. (2021). Few-

shot classiﬁcation with feature map reconstruction

networks. In Proceedings of the IEEE/CVF con-

ference on computer vision and pattern recognition,

pages 8012–8021.

Wu, J., Chang, D., Sain, A., Li, X., Ma, Z., Cao, J., Guo,

J., and Song, Y.-Z. (2023). Bi-directional feature re-

construction network for ﬁne-grained few-shot image

classiﬁcation. In Proceedings of the AAAI Conference

on Artiﬁcial Intelligence, volume 37, pages 2821–

2829.

Ye, H.-J., Hu, H., Zhan, D.-C., and Sha, F. (2020). Few-

shot learning via embedding adaptation with set-to-

set functions. In Proceedings of the IEEE/CVF con-

ference on computer vision and pattern recognition,

pages 8808–8817.

Zha, Z., Tang, H., Sun, Y., and Tang, J. (2023). Boosting

few-shot ﬁne-grained recognition with background

suppression and foreground alignment. IEEE Trans-

actions on Circuits and Systems for Video Technology.

Zhang, C., Cai, Y., Lin, G., and Shen, C. (2020). Deepemd:

Few-shot image classiﬁcation with differentiable earth

mover’s distance and structured classiﬁers. In Pro-

ceedings of the IEEE/CVF conference on computer vi-

sion and pattern recognition, pages 12203–12213.

Zhang, W., Liu, X., Xue, Z., Gao, Y., and Sun, C. (2021).

Ndpnet: A novel non-linear data projection network

for few-shot ﬁne-grained image classiﬁcation. arXiv

preprint arXiv:2106.06988.

Zhu, Y., Liu, C., and Jiang, S. (2020). Multi-attention meta

learning for few-shot ﬁne-grained image recognition.

In IJCAI, pages 1090–1096.

Local Foreground Selection Aware Attentive Feature Reconstruction for Few-Shot Fine-Grained Plant Species Classiﬁcation

539