Urban Re-Identiﬁcation: Fusing Local and Global Features with

Residual Masked Maps for Enhanced Vehicle Monitoring in Small

Datasets

William A. Ramirez

, Cesar A. Sierra Franco

, Thiago R. da Motta

and Alberto Raposo

Pontiﬁcal Catholic University of Rio de Janeiro, Brazil

Keywords:

Urban Re-Identiﬁcation, Dilated Region Proposal, Local and Global Attribute Fusion, Residual Connection

Modules, Multi-Object Tracking.

Abstract:

This paper presents an optimized vehicle re-identiﬁcation (Re-ID) approach focused on small datasets. While

most existing literature concentrates on deep learning techniques applied to large datasets, this work addresses

the speciﬁc challenges of working with smaller datasets, mainly when dealing with incomplete partitioning

information. Our approach explores automated regional proposal methods, examining residuality and uniform

sampling techniques for connected regions through statistical methods. Additionally, we integrate global and

local attributes based on mask extraction to improve the generalization of the learning process. This led to a

more effective balance between small and large datasets, achieving up to an 8.3% improvement in Cumulative

Matching Characteristics (CMC) at k=5 compared to attention-based methods for small datasets. We improved

generalization regarding context changes of up to 13% in CMC for large datasets. The code, model, and

DeepStream-based implementations are available at https://github.com/will9426/will9426-automatic-Region-

proposal-for-cars-in-Re-id-models.

1 INTRODUCTION

Re-identiﬁcation in urban areas, and generally in un-

common classes, often falls into small dataset scenar-

ios. In these cases, video frames with a temporal rela-

tionship and tracking information from one or multi-

ple cameras are commonly considered. This ﬁeld can

involve various types of approaches categorized into

Mask-Guided Models (Song et al., 2018; Kalayeh

et al., 2018; Zhang et al., 2020; Lv et al., 2024),

Stripe-Based Methods (Luo et al., 2019b; Wang et al.,

2018; Fan et al., 2019), Attention-Based Methods

(Si et al., 2018; Chen et al., 2021), and GAN-Based

Methods (Jiang et al., 2021), often supplemented by

re-ranking methods (Luo et al., 2019a). The previ-

ously mentioned techniques result from challenges

focused on data availability and the need to extract in-

creasingly deeper and more representative attributes.

Mask-guided models often entail additional costs in

annotation or inference, as it is necessary to have a

https://orcid.org/0000-0003-1060-1523

https://orcid.org/0000-0002-5825-8798

https://orcid.org/0000-0002-9579-5867

https://orcid.org/0000-0001-7279-1823

process for detecting parts of the instances. (Zhang

et al., 2020) shows how this can involve an addi-

tional annotation task in exchange for optimizing

the identiﬁcation process due to the availability of

more contextual information. On the other hand,

Stripe-Based Methods highlight another possible ap-

proach to global and localized attribute extraction, us-

ing statistical methods to propose areas that may be

more representative, shifting the annotation cost to a

method embedded within the model. Regarding the

use of attention modules along with partition-based

techniques, (Zhang et al., 2020) points out the fusion

of local and global attributes using squeeze and exci-

tation layers (SElayers) and attention, achieving im-

provements of up to 1.2% in the Cumulative Match-

ing Characteristics (CMC), a metric specialized in re-

identiﬁcation. Each of these methods presents trade-

offs: Mask-guided models offer ﬁne-grained detail

but require extensive annotation, increasing complex-

ity; Stripe-Based Methods simplify this at the risk of

oversimplifying regions of interest; Attention-Based

Methods balance attribute fusion but may overﬁt or

add computational overhead; and GAN-Based Meth-

ods provide ﬂexibility in augmenting data but can be

computationally intensive and challenging to stabi-

574

Ramirez, W. A., Franco, C. A. S., R. da Motta, T. and Raposo, A.

Urban Re-Identiﬁcation: Fusing Local and Global Features with Residual Masked Maps for Enhanced Vehicle Monitoring in Small Datasets.

DOI: 10.5220/0013176300003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 2: VISAPP, pages

574-581

ISBN: 978-989-758-728-3; ISSN: 2184-4321

lize.

This paper addresses the challenges associated

with small datasets by studying algorithms for efﬁ-

cient region estimation and improving feature map

diversity to enhance model generalization. We ex-

plore using statistical methods and attribute extrac-

tion techniques to create connected areas, proposing

more representative regions or stripes that lead to fea-

ture maps with greater representativeness. The pro-

posed model consists of three branches: localized fea-

ture maps (using RPN), globalized features (leverag-

ing the backbone and our custom attribute extraction

block), and base features from the backbone. Our

case study uses the VRIC and VeRi datasets, includ-

ing cross-validation experiments with mixed training

and testing conﬁgurations, speciﬁcally Vric− >VeRi

and VeRi− >Vric. Our approach builds on the base-

line established in (Zhang et al., 2020; Luo et al.,

2019a), which emphasizes the use of camera-guided

partitioned attention and attribute fusion while dis-

cussing the limitations associated with this module.

In summary, the main contributions of our work

are as follows:

• We propose an optimized approach for region ex-

traction, demonstrating how the use of statistical

methods like Monte Carlo can be helpful for stripe

extraction while simultaneously showing how us-

ing the same backbone to propose regions can be

a successful path to achieving a more balanced

model in terms of response to variance.

• We propose a validation method for the trained

model, introducing the concept of cross-inference

between two datasets with the same category

about the instance but with apparent differences

in context and resolution. We aim to demonstrate

that the trained model can be used in other con-

texts.

• Based on the challenges discussed around re-

identiﬁcation, we proposed a computationally bal-

anced and reliable model to establish a real-time

baseline using DeepStream.

1.1 Data Augmentation

Deep learning approaches are often highlighted for

their reliance on large datasets. Data augmentation

has emerged as a valuable strategy in the context of

re-identiﬁcation with small datasets. Techniques such

as grayscale conversion, random erasing, image ori-

entation ﬂips, and zooming generate synthetic data

with some variability (Gong et al., 2021; Jiang et al.,

2021). Similarly, Generative Networks (Zheng et al.,

2019; He et al., 2023; Karras et al., 2020; Karras et al.,

2021) have proven effective in increasing variability

in small datasets with low variability, facilitating data

augmentation in labeled Re-identiﬁcation datasets.

In the context of GANs, DG-Net (Zheng et al.,

2019) emphasizes its ability to generalize key features

such as pose and clothing. The model incorporates a

feature disentangling module and reconstruction loss,

enhancing cross-domain generation. DG-GAN (He

et al., 2023) aims to learn from defects or irregular

regions. DG-GAN architecture includes two genera-

tors and four discriminators across two domains, en-

hancing synthetic data generation for pattern recog-

nition in land surfaces, such as roads or open areas.

The StyleGAN architecture has gained attention in re-

cent years for its contributions to synthetic datasets

(Karras et al., 2020; Karras et al., 2021). StyleGAN3

(Karras et al., 2021), represents a signiﬁcant advance-

ment in computer vision, albeit with high computa-

tional demands. StyleGAN3 employs components

like the Mapping Network, Synthesis Network, and

Weight Demodulation to provide ﬁne control over im-

age style while effectively addressing aliasing, result-

ing in high-quality images without noise associated

with generative learning.

1.2 Baseline for ReID

In the past decade, ReID methods have evolved into

several categories: Mask-Guided Models, Stripe-

Based Methods, Pose-Guided Methods, Attention-

Based Methods, and GAN-Based Methods, often with

re-ranking techniques to address data limitations and

identiﬁcation challenges. Mask-guided models use

instance-speciﬁc masks derived from segmentation or

detection (Song et al., 2018; Kalayeh et al., 2018;

Zhang et al., 2020; Lv et al., 2024), combining lo-

cal and global attributes. Despite the additional pro-

cessing costs, they enhance matching performance,

achieving up to 93% for CMC@5 (Zhang et al., 2020;

Lv et al., 2024). Stripe-based methods segment im-

ages to create local embeddings, as seen in (Luo et al.,

2019b; Wang et al., 2018; Fan et al., 2019; Fawad

et al., 2020). However, alignment issues challenge

these methods, leading to approaches like Aligne-

dReID++ (Luo et al., 2019b), which reported a 3%

improvement in Rank-1 accuracy compared to mod-

els without alignment. Attention-based methods ex-

tract discriminative features without explicit masks

(Si et al., 2018; Chen et al., 2021). While effec-

tive for large datasets, they face overﬁtting issues

in smaller ones. For instance, (Chen et al., 2021)

reported a 9% improvement in MAP@5 for larger

datasets. Gan-based methods address small dataset

limitations by generating synthetic data. For exam-

Urban Re-Identiﬁcation: Fusing Local and Global Features with Residual Masked Maps for Enhanced Vehicle Monitoring in Small Datasets

575

ple, (Jiang et al., 2021) enhanced person category

diversity using GANs, achieving a 1% improvement

in CMC@1. Batch Normalization improves training

stability and generalization for large datasets. (Luo

et al., 2019a) demonstrated up to a 6% enhancement

for CMC@1.

2 METHOD

2.1 Dilated Region Proposal for Cars

(DRPC)

We proposed a module inspired by (Chen et al.,

2023; Lv et al., 2024), where we use the ResNet-

50 backbone to propose regions. Initially, we use

the ﬁrst layer of our backbone, where we extract

low-level features. In this layer, with tensor dimen-

sions [B, C, H, W], a feature map is generated as

F = ResNet(input tensor). We then average across

the channel dimension to obtain a 2D representation

of the feature magnitudes, as shown in Eq. 1.

avg

(b, h, w) =

out

∑

c=1

F(b, c, h, w) (1)

Where M

avg

is the average tensor with dimensions

[B, 1, H

out

To create a candidate region based on Eq. 2, we

use an adaptative threshold over the global average

value of M

avg

(b, h, w) =

(

1 if M

avg

(b, h, w) > µ

avg

0 if M

avg

(b, h, w) ≤ µ

avg

(2)

Where µ

avg

is the global average value of M

avg

de-

scribed on Eq. 3.

avg

B · H

out

·W

out

∑

b=1

out

∑

h=1

out

∑

w=1

avg

(b, h, w) (3)

Thus, we obtain a binary candidate mask that

highlights areas of interest based on the feature map

extracted. However, our goal is to create regions

around these initially highlighted characteristics in

stripes, aiming for the model to emphasize contours

and other low-level features of the pre-selected area.

To create the areas, we initialize a mask M

regions

with dimensions [B, N, H

out

], where N is the

number of desired regions. Each vertical stripe is ex-

tracted and dilated. Consider a stripe R with dimen-

sions [H

out

, width], where ‘width‘ is the width of the

stripe. The dilation is performed using a structural

operation that expands the stripe. If R is the original

stripe and D is the result after dilating R, as shown in

Eq. 4:

dilated

= R ⊕ K (4)

Where ⊕ denotes the morphological dilation and

K is a dilation kernel (in this case, a matrix of ones).

Finally, the dilated stripe D

dilated

is assigned to the re-

gion mask M

regions

in the corresponding position, Eq.

[5,6] describe the proposed region and the result in

the extraction within the proposed area.

regions

(b, i, :, start

: end

) = D

dilated

(5)

prop

= F ⊙ M

regions

(6)

2.2 Quasi-Monte Carlo for Proposal

Regions (QMCPR)

This section introduces the use of Quasi-Monte

Carlo (QMC) to generate proposal regions without

relying on pre-annotated masks or outputs from seg-

menters or detectors. Strips x

are deﬁned in a unit

space [0, 1]

using the Halton sequence, which dis-

tributes points uniformly. For the two-dimensional

case d = 2, each strip x

= (x

, y

) is derived using

bases b

and b

, as shown in Eq. 7.

= (φ

(n), φ

(n)) (7)

where φ

(n) is the inverse radical function in base

b, which converts an integer n into a fraction in base

b. Initially, the most statistically discriminative ar-

eas are selected through a histogram analysis of the

instance-level image to restrict the search areas and

highlight the discriminative areas. Based on this anal-

ysis, greater weight is given to the bases in the most

relevant regions.

Subsequently, the QMC method is applied to dis-

tribute the strips in these areas, ensuring that the strips

cover at least 15% of the total width or length of the

image, which is crucial to guarantee that the strips

contain sufﬁcient visual context. This strategy is ben-

eﬁcial for providing variability of context or informa-

tion to the model, allowing the model to relate the

instance from a smaller context. The generated strips

are originally fractions within the range [0, 1]. To

adapt them to the image space of size H ×W (where H

and W are the dimensions of the image), these points

are scaled to the appropriate range, as shown in Eq. 8.

= x



W − w

H − h



(8)

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

576

where w

and h

are the dimensions of the pro-

posed mask. The scaled values p

= (p

, p

) are

rounded to obtain the pixel indices (x

start

, y

start

) where

the mask will begin in the image. For each generated

strip p

, a binary strip mask M

of size w

× h

deﬁned in an image of size H × W . Eq. 9 shows the

cases for generating the strip.

(i, j) =







1 if

start

≤ i < x

start

+ w

start

≤ j < y

start

+ h

0 otherwise

(9)

Where (i, j) are the pixel indices in the image, the

masks M

are used to propose speciﬁc regions of the

image that are then processed to extract a series of

features. F is an image feature obtained from an in-

termediate layer of the network, so by applying the

mask M

on F, a proposed region feature is generated.

Eq. 10 describes how we use the proposed region to

deﬁne our proposed feature map.

prop

= F ⊙ M

(10)

where ⊙ denotes the element-wise product. This

proposed feature F

prop

can be used to train the model

by combining it with the global feature of the image,

allowing the model to focus on both local and global

features.

2.3 Feature Maps

The proposed model will address three branches of

the extraction and fusion of attributes: a base feature

map, a global feature map, and a local feature map.

The base map will describe the most superﬁcial at-

tributes found in the initial layers of the network, ob-

tained through convolutions at different levels. In this

stage, our model aims to capture patterns that may be

associated with edges, textures, and colors (low di-

mensionality). Eq. 11 shows what this branch repre-

sents.

base

= σ(W ∗ X + b) (11)

where

W is the convolution kernel, ∗ represents the con-

volution operation, X is the input (image or previous

features), b is the bias, and σ is an activation function

(such as ReLU).

Concerning the global attributes, these are ob-

tained by aggregating all the spatial information of the

image into a single representative vector using Global

Average Pooling (GAP), just as shown in Eq. 12.

global

H ×W

∑

i=1

∑

j=1

i j

(12)

where f

i j

is the feature vector at the spatial posi-

tion (i, j), and H ×W is the size of the spatial feature.

In our small dataset context, f

global

will contain

representations that help prevent overﬁtting by not re-

lying on speciﬁc details and providing a robust repre-

sentation that tolerates minor variations in the image.

proposal

will represent the local attributes ex-

tracted from the regions or stripes. In our case,

regions

will be the proposed stripes using the meth-

ods mentioned above, which will be used as delim-

iters to extract more localized attributes, f

proposal

Eq. 13.

proposal

∑

M=1

· f

(Input,M)

(13)

where α

are the selection weights, and f

Input,M

are the features in M region . f

proposal

generates repre-

sentative regions obtained through the outputs of the

DRPRC or QMCPR modules.

Subsequently, the reduced features are obtained

through a reduction of dimensionality for the feature

map of the proposed regions, performing an aggrega-

tion along with the activation, see Eq. 14.

reduce

= σ

∑

f =1

proposal

b, f

(14)

In this case, we calculate the mean of the values

across the feature dimension for each batch b.

2.4 Set up

Regarding our loss metric, we considered a fusion of

losses in triplet loss, starting with a cross-entropy loss

related to the class, followed by a loss for base at-

tributes, another for the proposed region, and another

for global attributes. Figure 1 shows the meaning of

each of these losses.

The total loss function (L

total

) is given by the Eq.

15 where λ

triplet

is deﬁned in 16.

total

=λ

· L

(cls

, y)

+ λ

triplet

· L

triplet

(global

, y)

+ pr

· L

triplet

(prop

, y)

+ pr

· L

triplet

(base

, y)

(15)

triplet

= max(0, dist

− dist

+ margin) (16)

Where:

is the weight assigned to the identiﬁcation loss.

triplet

is the weight assigned to the triplet loss. pr

the weight assigned to the proposal loss. pr

is the

Urban Re-Identiﬁcation: Fusing Local and Global Features with Residual Masked Maps for Enhanced Vehicle Monitoring in Small Datasets

577

Figure 1: Reid training Pipeline.

weight assigned to the base loss. L

is the cross-

entropy loss function, which measures the discrep-

ancy between the class prediction and the target label.

triplet

is the triplet loss function, which measures the

relative distance between an anchor, a positive, and

a negative in the feature space. N is the number of

layers or extracted features to which the loss is ap-

plied. cls

refers to the classiﬁcation outputs of each

layer. global

refers to the global features extracted

from each layer. prop

refers to the proposed features

extracted from each layer. base

refers to the base fea-

tures extracted from each layer. y is the target label

for classiﬁcation. dist

deﬁnes Anchor-Positive Dis-

tance and dist

deﬁnes Anchor-Negative Distance.

About our approach, we used the training conﬁg-

urations described in Table 1.

Table 1: Training Conﬁguration.

Conﬁguration Value

Input Size 3x224x224

Epochs 200

Early Stop CMC rank 1 tolerance (10 epochs)

Augmentation Flip, Rotation, Scaling, Grayscale

Learning Rate 0.001

Optimizer Adam

Reduce LR On Plateau

3 EXPERIMENTS AND RESULTS

3.1 Datasets and Metrics

We use the VRIC dataset (Kanaci et al., 2018) in

both small and large variants. The small version has

5,854 training samples (220 IDs, 10-30 samples/ID)

and 2,811 gallery samples (validation), maintaining

the original test set for consistent comparative anal-

ysis. The large version retains all 60,430 samples

(5,622 IDs). Additionally, we use the VeRi dataset

(Liu et al., 2016) to evaluate the model’s adaptabil-

ity to domain shifts. Our validation protocol consid-

ered cross-dataset evaluation of metrics to assess per-

formance in two different scenarios: one with richer

contextual information and the other with higher res-

olution.

The small dataset is designed to evaluate model

performance with fewer, more variable inputs, test-

ing generalization on unseen validation IDs. Data

augmentation includes ﬂips, grayscale conversions,

area-speciﬁc and full rotations, and GAN-based view

changes.

Performance is measured using mean Average

Precision (mAP) and Cumulative Matching Char-

acteristic (CMC) metrics, commonly used in re-

identiﬁcation tasks. mAP assesses retrieval effective-

ness (Eq. 17), while CMC evaluates ranking accuracy

(Eq. 18) for top k = 1, 5, 10.

mAP =

∑

i=1

∑

k=1

P(k) (17)

CMC(r) =

∑

i=1

δ(rank

≤ r) (18)

3.2 Analysis of Attribute Extraction

Method

The present work evaluated two methods for region

generation. The ﬁrst, DRPC, is based on generating

dilated regions using contours from the ﬁrst layer of

the ResNet50 backbone, which involves a computa-

tional cost of O(batch) × processing1 + O(regions).

In the ﬁrst stage, we iterate over the entire batch and

return the ﬁrst layer in the forward pass, performing a

pass through the model and returning the ﬁrst convo-

lution with binarization. The second stage encloses

each binarized area based on a dilation. The sec-

ond method, QMCPR, has a sublinear computational

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

578

cost and generates masks with an approximate com-

putational cost of O(batch × regions), approximately

O(8n), where ”regions” is a parameter ﬁxed at 8, and

”batch” is a variable input parameter. Both methods

did not drastically modify the computational cost dur-

ing training execution.

On the other hand, these two algorithms provide

a base approach using masks that can be generalized

to any ReID dataset, reducing the need for manual

annotations or training models for part segmentation.

The previously mentioned methods will enable the lo-

cal extraction of features. Since these do not rely on

pre-annotated masks, they will evaluate more repre-

sentative regions using the ResNet backbone’s upper

layers or by selecting regions based on distributions

within a unit ﬁeld representing the area of interest.

Subsequently, using extraction blocks, the features

within these regions will be extracted and reduced

to more relevant values using activations suggested

in this work. The proposed use of regions enhances

feature diversity by leveraging incomplete partition-

ing, which fuses and reﬁnes attributes from local,

global, and base extractions. This approach enables

the model to learn more discriminative contextual and

instance-speciﬁc information in small datasets, mak-

ing it a more robust solution.

Table 2 provides an initial notion of what these

methods imply in small datasets. Here, we compare

the model with mask guidance, the GRFR, and the

DRPC model against the baseline PGAN model with-

out attention. We observe that the proposed methods

are quite close to the training done entirely with the

masks generated for the VRIC dataset, provided in

(Zhang et al., 2020), showing a correlation with the

extraction of local attributes, which is a positive in-

dicator that our method successfully proposes highly

discriminative regions.

Table 2: Results of Different Models on Small-dataset Re-

Identiﬁcation Tasks.

Model mAP CMC Top 1 CMC Top 5

PGAN 45.8 33.7 60.1

model mask guided 53.9 43.2 66.2

model GRFR (ours) 53.5 43.1 67.3

model DRPC (ours) 56.0 44.5 68.3

model DRPC+AUG view (ours ) 63.3 53.3 74.9

In the experimental stage, we observed that pay-

ing attention to small datasets tends to degrade the

gradient, resulting in less model generalization. Com-

paring PGAN with and without attention (PGAN vs.

model mask guided), we see that in small datasets,

more globalized learning tends to bring better perfor-

mance in terms of the CMC Top K=5 metric (approx-

imately 6% improvement). In small datasets, the lim-

ited diversity and number of examples lead the atten-

tion mechanism to overﬁt speciﬁc, less generalizable

local features rather than capturing broader patterns.

This overﬁtting causes the model to become sensitive

to noise or small variations in the training data, which

in turn degrades the gradient during optimization and

hinders the model’s ability to generalize effectively

to unseen samples. Regarding other masking meth-

ods, our trend was quite in line with PGAN without

attention, in some cases having a better CMC K=5

(approximately 1.1% improvement).

3.3 Performance of Our Model

Our model is designed to handle small datasets with-

out region annotations, using a model masking ap-

proach based on the forward pass of the same model.

As depicted in Figure 1, we emphasize more repre-

sentative areas of the feature maps through a reduc-

tion facilitated by an activation function. This ap-

proach reduced training time by several seconds due

to its lower complexity, eliminating the need for at-

tention mechanisms and specialized reﬁnement mod-

ules. Speciﬁcally, the training times were 66 seconds

per batch (size of 8 samples) for the base model, com-

pared to 62 seconds with the DRPC method and 60

seconds with the QMCPR method. These tests were

performed using an NVIDIA GeForce RTX 3060, a

13th Gen Intel® Core™ i5-13600KF (20 cores), and

32GB of RAM.

Table 3: Experimental results of Different Models on Large

Dataset Re-Identiﬁcation Tasks, V RIC− > V RIC.

Model mAP CMC Top 1 CMC Top 5

PGAN 84.5 77.4 93.1

model DRPC (ours) 82.0 74.9 89.8

As shown in Table 3, our model lags behind

attention-based models on large datasets. This is be-

cause attention-based models typically involve deeper

feature extraction layers. In contrast, models based on

Squeeze-and-Excitation layers and attention mecha-

nisms achieved up to a 3.3% improvement.

However, as illustrated in Table 4, our model

demonstrated balanced performance across various

scenarios due to our generalized approach in extract-

ing local discriminative features and global attributes.

Our model had fewer parameters, reducing inference

times: 0.0302 seconds for the base model, 0.0095 sec-

onds for QMPRC, and 0.0216 seconds for DPRC.

Table 4 details the performance of our model in

mixed scenarios. The ﬁrst scenario involves train-

ing and evaluating the VeRi model on the VeRi test

set. This setup shows that VeRi, a higher-quality

dataset, allows more straightforward methods that im-

prove variability by expanding the visual context to

Urban Re-Identiﬁcation: Fusing Local and Global Features with Residual Masked Maps for Enhanced Vehicle Monitoring in Small Datasets

579

Figure 2: Qualitative inference of our model, VRIC to VeRi vs. VRIC to VRIC vs. large VRIC to VeRi.

Table 4: Experimental results of Different Models on Large

dataset Re-Identiﬁcation Tasks, V RIC− > VeRi vs VeRi− >

V RIC.

Model CMC Top 1 CMC Top 5 CMC Top 10 Train Test

PGAN 95.2 97.5 98.7 VeRi VeRi

PGAN 14.3 26.1 33.0 VeRi VRIC

PGAN 44.9 56.9 66.2 VRIC large VeRi

PGAN 37.5 43.2 50.8 VRIC small VeRi

DRPC (ours) 94.8 97.2 98.4 VeRi VeRi

DRPC (ours) 39.6 53.3 60.9 VRIC small VeRi

DRPC (ours) 56.8 62.1 68.0 VRIC large VeRi

DRPC (ours) 21.8 38.2 46.4 VeRi VRIC

perform well in similar contexts. As observed, PGAN

and the proposed method are nearly identical regard-

ing the CMC curve for VeRi vs. VeRi, with only a 0.5

percentage point difference at K=5.

Regarding susceptibility to input variability, our

model’s ability to learn contextual regions allowed it

to achieve superior performance in the Cumulative

Matching Characteristic (CMC) metric with K = 5

when transitioning from the VeRi dataset to the VRIC

dataset (VeRi vs VRIC). Speciﬁcally, the model im-

proved up to 12.1 percentage points, indicating en-

hanced generalization capabilities. In contrast, the

model trained with a smaller dataset and data aug-

mentation fell 5.3 percentage points short compared

to the baseline model trained on the full dataset.

3.4 Discussion of Results

Our model focused on improving generalization and

performance on small datasets. This led to evaluating

the performance of our model in terms of changes in

the number of samples and context variations. Figure

2 shows how the matching works for the k=5 smallest

distances concerning our query, evaluated on a gallery

or query set.

Figure 2 shows some critical and common cases

regarding re-identiﬁcation for VRIC and VeRi. Re-

garding the VRIC dataset, we found quite a few

matches for the top 5, considering the dimensionality

of the samples used in the training, which is quite pos-

itive. The shift in context to VeRi demonstrated that

part of the knowledge extracted from VRIC helped to

re-identify a wide variety of instances in this dataset.

The scores in the context shift was deﬁcient compared

to the base model trained with attention and the entire

VRIC set due to the inherent limitations of our small

dataset’s dimensionality. Regarding our model’s ef-

fectiveness in generalizing learning in a large dataset,

we also observed that our model not only reduced

the necessary training time but also generalized the

learning optimally, achieving a balance between per-

formance and susceptibility to variability.

4 COPYRIGHT FORM

This work is licensed under the Creative Com-

mons Attribution-NonCommercial-NoDerivatives 4.0

International License. To view a copy of this

license, visit http://creativecommons.org/licenses/

by-nc-nd/4.0/. This license permits non-commercial

use, distribution, and reproduction in any medium,

provided the original work is properly cited and is not

modiﬁed or adapted in any way.

5 CONCLUSIONS

Our approach demonstrated how to tackle a mask-

based method by creating regions through the same

activations proposed by the backbone, generating

partitions with greater context, and establishing a

baseline for scenarios with high-resolution instances

where emphasis is placed on the fusion of global and

local attributes. Along the same lines, we empha-

sized creating more discriminative feature maps by

reducing feature maps of proposed regions, leading to

more representative representations per proposed re-

gion. This made our approach more robust in domain

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

580

shift and small datasets. Providing greater context and

more localized information resulted in our model out-

performing our baseline by 8% in the small dataset

case, comparing our baseline PGAN result with our

DPRC-based model in the context of VRIC small. In

domain shift, our model (DPRC) exceeded our base-

line (PGAN) by up to 11% in CMC k=1, showing pos-

itive effects of proposing context-rich region parts in

using VRIC to train and VeRI for inference. Addi-

tionally, our approach aimed to verify the limitations

involved in small sample volumes and observed how

implementing classical techniques oriented towards

morphological transformations and using GANs for

simple changes like color and texture can help address

issues where our study instance is highly costly to an-

notate.

ACKNOWLEDGEMENTS

This study was ﬁnanced in part by the Coordination of

Superior Level Staff Improvement - Brasil (CAPES)

- Finance Code 001.

REFERENCES

Chen, H., Zhao, Y., and Wang, S. (2023). Person re-

identiﬁcation based on contour information embed-

ding. Sensors, 23(2):774.

Chen, X., Xu, H., Li, Y., and Bian, M. (2021). Person re-

identiﬁcation by low-dimensional features and metric

learning. Future Internet, 13(11):289.

Fan, X., Luo, H., Zhang, X., He, L., Zhang, C., and

Jiang, W. (2019). Scpnet: Spatial-channel paral-

lelism network for joint holistic and partial person re-

identiﬁcation. In Computer Vision–ACCV 2018: 14th

Asian Conference on Computer Vision, Perth, Aus-

tralia, December 2–6, 2018, Revised Selected Papers,

Part II 14, pages 19–34. Springer.

Fawad, Khan, M. J., and Rahman, M. (2020). Person re-

identiﬁcation by discriminative local features of over-

lapping stripes. Symmetry, 12(4):647.

Gong, Y., Zeng, Z., Chen, L., Luo, Y., Weng, B., and Ye, F.

(2021). A person re-identiﬁcation data augmentation

method with adversarial defense effect. arXiv preprint

arXiv:2101.08783.

He, X., Luo, Z., Li, Q., Chen, H., and Li, F. (2023). Dg-

gan: A high quality defect image generation method

for defect detection. Sensors, 23(13):5922.

Jiang, Y., Chen, W., Sun, X., Shi, X., Wang, F., and Li, H.

(2021). Exploring the quality of gan generated images

for person re-identiﬁcation. In Proceedings of the 29th

ACM International Conference on Multimedia, pages

4146–4155.

Kalayeh, M. M., Basaran, E., G

okmen, M., Kamasak,

M. E., and Shah, M. (2018). Human semantic pars-

ing for person re-identiﬁcation. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 1062–1071.

Kanaci, A., Zhu, X., and Gong, S. (2018). Vehicle re-

identiﬁcation in context. In Pattern Recognition - 40th

German Conference, GCPR 2018, Stuttgart, Ger-

many, September 10-12, 2018, Proceedings.

Karras, T., Aittala, M., Laine, S., H

ark

onen, E., Hellsten, J.,

Lehtinen, J., and Aila, T. (2021). Alias-free generative

adversarial networks. Advances in neural information

processing systems, 34:852–863.

Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J.,

and Aila, T. (2020). Analyzing and improving the im-

age quality of StyleGAN. In Proc. CVPR.

Liu, X., Liu, W., Mei, T., and Ma, H. (2016). A deep

learning-based approach to progressive vehicle re-

identiﬁcation for urban surveillance. In Computer

Vision–ECCV 2016: 14th European Conference, Am-

sterdam, The Netherlands, October 11-14, 2016, Pro-

ceedings, Part II 14, pages 869–884. Springer.

Luo, H., Jiang, W., Gu, Y., Liu, F., Liao, X., Lai, S., and

Gu, J. (2019a). A strong baseline and batch normal-

ization neck for deep person re-identiﬁcation. IEEE

Transactions on Multimedia, 22(10):2597–2609.

Luo, H., Jiang, W., Zhang, X., Fan, X., Qian, J., and Zhang,

C. (2019b). Alignedreid++: Dynamically matching

local information for person re-identiﬁcation. Pattern

Recognition, 94:53–61.

Lv, G., Ding, Y., Chen, X., and Zheng, Y. (2024).

Mp2pmatch: A mask-guided part-to-part matching

network based on transformer for occluded person re-

identiﬁcation. Journal of Visual Communication and

Image Representation, 100:104128.

Si, J., Zhang, H., Li, C.-G., Kuen, J., Kong, X., Kot, A. C.,

and Wang, G. (2018). Dual attention matching net-

work for context-aware feature sequence based per-

son re-identiﬁcation. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 5363–5372.

Song, C., Huang, Y., Ouyang, W., and Wang, L. (2018).

Mask-guided contrastive attention model for person

re-identiﬁcation. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 1179–1188.

Wang, G., Yuan, Y., Chen, X., Li, J., and Zhou, X. (2018).

Learning discriminative features with multiple granu-

larities for person re-identiﬁcation. In Proceedings of

the 26th ACM international conference on Multime-

dia, pages 274–282.

Zhang, X., Zhang, R., Cao, J., Gong, D., You, M., and Shen,

C. (2020). Part-guided attention learning for vehicle

instance retrieval. IEEE Transactions on Intelligent

Transportation Systems, 23(4):3048–3060.

Zheng, Z., Yang, X., Yu, Z., Zheng, L., Yang, Y., and Kautz,

J. (2019). Joint discriminative and generative learn-

ing for person re-identiﬁcation. In proceedings of the

IEEE/CVF conference on computer vision and pattern

recognition, pages 2138–2147.

Urban Re-Identiﬁcation: Fusing Local and Global Features with Residual Masked Maps for Enhanced Vehicle Monitoring in Small Datasets

581