A Triplet-learnt Coarse-to-Fine Reranking

for Vehicle Re-identiﬁcation

Efklidis Katsaros

1,2

, Henri Bouma

2 a

, Arthur van Rooijen

and Elise Dusseldorp

Leiden University, Niels Bohrweg 1, 2333 CA Leiden, The Netherlands

TNO, Oude Waalsdorperweg 63, 2597 AK The Hague, The Netherlands

Keywords:

Vehicle Re-identiﬁcation, Deep Learning, Coarse-to-Fine, Computer Vision.

Abstract:

Vehicle re-identiﬁcation refers to the task of matching the same query vehicle across non-overlapping cameras

and diverse viewpoints. Research interest on the ﬁeld emerged with intelligent transportation systems and

the necessity for public security maintenance. Compared to person, vehicle re-identiﬁcation is more intri-

cate, facing the challenges of lower intra-class and higher inter-class similarities. Motivated by deep metric

learning advances, we propose a novel, triplet-learnt coarse-to-ﬁne reranking scheme (C2F-TriRe) to address

vehicle re-identiﬁcation. Coarse vehicle features conduct the baseline ranking. Thereafter, a fully connected

network maps features to viewpoints. Simultaneously, windshields are detected and respective ﬁne features

are extracted to capture custom vehicle characteristics. Conditional to the viewpoint, coarse and ﬁne features

are combined to yield a robust reranking. The proposed scheme achieves state-of-the-art performance on the

VehicleID dataset and outperforms our baselines by a large margin.

1 INTRODUCTION

Vehicle re-identiﬁcation (re-id) is a recent studied re-

search task in the computer vision community. Given

a query vehicle tracked on a camera, we aim to ﬁnd

its re-appearances on different ones. Optimization of

transportation, trafﬁc management and mobility in-

creasingly grows research interest on vehicle re-id.

Relevant applications consist of automatic toll collec-

tion and trafﬁc analysis. Vehicle re-id is addressed

independently of license plates as they tend to be in-

discernible in low-resolution videos or faked.

Vehicle re-identiﬁcation is inspired by the exten-

sively studied person re-id ﬁeld (Zheng et al., 2016)

but faces substantially different challenges. Given a

static scene, it is easier for human vision to “identify”

a query person in the crowd than a vehicle in a park-

ing lot. Vehicles exhibit larger intra-class semantic

differences compared to persons under diverse view-

points and across different cameras. Simultaneously,

persons tend to bear discriminative parts robust to il-

lumination. Accordingly, human outﬁt and skin color

are more discriminative than vehicle color. The lat-

ter is more prone to vary considerably under different

lightings.

https://orcid.org/0000-0002-9363-6870

The proposed method consists of a novel two-

fold and triplet-learnt coarse-to-ﬁne scheme. Our ap-

proach relies on the simple re-identiﬁcation assump-

tion that if a vehicle is captured twice by the same

view, windshields should be similar. If they are not

(different logos or stickers), the vehicles compared

are not the same. Prompted by deep metric learning

advances, we employ the triplet loss to learn coarse

and ﬁne vehicle similarities. Given an input vehi-

cle image, we ﬁrst extract the whole-image features

and use them to perform both the baseline ranking

and viewpoint classiﬁcation. In the second phase, we

detect and crop the windshields. Subsequently, we

extract the ﬁne windshield features. Conditional to

the viewpoints, we use the windshield distances to

reﬁne the baseline ranking. Windshield information

is explicitly incorporated into our reranking mecha-

nism to further pull same-ID images closer and push

different-ID ones further apart.

The rest of this work is structured as follows. In

Section 2, we discuss related vehicle re-id studies. In

Section 3 we propose our “C2F-TriRe” scheme and

describe its components. In Section 4, we introduce

the dataset and the evaluation metrics. Thereafter, in

Section 5, we present the experiments and compare

the results of our study. Conclusions are drawn in

Section 6.

518

Katsaros, E., Bouma, H., van Rooijen, A. and Dusseldorp, E.

A Triplet-learnt Coarse-to-Fine Reranking for Vehicle Re-identiﬁcation.

DOI: 10.5220/0008974005180525

In Proceedings of the 9th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2020), pages 518-525

ISBN: 978-989-758-397-1; ISSN: 2184-4313

2 RELATED WORK

Softmax loss retains its popularity in the re-id lit-

erature. Zhu et al. (Zhu et al., 2019) propose

quadruple directional deep learning features to learn

view-invariant representations. The authors introduce

quadruple directional average pooling layers and a

spatial normalization layer in the end of four convo-

lutional networks. The latter are optimized with the

softmax over samples and classes. In another work,

Guo et al. (Guo et al., 2018) combine the softmax with

other partial loss terms into their coarse-to-ﬁne rank-

ing loss function. The partial loss terms impose larger

differences on vehicles of different models (coarse)

and images of vehicles belonging to the same model

but different vehicle identities (ﬁne). A last pairwise

loss term is then added to shrink the (intra-class) vari-

ance of images belonging to the same vehicle identi-

ties.

Given the recent advances in deep metric learn-

ing, many studies employ the triplet loss in various

settings. Kumar et al. (Kumar et al., 2019) transfer

the batch-sampling idea from person (Hermans et al.,

2017) to vehicle re-id and introduce new sampling

schemes for the triplet loss. Bai et al. (Bai et al.,

2018) account for the intra-class variance by formu-

lating subgroups using k-means clustering in the fea-

ture space during training. The weighted average of

jointly optimized triplet and softmax loss functions

is minimized to learn the space. For the triplet loss,

the anchor is selected as the center of the positive in-

stances, whereas the negative sample is picked as the

closest-to-the-anchor negative instance.

Other works focus explicitly on regional infor-

mation. Such is the region aware deep model

(RAM) (Liu et al., 2018). Liu et al. introduce a multi-

branch network trained with both attributes (color,

model) and identities supervision. Features extracted

from global and local branches are fused and a joint

loss function is optimized to account for attributes,

local and global information simultaneously. Zhao et

al. (Zhao et al., 2019) assume a hierarchical structure

in three levels, that is, whole-image, vehicle model

and personalized traits like annual service signs, tis-

sue boxes. An SSD detector is employed to learn

such objects. Therefore, images are manually an-

notated with bounding boxes. A boosting model is

then trained to aggregate features from all levels and

rank images based on classiﬁcation scores. Learning

personalized information likewise, however, depends

heavily on the plenitude of annotated objects and the

detection performance.

3 PROPOSED METHOD

Motivated by recent advances in deep metric learn-

ing, we adopt the triplet loss to learn vehicle features

on coarse and ﬁne stages. We propose a novel coarse-

to-ﬁne reranking scheme (C2F-TriRe). The proposed

scheme relies on a simplistic re-identiﬁcation as-

sumption; conditional to the viewpoint, same-identity

vehicle images should bear similar windshields. Per-

sonalized characteristics (stickers, hangings, driver’s

ﬁgure, interior decoration, boxes, etc.) should re-

cur on both reappearances of the same vehicle. The

structural components of the proposed scheme are

shown in Figure 1. Our coarse network is a DenseNet-

121 (Huang et al., 2017) optimized to learn vehicle

(dis)similarities. On top of the former a fully con-

nected and an output layer are attached and optimized

to classify viewpoints. We then crop the windshields

and – in the same spirit with the coarse – train an-

other ﬁne DenseNet-121 to learn semantic similari-

ties between windshields. Accordingly, we use the

coarse whole-image features and their L2 distances

to conduct the baseline ranking. Thereafter, the ﬁne

features are introduced to reﬁne the coarse distances

for all query-gallery pairs sharing the same viewpoint.

Different-viewpoint pairs retain their initial baseline

ranking.

FR-CNN

D-121

Coarse Embedding

Fine Embedding

Viewpoint

Windshield

Vehicle

Figure 1: Proposed coarse-to-ﬁne reranking pipeline, with

DenseNet-121 (D-121), fully connected layer (FC) and

faster R-CNN (FR-CNN).

Prior to the proposed coarse-to-ﬁne reranking, ﬁne

windshields need to be detected. Windshield detec-

tion is a single-object detection task, that is, inter-

est is on one class. Performance-wisely we resort to

the Faster R-CNN (Ren et al., 2015). The detector

scans the input image to detect candidate windshield

patches. Patches deemed as windshields are further

propagated into the detector architecture, for bound-

ing box regression and classiﬁcation. The coarse net-

work is denoted as f and operates on whole-vehicle

images, projecting semantically similar ones to metri-

cally closer d-dimensional embeddings and dissimilar

ones further apart, such that:

f : R

h×w

→ R

, (1)

A Triplet-learnt Coarse-to-Fine Reranking for Vehicle Re-identiﬁcation

519

where R

h×w

is the space spanning images of height h

and width w. To learn such a mapping, the vision com-

munity resorts to the triplet loss (Schroff et al., 2015;

Hermans et al., 2017; Kumar et al., 2019). The lat-

ter considers three images, an anchor x

, one positive

and one negative sample, as (x

, x

−

). The triplet

loss for an input triplet is positive when the following

constraint is violated:



f (x

) − f (x

)



≥



f (x

) − f (x

−

)



− a, (2)

where α is a pre-deﬁned margin determining the ex-

tend up to which the triplet constraint is violated. The

triplet loss is then deﬁned as follows:

L( f | x

, x

−

) =

max(



f (x

) − f (x

)



−



f (x

) − f (x

−

)



+ a, 0)

(3)

In the next step, we train a fully connected

network for vehicle viewpoint classiﬁcation. Triplet

coarse embeddings encode viewpoint information. As

such, the fully connected network is attached on top

of the coarse DenseNet-121 one. Likewise, no further

convolutional operations are required for neither the

training nor the inference phase. The classiﬁer con-

sists of a 256-nodes fully connected layer, followed

by a REctiﬁed Linear Unit (ReLU) and the output

one. The viewpoint classiﬁer is denoted as:

v : R

→ {back, f ront}, (4)

Thereafter, in the same fashion with the coarse

DenseNet-121, the ﬁne f

one is trained with the

triplet loss to operate on windshields, projecting them

on a d-dimensional space which retains their se-

mantic similarities. Assuming an image x

, we omit

the detection part and denote ﬁne embeddings as

). Fine distances are meaningful only for same-

viewpoint vehicle images. Assuming a query image q

and a set of gallery images G = {g

, g

, ..., g

}, coarse

and ﬁne distances are – for simplicity – denoted as

d(q, g

) and d

(q, g

). We introduce a reranking rule

as follows; If a pair (q, g

) does not share the same

viewpoint, we retain their initial baseline distance as

per the the coarse model. Otherwise, we swift their

distance through a weighted average of coarse and

ﬁne distances. The reranking equation is formulated

as:

(q, g

)



λd(q, g

) + (1 − λ)d

(q, g

), v(q)

v(g

)

d(q, g

), otherwise

(5)

The optimal hyper-parameter λ is determined as

the scalar that maximizes performance on the valida-

tion set. Our proposed reranking scheme results in

a two-fold improvement, as illustrated in Figure 2.

When both query and its same-identity expected

match (reappearance) are captured from the same

viewpoint, combined coarse and ﬁne features lead

essentially to more accurate vehicle re-identiﬁcation.

Distracting gallery vehicles of same viewpoint, color

and model will be pushed further away due to the dis-

similarity of their windshields whereas for the same

reason, the expected image will be pulled closer to

the query. When query and its reappearance are

captured from different viewpoints, distracting same-

view gallery vehicles are pushed further away; simul-

taneously, the expected same-identity one preserves

its baseline distance from the query. Accordingly, it is

more likely to be retrieved.

4 EXPERIMENTAL SETUP

4.1 Dataset

VehicleID (Liu et al., 2016) is one of the largest and

most popular vehicle datasets. It contains 221763 im-

ages of 26267 vehicles (8.7 images/ID) split equally

on the vehicle identity-level to form the train and test

sets consisting of 13164 (110178 images) and 13103

(111585 images) identities respectively. As most real-

world re-id datasets, VehicleID is captured by cam-

eras initially installed for plate veriﬁcation. Subse-

quently, vehicles appear in two viewpoints, frontal

and backside. In contrast to others, VehicleID is not

yet saturated.

The 13103 identities form three different testing

splits of 800, 1600 and 2400 identities (not all 13103

ones are used) speciﬁed by the evaluation protocol.

For each of the splits, the gallery set is formulated

with one image per identity. The rest of the images are

allocated to the corresponding query set. The amount

of images in the query and gallery sets for each of the

splits is shown in Table 1. Query sizes in the table are

slightly different than the ones reported in the original

paper, as the version of the dataset employed for this

work is corrected for fake license plates.

Table 1: Test set splits on the VehicleID dataset.

Number of images Small Medium Large

Query size 5693 11777 17377

Gallery size 800 1600 2400

4.2 Evaluation Metrics

Following the evaluation protocol, we compute the

Cumulative Matching Characteristic (CMC) curve

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

520

Query

Same-ID

Misclassiﬁed

Distractor

d1'

d2'

Misclassiﬁed

Distractor

Query

Same-ID

d1'

d2'

d3'

Same-View Reappearance Scenario

Different-ViewReappearanceScenario

Figure 2: Schematic representation of the proposed C2F-TriRe reranking mechanism. On the right-hand side, our scheme

can accurately match identical stickers on the frontal windshields of both query and its reappearance (same-ID), pulling them

closer. On the left-hand side, dissimilar windshields push distractors further away; the expected same-ID image retains its

initial distance and is thus the correct match after reranking.

along with the respective Top-1 and Top-5 match rates

(accuracy). CMC@k (at rank k) is deﬁned as follows:

CMC@k =

∑

q=1

gt(q, k)

(6)

where gt(q, k) is an indicator function returning 1

when the ground-truth match for query q belongs in

the ﬁrst k ranks, 0 otherwise. Q denotes the total

number of queries. Top-1 and Top-5 match rates are

equivalent to CMC@1 and CMC@5 respectively. To

account for query-gallery splitting variability, results

are averaged over 20 replications for all experiments

performed.

5 EXPERIMENTS AND RESULTS

All experiments are performed with PyTorch

1.1.0 (Paszke et al., 2017) utilizing two Tesla K80

GPUs and an Intel Xeon E5-265ov3 @2.30GHz

processor.

5.1 Softmax-learnt Ranking

Similarly to Zhu et al. (Zhu et al., 2019), we imple-

ment a baseline approach, optimizing DenseNet-121

with the softmax loss function. Softmax guides train-

ing towards learning features upon which the feature

space is separated into sub-spaces wherein each class

belongs (decision areas). Learning the convolutional

weights simultaneously with the decision boundaries

results in similar identities being clustered closer to-

gether and dissimilar ones further apart. Upon infer-

ence we rank images based on the L2 distances of the

extracted embeddings.

The training setup is similar to Zhu et al., ex-

cept for the validation strategy. Instead of the

classiﬁcation-based validation approach “Leave-One-

Image-Out” (L-O-O), we resort to the “Leave-N-

Identities-Out” (L-N-IDs-O). Out of the total 13164

identities in the training set, we randomly allocate

12052 for actual training and the rest 1112 for vali-

dation. Likewise, we measure Top-1 and Top-5 ac-

curacy on the validation identities after each training

epoch. For training, softmax loss is augmented with

an L2 regularization term set at 0.001. Batch size is

set to 128, that is, 32 positive and 32 negative im-

age pairs. Images are resized to 224x224, augmented

by horizontal mirroring and random rotations within

the ranges of [-3 °, 0°] and [0°, 3°]. The network

is trained for 80 epochs. The optimizer employed is

Stochastic Gradient Descent (SGD) with a momen-

tum of 0.9. Learning rate starts from 0.01 and is mul-

tiplied by 0.1 on epochs 45 and 70.

Results of softmax-trained models are depicted in

Table 2. Shifting to the Leave-N-Identities-Out val-

idation strategy, we observe an 1.5-6% and 8-10%

increase on Top-1 and Top-5 accuracy respectively,

compared to match rates reported by Zhu et al.. An

appropriate validation strategy is essential to guide

softmax optimization towards learning embeddings

for re-id. Likewise, a decent baseline is achieved.

A Triplet-learnt Coarse-to-Fine Reranking for Vehicle Re-identiﬁcation

521

Table 2: Performance evaluation of softmax-trained DenseNet-121 models under different validation setups.

Validation Strategy Small Subset Medium Subset Large Subset

Method Implementation Top-1 Top-5 Top-1 Top-5 Top-1 Top-5

L-O-O (Zhu et al., 2019) 66.10 77.87 67.39 75.49 63.07 72.57

L-N-IDs-O Ours 72.26 89.25 68.57 84.09 65.11 80.44

5.2 Triplet-Learnt Coarse-to-Fine

Reranking Scheme

5.2.1 Coarse Triplet Learning

While softmax preserves inter-class distances to re-

tain efﬁcacy of decision boundaries, the triplet loss

learns inter-class differences and intra-class similari-

ties simultaneously. Accordingly, triplet-learnt vehi-

cle embeddings are more representative for re-id as

intra-class variance is further minimized. The coarse

proposed network is a DenseNet-121 trained with

the triplet loss. To address triplet optimization lim-

itations, we construct the triplets “within-the-batch”.

Therefore, we resort to the batch-all (BA) (Hermans

et al., 2017) batch-sampling strategy as it hardly de-

pends on the choice of the margin α (see Eq. 3).

The experimental setup for triplet optimization is

described as follows. For each update, we sample

P identities comprising of K images each, compos-

ing a batch of PK images. Out of these, we con-

struct all possible PK(PK-K)(K-1) triplets and retain

the ones that violate the triplet constraint, that is, the

ones that yield non-zero loss values. Batch size is set

to 72 (P=12, K=4) and the margin α is set at either

0.3 or 0.7. Following common practices in the litera-

ture (Hermans et al., 2017; Kumar et al., 2019), Adam

is used as the optimizer, further augmented with an

L2 regularization term ﬁxed at 0.001. The learning

rate is set to 0.0002 and is multiplied by 0.85 every

30 epochs for 300 in total. One epoch is one pass

over all the identities (but not all images) and both

our validation and data augmentation frameworks are

identical to our classiﬁcation approach described in

subsection 5.1.

Results are shown in Table 3. Top-1 and Top-5 ac-

curacy for DenseNet-121 initialized on ImageNet and

trained with a 0.3 margin is 76.68% and 94.70% re-

spectively on the small subset, that is, a minor upsurge

compared to the 0.7 margin. Deviations on the larger

subsets are of similar magnitude.

Compared to softmax, the triplet-trained coarse

DenseNet-121 leads to higher results. Metric learn-

ing of parameters f propagates more meaningful gra-

dients for the re-identiﬁcation task. Accordingly,

triplet-learnt vehicle embeddings are robust, that is,

they accommodate the viewpoint-caused variability

and are thus semantically more representative.

5.2.2 Viewpoint Classiﬁcation

Conditional to our coarse network, we show that

viewpoint classiﬁcation can be reliably performed

with only minor mistakes. Therefore, we design a

feed-forward fully connected network to classify ve-

hicle images in two classes, namely, front and back-

side view.

We annotate viewpoints on 5600 randomly se-

lected vehicle images. The train, validation and test

sets consist of 3250, 950 and 1400 images respec-

tively. A fully connected network is attached on our

coarse triplet network. The learning rate is set to 0

for the latter and 0.01 for the former. Accordingly,

only the newly-introduced fully connected weights

are learnt. For all experiments, the learning rate is

multiplied by 0.9, every 30 epochs, for 150 in to-

tal. Four experiments are performed, connecting the

triplet embedding layer (1024) with a pre-last layer

entailing 32, 128, 256 or 512 nodes followed by a

RELU and a two-nodes output layer followed by a

Sigmoid activation. A last sub-experiment is per-

formed connecting directly the embedding with the

output layer. The optimization objective loss mini-

mized is the softmax.

The results for our viewpoint classiﬁcation model

are illustrated in Table 4. A network with 256 hidden

nodes is the best-performing fully connected compo-

nent. Respective accuracy on the test set is 98.34%.

Architectures with less connections yield inferior re-

sults. Therefore, viewpoint classiﬁcation can be accu-

rately performed and error propagation into the C2F-

TriRe scheme is minimized.

5.2.3 Windshield Detection

A Faster R-CNN (Ren et al., 2015) object detector is

trained to extract windshields from images of vehi-

cles. Regarding windshield detection, two classes are

of interest, windshield or not (background). Faster

R-CNN bears a ResNet-51 backbone network and is

pre-trained on the COCO dataset (Lin et al., 2014).

The latter consists of 81 different classes (80 or back-

ground) of instances. The detector is now ﬁne-tuned

to learn the two classes.

The experimental setup is the following. The

81-nodes classiﬁcation layer is substituted by a two-

nodes one. The regression layer is retained as such.

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

522

Table 3: Performance evaluation of triplet-trained DenseNet-121 models under different margins.

Training Details Small Subset Medium Subset Large Subset

Loss Margin Top-1 Top-5 Top-1 Top-5 Top-1 Top-5

Triplet α = 0.7 75.93 93.12 71.03 89.94 66.92 86.98

Triplet α = 0.3 76.68 94.52 71.76 90.97 67.78 87.45

Table 4: Fully connected networks performance on view-

point classiﬁcation. Validation column refers to the accu-

racy value evaluated on the validation set upon convergence.

Training Details Validation Test

Connections Accuracy

(1024, 2) 77.06 75.13

(1024, 32, 2) 97.58 97.35

(1024, 128, 2) 97.91 97.90

(1024, 256, 2) 98.24 98.34

(1024, 512, 2) 98.24 98.19

To ﬁne-tune the detector on learning windshields, we

annotate 700 images, 500 for training and 200 for val-

idation. The bounding boxes are manually drawn so

as to encompass the whole windshields. The detector

is trained for 50 epochs with a batch size of 4. Op-

timizer used is SGD with a learning rate, a momen-

tum and an L2-penalty of 0.005, 0.9 and 0.0005 re-

spectively. In accordance with the latest detection re-

search we adopt the COCO evaluation metrics. Model

weights are saved upon the highest noted AP@IoU of

0.75 on the validation set.

Observed validation statistics converge in 10

epochs, as, ﬁrst, the detector is pre-trained on a larger

and challenging dataset and second, due to the small

amount of our manually annotated windshields. Val-

ues for AP@IoU of 0.5 and 0.75 are 0.99 and 0.97.

Qualitative evaluation of bounding boxes yields no

detection errors on inspected images. Boxes pre-

dicted on unseen vehicles encompass the whole wind-

shields, identically to our annotations. Faster R-CNN

is clearly able to detect windshields, regardless of

scale, location and vehicle type.

5.2.4 Fine Triplet Learning

Similarly to the coarse triplet network, we optimize a

ﬁne DenseNet-121 to learn semantic similarities be-

tween windshields. We infer windshields and view-

points over the VehicleID training set to construct a

custom “windshields” dataset. Then, we group wind-

shield images based on vehicle identity and inferred

viewpoint. Each new windshield identity contains all

images of a vehicle identity sharing the same view-

point. Accordingly, a vehicle identity which con-

tained 16 images out of which 10 were front and

6 were backside-viewed, now yields two new wind-

shield identities. We retain only identities with ﬁve

images or more. The windshields dataset consists of

7362 identities (74658 images) used for training and

another 691 (6462 images) for validation.

We train the ﬁne triplet DenseNet-121 under the

following setup. Windshields are resized to mean

height and width (62×152). We resort to the same

BA sampling strategy, using a batch size (P*K, with

various combinations of P. K, of 140 and margins α of

0.2 and 0.5. One pass over the training dataset loops,

again, over all windshield classes. We retain random

rotations, but we do not ﬂip images horizontally, as

discriminative spatial features are subject to be con-

founded. Optimizer is again Adam, with a learning

rate and an L2-penalty of 0.0001 and 0.001. The for-

mer is multiplied with 0.9 every 30 for 240 epochs in

total. Identically to the coarse network, weights are

now saved upon highest validation Top-1 accuracy on

matching windshields.

Results for the ﬁne triplet DenseNet-121 on our

custom validation set are illustrated in Table 5. A

0.5 margin performs slightly higher. Choice of

P,K does not severely impact learning of windshield

(dis)similarities. The model optimized with a 0.5

margin over batches of P=17 identities and K=8 im-

ages each is utilized as the ﬁne-grained feature extrac-

tion component of the proposed method.

Table 5: Triplet ﬁne DenseNet-121 performance on match-

ing windshields. Validation columns refer to Top-1 and

Top-5 match rates observed on the custom validation set

upon convergence.

Training Details Validation

Margin P×K Top-1 Top-5

α = 0.2 20×7 90.47 95.61

α = 0.5 17×8 90.91 96.03

α = 0.2 35×4 90.32 95.57

α = 0.5 35×4 90.87 95.79

5.2.5 Coarse-to-Fine Reranking

Prior to evaluating our coarse-to-ﬁne reranking

scheme, the hyper-parameter λ is chosen based on

Top-1 match rate on the L-N-IDs-O validation set

(1112 vehicle identities). Therefore, we evaluate

coarse and ﬁne features over the latter and re-rank

gallery candidates conditional to inferred viewpoints

(see Eq. 5) for each λ. We perform additional experi-

ments for the parameter k - referring to the number of

the nearest (as per the coarse ranking) vehicles to be

reranked for each query.

A Triplet-learnt Coarse-to-Fine Reranking for Vehicle Re-identiﬁcation

523

Table 6: Comparative results of the proposed C2F-TriRe versus baselines and various related works on the VehicleID dataset.

Method Small Subset Medium Subset Large Subset

Top-1 Top-5 Top-1 Top-5 Top-1 Top-5

QD-DLF(Zhu et al., 2019) 72.3 92.5 70.6 88.9 69.0 88.3

C2F-Rank(Guo et al., 2018) 61.1 81.7 56.2 76.2 51.4 72.2

XG-6 (Zhao et al., 2019) 76.1 92.0 73.1 88.1 71.2 84.4

RAM (Liu et al., 2018) 75.2 91.5 72.3 87.0 67.7 84.5

GS-TRE (Bai et al., 2018) 75.9 84.2 74.8 83.6 74.0 82.7

Triplet BS (Kumar et al., 2019) 78.8 96.2 73.4 92.6 69.3 89.4

Softmax Ranking 72.26 89.25 68.57 84.09 65.11 80.44

Triplet Ranking (Coarse) 76.68 94.52 71.76 90.97 67.78 87.45

C2F-TriRe (k=10) 79.63 95.82 76.06 93.03 70.44 88.62

C2F-TriRe (k=20) 79.51 95.70 75.64 93.11 71.57 89.94

C2F-TriRe (k=30) 79.87 95.78 75.82 93.08 71.77 90.01

C2F-TriRe (k=50) 80.28 96.34 76.09 93.13 72.23 90.08

Figure 3: CMC curve of C2F-TriRe versus triplet-learnt

(coarse) and softmax-learnt ranking on the small VehicleID

subset.

Results of our C2F-TriRe scheme are depicted in

Table 6. The CMC curve is shown in Figure 3. The

optimal hyper parameter λ was found to be 0.85. C2F-

TriRe outperforms the triplet (coarse) and softmax

models by Top-1 and Top-5 margins of 3-4.5% and

8-9% respectively on the small VehicleID subset. The

margin is consistent over all test sets of VehicleID for

both match rates. Moreover, we show that, with a

negligible accuracy trade-off one can minimize com-

putational time by restricting windshield detections

and evaluations to only 10 (k-closest as per the coarse

ranking) for each query. The additional computa-

tional cost of C2F-TriRe more than the coarse em-

bedding is negligible when k is much smaller than the

gallery set (e.g. k = 10, n = 2400). Note that, due

to sampling variability, match-rates are not necessar-

ily monotonically increasing with k on the small and

medium subsets. Compared to the top-performing ap-

proaches, C2F-TriRe notes the highest performance

on all subsets and ranks of the VehicleID dataset, ex-

cept for Top-1 on the large version, where GS-TRE

reports an 1.8% higher match rate. Regarding Top-

5 rates, C2F-TriRe outperforms GS-TRE by a mar-

gin of 12%, 10% and 7%, on the small, medium and

large subsets, respectively. Furthermore, the proposed

method achieves an 1.5-3% upsurge on Top-1 match

rate compared to Triplet BS.

6 CONCLUSION

In this paper, we propose C2F-TriRe, a novel coarse-

to-ﬁne reranking scheme for vehicle re-identiﬁcation.

The introduced method involves a triplet coarse net-

work operating on whole-vehicle images, a fully

connected one for viewpoint classiﬁcation, a wind-

shield detector and one triplet ﬁne network oper-

ating on the windshield area. The whole-image

features are extracted with the coarse DenseNet-

121; they are utilized for the baseline ranking be-

fore inputted for viewpoint classiﬁcation. Condi-

tional to the viewpoint, a ﬁne DenseNet-121 is then

used to extract windshield features. Upon infer-

ence, a different-viewpoint query-gallery pair retains

its coarse, baseline distance whereas same-viewpoint

ones are reranked.

The proposed C2F-TriRe exploits metric learn-

ing advances to incorporate personalized vehicle in-

formation into a robust vehicle re-id system. Our

C2F-TriRe scheme outperforms both baselines and

related works. We show that, extending end-to-end

deep learning architectures to a coarse-to-ﬁne vision

system leads to increased vehicle re-id performance.

Moreover, restricting reranking to a small number k,

– referring to the coarsely-ranked nearest neighbors

– reduces dramatically computational expenses with

only a minor performance trade-off. Future work

could include generalization to other viewpoints, such

the side view.

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

524

ACKNOWLEDGEMENTS

Portions of the research in this paper use the Vehi-

cleID dataset collected under the sponsor of the Na-

tional Natural Science Foundation of China.

REFERENCES

Bai, Y., Lou, Y., Gao, F., Wang, S., Wu, Y., and Duan, L.-Y.

(2018). Group-sensitive triplet embedding for vehicle

reidentiﬁcation. IEEE Transactions on Multimedia,

20(9):2385–2399.

Guo, H., Zhao, C., Liu, Z., Wang, J., and Lu, H. (2018).

Learning coarse-to-ﬁne structured feature embedding

for vehicle re-identiﬁcation. In Thirty-Second AAAI

Conference on Artiﬁcial Intelligence.

Hermans, A., Beyer, L., and Leibe, B. (2017). In defense

of the triplet loss for person re-identiﬁcation. arXiv

preprint arXiv:1703.07737.

Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger,

K. Q. (2017). Densely connected convolutional net-

works. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 4700–

4708.

Kumar, R., Weill, E., Aghdasi, F., and Sriram, P. (2019).

Vehicle re-identiﬁcation: an efﬁcient baseline using

triplet embedding. arXiv preprint arXiv:1901.01015.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P.,

Ramanan, D., Doll

ar, P., and Zitnick, C. L. (2014).

Microsoft coco: Common objects in context. In Euro-

pean conference on computer vision, pages 740–755.

Springer.

Liu, H., Tian, Y., Yang, Y., Pang, L., and Huang, T. (2016).

Deep relative distance learning: Tell the difference be-

tween similar vehicles. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recogni-

tion, pages 2167–2175.

Liu, X., Zhang, S., Huang, Q., and Gao, W. (2018). Ram: a

region-aware deep model for vehicle re-identiﬁcation.

In 2018 IEEE International Conference on Multime-

dia and Expo (ICME), pages 1–6. IEEE.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,

DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and

Lerer, A. (2017). Automatic differentiation in pytorch.

In NIPS-W.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster

r-cnn: Towards real-time object detection with region

proposal networks. In Advances in neural information

processing systems, pages 91–99.

Schroff, F., Kalenichenko, D., and Philbin, J. (2015).

Facenet: A uniﬁed embedding for face recognition

and clustering. In Proceedings of the IEEE conference

on computer vision and pattern recognition, pages

815–823.

Zhao, Y., Shen, C., Wang, H., and Chen, S. (2019). Struc-

tural analysis of attributes for vehicle re-identiﬁcation

and retrieval. IEEE Transactions on Intelligent Trans-

portation Systems.

Zheng, L., Yang, Y., and Hauptmann, A. G. (2016). Per-

son re-identiﬁcation: Past, present and future. arXiv

preprint arXiv:1610.02984.

Zhu, J., Zeng, H., Huang, J., Liao, S., Lei, Z., Cai, C.,

and Zheng, L. (2019). Vehicle re-identiﬁcation us-

ing quadruple directional deep learning features. IEEE

Transactions on Intelligent Transportation Systems.

A Triplet-learnt Coarse-to-Fine Reranking for Vehicle Re-identiﬁcation

525