Table 2: Performance evaluation of softmax-trained DenseNet-121 models under different validation setups.
Validation Strategy Small Subset Medium Subset Large Subset
Method Implementation Top-1 Top-5 Top-1 Top-5 Top-1 Top-5
L-O-O (Zhu et al., 2019) 66.10 77.87 67.39 75.49 63.07 72.57
L-N-IDs-O Ours 72.26 89.25 68.57 84.09 65.11 80.44
5.2 Triplet-Learnt Coarse-to-Fine
Reranking Scheme
5.2.1 Coarse Triplet Learning
While softmax preserves inter-class distances to re-
tain efficacy of decision boundaries, the triplet loss
learns inter-class differences and intra-class similari-
ties simultaneously. Accordingly, triplet-learnt vehi-
cle embeddings are more representative for re-id as
intra-class variance is further minimized. The coarse
proposed network is a DenseNet-121 trained with
the triplet loss. To address triplet optimization lim-
itations, we construct the triplets “within-the-batch”.
Therefore, we resort to the batch-all (BA) (Hermans
et al., 2017) batch-sampling strategy as it hardly de-
pends on the choice of the margin α (see Eq. 3).
The experimental setup for triplet optimization is
described as follows. For each update, we sample
P identities comprising of K images each, compos-
ing a batch of PK images. Out of these, we con-
struct all possible PK(PK-K)(K-1) triplets and retain
the ones that violate the triplet constraint, that is, the
ones that yield non-zero loss values. Batch size is set
to 72 (P=12, K=4) and the margin α is set at either
0.3 or 0.7. Following common practices in the litera-
ture (Hermans et al., 2017; Kumar et al., 2019), Adam
is used as the optimizer, further augmented with an
L2 regularization term fixed at 0.001. The learning
rate is set to 0.0002 and is multiplied by 0.85 every
30 epochs for 300 in total. One epoch is one pass
over all the identities (but not all images) and both
our validation and data augmentation frameworks are
identical to our classification approach described in
subsection 5.1.
Results are shown in Table 3. Top-1 and Top-5 ac-
curacy for DenseNet-121 initialized on ImageNet and
trained with a 0.3 margin is 76.68% and 94.70% re-
spectively on the small subset, that is, a minor upsurge
compared to the 0.7 margin. Deviations on the larger
subsets are of similar magnitude.
Compared to softmax, the triplet-trained coarse
DenseNet-121 leads to higher results. Metric learn-
ing of parameters f propagates more meaningful gra-
dients for the re-identification task. Accordingly,
triplet-learnt vehicle embeddings are robust, that is,
they accommodate the viewpoint-caused variability
and are thus semantically more representative.
5.2.2 Viewpoint Classification
Conditional to our coarse network, we show that
viewpoint classification can be reliably performed
with only minor mistakes. Therefore, we design a
feed-forward fully connected network to classify ve-
hicle images in two classes, namely, front and back-
side view.
We annotate viewpoints on 5600 randomly se-
lected vehicle images. The train, validation and test
sets consist of 3250, 950 and 1400 images respec-
tively. A fully connected network is attached on our
coarse triplet network. The learning rate is set to 0
for the latter and 0.01 for the former. Accordingly,
only the newly-introduced fully connected weights
are learnt. For all experiments, the learning rate is
multiplied by 0.9, every 30 epochs, for 150 in to-
tal. Four experiments are performed, connecting the
triplet embedding layer (1024) with a pre-last layer
entailing 32, 128, 256 or 512 nodes followed by a
RELU and a two-nodes output layer followed by a
Sigmoid activation. A last sub-experiment is per-
formed connecting directly the embedding with the
output layer. The optimization objective loss mini-
mized is the softmax.
The results for our viewpoint classification model
are illustrated in Table 4. A network with 256 hidden
nodes is the best-performing fully connected compo-
nent. Respective accuracy on the test set is 98.34%.
Architectures with less connections yield inferior re-
sults. Therefore, viewpoint classification can be accu-
rately performed and error propagation into the C2F-
TriRe scheme is minimized.
5.2.3 Windshield Detection
A Faster R-CNN (Ren et al., 2015) object detector is
trained to extract windshields from images of vehi-
cles. Regarding windshield detection, two classes are
of interest, windshield or not (background). Faster
R-CNN bears a ResNet-51 backbone network and is
pre-trained on the COCO dataset (Lin et al., 2014).
The latter consists of 81 different classes (80 or back-
ground) of instances. The detector is now fine-tuned
to learn the two classes.
The experimental setup is the following. The
81-nodes classification layer is substituted by a two-
nodes one. The regression layer is retained as such.
ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods
522