Deep Part Features Learning by a Normalised Double-Margin-Based

Contrastive Loss Function for Person Re-Identification

María José Gómez-Silva, José María Armingol and Arturo de la Escalera

Intelligent Systems Lab (LSI) Research Group, Universidad Carlos III de Madrid, Leganés, Madrid, Spain

{magomezs, armingol, escalera}@ing.uc3m.es

Keywords: Re-identification, Deep Learning, Siamese Network, Contrastive Loss Function.

Abstract: The selection of discriminative features that properly define a person appearance is one of the current

challenges for person re-identification. This paper presents a three-dimensional representation to compare

person images, which is based on the similarity, independently measured for the head, upper body, and legs

from two images. Three deep Siamese neural networks have been implemented to automatically find salient

features for each body part. One of the main problems in the learning of features for re-identification is the

presence of intra-class variations and inter-class ambiguities. This paper proposes a novel normalized double-

margin-based contrastive loss function for the training of Siamese networks, which not only improves the

robustness of the learned features against the mentioned problems but also reduce the training time. A

comparative evaluation over the challenging PRID 2011 dataset has been conducted, resulting in a remarkable

enhancement of the single-shot re-identification performance thanks to the use of our descriptor based on

deeply learned features in comparison with the employment of low-level features. The obtained results also

show the improvements generated by our normalized double-margin-based function with respect to the

traditional contrastive loss function.

1 INTRODUCTION

The person re-identification problem consists of

visually associating people across camera views at

different locations and time, this means recognizing

an individual through different images. Automating

the re-identification problem is an essential task for

large scale distributed multi-camera surveillance

systems, whose rapid expansion deals a vast quantity

of visual data to manage.

As most of the object recognition mechanisms, the

re-identification process presents two main steps.

First, the selection and computation of features to get

a person representation, and second, the matching of

two samples of the same person by means of

measuring the similarity between them. However,

visual appearance based matching becomes a

remarkable challenge in unconstrained scenarios,

where large changes occur in view angle,

illumination, background, occlusion, and resolution,

producing huge differences in a person’s appearance

among different camera views.

In order to face this problem, a large amount of

research has been focused on the design of novel

visual features able to represent the most discriminant

aspects of an individual’s appearance, which are

invariant to pose, scale and illumination. Some of the

most commonly used representations, like RGB or

HSV histograms (Bazzani, 2013), Gabor filters

(Zhang, 2011) and HOG-based signatures (Oreifej,

2010), are based on low-level local features, such as

color, texture, and shape respectively. With the aim

of integrating several types of features with

complementary nature, like the used ones in

(Farenzena, 2010), into a global signature, bag-of-

words (BoW) model has been widely employed. In

(Ma, 2014), BoW model is improved by means of

using the Fisher Vector, (Sánchez, 2013), which

encodes higher order statics of local features. Other

relevant methods to fuse different modalities of

features are the ones based on covariance descriptors,

extensively used in feature-oriented approaches, such

the method proposed in (Corvee, 2010).

To improve the robustness to partial occlusions,

region-based approaches decompose a human body

image in different articulated parts and extract

features for each one. In that way, spatial information

is also integrated into the feature representation. In

(Bazzani, 2014), a symmetry-based silhouette

partition is used to detect salient body regions.

GÃ¸smez-Silva M., Armingol J. and de la Escalera A.

Deep Part Features Learning by a Normalised Double-Margin-Based Contrastive Loss Function for Person Re-Identiﬁcation.

DOI: 10.5220/0006167002770285

In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2017), pages 277-285

ISBN: 978-989-758-227-1

277

Recently, research has been focused on the reduction

of the number of high-level representations used to

describe a person. Instead of using traditional

dimensionality reduction methods, such as Principal

Component Analysis (PCA), many re-identification

algorithms choose an attributes-based approach,

allowing a description, which is semantically

meaningful to humans. In (Layne, 2014) attributes as

hair or clothing style are combined with low-level

features.

An alternative solution to the feature selection

problem is the use of deep learning algorithms, which

provides a useful tool to automatically find salient

high-level representations from an image. These

algorithms usually train a neural network model with

a high number of layers by means of a supervised

learning process. Deep Convolutional Neural

Networks have commonly been trained to perform

object recognition or classification tasks, such as in

the work presented in (Krizhevsky, 2012).

The deep Siamese networks, recently employed in

re-identification field, allow the learning of high-level

features by means of modelling the similarity

between a pair of images. The approach presented in

(Yi, 2014) not only uses Siamese networks to learn

deep features but also to address a distance metric

learning jointly. A Siamese network consists of two

convolutional neural networks sharing parameters

and joined in the last layer. In this last layer, the loss

function leads the whole network to discriminate

between pairs of similar or dissimilar images.

Therefore, the re-identification task is treated as a

pairwise classification problem. In order to reduce

the intra-class variation and highlight the inter-class

variation, the contrastive loss function, described in

(Hadsell, 2006), has been widely employed as loss

function in the last layer of deep Siamese networks.

The matching process in re-identification consists

of recognising the person shown by an image that has

been selected in one view (probe image) in all the

images from another view (gallery images). This is

achieved by calculating the distances between the

probe image and all gallery images using the

extracted features, and returning those gallery images

with the smallest distances as potential matches.

In this paper, a novel three-dimensional

representation to describe the similarity between two

images is proposed. The human shape has been

divided into three parts: head, body, and legs. For

each part, a deep Siamese Network has been

implemented to model the appearance similarity

between this body part from different images. The

result is the learning of three similarity distances

(head, body, legs) whose computation gives the value

of each one of the components of our three-

dimensional descriptor.

With the aim of obtaining a single value metric to

perform the re-identification matching, the

comparison between two images has been calculated

as the module of our three-dimensional descriptor.

The main contributions presented in this paper are

the improvements performed over the contrastive loss

function originally employed in Siamese networks. In

the first place, a new formulation based on two

margins instead of only one is proposed.

Subsequently, a second enhancement has been

achieved by means of implicitly including the

normalization of the compared features in the loss

function. This novel loss function allows to obtain

more discriminative features, which present more

robustness against intra-class variations and inter-

class ambiguities, as well as, a faster training stage,

due to the reduction in the number of the required

iterations.

The re-identification capacity of our method has

been evaluated over the challenging PRID 2011

dataset (Hirzer, 2011), proving the improvement

obtained with the use of our normalized double

margin-based loss function, in comparison with the

traditional one. Furthermore, these results have been

compared with a re-identification method based on

low-level features, highlighting the advantages of

employing deep features.

The rest of the paper is organized as follows.

Section 2 describes the proposed approach, Section 3

presents the experimental results obtained, and some

concluding remarks are given in Section 4.

2 THE PROPOSED METHOD

Considering the person re-identification task as an

isolated module of a more complex surveillance

system, its objective is to recognize a certain person

rendered in an image that has been selected in one

camera view (probe images), in all the images from

another view (gallery images). Therefore, it is

assumed that these person images have been

previously detected in both non-overlapping camera

views.

In this paper, that re-identification goal is

achieved by calculating the distances between the

probe image and all gallery images using a 3D deep

feature. Our 3D descriptor measures the degree of

similarity between two images. Therefore,

understanding the descriptor as a vector, its module

should present the smallest values for the matchings

between the images rendering the same person.

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

278

Figure 1: Re-identification Method Architecture.

The following sections present the architecture of this

re-identification method, the training of the deep

features models, the formulation of the new loss layer

employed, and the interpretation of our descriptor as

a 3D point, or as a Euclidean distance.

2.1 Re-Identification Method

Architecture

The proposed re-identification method follows the

architecture presented in Figure 1, as explained

bellow.

To measure the similarity between two person

images, firstly, each image is divided into three parts

roughly corresponding to the head, upper body, and

legs, using the human shape proportions. Secondly, a

Convolutional Neural Network (CNN) computes a

multi-dimensional descriptor, 









, for each one of

the mentioned parts, whose formulation is shown in

(1), where the index bp represents one of the labels h,

b, and l, to refer to the head, upper body or legs

descriptor respectively. The index i takes the values 1

or 2 corresponding to each one of the two input

images. Every element of the array descriptor

presents the form 









, where the index j represents

each one of the elements of the descriptor array.

Therefore, j can take values from 1 to N, being N the

dimension of the descriptor. The value of N is 100 for











, and 200 for









and 



















 









,…, 











(1)

Subsequently, for every body part, bp, a comparison

function, 





, calculates the distance between the

multi-dimensional descriptors obtained for the two

images, as in (2).











∑





























(2)

The distance obtained for each body part comparison

is an element of a three-dimensional descriptor

P



,



,



, as it is set in (3), (4) and (5). The

module of this vector is then calculated and employed

as the metric to measure the appearance differences

between the images.













(3)













(4)













(5)

2.2 Siamese Deep Neural Network

Training

In order to train each one of the three Convolutional

Neural Networks used in the proposed re-

identification method, a deep Siamese architecture

has been configured. For each body part, a training

process has been conducted using a Siamese network.

The training is supervised, therefore the images must

be labelled with an identification number, ID, which

represents the identity of the rendered person.

Figure 2: Siamese deep training algorithm for head CNN.

Deep Part Features Learning by a Normalised Double-Margin-Based Contrastive Loss Function for Person Re-Identiﬁcation

279

Figure 3: Convolutional Neural Network model.

Figure 2 shows the Siamese network employed to

train the head model. This model follows a CNN

architecture traditionally applied in computer vision

classification, which is shown in Figure 3. This neural

network is considered as a deep one because of the

relatively high number of layers compounding it,

whose implementations have been performed using

the Caffe libraries (Jia, 2014). The essential layers of

this network are two convolutional layers, two max-

pooling layers and an inner product layer, which is a

fully connected layer whose number of outputs is

equal to N, (1). The main parameters of the layers are

presented in Figure 3, where each layer is named with

the following combination of several fields

layernameR_i_bp. The layername field indicates the

layer type; R only is needed when more than one layer

of the same type is used in the model and takes

different values in order to differ those layers; i

represents the labels a and b to refer to each one of

the two input images, and bp takes the values 1, 2, or

3, corresponding to each body part, i.e. head, upper

body, and legs, respectively.

With the aim of learning the CNN weights values,

this model is duplicated. The input of each one of the

two obtained identical models is each one of the

images to be compared. Therefore, two CNN sharing

their parameters are joined by means of the

comparison function, (2). Moreover, a labeller layer

takes the identification numbers as inputs and its

output, y, values 1 if the IDs are the same number and

0, otherwise.

During the training, by means of the forward and

back propagation method (Rumelhart, 1988), the

parameters of both CNN are lead to values which

make the comparison function, 





, get closer to

certain objective values. The objective values for the

comparison function, both when y equals to 1 and 0,

are defined in the loss function, 







. This function

measures the deviation of the comparison function

value from the objective ones.

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

280

2.3 Loss Function

Siamese networks have commonly been trained using

the contrastive loss function, presented in (6), as loss

function, 







. 



is a constant parameter called

margin, and B is the batch size which means the

number of pairs used to compute the cost function in

every iteration. θ is an array, whose elements,





are

the distances computed by the comparison function

for each one of the images pair of a batch of B

samples, i.e.





is the 





of the sample n.

This equation measures the half average of the

error computed for every pair, taking into account

both cases, when the input images patches belong to

the same person, positive pair (y=1), or to different

ones, negative pair (y=0). The contribution of a

positive and a negative pair to the loss function are

described by (7) and (8), respectively.











2









1

















(6)









(7)









 







(8)

According to (7), positive samples cause an error or

loss even when the distance measured is under the

value of the margin given, as is shown in Figure 4a.

This loss leads the training to learn features which

make the distance between images null, causing at the

same time that the distance corresponding to quite

negative samples is also lower than the margin, m,

also causing an increment of the total loss. Therefore,

the total loss value oscillates throughout the training

due to the difficulty of leading the distances in

positive samples near the zero value, as Figure 5a

shows.

One of the main contributions of this paper is the

design of a new contrastive loss function based on

two margins. These two margins establish the

separation between the objective values of the

distances (calculated for the comparison function) for

positive and negative samples. The result of using two

margins is an increment in the discriminative capacity

of the learned features.

The formulation of the double-margin-based

contrastive function is described in (9), where 



and





are two constant parameters called margins, and

B is the batch size, being each element,





the

distance computed by the comparison function, 





for the sample n.















∑

























1

















(9)

(a)

(b)

(c)

Figure 4: Loss contribution of a positive,



, and

negative,



, single pair sample caused by the loss function

in the cases: (a) using the traditional contrastive loss

function, (6), with m=1.5 and  





; (b) using the

double-margin contrastive loss function (9) with 





1and 



=2 when  





and (c) using the double-margin

contrastive loss function, (9), with 



0.3and 



=0.7

when  





The contribution of a positive and a negative pair to

the loss function, (9), are described by (10) and (11),

respectively.



















(10)















,0

(11)

According to (10) and (11), no loss is caused by

positive samples with a distance value lower than the

first margin, 



, and negative samples with a distance

higher than the second margin, 



, as is shown in

Figure 4b. The function forces the comparison values

obtained for positive pairs to be lower than the first

margin, and those obtained with negative pairs to be

higher than the second margin. This results in a

reduction in the amplitude of the oscillation in the

total loss value throughout the training process, as

well as in a decrease in the value around which the

loss function oscillates. Furthermore, such a value is

achieved in a lower number of iterations, becoming

the learning process faster, as Figure 5b shows.

Deep Part Features Learning by a Normalised Double-Margin-Based Contrastive Loss Function for Person Re-Identiﬁcation

281

(a)

(b)

Figure 5: Loss function evaluated over a cross-validation

set of samples throughout the training process of the head

part in the cases: (a) using the traditional contrastive loss

function, (6); and (b) using the double-margin contrastive

loss function, (9).

The selection of the margins values is a critical

issue which depends on the range of values presented

by the distances, 





. However, this range varies

along the training process. For that reason, a new

intermediate step of normalization has been added

with the goal of obtaining normalized distances,







, with a range of values [0,1). The distances

calculated by the comparison function 





, are

normalized by the function presented in (12).







2

1







0.5

(12)

Subsequently, the normalized distances, 





, are

the inputs,





of the loss function, (9), whose margin

parameters have been set with the values 



0.3

and 



=0.7. The loss contribution of a positive

sample, (10), and a negative sample, (11), in function

of the value of the distance 





present a slightly

different shape, as Figure 4c shows.

2.4 Three-Dimensional Feature Module

Once the deep features for head, upper body, and legs

have been learned, the distances computed by the

comparison function for each one of them, 





, 





and 





, respectively, constitute the elements of our

three-dimensional descriptor, P, according to Figure

1. The last step to obtain a single metric to measure

the appearance difference between person images is

the computation of the module of P, as in (13).

























(13)

The descriptor P can be understood as a 3D point, and

consequently, its module,



, is the distance from the

origin of the coordinates system to the 3D point P.

Figure 6 represents that point for a test set of images

pairs. The point positions for positive samples are

nearer the origin than those for the negative ones,

proving that the training process has been effectively

conducted.

(a)

(b)

Figure 6: Three-dimensional representation of descriptor P,

in green for positive samples, and in red for the negative

samples for a test set in the cases: (a) using one margin

contrastive loss function, (6), in the training, and (b) using

double-margin contrastive loss function, (9), in the training.

The result of a proper training is the formation of

two separated clusters, those formed by the points

representing the positive samples and the negative

ones. The separating boundary between them is a

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

282

cube, whose dimension is the value of the margin

parameter employed in the loss function. Therefore,

if the loss function employed is based on only one

margin, m, like the traditional contrastive loss

function, (6), the points representing positive samples

should be inside a cube of dimension m, and the

negatives outside, as Figure 6a shows.

If the novel double-margin contrastive loss function,

(9), is employed, the resulting clusters present a

bigger separation between them, which means that

more discriminative features have been learned. The

points representing positive samples should be inside

a cube of dimension 



, and the negatives outside a

larger cube of dimension 



, as Figure 6b shows.

2.5 Euclidean Distance as Metric

Distance

A common approach for the re-identification task

consist of the extraction of a number of features and

subsequently the computation of a metric distance to

measure the appearance similarity between two

person images from their vectors of features.

Following that structure, the vector of features

extracted, for each image i, by the proposed method

is an array of 500 deep features, 





, which is

formed by means of concatenating the vectors of

features calculated for head, upper body and legs,











, 









and 









respectively, as in (14).



















,…, 









,









,…, 









,









,…, 











(14)

The metric distance used by our method is the

Euclidean distance, as can be deducted from the

computation of P module, (13). Taking into account

that the elements of P are the distances calculated by

the comparison function for each one of the body

parts, the module of P can be computed as (15)

presents. Using (2) to substitutes each one of the

distances, the modules computation takes the form

shown in (16), which is clearly the formulation for the

Euclidean distance between 





and 







































(15)











 



























 

























































(16)

3 EXPERIMENTAL RESULTS

In this section, the evaluation method is described, as

well as the dataset used to perform the different tests.

Finally, the obtained results are presented and

discussed.

3.1 Evaluation

The Cumulative Matching Characteristic (CMC)

curve (Moon, 2001) has been calculated in order to

evaluate the performance of the proposed re-

identification method. The CMC, which is a standard

performance measurement, renders the matching rate

for each rank, r, i.e. the expectation of finding the

correct match within the top r matches.

To obtain the CMC curve, first, every image from

the probe set is matched with all the images from the

gallery set, considering as top matches those which

present the lowest values for the module of our three-

dimensional descriptor,



3.2 Dataset

The tests have been performed on one of the most

widely used datasets for evaluating re-identification

approaches, the PRID 2011 dataset (Hirzer, 2011).

This dataset is formed by person images recorded

from two different static cameras, presenting

substantial differences in camera parameters,

illumination, person poses, and background. Two

versions are provided, single-shot, that contains only

one image for each person from each camera, and

multi-shot that presents several images for each

person and each camera. In this work, the first one has

been selected, where camera view A contains 385

individuals, and camera B, 749. There are 200 of

these pedestrians appearing in both views, which are

randomly divided into two groups of 100 individuals,

one used in the training process, and the other in the

test, that is the CMC curve computation. For

evaluation on the test set, the procedure described in

(Hirzer, 2011) is followed, i.e., the images of view A

for the 100 individuals selected as test set has been

used as probe set, and the gallery set has been formed

by 649 images belonging to camera view B (all

images of view B except the 100 corresponding to the

training individuals).

3.3 Results

The tests performed involve the computation of the

CMC curve using the PRID 2011 dataset as it has

been explained above.

Deep Part Features Learning by a Normalised Double-Margin-Based Contrastive Loss Function for Person Re-Identiﬁcation

283

Three versions of the re-identification method

proposed in this paper have been tested, according to

the loss function used to learned the deep features:

 Deep Features Learned by One Margin-based

contrastive loss function (6), (DFL-1M).

 Deep Features Learned by Double Margin-based

contrastive loss function (9), (DFL-2M).

 Deep Features Learned by Normalized Double

margin-based contrastive loss function (9) y (12),

(DFL-N2M).

The CMC scores are presented in Table 1, and the

corresponding curves are plotted in Figure 7.

Table 1: CMC scores (%) for three versions of the proposed

re-identification method.

Rank 1 10 20 50 100

DFL-1M 1 5 9 19 34

DFL-2M 3 15 25 37 56

DFL-N2M 6 20 26 38 63

The use of our novel double-margin based

contrastive loss function cause remarkable

improvement in the performance of the re-

identification method. In addition, the normalization

of the distances considered by that loss function also

lightly enhances the obtained results. For that reason,

this last version has been selected as the definitive one

to be implemented in our re-identification algorithm.

Figure 7: CMC curves for Three Versions of the Proposed

Re-Identification Method.

In order to evaluate the effect of using deeply learned

features instead of other low-level features (LLF)

selected, our method performance has been compared

with an algorithm based on the Euclidean distance

between color and texture features, extracted as

(Hirzer, 2012) describes. This algorithm has been

selected because our algorithm can be understood as

the computation of the Euclidean distance between

the deep features computed for each image of a pair,

as it has been explained in section 2.5. In that way,

since the distance metric used for both of the

compared methods is the Euclidean distance, the

difference in the performance is only due to the use

of deeply learned features in contrast to low-level

ones.

The CMC scores for both approaches are

presented in Table 2, and the corresponding curves

are plotted in Figure 8.

Table 2: CMC scores (%) for the proposed Re-

Identification method and a method based on Low-Level

Features.

Rank 1 10 20 50 100

DFL-N2M 6 20 26 38 63

LLF 3 10 14 28 45

The automatic selection of features provided by the

proposed deep learning algorithm results in a

considerable improvement of the re-identification

performance compared with the computation of low-

level features based on color and texture, which have

been traditionally employed.

Figure 8: CMC curves for the Proposed Re-Identification

method and other based on Low-Level Features.

4 CONCLUSIONS

This paper presents a re-identification approach based

on the learning of deep features for different body

parts, providing a three-dimensional descriptor which

results in a notable improvement in the performance

in comparison with an algorithm based on low-level

features.

In addition, the traditional contrastive loss

function employed in the learning process has been

enhanced by the design of a novel formulation based

on two margins and the normalization of the variable

on which it depends.

The evaluation of the proposed method has been

performed over a highly challenging dataset, the

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

284

PRID 2011 dataset, that presents many of the

variations occurring in a real world surveillance

scenario, such as changes in human pose,

illumination, background, and even camera

parameters. The evaluation results demonstrate that

deeply learned features provide more robustness

against these challenges than low-level features based

on color and texture.

The conducted tests have proved the remarkable

improvement in the performance due to the use of the

new loss function. This normalized double margin-

based loss function leads the training process to learn

more discriminative features, which reduces the intra-

class variation and highlights the inter-class variation.

Moreover, the proposed new loss function makes the

training process faster, since an acceptable model is

learned in a lower number of iterations, thanks to the

use of two margin parameters.

The obtained results present the normalized

double-margin contrastive loss function as a

potentially useful tool in the learning of appearance

similarity descriptors for multiple applications, as

well as, in the learning of a distance metric to get the

proper weighting of the deep features in the

construction of the optimal discriminative descriptor

for re-identification.

ACKNOWLEDGEMENTS

This work was supported by the Spanish Government

through the CICYT project (TRA2013-48314-C3-1-

R), (TRA2015-63708-R) and Ministerio de

Educación, Cultura y Deporte para la Formación de

Profesorado Universitario (FPU14/02143), and

Comunidad de Madrid through SEGVAUTO-TRIES

(S2013/MIT-2713).

REFERENCES

Bazzani, L., Cristani, M. and Murino, V., 2013. Symmetry-

driven accumulation of local features for human

characterization and re-identification. Computer Vision

and Image Understanding, 117(2), pp.130-144.

Bazzani, L., Cristani, M. and Murino, V., 2014. SDALF:

modeling human appearance with symmetry-driven

accumulation of local features. Person Re-

Identification, pp. 43-69. Springer London.

Corvee, E., Bremond, F. and Thonnat, M., 2010. Person re-

identification using spatial covariance regions of

human body parts. Advanced Video and Signal Based

Surveillance (AVSS), 2010 Seventh IEEE International

Conference on, pp. 435-440. IEEE.

Farenzena, M., Bazzani, L., Perina, A., Murino, V. and

Cristani, M., 2010. Person re-identification by

symmetry-driven accumulation of local features.

Computer Vision and Pattern Recognition (CVPR),

2010 IEEE Conference on, pp. 2360-2367. IEEE.

Hadsell, R., Chopra, S. and LeCun, Y., 2006.

Dimensionality reduction by learning an invariant

mapping. 2006 IEEE Computer Society Conference on

Computer Vision and Pattern Recognition (CVPR'06),

Vol. 2, pp. 1735-1742. IEEE.

Hirzer, M., Beleznai, C., Roth, P. M. and Bischof, H., 2011.

Person re-identification by descriptive and

discriminative classification. Scandinavian conference

on Image analysis, pp. 91-102. Springer Berlin

Heidelberg.

Hirzer, M., Roth, P. M., Köstinger, M. and Bischof, H.,

2012. Relaxed pairwise learned metric for person re-

identification. European Conference on Computer

Vision, pp. 780-793. Springer Berlin Heidelberg.

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J.,

Girshick, R., Guadarrama, S. and Darrell, T., 2014.

Caffe: Convolutional architecture for fast feature

embedding. Proceedings of the 22nd ACM

international conference on Multimedia, pp. 675-678.

ACM.

Krizhevsky, A., Sutskever, I. and Hinton, G. E., 2012.

Imagenet classification with deep convolutional neural

networks. Advances in neural information processing

systems, pp. 1097-1105.

Layne, R., Hospedales, T. M. and Gong, S., 2014.

Attributes-based re-identification. Person Re-

Identification, pp. 93-117. Springer London.

Ma, B., Su, Y. and Jurie, F., 2014. Discriminative Image

Descriptors for Person Re-identification. Person Re-

Identification, pp. 23-42. Springer London.

Moon, H. and Phillips, P. J., 2001. Computational and

performance aspects of PCA-based face-recognition

algorithms. Perception, 30(3), 303-321.

Oreifej, O., Mehran, R. and Shah, M. 2010. Human identity

recognition in aerial images. Computer Vision and

Pattern Recognition (CVPR), 2010 IEEE Conference

on, pp. 709-716. IEEE.

Rumelhart, D. E., Hinton, G. E. and Williams, R. J., 1988.

Learning representations by back-propagating

errors. Cognitive modeling, 5(3), 1.

Sánchez, J., Perronnin, F., Mensink, T. and Verbeek, J.,

2013. Image classification with the fisher vector:

Theory and practice. International journal of computer

vision, 105(3), pp.222-245.

Yi, D., Lei, Z., Liao, S. and Li, S. Z., 2014. Deep Metric

Learning for Person Re-identification. ICPR, Vol.

2014, pp. 34-39.

Zhang, Y. and Li, S., 2011. Gabor-LBP based region

covariance descriptor for person re-identification.

Image and Graphics (ICIG), 2011 Sixth International

Conference on, pp. 368-371. IEEE.

Deep Part Features Learning by a Normalised Double-Margin-Based Contrastive Loss Function for Person Re-Identiﬁcation

285