People Re-identiﬁcation using Deep Convolutional Neural Network

Guanwen Zhang, Jien Kato, Yu Wang and Kenji Mase

Graduate School of Information Science, Nagoya University, Furo-cho, Chikusa-ku, Japan

Keywords:

People Re-identiﬁcation, Deep Convolutional Neural Network, Linear SVM.

Abstract:

One key issue for people re-identiﬁcation is to ﬁnd good features or representation to bridge the gaps among

different appearances of the same people, which is introduced by large variances in view point, illumination

and non-rigid deformation. In this paper, we create a deep convolutional neural network (deep CNN) to solve

this problem and integrate feature learning and re-identiﬁcation into one framework. In order to deal with

such ranking-like comparison problem, we introduce a linear support vector machine (linear SVM) to replace

conventional softmax activation function. Instead of learning cross-entropy loss, we adopt a margin-based

loss of pair-wise image to measure the similarity of the comparing pair. Although the proposed model is quite

simple, the experimental result shows encouraging performance of our method.

1 INTRODUCTION

People re-identiﬁcation refers to the problems that,

recognize peoplewhen he/she leaves one camera view

and enters another camera view, or recognize peo-

ple when he/she reappears in the same ﬁeld of view,

which is crucial for inter-camera tracking and for

understanding people behavior across camera net-

work. It is a valuable task in video surveillance

system and receives more and more attention with

the spreading camera networks installation (Zheng

et al., 2013)(Kviatkovsky et al., 2013)(Zhao et al.,

2013)(Farenzena et al., 2010).

Due to low image resolution and the long distance

between people and camera, biological information

such as people’s face or gait is general unavailable

for people re-identiﬁcation. In addition, because of

the crossing camera issues, continuous visual track-

ing and intra-camera motion information of people

cannot be immediately utilized. Therefore, in the

current literature, studies on person re-identiﬁcation

mainly focus on analyzing the people appearance,

with the acceptable assumption that people will not

change their clothing during the observation period.

The challenge in such an appearance based approach

principally comes from appearance variance induced

by light illumination, camera views, and non-rigid de-

formation of posture. This leads to the intra-camera

variance being even larger than the inter-camera vari-

ance, that is, the same people could look considerably

different in the videos captured by different cameras,

Figure 1: Selected images from VIPeR and CAVIAR4REID

dataset. The people’s appearance change with the variations

in posture, illumination, and resolution. Even the same peo-

ple look considerably different.

whereas different people could look extremely similar

in the videos captured by the same camera.

Exiting researches, which are trying to bridge the

“gap” between the different appearances of the same

people, can be roughly divided into two groups. The

ﬁrst group focuses on extracting discriminative ap-

pearance of people to form a stable feature represen-

tation. In these studies, the global and local features,

such as color (Gray and Tao, 2008), shape (Wang

et al., 2007), or texture (Bazzani et al., 2012), are

integrated over images. Spatial information, such as

pictorial structure (Cheng et al., 2011), co-occurrence

representation (Wang et al., 2007), symmetry factors

(Farenzena et al., 2010) and salience (Zhao et al.,

216

Zhang G., Kato J., Wang Y. and Mase K..

People Re-identiﬁcation using Deep Convolutional Neural Network.

DOI: 10.5220/0004740302160223

In Proceedings of the 9th International Conference on Computer Vision Theory and Applications (VISAPP-2014), pages 216-223

ISBN: 978-989-758-009-3

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

2013), are also incorporated to deal with the lack of

spatial information of histogram description. The sec-

ond group, including a few researches, is working

on measuring the similarity between representations.

These methods, such as the well-known relative dis-

tance comparison (RDC) (Zheng et al., 2013), local

aligned feature transform (Li and Wang, 2013), local

distance comparison (Zhang et al., 2012) and large

margin nearest neighbor with rejection (Dikmen et al.,

2010), also achieve good performance.

All of the existing methods perform the higher

and complex model by using or being based on hand-

crafted features. A great amount of time and manual

work will be needed for different speciﬁc target re-

identiﬁcation task. Therefore, an end-to-end solution

for re-identiﬁcation task will be very valuable. As the

great improvement has been achieved by deep learn-

ing in various tasks (Krizhevsky et al., 2012)(Ser-

manet et al., 2012), representation learning from raw

input image seems to be a promising technique for

people re-identiﬁcation work.

In this paper, we try to utilize deep convolu-

tional neural network (deep CNN) to solve people re-

identiﬁcation problem, and thus incorporate feature

learning and re-identiﬁcation into one framework. In

practical use, people re-identiﬁcation needs perform-

ing ranking-likecomparison. In order to measure sim-

ilarity of the comparing pair, we introduce linear sup-

port vector machine (linear SVM) to replace the tradi-

tional softmax activation function. Instead of learning

cross-entropy loss for predicting class label, we mea-

sure the distance to the decision boundary that is more

suitable for re-identiﬁcation task. Our work follows

the conventional approaches and pre-trains the lay-

ers with a unsupervised learning method in a greedy

layer-wise manner. Dropout technique is also adopted

in supervised learning phase to overcome the overﬁt-

ting problem. Although the model we used is quite

simple and results have not reached those of the state-

of-the-art methods, the experiments on public datasets

still show encourage performance.

There are three kinds of contribution in this paper:

(1) we proposed a simple architecture of deep CNN

for people re-identiﬁcation problem that has not been

addressed before; (2) we introduced linear SVM on

the top of the network to measure the ranking com-

parison that is needed by people re-identiﬁcation; (3)

we gave a detailed discussion about the limitation of

using deep learning in re-identiﬁcation problem and

the potential for further improvement.

The rest of the paper is organized as follows. The

details of our approach are discussed in Section 2. We

ﬁrst explain the architecture of deep CNN (2.1). The

linear SVM is then introduced in (2.2). At the end of

this section, we brieﬂy describe unsupervised learn-

ing and dropout techniques (2.3). Experimental re-

sults are discussed in Section 3. Finally, we present

our conclusions and future perspectives in Section 4.

2 METHODOLOGY

People re-identiﬁcation needs measuring the similar-

ity of the comparing images. Multiple comparison

results in a ranking list. Those with the highest sim-

ilarity are selected as the results. The deep CNNs

that have been reported so far are generally work-

ing on classiﬁcation problems rather than compari-

son problems. In order to apply the deep CNN to

re-identiﬁcation problem, we design the architecture

with two input branches, and introduce a linear sup-

port vector machine to measure the similarity of two

input images.

In following sections, we give details about the ar-

chitecture of our proposed network, and describe the

techniques we used in network training.

2.1 Architecture

The architecture of our deep CNN is summarized in

Fig.1. It contains 8 layers with three convolutional

layers (C1, C3 and C5), three subsampling layers (S2,

S4 and S6), one fully-connected layer (F7) and one

weighting layer (W8). One subsampling layer follows

one convolutional layer with local connection. S6 and

F7 are fully-connected. F7 and W8 play the function

of a linear SVM together. In the training phase, our

deep CNN minimizes the squared hinge loss of the

linear SVM that is equivalent to ﬁnding the max mar-

gin according to the true match (+1) and false match

(-1) over training sample pair. In the testing phase, the

similarity of the input image pair could be measured

by the distance to the decision boundary.

The two images, with the R, G, B channel, are lo-

cally connected to the ﬁrst convolutional layer (C1)

through two different branches. One kernel in this

convolutional layer is working on the three channels

simultaneously and produces one channel. There are

32 kernels in this layer (in each branch) that pro-

duce 32 output channels totally. The neurons in

the second convolutional (C3) layer are connected to

both branches in the previous subsampling layer (S2),

which follows the ﬁrst convolutional layer (C1). The

outputs of the kernels of two branches are simply

added together as the kernel maps. Another convo-

lutional and subsampling pair (C5 and S6), with more

kernels, follows the second subsampling layer (S4).

PeopleRe-identificationusingDeepConvolutionalNeuralNetwork

217

Figure 2: An illustration of the architecture of our deep CNN. There are two branches used for input of comparing pair. One

subsampling layer follows one convolutional layer, and works as a pair. The last two layers are fully connected. A linear

SVM classiﬁer is added at top to replace the softmax layer for measuring similarity of input images pair.

The neurons in the fully-connected layer (F7) are con-

nected to all neurons in the third subsampling layer

(S6). The rectiﬁed-linearity is applied to the neurons

of every convolutional layer. We use the max-pooling

in both two branches in the ﬁrst subsampling layer

(S2), and use the average-pooling in the second and

third subsampling layers (S4 and S6).

The ﬁrst convolutional layer (C1) ﬁlters 32x32x3

input images with 32 kernels of size 5x5x3 with a

stride 1 pixels (for every convolution, we pad the pre-

vious kernel map with a 2-pixel border of zeros). The

input for the two branches in the ﬁrst subsampling

(S2) layer is 32 kernel maps with size of 32x32. Sub-

sampling is performed for each channel and thus the

number of output channels is equal to number of in-

put. We use the ﬁlter with size of 3x3 by 2-pixel stride

to reduce the input kernel map into half size. The sec-

ond convolutional layer (C3) takes the both outputs of

subsampling (S2) on two branches as the input, and

ﬁlters it with 32 kernels of size 16x16x2 by 1 pix-

els stride. The third convolutional layer (C5) has 128

kernels of size 8x8x32, and the fully-connected layer

(F7) has the 2048 nodes with fully connection to the

4x4x128 neurons in the third subsampling layer (S6).

The processing in the second and third subsampling

layers (S4 and S6) is performed as same as the ﬁrst

subsampling layer (S2) and is down-sampled by a fac-

tor of 2.

2.2 Softmax vs. Linear SVM

In conventional methods, it is popular to use the

softmax activate function in the top layer to predict

the classes. Assume there is a fully connection be-

tween softmax and penultimate layer and we have a N

classes to predict, the number of the nodes in softmax

layer will have the same number N as the classes. Let

be the activation of node j in penultimate layer, and

be the weighting of the connection between node

j in penultimate layer and node i in softmax layer.

Since there is a fully connection, the input for the

node i in softmax layer is then given by a

∑

The probability of class i, i.e. the output of node i, is

deﬁned as:

exp(a

)

∑

exp(a

)

, (1)

and

∑

= 1. In this case, the predicted class label

would be

i = argmax(p

In this paper, in order to measure similarity of

the comparing pair, we introduce linear support vec-

tor machine to replace the softmax layer. Given the

training data {x

}

n=1

, where x

∈ R

and t

∈

{−1, +1}, the linear support vector machine (linear

SVM) could be formulated as the following optimiza-

tion problem:

obj(w) =

w+C

∑

n=1

(max(1− w

,0))

. (2)

This is known as L2-SVM, which is a popular opti-

mization of SVM due to its differentiable and harder

punishment of violating samples. The predicted class

label could be obtain by

t = argmax(w

x)t.

In order to use the objective function as the super-

vised learning to train the parameters in low layers,

we should back propagate the gradients of the linear

SVM. The differentiate of the linear SVM, with re-

spective to w, is given by

∂obj(w)

∂w

= w− 2Cx

(max(1− w

,0)). (3)

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

218

By introducing activation, h = (h

,. .., h

2048

)

, of the

neurons in penultimate layer (F7), the differentiate for

each activation h

is given by

∂obj(w)

∂h

= w− 2Ch

(max(1− w

,0)), (4)

where t

indicates true or false match of the input

pair. By using such gradient to replace the gradient

of softmax function, we could use the same back-

propagation algorithm as in traditional deep learning

method to train the parameters for each layer.

We have noted that the same strategy is also used

in (Zhong et al., 2000)(Nagiet al., 2012)(Tang, 2013).

However, rather than on the class label of the input

data, we focus more on the distance of input pair to

the decision boundary, where larger distance indicates

the higher similarity.

2.3 Unsupervised Learning and

Dropout

Generally, there are only a few training data in the

re-identiﬁcation task. However, our neural network

architecture has thousands of parameters. Therefore,

it is difﬁcult to learn so many parameters without con-

siderable overﬁtting. Below, we describe two primary

ways in which we combat overﬁtting problem.

2.3.1 Unsupervised Learning

In conventional approaches for training a deep net-

work, the unknown parameters are ﬁrst randomly ini-

tialized, and then learned by directly searching gra-

dient descent of supervised objective function. How-

ever this kind of methods often leads to local min-

imum and performance gets worse as the depth of

network increasing. Hinton et al. proposed a greedy

layer-wise unsupervised pre-training to deal with such

problem (Hinton and Salakhutdinov, 2006). They

used unsupervised learning algorithm to pre-train

each layer, where out of the previous layer is used

as input for training following layer. After the pre-

training stage, the whole network is ﬁne-tuned and

ﬁnally realizes the optimization with a global su-

pervised objective function. In this way, the unsu-

pervised learning, working as initializing parameters

phase, leads to much better solution in term of gener-

alization performance.

In this paper, following the similar idea of (Le-

Cun et al., 2010)(Sermanet et al., 2012), we use Pre-

dictive Sparse Decomposition (PSD) as unsupervised

learning method to pre-train each layer of our deep

CNN. PSD approximates the inputs as a sparse linear

transformation on a dictionary. Similar as sparse cod-

ing, the sparse representation Z

∗

could be obtained by

minimizing the energy function as follows:

E(Z,W,K) = kX −WZk

+ λkZk

+ kZ −C(X,K)k

(5)

where X is the input image, K is the ﬁlters in current

layer that we want to learn and matrixW is the dictio-

nary which is randomly initialized. The unsupervised

learning is processing in two steps: in the ﬁrst step,

we ﬁnd sparse representation Z

∗

and in the second

step, we update the dictionary matrix W and ﬁlters K.

The details can be referred to (Sermanet et al., 2012)

and (LeCun et al., 2010).

2.3.2 Dropout

Dropout is an efﬁcient technique introduced in (Hin-

ton et al., 2012) which can reduce the generalization

error of deep architecture neural network. Similar

as denoising AutoEncoder, dropout randomly select

a fraction of neurons in the hidden layers, and force

them to be inactivated by setting the noise zeros. The

selected neurons in every epoch do not contribute to

the forward pass and also do not participate in back-

propagation. However, unlike in denoising AutoEn-

coder, the dropout is performed in supervised training

and could be used in all layers in a deep neural net-

work for different purposes (Hinton et al., 2012). In

this way, the neural network samples the network with

different connectivity patterns, but all these architec-

tures could share weights by the neurons that are not

dropped out. By this kind of randomly sampling, the

network is likely forced to learn an averaging model

and ﬁnally achieve a robust model to battle against

overﬁtting. We use dropout in the third convolutional

layer (C5) and fully-connected layer (F7). We choose

0.5 as the dropout rate of parameters.

3 EXPERIMENT

3.1 Dataset Description

We evaluated our proposed deep CNN by applying

it to two public datasets VIPeR of Douglas et al.

(Gray et al., 2007) and CAVIAR4REID of Cheng et

al. (Cheng et al., 2011). These datasets cover different

genres and include different people postures, under

a variety of illumination conditions, with various de-

grees of occlusion and camera resolution. Therefore,

they are very challenging. VIPeR is originally created

for viewpoint invariant pedestrian recognition. There

are 632 image pairs in this dataset which are captured

by two camera views in outdoor environment. Due

to the arbitrary change of the view point under vary-

PeopleRe-identificationusingDeepConvolutionalNeuralNetwork

219

0 20 40 60 80 100

100

Rank score

Recognition percentage

VIPeR CMC Curve

PRC

PLS

Bhat.

Li’s

ITML

OURS

5 10 15 20 25 30

100

Rank score

Recognition percentage

CAVIAR4REID CMC Curve

Li’s

ITML

OURS

Figure 3: Evaluation of CMC curve for VIPeR and CAVIAR4REID dataset with the gallery size as 316 and 36 respectively.

The state-of-the-art and base line methods are used as comparison. The corresponding results are obtained from the public

papers.

Table 1: Matching rate of top ranking(%) on VIPeR dataset, gallery size is 316. The bold and red typeface are used to

highlight the best results and the baseline results. Our results are shown in ﬁrst the row.

Method Top 1 Top 10 Top 20 Top 30 Top 40 Top 50 Top 60 Top 80 Top 100

OURS 12.5 26.3 39.7 52.6 62.1 67.9 74.5 85.0 90.4

RDC 15.7 53.9 70.1 79.4 83.5 87.4 90.2 92.7 96.7

PLS

2.7 10.9 17.3 24.3 28.6 32.2 38.1 48.4 53.8

Xing’s 4.6 16.6 24.4 30.4 33.9 39.2 44.5 54.8 61.2

L1-norm

4.2 16.5 23.8 29.7 32.4 37.7 42.6 50.1 56.7

Li’s 29.6 69.3 82.3 91.7 94.6 96.8 97.6 99.1 100

ITML

12.4 39.7 55.2 66.3 72.9 78.7 82.5 87.2 91.3

ing illumination between two camera views, appear-

ance of the samples in an image pair has great differ-

ence. CAVIAR4REID is built speciﬁcally for person

re-identiﬁcation tasks. It consists of 72 people with

1,220 images. 50 people are captured in two camera

views, and the remaining 22 are captured in a single

camera view. The images in this dataset are selected

by maximizing the variance with respect to the res-

olution, illumination, occlusion and posture (Cheng

et al., 2011).

3.2 Evaluation Method

We randomly divide each dataset into a training set

and a testing set according to people’s number. The

training samples in each training set consist of two

kinds of pairs: true pair and false pair. Given an im-

age from one camera view, the true pair is created by

selecting the image of the same people in the other

view, while the false is created by randomly select-

ing the image of a different people from the other

view. Since there are multiple images for one per-

son in CAVIAR4REID dataset, the training pairs are

created by selecting all images of each person. In this

way, we have 632 and around 500 training pairs for

VIPeR and CAVIAR4REID datasets respectively.

Each testing set is composed of a gallery set and

a probe set. The gallery set consists of one image of

each person, and the remaining images are used as

the probe set. Re-identiﬁcation is conducted by ﬁnd-

ing a match from the gallery for each probe. During

experiments, this procedure is repeated 5 times to ob-

tain average performance. The popular evaluation cri-

terion, Cumulative Matching Characteristic (CMC),

which represents the expectation of ﬁnding the cor-

rect match in the top n candidates, is used to measure

matching rates.

Our proposed method is implemented by C++ and

CUDA with library of Python. The convolution work

is based on the CUDA kernels published by Alex

Krizhevsky

. The graphic card we use in the experi-

ments is NVIDIA GTX 660.

In the training phase, the initial parameters are set

by a uniform distribution in the range [−0.05, 0.05].

We also initialize the basis in the convolutional layers

with the constant 0.002 and set the momentum as 0.9,

weight decay as 0.004 and batch size as 50. We use all

training pairs of two datasets for the pre-trained unsu-

http://code.google.com/p/cuda-convnet

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

220

Table 2: Matching rate of top ranking(%) on CAVIAR4REID dataset, gallery size is 36. The bold and red typeface are used

to highlight the best results and the baseline results. Our results are shown in ﬁrst the row.

Method Top 1 Top 5 Top 10 Top 15 Top 20 Top 25 Top 30

OURS 7.2 26.9 42.6 57.5 76.25 86.5 92.6

4.1 21.5 37.8 44.6 53.2 61.4 68.6

PS 8.5 32 48.8 59.7 66.4 79.7 86.6

SDALF

6.8 25 45 55 64.5 74 83

Li’s 10.2 39 59 71.4 79.5 84.4 88.2

ITML

7.3 32.49 50.5 61.3 70.4 77.8 82

0 50 100 150 200 250 300 350 400 450 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Epochs

Recognition Error Rate

VIPeR

Testing Error (N)

Training Error (N)

Testing Error (Y)

Training Error (Y)

0 50 100 150 200 250 300 350 400 450 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Epochs

Recognition Error Rate

CAVIAR4REID

Testing Error (N)

Training Error (N)

Testing Error (Y)

Training Error (Y)

Figure 4: Training error rate and testing error rate of the identiﬁcation on VIPeR and CAVIAR4REID dataset. They are

evaluated by USING pre-trained unsupervised learning and dropout (Y), and NOT USING these techniques (N). The testing

errors are obtained by every 10 epochs.

pervised learning. The dropout technique is only used

in ﬁne-tune stage, and the dropout rate is set as 0.5.

During ﬁne-tune stage for speciﬁc target training set,

we follow the similar strategy as (Krizhevsky et al.,

2012) and manually adjust learning rate throughout

training.

3.3 Results

In experiments, we only choose the learning-based

method that works on the two datasets. We com-

pare our proposed method with the baseline meth-

ods, partial least squares (PLS) (Schwartz and Davis,

2009) and Bhattacharyya distance learning (Bhat.), on

VIPeR, and compare with baseline method, Euclidean

distance learning (L2), on CAVIAR4REID. We also

compare our method with some well-known state-of-

the-art methods such as relative distance comparison

(RDC) (Zheng et al., 2013), information theoretical

metric learning (ITML) and the method proposed by

Li et al. (Li’s) (Li and Wang, 2013). The results

of above mentioned methods are obtained from the

published papers (Zheng et al., 2013)(Li and Wang,

2013).

The comparison results are shown in Fig.3 and Ta-

ble 1 and 2. It can be seen that our proposed model

outperforms the baseline methods on these datasets,

but it is still worse than the state-of-the-art methods.

Notice that the CMC curve of our method gets com-

petitive after rank 15 on the CAVIAR4REID. The size

of training pairs of VIPeR and CAVIAR4REID are

632 and around 500 (people with 10 images may be

selected as training set), respectively. However, the

number of people, that will be re-identiﬁed, is 316 and

36. Therefore, the relative number of training pairs

for each person in VIPeR is 2 (= 632/316), much

smaller than 14 (= 500/36) in CAVIAR4REID. This

makes the performance of CAVIAR4REID is better

than that of VIPeR.

We also notice that, although we have introduced

pre-trained unsupervised learning and dropout tech-

nique, overﬁtting is still severe. In order to show the

details, we give a comparison of training and testing

errors obtained by using or not using unsupervised

learning and dropout technique. The testing errors

are obtained from a validation set, which is created

in the same way as training set, but using the im-

ages of the people in testing set. In Fig.4, it can be

seen clearly that, without using unsupervised learn-

ing and dropout technique, the training errors on both

datasets are close to zeros, while the testing errors

are quite high. By introducing unsupervised learning

and dropout technique, the divergence between train-

ing and testing error reduces on both datasets. Never-

theless, the higher and wild swing testing errors still

reveal the networks suffer from overﬁtting.

Increasing the training data is essential way to re-

duce overﬁtting and improve performance of the net-

PeopleRe-identificationusingDeepConvolutionalNeuralNetwork

221

work. However, for re-identiﬁcation task, since a pos-

itive sample (true match) consists of images from the

same people, the number of the people in the dataset

restricts the number of the positive samples. This

makes the number of positive samples usually much

smaller than that of negative samples. In the training

phase, the negative samples should be limited within

a certain amount to avoid overﬁtting.

Multiple-shot dataset of re-identiﬁcation task is a

good choice for solving this problem. As shown in ex-

perimental results, by creating more positive samples

crossing multiple images, the performance of network

on CAVIAR4REID is better than that on VIPeR. Dur-

ing the experiments, we observe that the performance

of the network on some multiple shot datasets, such

as ETHZ (Ess et al., 2007) and Person Re-ID 2011

(Hirzer et al., 2011), is not impressive. By further in-

specting these datasets, we ﬁnd that multiple images

of the people are extracted from video sequences. The

difference between images is small, and they cannot

contribute to training work.

From another point of view, similar as denoising

AutoEncoder, it is possible to increase positive sam-

ples by applying some transformations to original im-

ages, such as partial corruption of the input image

pairs, extracting random patch of the images and so

on. By such a strategy, we can not only increase pos-

itive samples to combat overﬁtting issue, but also can

improve the robustness of the network against noise.

4 CONCLUSIONS

How to ﬁnd a good feature representation to bridge

the “gap” between appearances of the same people

is a very challenging task. Existing methods either

employ hand craft features or use machine learning

method with existing features to form a speciﬁc rep-

resentation. However, there are a lot of uncertainty

in these methods due to human factors and speciﬁc

applications. Deep learning, with ability to learn a

proper feature representation from the bottom of the

raw images, seems to be a promising solution for the

people re-identiﬁcation tasks.

In this paper, we utilize deep convolutional neu-

ral network to solve people re-identiﬁcation problem.

We integrate feature learning and re-identiﬁcation

into one framework, and accomplish learning and re-

identiﬁcation simultaneously. In order to deal with

the ranking-like comparison problem, we introduce

a linear support vector machine to replace the soft-

max lay for measuring the similarity of the comparing

images. Since there is a large amount of parameters

of the network needed to be estimated, while only a

small number of training data are available, the pre-

trained unsupervised learning and dropout technique

are used to reduce overﬁtting.

Although the proposed is quite simple, we still

achieve very encourage performance compared with

baseline methods, which gives us great conﬁdence.

But compared with the state-of-the-art methods, our

performance needs to be further improved. The care-

ful analysis on the results shows that the serious over-

ﬁtting caused by the lack of positive training samples

seems to be the reason. This is our future work.

ACKNOWLEDGEMENTS

This research was supported by the National In-

stitute of Information and Communication Technol-

ogy (NICT), and by the Strategic Information and

Communications R&D Promotion Programme (No.

131306004). Yu Wang is supported by Grant-in-Aid

for Japan Society for the Promotion of Science and

Guanwen Zhang is also supported by the Fund of the

China Scholarship Council.

REFERENCES

Bazzani, L., Cristani, M., Perina, A., and Murino, V. (2012).

Multiple-shot Person Re-identiﬁcation by Chromatic

and Epitomic Analyses, volume 33.

Cheng, D. S., Cristani, M., Stoppa, M., Bazzani, L., and

Murino, V. (2011). Custom Pictorial Structures for

Re-identiﬁcation.

Dikmen, M., Akbas, E., Huang, T. S., and Ahuja, N. (2010).

Pedestrian Recognition with a Learned Metric. Proc.

Asia Conf. Computer Vision, pages 501–512.

Ess, A., Leibe, B., and van Gool, L. (2007). Depth and

Appearance for Mobile Scene Analysis. Proc. Int’l

Conf. Computer Vision, pages 1–8.

Farenzena, M., Bazzani, L., Perina, A., Murino, V., and

Cristani, M. (2010). Person Re-Identiﬁcation by

Symmetry-Driven Accumulation of Local Features.

Gray, D., Brennan, S., and Tao, H. (2007). Evaluating ap-

pearance models for recognition, reacquisition, and

tracking. In 10th IEEE Int’l Workshop on Perfor-

mance Evaluation of Tracking and Surveillance.

Gray, D. and Tao, H.(2008). Viewpoint Invariant Pedestrian

Recognition with an Ensemble of Localized Features.

Proc. European Conf. Computer Vision, pages 262–

275.

Hinton, G. E. and Salakhutdinov, R. R. (2006). Reducing

the dimensionality of data with neural networks, vol-

ume 313.

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I.,

and Salakhutdinov, R. (2012). Improving neural net-

works by preventing co-adaptation of feature detec-

tors, volume abs/1207.0580.

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

222

Hirzer, M., Beleznai, C., Roth, P. M., and Bischof, H.

(2011). Person re-identiﬁcation by descriptive and

discriminative classiﬁcation. Proc. Scandinavian

Conf. on Image Analysis.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. In Proc. Advances in Neural Information

Processing Systems 25, pages 1106–1114.

Kviatkovsky, I., Adam, A., and Rivlin, E. (2013). Color

Invariants for Person Reidentiﬁcation, volume 35.

LeCun, Y., Kavukcuoglu, K., and Farabet, C. (2010). Con-

volutional networks and applications in vision. pages

253–256.

Li, W. and Wang, X. (2013). Locally Aligned Feature Trans-

forma accros Views.

Nagi, J., Di Caro, G. A., Giusti, A., , Nagi, F., and Gam-

bardella, L. (2012). Convolutional Neural Support

Vector Machines: Hybrid visual pattern classiﬁers for

multi-robot systems. pages 27–34.

Schwartz, W. R. and Davis, L. S. (2009). Learning Dis-

criminative Appearance-Based Models Using Partial

Least Squares. Proc. 2009 XXII Brazilian Symposium

on Computer Graphics and Image Processing, pages

322–329.

Sermanet, P., Kavukcuoglu, K., Chintala, S., and Le-

Cun, Y. (2012). Pedestrian Detection with Un-

supervised Multi-Stage Feature Learning, volume

abs/1212.0142.

Tang, Y. (2013). Deep Learning using Support Vector Ma-

chines, volume abs/1306.0239.

Wang, X., Doretto, G., Sebastian, T., Rittscher, J., and Tu,

P. (2007). Shape and Appearance Context Modeling.

Proc. Int’l Conf. Computer Vision, pages 1–8.

Zhang, G., Wang, Y., Kato, J., Marutani, T., and Mase, K.

(2012). Local Distance Comparison for Multiple-shot

People Re-identiﬁcation. Proc. Asia Conf. Computer

Vision, pages 677–690.

Zhao, R., Ouyang, W., and Wang, X. (2013). Unsupervised

Salience Learning for Person Re-identiﬁcation.

Zheng, W.-S., Gong, S., and Xiang, T. (2013). Re-

identiﬁcation by Relative Distance Comparison, vol-

ume 99.

Zhong, S., , Zhong, S., and Ghosh, J. (2000). Decision

Boundary Focused Neural Network Classiﬁer.

PeopleRe-identificationusingDeepConvolutionalNeuralNetwork

223