tracking
segmentation
skeletonisation
Microsoft
Kinect SDK
detection
acquired
tracks
camera network
Re-ID module
matching with
templates
ranking
body part
subdivision
descriptor
computation
Kinect
Kinect
ID 4
ID 0
ID 15
ID 7 ID 16 ID 2
(a) (b) (c) (d)
GUI
presentation
add new tracks
to the
Template DB
Template
DB
templates
ranking
Kinect
Figure 1: Left: example of the capabilities of the Kinect SDK. Two individuals are tracked, and their skeleton is super-
impressed on the image. The upper-right box shows segmentation in depth domain. Right: system architecture (see text).
{I
1
, . . . , I
M
}, the corresponding MCD descriptor is ob-
tained as the concatenation of the M dissimilarity vec-
tors I
D
= [I
D
1
, . . . , I
D
M
], where:
I
D
m
=
d(I
m
, P
m,1
), . . . , d(I
m
, P
m,N
m
)
, m = 1, . . . , M ,
and d(·, ·) is a dissimilarity measure between two
sets of components, e.g., the k-th Hausdorff Distance
(Satta et al., 2012). To match two dissimilarity vec-
tors I
D
1
and I
D
2
, a weighted Euclidean distance is used:
higher weights are assigned to most significant proto-
types (see (Satta et al., 2012) for further details).
A specific implementation of MCD (MCDimpl)
was proposed in (Satta et al., 2012). It subdivides
body into torso and legs (M = 2), and uses as com-
ponents patches randomly extracted from each body
part, described by their HSV colour histogram. Pro-
totypes are constructed using a two-stage clustering
algorithm, and are then defined as the patch near-
est to the centroid of each cluster. In (Satta et al.,
2012) it was shown that MCDimpl can allow several
thousands matchings per second, since they reduce
to comparing two real vectors. Moreover, although
prototype construction can be time-consuming, pro-
totypes can be obtained off-line from any gallery of
individuals that exhibits a reasonable variety of cloth-
ings. In particular, such gallery can be different from
the template gallery of the system, and thus does not
need to be updated, as new templates are added during
operation.
2.2 The Kinect Device
The Kinect platform was originally proposed for the
home entertainment market. Due to its low cost, it
is currently gaining much interest over the computer
vision community. The device provides: (i) an RGB
camera (1280 × 960 pixels at 30fps); (ii) an IR depth
sensor based on code structured light, which con-
structs a 640 × 480 pixels depth map at 30fps, with
an effective range of 0.7 to 6 meters. The Kinect
SDK also provides reliable tracking, segmentation
and skeletonisation (Shotton et al., 2011), based on
depth and RGB data (see Fig. 1-left).
The technology adopted by the Kinect device suf-
fers from two limitations. First, the maximum dis-
tance at which a person can be detected (around 5–6
mt) is relatively low. However, ad hoc sensors (proba-
bly more costly) can be developed to deal with higher
distances, based on the same technology. Second, the
use of IR projectors and sensors to build the depth
map prevents outdoor usage, because of the interfer-
ence in the IR band caused by the sun light. In-
door environments include nevertheless typical video-
surveillance scenarios (e.g., offices, airports).
3 SYSTEM IMPLEMENTATION
Our prototype tracks all the individuals seen by a net-
work of Kinect cameras, adds to a template data base
an appearance descriptor (template) of each acquired
track, and re-identifies online each new individual, by
matching its descriptor with all current templates. Af-
ter a track is acquired, the operator is shown the list
of templates, ranked according to the matching score
to that track.
Our prototype architecture is shown in Fig. 1
(right). It consists of a network of Kinect cameras,
connected to a PC (Fig. 1, right (a)). First, detection,
tracking, segmentation (i.e., silhouette extraction) and
skeletonisation of each individual seen by the network
are carried out (Fig. 1, right (b)), exploiting the Kinect
SDK (other detection techniques based on RGB and
range data can also be used, e.g., (Salas and Tomasi,
2011; Spinello and Arras, 2011)). Each individual
is associated to a track, i.e., the sequence of regions
of the RGB frames containing him/her, extracted by
the detector, and the corresponding skeletons. After
a track T is acquired, a template is created by the re-
identification module (Fig. 1-right(c)), and is added to
the Template DB. A template is made up of the acqui-
sition date and time, and of the 5 frames {T
1
, . . . , T
5
}
of the track exhibiting the largest silhouette area, with
the corresponding MCDimpl descriptors.
Online re-identification is performed for each new
track T , with respect to all current templates. The
first frame of T is initially matched to the templates,
and then subsequent frames sampled every t
acquire
= 1
sec., but only if the corresponding silhouette area
VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications
408