N
bound
as follows:
N
bound
=
T
V.F.I.
(1)
where V.F.I. is the Valid Frame Interval, i.e. the min-
imum reasonable lapse of time in which a face should
be present in order to be taken under consideration
and T is the total video length (both measured in
frames). In our test, V.F.I. is defined as four time the
video frame rate (i.e. a consistency of at least four
seconds).
In particular the best k is chosen by evaluating
several internal statistical indices, i.e. indices that
are computed starting from the observation used to
create clusters. Notice also that external indices can
not be used since no a prior knowledge nor a pre-
specified data structure like a set of true known la-
bels are available. In this paper, the investigated in-
dices are the following: Average Silhouette, Davies-
Bouldin (DB), Calinski-Harabasz (CH), Krzanowski
and Lai (KL), Hartigan Index, weighted inter-intra
(Wint) cluster ratio, Homogeneity-Separation. For
a more comprehensive treatment, refer to (Kaufman
and Rousseeuw, 2009; Davies and Bouldin, 1979;
Cali
´
nski and Harabasz, 1974; Krzanowski and Lai,
1988; Hartigan, 1975). The chosen criterion will be
presented in section 3.
At the end of this step it is possible that the best
number of clusters is not still defined since no satis-
factory values are obtained during the whole iterative
process. In that case, the hypothesis that only one per-
son is present on the scene is made.
2.4 Post-processing
After clustering, each detected facial image is labeled
as belonging to one of the k clusters found. Any-
way some errors can occur: on the one hand the al-
gorithm could create very small clusters, for example
in correspondence of one or more false positive fa-
cial images detected by Viola-Jones algorithm. On
the other hand, some segment could be split in case
of miss-detection of the face detector. To overcome
these problems and then to rightly determine the in-
tervals of frames in which each person appears in the
video a proper post-processing is introduced. It oper-
ates in a twofold manner (at a clustering level and, for
a given cluster, at segment level) as follows:
1. a cluster is considered consistent if it classifies a
person that is present in the scene for at least 4
seconds. All inconsistent clusters are removed;
2. two segments that have a temporal distance lower
than 1.2 seconds are merged;
3. if a segment reveals a duration of less than 1.2
seconds but its neighbors are distant more than a
frame number equal to 1.2 seconds, it is dropped
from the segment list.
3 EXPERIMENTAL RESULTS
The proposed framework has been tested on several
videos. The videos differ for number of people, peo-
ple recurrences, lighting conditions, camera resolu-
tion, camera movements (quasi-static or continuously
moving, like in the case of a mobile phone in the
user’s hand) and acquisition environments (indoor or
outdoor). Each video has been, at first, processed by
the face detector and then facial images are scaled, ra-
diometrically equalized and finally projected, by the
Principal Component Analysis, onto a feature space
so that the element with greatest variance is projected
onto the first axis, the second one onto the second axis
and so on. At this point, for each video the minimum
number of components to be retained for further pro-
cessing has been set as the one able to preserve at least
the 95% of the total variance of data. For example,
for the fourth video, first 100 components overtake
the threshold and are selected, like in Fig. 3. Reduced
data are finally given as input to the generalized ver-
sion of the k-means algorithm that, by the evaluation
of a set of statistical indices, provides expected out-
comes (i.e. the number of people and the intervals of
frames in which each person appears).
In the first experimental phase the ability of
the proposed framework to correctly detect the
number of persons in the videos is tested. Table
1 reports the detailed results obtained for videos
processed in this experimental phase. Each row
lists a short description of the video (environment
conditions i.e. indoor/outdoor, acquisition device i.e.
mobile phone/camera, camera movements, i.e. M
if the camera is in the hands of operator and then
it continuously moves during recording, or S if the
camera is quasi-static), the spatial resolution of the
acquired images, the length of the video (in frames),
the temporal resolution (fps), the total number of
people appearing in the video and, in the last column,
the number of people really detected by the proposed
algorithm.
In the videos in rows 1-3 the proposed approach
correctly detects the number of people that appear
in it. In particular the first one is a video of size
1440 × 1080, with a frame rate of 30 fps and with
3600 frames. There are 8 persons, each one occurring
once in the video. The video was acquired by a cam-
ACompleteFrameworkforFully-automaticPeopleIndexinginGenericVideos
251