10 20 30 40 50 60 70 80 90 100
0.15
0.16
0.17
0.18
0.19
0.2
0.21
0.22
0.23
Number of neurons in the hidden layer
Classification error
Classification error of the Neural Netork for various sizes of the hidden layer
Figure 6: Classification error of the NN classifier for vari-
ous parameter configurations, during validation stage.
0 500 1000 1500 2000 2500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Dimension count
Classification error
Classification error for the SVM using various kernels and dimensions for the data
linear kernel
polynomial kernel
radial kernel
sigmoid kernel
Figure 7: Classification error of the SVM classifier for var-
ious parameter configurations, during validation stage.
nodes. Although it is impossible to assess the role
of each neuron and thus to provide a solid explana-
tion for the correlation between the number of neu-
rons and performance of the network, one can argue
that this size of the hidden layer is influenced by the
number of relevant features in the data, similarly to
the minimum number of dimensions that yield rea-
sonable good results. Should that be the case, the ac-
tivation of each neuron is more heavily influenced by
one of these implicit relevant features.
SVM Validation. For the SVM classifier, our initial
intention was to also employ dimensionality reduc-
tion on the features, to obtain faster training times.
However, after assessing the performance for various
dimensions, as shown in Figure 7, and considering
manageable training durations, we decided to use all
2268 HoG dimensions for the SVM classifier.
The plot from Figure 7 shows the evolution of the
SVM classification error for various dimensions and
using several kernel functions, where we found that
the best performed kernel is the linear one.
4.1 Dataset Description
During the training of the classifiers we used sev-
eral datasets, to have a greater variety of appearances.
This, in turn, would be beneficial to achieve a bet-
ter generalization of the training data and a good ex-
ploitation of the existing patterns. Some characteris-
tics of the datasets used during training are given in
Table 1.
For testing the proposed method, we used video
sequences from the Collective Activity dataset (Choi
et al., 2011), which depict multiple human targets
moving unrestricted in an urban environment. The
ground truth annotation is available once every 10
frames.
4.2 Results and Discussion
The results of the experiments to evaluate the perfor-
mance of our method are presented in Table 2.
Overall, the performances of the individual clas-
sifiers vary to some extent. These variations ensures
the capability of a combination of classifiers to yield
better results. A certain dependence on the video se-
quence can also be observed, as all the classifiers ob-
tained better results on Seq 42 than Seq 15. Since
these classifiers take into consideration only the vi-
sual appearance of the targets, modeled by the HoG
descriptors, the only explanation for this behaviour is
the fact that the targets from Seq 42 resemble more
closely the targets used for the training of the classi-
fier. This visual resemblance can further be explained
by a closer similarity of the angle of the camera at
which the images were captured, as well as a similar-
ity of the resolution of the images.
The error obtained by combining the response of
multiple classifiers proved to be better than the indi-
vidual responses. Thus, in the case of Seq 42, all the
combined responses yielded better results than the in-
dividual ones. As expected, the combinations includ-
ing the more robust classifiers, such as GMM+NN,
outperform the ones with the lower performing ones,
such as GMM+SVM. In the case of Seq 15, the more
pronounced poor result of the SVM classifier has a
detrimental impact on the combined responses. Thus,
only the GMM+NN combination has a better per-
formance than any of its components, all others be-
ing roughly similar or even worse than the individual
components.
The second goal of these experiments was to as-
sess the impact of the individual cues considered. The
performance of the method when only the velocity is
used, proves to be better than the responses of any of
the individual or combined HoG-based classifiers, for
the considered video sequences Seq 15 and Seq 42,
thus highlighting the importance of this additional
cue. However, one might expect that for more partic-
ular video sequences in which the targets are mostly
stationary, the velocity cue would provide less infor-
mation and thus yield poorer results. The next config-
uration tested was the combination of the response of
VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications
520