The best ratio of combined channels
{c
k
, c
l
, . . . , c
z
} ∈ C for a given training set is es-
timated using a coordinate descend method. The
set of input channels needs to be specified outside
of the training process. Beside this SVM building
procedure requires the number of input parameters
that affect the classifier accuracy; these parameters
are automatically evaluated using the cross-validation
approach (Hsu et al., 2010). The classifier creation
process may be apprehended in the whole proce-
dure as a black-box unit which for a given input
automatically creates the best performing classifier.
3 EXPERIMENTS
The purpose of the experiments performed in the
presented work has been to determine the minimum
length of a video shot containing the action to be rec-
ognized, and for which the recognition accuracy is
still comparable to the state-of-the-art solutions. Such
experiments correspond to real-time search in video
or, for example, to a situation where a video record
is searched for the action (at randomly selected posi-
tions, positions determined by the application).
We have used the pipeline presented in Chapter 2
and the Hollywood2 dataset presented in Chapter 1.1
(Marszalek et al., 2009). This dataset, as mentioned
earlier, contains twelve action classes from Holly-
wood movies, namely: answering the phone, driving
car, eating, fighting, getting out of the car, hand shak-
ing, hugging, kissing, running, sitting down, sitting up
and standing up.
We investigated the recognition algorithm behav-
ior in such a way that pieces of video containing the
action were presented to the algorithm at randomly
selected positions inside the actions. For example, the
actions were known to start earlier than the beginning
of the processed piece of video, and ended only after
the end of the presented piece of video. For this pur-
pose, we had to reannotate the Hollywood2 dataset
(all three its parts - train, autotrain and test) to obtain
precise beginning and ending frames of the actions.
In our experiments, we have been trying to depict
a dependency between the length of video shot, be-
ing an input to the processing, and the accuracy of the
output. We have set the minimum shot length to 5
frames, more precisely the 5 frames from which the
space-time point features are extracted. The maxi-
mum shot length was set to 100 frames and the frame
step was set to 5 frames.
The space-time features extractor process N pre-
vious and N consequent frames of the video sequence
in order to evaluate the points of interest for a single
frame. Therefore, 2*N should be added to every fig-
ure concerning the number of frames to get the total
number of frames of the video sequence to be pro-
cessed. In our case, N was equal to 4 so that, for
example, the 5 frames processed in Figure 3 mean 13
frames of the video.
A classifier has been constructed for every video
shot length considered. The training samples were
obtained from the training part of the dataset in the
following way: the information of the start and stop
position in the currently processed sample was used
and large number of the randomly selected subshots
were obtained. The training dataset has 823 video
samples in total and from each sample, we extracted
6 subshots on average.
The actual evaluation of the classifier has been
done four times in order to obtain the information
about reliability of the solution. Also, the above men-
tioned publications used the 823 samples for evalua-
tion purposes and we wanted our results to be directly
comparable. The results shown in Table 1 and Fig-
ure 3 present the average of the results of the four
runs. For this purpose we have randomly determined
a position of starting frame of a testing subshot within
a testing sample four times. The above approach
brings us two benefits - the final solution accuracy
can be measured using an average precision metric
and the results obtained through the testing can be
easily comparable to the published state-of-the-art so-
lutions. The results were compared with the accu-
racy achieved on the video sequences with completely
unrestricted size that are close to the state-of-the-art
(Reznicek and Zemcik, 2013).
The parameters for feature processing and clas-
sification purposes were as follows: the tested fea-
ture extractor is the dense trajectories extractor (Wang
et al., 2011), which produces four types of descrip-
tors, namely: HOG, HOF, DT and MBH. These four
feature vectors were used separately. For each de-
scriptor a vocabulary of 4000 words was produced
using the k-means method and the bag-of-words rep-
resentation was produced with the following param-
eters: σ = 1, the number of searched closest vectors
is 16; these values and codebook size were evaluated
in (Reznicek and Zemcik, 2013) and are suitable for
bag-of-words creation from space-time low-level fea-
tures. In the multi-kernel SVM creation process all
four channels (bag-of-words representations of HOG,
HOF, DT and MBH descriptors) are combined to-
gether, no searching for a better combination is per-
formed.
The above described evaluation procedure was re-
peated for every class contained in the Hollywood2
dataset. For each class, we are presenting the graph of
HumanActionRecognitionforReal-timeApplications
651