thresholding a feature vector that is a combination of
the number of keypoints of a frame, and the number
matches between two consecutive frames. The
keyframe of a shot is the one that maximizes the
number of keypoints as, according to the authors, it
is the one that contains the maximum information. In
this paper we considered our implementation of
(Liu et al., 2009) with two difference with respect to
the original algorithm :
we used SURF algorithm, rather than SIFT, for
computing convenience;
we introduced our shot selection step (see par.
2.3), to select the most significant shot of the clip
(w.r.t. the keyword), in between the video
segmentation and keyframe extraction steps of
the reference algorithm, as it has no counterpart
in the original algorithm.
Furthermore we took in account for comparison
the central frame of the selected shot, but we
decided not to show the results as we observed that
they are very similar to those obtained with the
reference method, in terms of the metrics described
in section 3.3. We also studied the algorithm
proposed in (Guan et al., 2013), but it is extremely
slower than the chosen reference method, then we
have not considered it for efficiency reasons.
3.3 Evaluation Metrics
In our tests we were not interested in evaluating
separately the performance of the video
segmentation part of the algorithm, but of the whole
keyframe extraction process. Since it is impossible
to define an objective metric to evaluate the
performance of a keyframe extraction method, we
adopted a subjective comparative approach. We
asked to 5 testers to evaluate the “proposed”
keyframe in comparison, separately, with the
“central” and the “reference” keyframes, in terms of
Significance and of Quality. A keyframe is more
significant than another if its visual content is more
representative for the input keyword. The Quality
concept is highly subjective and involves many
aspects, but a blurred or a motion blurred frame
typically is considered a poor quality frame. With
regard to the Significance evaluation, the testers
have three options:
1. frame F1 is more significant than frame F2;
2. frame F2 is more significant than frame F1;
3. frames F1 and F2 are equally significant;
and the additional option:
4. none of the frames is significant.
If more than a half of the people select this last
option, both the keyframes are labeled as
insignificant. With regard to the Quality evaluation,
the testers have three options:
1. frame F1 has better quality than frame F2;
2. frame F2 has better quality than frame F1;
3. frames F1 and F2 have the same quality.
For each test the decisions are taken at majority.
In case of draw between the options 1 and 2, the two
frames are considered equally significant (or of the
same quality). In case of draw between the options 1
(or 2) with option 3, the option 1 (2) wins.
3.4 Experimental Results
Table 1 shows the results obtained for the different
domains and the different languages. The first result
is that a lot of retrieved clips (about 50%) contain
information that have been evaluated by the testers
as not significant to the input keyword. This is
typically the case of a person speaking of
“something”, without showing “something” and, in
this case, the extracted frame is not relevant for the
input keyword. In our tests we measured the
Significance metric comparing only frames that are
part of significant clips, while we compared all the
retrieved frames in terms of Quality. Analyzing only
the significant clips, we observed that in many cases
all the methods give the same results. In fact, when
the retrieved caption is related to a single shot
sequence (the most frequent case), all the frames of
the sequence have the same visual content and it is
not very different, mainly in terms of Significance,
to select a frame or another one. Moreover, our
method and the reference one use a similar video
segmentation algorithm, and the method the shot
selection step is the same in both the algorithms.
Thus the two methods select, in almost all cases, the
same shot.
In terms of Quality, our method achieves better
results, above all for the “recipes” domain. This
means that the selection of the frame which have the
highest informative content (the largest number of
interest points) do not ensure the choice of a good
quality frame, that is, to our opinion, a very
important feature to better understand the content of
an image. With respect to the “central” method, the
improvement of our method are more evident, both
in terms of Significance and Quality. In fact the
selection of the “central” frame is almost a random
choice, and there is no guarantee that the selected
frame is correlated to the input keyword, nor that the
frame is a “good quality” frame. We observed no
relevant differences in the results obtained using
Italian and English languages.
ICPRAM2015-InternationalConferenceonPatternRecognitionApplicationsandMethods
174