
thresholding a feature vector that is a combination of 
the number of keypoints of a frame, and the number 
matches between two consecutive frames. The 
keyframe of a shot is the one that maximizes the 
number of keypoints as, according to the authors, it 
is the one that contains the maximum information. In 
this paper we considered our implementation of  
(Liu et al., 2009) with two difference with respect to 
the original algorithm :  
  we used SURF algorithm, rather than SIFT, for 
computing convenience;  
  we introduced our shot selection step (see par. 
2.3), to select the most significant shot of the clip 
(w.r.t. the keyword), in between the video 
segmentation and keyframe extraction steps of 
the reference algorithm, as it has no counterpart 
in the original algorithm.   
Furthermore we took in account for comparison 
the central frame of the selected shot, but we 
decided not to show the results as we observed that 
they are very similar to those obtained with the 
reference method, in terms of the metrics described 
in section 3.3. We also studied the algorithm 
proposed in (Guan et al., 2013), but it is extremely 
slower than the chosen reference method, then we 
have not considered it for efficiency reasons. 
3.3 Evaluation Metrics 
In our tests we were not interested in evaluating 
separately the performance of the video 
segmentation part of the algorithm, but of the whole 
keyframe extraction process. Since it is impossible 
to define an objective metric to evaluate the 
performance of a keyframe extraction method, we 
adopted a subjective comparative approach. We 
asked to 5 testers to evaluate the “proposed” 
keyframe in comparison, separately, with the 
“central” and the “reference” keyframes, in terms of 
Significance and of Quality. A keyframe is more 
significant than another if its visual content is more 
representative for the input keyword. The Quality 
concept is highly subjective and involves many 
aspects, but a blurred or a motion blurred frame 
typically is considered a poor quality frame. With 
regard to the Significance evaluation, the testers 
have three options:  
1.  frame F1 is more significant than frame F2;  
2.  frame F2 is more significant than frame F1; 
3.  frames F1 and F2 are equally significant; 
and the additional option:  
4.  none of the frames is significant.  
If more than a half of the people select this last 
option, both the keyframes are labeled as 
insignificant.  With regard to the Quality evaluation, 
the testers have three options: 
1.  frame F1 has better quality than frame F2;  
2.  frame F2 has better quality than frame F1; 
3.  frames F1 and F2 have the same quality. 
For each test the decisions are taken at majority. 
In case of draw between the options 1 and 2, the two 
frames are considered equally significant (or of the 
same quality). In case of draw between the options 1 
(or 2) with option 3, the option 1 (2) wins.  
3.4 Experimental Results 
Table 1 shows the results obtained for the different 
domains and the different languages. The first result 
is that a lot of retrieved clips (about 50%) contain 
information that have been evaluated by the testers 
as not significant to the input keyword. This is 
typically the case of a person speaking of 
“something”, without showing “something” and, in 
this case, the extracted frame is not relevant for the 
input keyword. In our tests we measured the 
Significance metric comparing only frames that are 
part of significant clips, while we compared all the 
retrieved frames in terms of Quality. Analyzing only 
the significant clips,  we observed that in many cases 
all the methods give the same results. In fact, when 
the retrieved caption is related to a single shot 
sequence (the most frequent case), all the frames of 
the sequence have the same visual content and it is 
not very different, mainly in terms of Significance, 
to select a frame or another one. Moreover, our 
method and the reference one use a similar video 
segmentation algorithm, and the method the shot 
selection step is the same in both the algorithms. 
Thus the two methods select, in almost all cases, the 
same shot.  
In terms of Quality, our method achieves better 
results, above all for the “recipes” domain. This 
means that the selection of the frame which have the 
highest informative content (the largest number of 
interest points) do not ensure the choice of a good 
quality frame, that is, to our opinion, a very 
important feature to better understand the content of 
an image. With respect to the “central” method, the 
improvement of our method are more evident, both 
in terms of Significance and Quality. In fact the 
selection of the “central” frame is almost a random 
choice, and there is no guarantee that the selected 
frame is correlated to the input keyword, nor that the 
frame is a “good quality” frame. We observed no 
relevant differences in the results obtained using 
Italian and English languages. 
ICPRAM2015-InternationalConferenceonPatternRecognitionApplicationsandMethods
174