Table 1 demonstrates the relative increase in
both measures for Sets 2-5 compared to Set 1. A
general improvement can be clearly seen, reaching
in certain situations up to almost 18%, although
there are significant differences between the various
classes. Thus, while for example in the case of
Swimming Pool shots, containing shots of swimming
and water polo, motion information gives a powerful
boost, in the case of the
Vehicles class, which is
extremely varied and is not characterized by any
specific motion patterns the overall increase is small,
while the increases from individual features can be
characterised insignificant. Similar explanations can
be given for all elements of Table 1. It cannot be
denied, however, that the addition of motion
information expressed through our descriptors has
boosted the classification potential of the overall
system.
5 CONCLUSIONS AND FUTURE
WORK
We have demonstrated the ability of motion
descriptors to boost classification performance
compared to a system based solely on spatial
descriptors. It can be observed, however, that not all
descriptors are equally effective for all classes. More
in-depth study would help clarify the strengths and
limitations of each motion descriptor presented here.
Furthermore, the descriptors presented here
could be enhanced in various ways. A dense motion
vector field could transform the PMES into a longer
and more powerful descriptor which would give a
detailed map of foreground activity. It would also
allow the local BoMF patches to be expanded
beyond size 2×2, thus giving a more detailed
description of local motion. Increase in the number
of the raw motion vectors would result in an increase
in the number of patches, and, as a result salient
feature detection could also be performed on the
motion patches prior to the local descriptor
extraction. In such an approach, the tradeoffs
between any increase in performance and the cost of
not being able to take direct advantage of MPEG-
encoded vector fields should be taken into account.
Finally, rotation invariance should be tried for
the BoMF descriptor and compared to its current
non-invariant design, while a reduction in the
temporal distance during the motion field extraction
would lead to a more detailed DDH descriptor, also
possibly allowing a narrowing of the angle bins so as
to extend of the descriptor length. Furthermore, the
DDH descriptor could be made more descriptive by
transforming it from a simple histogram of the
dominant vector directions to a histogram of actual
camera activity, as they can be derived from the
motion vector field (Tan, Saur, Kulkarni and
Ramadge, 2000).
These further steps in our research would both
lead to the formation of more expressive and
powerful descriptors for automatic video indexing,
and grant us valuable insights as to the ways motion
patterns differ between video shots of different
semantic content. It is our hope that research on the
semantic indexing and retrieval of video shots can
eventually reach a point where motion will be
considered as important as spatial information for
the success of these tasks.
REFERENCES
Bovik, A. C., Clark, M., Geisler, W. S. (1990).
Multichannel Texture Analysis Using Localized
Spatial Filters. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 12, 55-73.
Canny, J. (1986) A computational approach to edge
detection. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 8, 679–698.
Cao, J., Lan, Y., Li, J., Li, Q., Li, X., Lin, F., Liu, X., Luo,
L., Peng, W., Wang, D., Wang, H., Wang, Z., Xiang,
Z., Yuan, J., Zheng, W., Zhang, B., Zhang, J., Zhang,
L., Zhang, X. (2006). Intelligent Multimedia Group of
Tsinghua University at TRECVID 2006. In Proc. 2006
NIST TREC Video Retrieval Evaluation Workshop.
Hao, S., Yoshizawa, Y., Yamasaki, K., Shinoda, K., Furui,
S. (2008). Tokyo Tech at TRECVID 2008. In Proc.
2008 NIST TREC Video Retrieval Evaluation
Workshop.
Huang, J., Kumar, S. R., Mitra, M., Zhu, W. J., Zabih, R.
(1999). Spatial Color Indexing and Applications.
International Journal of Computer Vision, 35, 245-
268.
Jain, A. K., Vailaya, A. (1996). Image retrieval using color
and shape. Pattern Recognition, 29, 1233-1244.
Jurie, F., Triggs, B. (2005). Creating Efficient Codebooks
for Visual Recognition, Proc. ICCV '05, 10th IEEE
International Conference on Computer Vision, 1, 604-
610 .
Liu, A., Tang, S., Zhang, Y., Song, Y., Li, J., Yang, Z.
(2007). TRECVID 2007 High-Level Feature
Extraction By MCG-ICT-CAS. In Proc. 2007 NIST
TREC Video Retrieval Evaluation Workshop.
Lowe, D. G. (2004) Distinctive Image Features from
Scale-Invariant Keypoints, International Journal of
Computer Vision, 60, 91–110.
Ma, Y. F., Zhang, H. J. (2001). A new perceived motion
based shot content representation. In Proc. ICIP ’01,
8
th
IEEE International Conference on Image
Processing, 3, 426-429.
MOTION DESCRIPTORS FOR SEMANTIC VIDEO INDEXING
183