Next Generation TV through Automatic Multimedia Annotation Systems - A Hybrid Approach

Joël Dumoulin, Marco Bertini, Alberto Del Bimbo, Elena Mugellini, Omar Abou Khaled, Maria Sokhn


After the advent of smartphones, it is time for television to see its next big evolution, to become smart TVs. But to provide a richer television user experience, multimedia content first has to be enriched. In recent years, the evolution of technology has facilitated the way to take and store multimedia assets, like photographs or videos. This causes an increased difficulty in multimedia resources retrieval, mainly because of the lack of methods that handle non-textual features, both in annotation systems and search engines. Moreover, multimedia sharing websites like Flickr or YouTube, in addition to information provided by Wikipedia, offer a tremendous source of knowledge interesting to be explored. In this position paper, we address the automatic multimedia annotation issue, by proposing a hybrid system approach. We want to use unsupervised methods to find relationships between multimedia elements, referred as hidden topics, and then take advantage of social knowledge to label these resulting relationships. Resulting enriched multimedia content will allow to bring new user experience possibilities to the next generation television, allowing for instance the creation of recommender systems that merge this information with user profiles and behavior analysis.


  1. Akrivas, G., Papadopoulos, G., Douze, M., Heinecke, J., O'Connor, N., Saathoff, C., and Waddington, S. (2007). Knowledge-based semantic annotation and retrieval of multimedia content. In Proc. of 2nd International Conference on Semantic and Digital Media Technologies, pages 5-6, Genoa, Italy.
  2. Andreetto, M., Zelnik-Manor, L., and Perona, P. (2008). Unsupervised learning of categorical segments in image collections. Proc. of Computer Vision and Pattern Recognition Workshops, 2008. CVPRW 7808., pages 1- 8.
  3. Ballan, L., Bertini, M., Bimbo, A. D., Meoni, M., and Serra, G. (2010). Tag suggestion and localization in usergenerated videos based on social knowledge. In Proc. of second ACM SIGMM Workshop on Social Media (WSM), pages 3-8.
  4. Bertini, M., Amico, G. D., Ferracani, A., Meoni, M., and Serra, G. (2010). Web-based Semantic Browsing of Video Collections using Multimedia Ontologies. In Proceedings of the international conference on Multimedia - MM'10, pages 1629-1632, Firenze, Italy. ACM.
  5. Bizer, C. (2009). The Emerging Web of Linked Data. IEEE Intelligent Systems, 24(5):87-92.
  6. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993-1022.
  7. Bosch, A. and Zisserman, A. (2006). Scene classification via pLSA. Proc. of ECCV.
  8. Cai, D., Mei, Q., Han, J., and Zhai, C. (2008). Modeling hidden topics on document manifold. In Proc. of the 17th ACM conference on Information and knowledge management, pages 911-920.
  9. Cooper, W. (2008). The interactive television user experience so far. Proc. of the 1st international conference on Designing interactive user experiences for TV and video (UXTV), 44:133.
  10. Damme, C. V., Hepp, M., and Siorpaes, K. (2007). FolksOntology: An Integrated Approach for Turning Folksonomies into Ontologies. Social Networks, 2:57-70.
  11. Dong, A. and Li, H. (2006). Multi-ontology Based Multimedia Annotation for Domain-specific Information Retrieval. IEEE International Conference on Sensor Networks, Ubiquitous, and Trustworthy Computing - Vol 2 - Workshops, 2:158-165.
  12. Guillaumin, M., Mensink, T., Verbeek, J., and Schmid, C. (2009). Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In Computer Vision, 2009 IEEE 12th International Conference on, pages 309 -316.
  13. Hauptmann, A. G., Christel, M. G., and Yan, R. (2008). Video retrieval based on semantic concepts. In Proceedings of the IEEE, volume 96, pages 602-622.
  14. Hofmann, T. (2001). Unsupervised Learning by Probabilistic Latent Semantic Analysis. Mach. Learn., pages 177-196.
  15. Hörster, E., Lienhart, R., and Slaney, M. (2007). Image retrieval on large-scale image databases. Proc. of Conference on Image and video retrieval, pages 17-24.
  16. Hu, D. (2009). Latent Dirichlet Allocation for Text, Images, and Music., pages 1-19.
  17. Kennedy, L. S., Chang, S. F., and Kozintsev, I. V. (2006). To search or to label?: predicting the performance of search-based automatic image classifiers. Proc. of the 8th ACM international workshop on Multimedia Information Retrieval (MIR), pages 249-258.
  18. Lew, M. S., Sebe, N., Djereba, C., and Jain, R. (2006). Content-Based Multimedia Information Retrieval : State of the Art and Challenges. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP), 2(1):1-19.
  19. Li, X. and Snoek, C. (2009). Visual categorization with negative examples for free. In Proc.s of ACM Multimedia, pages 661-664.
  20. Li, X., Snoek, C. G. M., and Worring, M. (2009). Learning Social Tag Relevance by Neighbor Voting. IEEE Transactions on Multimedia, 11:1310-1322.
  21. Lienhart, R. and Slaney, M. (2007). pLSA on large scale image databases. Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, (ICASSP), pages 1217-1220.
  22. Makadia, A., Pavlovic, V., and Kumar, S. (2008). A new baseline for image annotation. In Proc. ECCV, pages 316-329.
  23. Meyer, D. (2001). Support Vector Machines. R News, 2(2):23-26.
  24. Monay, F. and Gatica-Perez, D. (2004). PLSA-based image auto-annotation: constraining the latent space. In Proc. of ACM Multimedia, pages 348-351.
  25. Nguyen, C., Phan, X., and Horiguchi, S. (2009). Web Search Clustering and Labeling with Hidden Topics. ACM Transactions on Asian Language Information Processing (TALIP), 8(3).
  26. Phan, X., Nguyen, L., and Horiguchi, S. (2008). Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In Proc. of the 17th international conference on World Wide Web, pages 91-100.
  27. Qi, G.-J., Aggarwal, C., Tian, Q., Ji, H., and Huang, T. (2012). Exploring context and content links in social media: A latent space method. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(5):850 -862.
  28. Schapire, R. E. (2003). The Boosting Approach to Machine Learning An Overview. MSRI Workshop on Nonlinear Estimation and Classification, 7(4):1-23.
  29. Setz, A. and Snoek, C. (2009). Can social tagged images aid concept-based video search? In Proc. of IEEE International Conference on Multimedia and Expo (ICME), pages 1460-1463.
  30. Sivic, J. and Zisserman, A. (2003). Video Google: a text retrieval approach to object matching in videos. Proceedings Ninth IEEE International Conference on Computer Vision, pages 1470-1477 vol.2.
  31. Smeaton, A., Over, P., and Kraaij, W. (2009). High-level feature detection from video in TRECVid: a 5-year retrospective of achievements. Multimedia Content Analysis, pages 1-24.
  32. Smeulders, A. W. M., Worring, M., Santini, S., Gupta, A., and Jain, R. (2000). Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12):1349- 1380.
  33. Snoek, C. G. M., Worring, M., van Gemert, J. C., Geusebroek, J.-M., and Smeulders, A. W. M. (2006). The challenge problem for automated detection of 101 semantic concepts in multimedia. In Proc. of ACM Multimedia.
  34. Tjondronegoro, D., Chen, Y.-P. P., and Pham, B. (2005). Content-based video indexing for sports applications using integrated multi-modal approach. Proc. of ACM Multimedia, page 1035.
  35. Tsai, D., Jing, Y., Liu, Y., Rowley, H., Ioffe, S., and Rehg, J. (2011). Large-Scale Image Annotation using Visual Synset. Proc. of ICCV.
  36. Ulges, A., Schulze, C., Koch, M., and Breuel, T. M. (2010). Learning automatic concept detectors from online video. Computer Vision and Image Understanding, 114(4):429-438.
  37. Wang, X. and Grimson, E. (2007). Spatial latent dirichlet allocation. Proc. of Neural Information Processing Systems Conference, pages 1-8.
  38. Yin, Z., Li, R., Mei, Q., and Han, J. (2009). Exploring social tagging graph for web object classification. Proc. of ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), page 957.

Paper Citation

in Harvard Style

Dumoulin J., Bertini M., Del Bimbo A., Mugellini E., Abou Khaled O. and Sokhn M. (2012). Next Generation TV through Automatic Multimedia Annotation Systems - A Hybrid Approach . In Proceedings of the International Conference on Signal Processing and Multimedia Applications and Wireless Information Networks and Systems - Volume 1: SIGMAP, (ICETE 2012) ISBN 978-989-8565-25-9, pages 192-197. DOI: 10.5220/0004128101920197

in Bibtex Style

author={Joël Dumoulin and Marco Bertini and Alberto Del Bimbo and Elena Mugellini and Omar Abou Khaled and Maria Sokhn},
title={Next Generation TV through Automatic Multimedia Annotation Systems - A Hybrid Approach},
booktitle={Proceedings of the International Conference on Signal Processing and Multimedia Applications and Wireless Information Networks and Systems - Volume 1: SIGMAP, (ICETE 2012)},

in EndNote Style

JO - Proceedings of the International Conference on Signal Processing and Multimedia Applications and Wireless Information Networks and Systems - Volume 1: SIGMAP, (ICETE 2012)
TI - Next Generation TV through Automatic Multimedia Annotation Systems - A Hybrid Approach
SN - 978-989-8565-25-9
AU - Dumoulin J.
AU - Bertini M.
AU - Del Bimbo A.
AU - Mugellini E.
AU - Abou Khaled O.
AU - Sokhn M.
PY - 2012
SP - 192
EP - 197
DO - 10.5220/0004128101920197