ACTIVE, an Extensible Cataloging Platform for Automatic Indexing of Audiovisual Content

Maurizio Pintus, Maurizio Agelli, Felice Colucci, Nicola Corona, Alessandro Sassu, Federico Santamaria

2016

Abstract

The cost of manual metadata production is high, especially for audiovisual content, where a time-consuming inspection is usually required in order to identify the most appropriate annotations. There is a growing need from digital content industries for solutions capable of automating such a process. In this work we present ACTIVE, a platform for indexing and cataloging audiovisual collections through the automatic recognition of faces and speakers. Adopted algorithms are described and our main contributions on people clustering and caption-based people identification are presented. Results of experiments carried out on a set of TV shows and audio files are reported and analyzed. An overview of the whole architecture is presented as well, with a focus on chosen solutions for making the platform easily extensible (plug-ins) and for distributing CPU-intensive calculations across a network of computers.

References

  1. Ahonen, T., Hadid, A., and Pietikäinen, M. (2004). Face Recognition with Local Binary Patterns. In Proc. ECCV, pages 469-481.
  2. Ajmera, J., McCowan, I., and Bourlard, H. (2004). Robust speaker change detection. IEEE Signal Processing Letters, 11(8):649-651.
  3. Barras, C., Zhu, X., Meignier, S., and Gauvain, J.-L. (2006). Multistage speaker diarization of broadcast news. IEEE Transactions on Audio, Speech, and Language Processing, 14(5):1505-1512.
  4. Berg, T. L., Berg, A. C., Edwards, J., Maire, M., White, R., Teh, Y.-W., Learned-Miller, E., and Forsyth, D. (2004). Names and Faces in the News. In Proc. CVPR, pages II-848-II-854 Vol.2.
  5. Bertini, M., Bimbo, A. D., and Pala, P. (2001). Contentbased indexing and retrieval of tv news. Pattern Recognition Letters, 22(5):503-516.
  6. Bradski, G. R. (1998). Real Time Face and Object Tracking as a Component of a Perceptual User Interface. In Proc. WACV, pages 214-219.
  7. Bradski, G. R. (2000). The OpenCV Library. Dr. Dobb's Journal of Software Tools, 25(11):120, 122-125.
  8. Chen, S. S. and Gopalakrishnan, P. S. (1998). Speaker, Environment And Channel Change Detection And Clustering Via The Bayesian Information Criterion. In Proc. DARPA Broadcast News Transcription and Understanding Workshop, pages 127-132.
  9. Dugad, R., Ratakonda, K., and Ahuja, N. (1998). Robust Video Shot Change Detection. In Proc. MMSP, pages 376-381.
  10. El-Khoury, E., Senac, C., and Joly, P. (2010). Face-andclothing based people clustering in video content. In Proc. MIR, pages 295-304.
  11. Everingham, M. R., Sivic, J., and Zisserman, A. (2006). “Hello! My name is... Buffy” - Automatic Naming of Characters in TV Video. In Proc. BMVC, pages 92.1-92.10.
  12. Gauvain, J.-L. and Lee, C.-H. (1994). Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains. IEEE Transactions on Speech and Audio Processing, 2(2):291-298.
  13. Khan, S., Rafibullslam, M., Faizul, M., and Doll, D. (2008). Speaker recognition using mfcc. International Journal of Computer Science and Engineering System, 2(1).
  14. Korshunov, P. and Ooi, W. T. (2011). Video quality for face detection, recognition, and tracking. ACM Transactions on Multimedia Computing, Communications, and Applications, 7(3):14:1-14:21.
  15. Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions and reversals. Cybernetics and control theory, 10(8):707-710.
  16. Lienhart, R., Kuranov, A., and Pisarevsky, V. (2003). Empirical Analysis of Detection Cascades of Boosted Classifiers for Rapid Object Detection. In Proc. DAGM, pages 297-304.
  17. Maji, S. and Bajcsy, R. (2007). Fast Unsupervised Alignment of Video and Text for Indexing/Names and Faces. In Proc. MM, pages 57-64.
  18. Martin, A., Doddington, G., Kamm, T., Ordowski, M., and Przybocki, M. (1997). The DET curve in assessment of detection task performance. In Proc. EUROSPEECH, pages 1895-1898.
  19. Meignier, S. and Merlin, T. (2010). LIUM SpkDiarization: An Open Source Toolkit For Diarization. In Proc. CMU SPUD Workshp.
  20. Otsu, N. (1979). A threshold selection method from graylevel histograms. IEEE Transactions on Systems, Man and Cybernetics, 9(1):62-66.
  21. Reynolds, D. A., Quatieri, T. F., and Dunn, R. B. (2000). Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10(1-3):19-41.
  22. Rouvier, M., Dupuy, G., Gay, P., Khoury, E., Merlin, T., and Meignier, S. (2013). An Open-source State-ofthe-art Toolbox for Broadcast News Diarization. In Proc. INTERSPEECH.
  23. Sim, T., Baker, S., and Bsat, M. (2002). The CMU Pose, Illumination, and Expression (PIE) database. In Proc. FG, pages 46-51.
  24. Sivic, J., Zitnick, C. L., and Szeliski, R. (2006). Finding people in repeated shots of the same scene. In Proc. BMVC, pages 93.1-93.10.
  25. Smith, R. (2007). An overview of the Tesseract OCR Engine. In Proc. ICDAR, pages 629-633.
  26. Viola, P. and Jones, M. J. (2001). Rapid Object Detection using a Boosted Cascade of Simple Features. In Proc. CVPR, pages I-511-I-518 Vol.1.
  27. Viola, P. and Jones, M. J. (2004). Robust real-time face detection. International Journal of Computer Vision, 57(2):137-154.
  28. Zhang, Y.-F., Xu, C., Lu, H., and Huang, Y.-M. (2009). Character identification in feature-length films using global face-name matching. IEEE Transactions on Multimedia, 11(7):1276-1288.
Download


Paper Citation


in Harvard Style

Pintus M., Agelli M., Colucci F., Corona N., Sassu A. and Santamaria F. (2016). ACTIVE, an Extensible Cataloging Platform for Automatic Indexing of Audiovisual Content . In Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 4: VISAPP, (VISIGRAPP 2016) ISBN 978-989-758-175-5, pages 574-581. DOI: 10.5220/0005722205740581


in Bibtex Style

@conference{visapp16,
author={Maurizio Pintus and Maurizio Agelli and Felice Colucci and Nicola Corona and Alessandro Sassu and Federico Santamaria},
title={ACTIVE, an Extensible Cataloging Platform for Automatic Indexing of Audiovisual Content},
booktitle={Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 4: VISAPP, (VISIGRAPP 2016)},
year={2016},
pages={574-581},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005722205740581},
isbn={978-989-758-175-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 4: VISAPP, (VISIGRAPP 2016)
TI - ACTIVE, an Extensible Cataloging Platform for Automatic Indexing of Audiovisual Content
SN - 978-989-758-175-5
AU - Pintus M.
AU - Agelli M.
AU - Colucci F.
AU - Corona N.
AU - Sassu A.
AU - Santamaria F.
PY - 2016
SP - 574
EP - 581
DO - 10.5220/0005722205740581