Dynamic Subtitle Placement Considering the Region of Interest and Speaker Location

Wataru Akahori, Tatsunori Hirai, Shigeo Morishima

Abstract

This paper presents a subtitle placement method that reduces unnecessary eye movements. Although methods that vary the position of subtitles have been discussed in a previous study, subtitles may overlap the region of interest (ROI). Therefore, we propose a dynamic subtitling method that utilizes eye-tracking data to avoid the subtitles from overlapping with important regions. The proposed method calculates the ROI based on the eye-tracking data of multiple viewers. By positioning subtitles immediately under the ROI, the subtitles do not overlap the ROI. Furthermore, we detect speakers in a scene based on audio and visual information to help viewers recognize the speaker by positioning subtitles near the speaker. Experimental results show that the proposed method enables viewers to watch the ROI and the subtitle in longer duration than traditional subtitles, and is effective in terms of enhancing the comfort and utility of the viewing experience.

References

  1. Akahori, W., Hirai, T., Kawamura, S., and Morishima, S. (2016). Region-of-interest-based subtitle placement using eye-tracking data of multiple viewers. In Proceedings of the ACM International Conference on Interactive Experiences for TV and Online Video, pages 123-128. ACM.
  2. Apostolidis, E. and Mezaris, V. (2014). Fast shot segmentation combining global and local visual descriptors. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6583- 6587. IEEE.
  3. Cao, Y., Lau, R. W., and Chan, A. B. (2014). Look over here: Attention-directing composition of manga elements. ACM Transactions on Graphics (TOG), 33(4):94.
  4. Cerf, M., Harel, J., Einhäuser, W., and Koch, C. (2008). Predicting human gaze using low-level saliency combined with face detection. In Advances in neural information processing systems, pages 241-248.
  5. Chun, B.-K., Ryu, D.-S., Hwang, W.-I., and Cho, H.-G. (2006). An automated procedure for word balloon placement in cinema comics. In International Symposium on Visual Computing, pages 576-585. Springer.
  6. Danelljan, M., Häger, G., Khan, F., and Felsberg, M. (2014). Accurate scale estimation for robust visual tracking. In British Machine Vision Conference, Nottingham, September 1-5, 2014. BMVA Press.
  7. Everingham, M., Sivic, J., and Zisserman, A. (2006). Hello! my name is... buffy”-automatic naming of characters in tv video. In BMVC, volume 2, page 6.
  8. Harel, J., Koch, C., and Perona, P. (2006). Graph-based visual saliency. In Advances in neural information processing systems, pages 545-552.
  9. Hong, R., Wang, M., Yuan, X.-T., Xu, M., Jiang, J., Yan, S., and Chua, T.-S. (2011). Video accessibility enhancement for hearing-impaired users. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 7(1):24.
  10. Hou, X. and Zhang, L. (2007). Saliency detection: A spectral residual approach. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1-8. IEEE.
  11. Hu, Y., Kautz, J., Yu, Y., and Wang, W. (2015). Speakerfollowing video subtitles. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 11(2):32.
  12. Itti, L., Koch, C., Niebur, E., et al. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence, 20(11):1254-1259.
  13. Jain, E., Sheikh, Y., Shamir, A., and Hodgins, J. (2015). Gaze-driven video re-editing. ACM Transactions on Graphics (TOG), 34(2):21.
  14. Kanan, C., Tong, M. H., Zhang, L., and Cottrell, G. W. (2009). SUN: Top-down saliency using natural statistics. Visual cognition, 17(6-7):979-1003.
  15. Katti, H., Rajagopal, A. K., Kankanhalli, M., and Kalpathi, R. (2014). Online estimation of evolving human visual interest. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 11(1):8.
  16. King, D. E. (2015). Max-margin object detection. arXiv preprint arXiv:1502.00046.
  17. Kurlander, D., Skelly, T., and Salesin, D. (1996). Comic chat. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 225-236. ACM.
  18. McConkie, G. W., Kerr, P. W., Reddix, M. D., Zola, D., and Jacobs, A. M. (1989). Eye movement control during reading: Ii. frequency of refixating a word. Perception & Psychophysics, 46(3):245-253.
  19. Rayner, K. (1975). The perceptual span and peripheral cues in reading. Cognitive Psychology, 7(1):65-81.
  20. Rudoy, D., Goldman, D. B., Shechtman, E., and ZelnikManor, L. (2012). Crowdsourcing gaze data collection. arXiv preprint arXiv:1204.3367.
  21. San Agustin, J., Skovsgaard, H., Mollenbach, E., Barret, M., Tall, M., Hansen, D. W., and Hansen, J. P. (2010). Evaluation of a low-cost open-source gaze tracker. In Proceedings of the 2010 Symposium on Eye-Tracking Research & Applications, pages 77-80. ACM.
  22. U?ric?á?r, M., Franc, V., and Hlavác?, V. (2012). Detector of facial landmarks learned by the structured output svm. VIsAPP, 12:547-556.
  23. Yang, J. and Yang, M.-H. (2012). Top-down visual saliency via joint crf and dictionary learning. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2296-2303. IEEE.
Download


Paper Citation


in Harvard Style

Akahori W., Hirai T. and Morishima S. (2017). Dynamic Subtitle Placement Considering the Region of Interest and Speaker Location . In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 6: VISAPP, (VISIGRAPP 2017) ISBN 978-989-758-227-1, pages 102-109. DOI: 10.5220/0006262201020109


in Bibtex Style

@conference{visapp17,
author={Wataru Akahori and Tatsunori Hirai and Shigeo Morishima},
title={Dynamic Subtitle Placement Considering the Region of Interest and Speaker Location},
booktitle={Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 6: VISAPP, (VISIGRAPP 2017)},
year={2017},
pages={102-109},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006262201020109},
isbn={978-989-758-227-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 6: VISAPP, (VISIGRAPP 2017)
TI - Dynamic Subtitle Placement Considering the Region of Interest and Speaker Location
SN - 978-989-758-227-1
AU - Akahori W.
AU - Hirai T.
AU - Morishima S.
PY - 2017
SP - 102
EP - 109
DO - 10.5220/0006262201020109