we would like to improve the precision of the speaker
detection by combining the content-aware analysis
and eye-tracking data for better presentation of sub-
titles.
ACKNOWLEDGEMENTS
We thank S. Kawamura, T. Kato and T. Fukusato
(Waseda University, Japan) for their advisory. This
research was supported by JST ACCEL and CREST.
REFERENCES
Akahori, W., Hirai, T., Kawamura, S., and Morishima, S.
(2016). Region-of-interest-based subtitle placement
using eye-tracking data of multiple viewers. In Pro-
ceedings of the ACM International Conference on In-
teractive Experiences for TV and Online Video, pages
123–128. ACM.
Apostolidis, E. and Mezaris, V. (2014). Fast shot segmen-
tation combining global and local visual descriptors.
In 2014 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 6583–
6587. IEEE.
Cao, Y., Lau, R. W., and Chan, A. B. (2014). Look
over here: Attention-directing composition of manga
elements. ACM Transactions on Graphics (TOG),
33(4):94.
Cerf, M., Harel, J., Einh¨auser, W., and Koch, C. (2008).
Predicting human gaze using low-level saliency com-
bined with face detection. In Advances in neural in-
formation processing systems, pages 241–248.
Chun, B.-K., Ryu, D.-S., Hwang, W.-I., and Cho, H.-G.
(2006). An automated procedure for word balloon
placement in cinema comics. In International Sympo-
sium on Visual Computing, pages 576–585. Springer.
Danelljan, M., H¨ager, G., Khan, F., and Felsberg, M.
(2014). Accurate scale estimation for robust visual
tracking. In British Machine Vision Conference, Not-
tingham, September 1-5, 2014. BMVA Press.
Everingham, M., Sivic, J., and Zisserman, A. (2006). Hello!
my name is... buffy”–automatic naming of characters
in tv video. In BMVC, volume 2, page 6.
Harel, J., Koch, C., and Perona, P. (2006). Graph-based vi-
sual saliency. In Advances in neural information pro-
cessing systems, pages 545–552.
Hong, R., Wang, M., Yuan, X.-T., Xu, M., Jiang, J., Yan, S.,
and Chua, T.-S. (2011). Video accessibility enhance-
ment for hearing-impaired users. ACM Transactions
on Multimedia Computing, Communications, and Ap-
plications (TOMM), 7(1):24.
Hou, X. and Zhang, L. (2007). Saliency detection: A spec-
tral residual approach. In 2007 IEEE Conference on
Computer Vision and Pattern Recognition, pages 1–8.
IEEE.
Hu, Y., Kautz, J., Yu, Y., and Wang, W. (2015). Speaker-
following video subtitles. ACM Transactions on Mul-
timedia Computing, Communications, and Applica-
tions (TOMM), 11(2):32.
Itti, L., Koch, C., Niebur, E., et al. (1998). A model of
saliency-based visual attention for rapid scene analy-
sis. IEEE Transactions on pattern analysis and ma-
chine intelligence, 20(11):1254–1259.
Jain, E., Sheikh, Y., Shamir, A., and Hodgins, J. (2015).
Gaze-driven video re-editing. ACM Transactions on
Graphics (TOG), 34(2):21.
Kanan, C., Tong, M. H., Zhang, L., and Cottrell, G. W.
(2009). Sun: Top-down saliency using natural statis-
tics. Visual Cognition, 17(6-7):979–1003.
Katti, H., Rajagopal, A. K., Kankanhalli, M., and Kalpathi,
R. (2014). Online estimation of evolving human vi-
sual interest. ACM Transactions on Multimedia Com-
puting, Communications, and Applications (TOMM),
11(1):8.
King, D. E. (2015). Max-margin object detection. arXiv
preprint arXiv:1502.00046.
Kurlander, D., Skelly, T., and Salesin, D. (1996). Comic
chat. In Proceedings of the 23rd annual conference on
Computer graphics and interactive techniques, pages
225–236. ACM.
McConkie, G. W., Kerr, P. W., Reddix, M. D., Zola, D., and
Jacobs, A. M. (1989). Eye movement control during
reading: Ii. frequency of refixating a word. Perception
& Psychophysics, 46(3):245–253.
Rayner, K. (1975). The perceptual span and peripheral cues
in reading. Cognitive Psychology, 7(1):65–81.
Rudoy, D., Goldman, D. B., Shechtman, E., and Zelnik-
Manor, L. (2012). Crowdsourcing gaze data collec-
tion. arXiv preprint arXiv:1204.3367.
San Agustin, J., Skovsgaard, H., Mollenbach, E., Barret,
M., Tall, M., Hansen, D. W., and Hansen, J. P. (2010).
Evaluation of a low-cost open-source gaze tracker. In
Proceedings of the 2010 Symposium on Eye-Tracking
Research & Applications, pages 77–80. ACM.
Uˇriˇc´aˇr, M., Franc, V., and Hlav´aˇc, V. (2012). Detector of
facial landmarks learned by the structured output svm.
VIsAPP, 12:547–556.
Yang, J. and Yang, M.-H. (2012). Top-down visual saliency
via joint crf and dictionary learning. In Computer
Vision and Pattern Recognition (CVPR), 2012 IEEE
Conference on, pages 2296–2303. IEEE.