Finally, we would also like to build a robust vali-
dation framework for the whole pipeline which would
requiere acquiring and annotatting more concerts.
ACKNOWLEDGEMENTS
This work was supported by the Secretaria
d’Universitats i Recerca de la Generalitat de
Catalunya (SGR-2017-1669), by the Ag
`
encia de
Gesti
´
o d’Ajuts Universitaris i de Recerca (AGAUR)
through the Doctorat Industrial programme and by
CERCA Programme/Generalitat de Catalunya.
The authors want to thank Auditori de Sant Cugat
and L’Auditori de Barcelona for granting us access to
recordings of music concerts.
REFERENCES
Angermann, Q. and Bernal, J. e. a. (2017). Towards real-
time polyp detection in colonoscopy videos: Adapting
still frame-based methodologies for video sequences
analysis. In Computer Assisted and Robotic En-
doscopy and Clinical Image-Based Procedures, pages
29–41. Springer.
Borji, A., Sihite, D. N., and Itti, L. (2012). Quantitative
analysis of human-model agreement in visual saliency
modeling: A comparative study. IEEE Transactions
on Image Processing, 22(1):55–69.
Bruce, N. D. and Tsotsos, J. K. (2009). Saliency, attention,
and visual search: An information theoretic approach.
Journal of vision, 9(3):5–5.
Cazzato, D., Leo, M., Distante, C., and Voos, H. (2020).
When i look into your eyes: A survey on computer
vision contributions for human gaze estimation and
tracking. Sensors, 20(13):3739.
Chen, C., Wang, O., Heinzle, S., Carr, P., Smolic, A., and
Gross, M. (2013). Computational sports broadcast-
ing: Automated director assistance for live sports. In
2013 IEEE International Conference on Multimedia
and Expo (ICME), pages 1–6. IEEE.
Costa, Y. M., Oliveira, L. S., Koericb, A. L., and Gouyon, F.
(2011). Music genre recognition using spectrograms.
In 2011 18th International Conference on Systems,
Signals and Image Processing, pages 1–4. IEEE.
Costa, Y. M., Oliveira, L. S., and Silla Jr, C. N. (2017). An
evaluation of convolutional neural networks for music
classification using spectrograms. Applied soft com-
puting, 52:28–38.
Dorfer, M., Arzt, A., and Widmer, G. (2016). Towards
score following in sheet music images. arXiv preprint
arXiv:1612.05050.
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas,
C., Golkov, V., Van Der Smagt, P., Cremers, D., and
Brox, T. (2015). Flownet: Learning optical flow with
convolutional networks. In Proceedings of the IEEE
international conference on computer vision, pages
2758–2766.
Gao, C.-C. and Hui, X.-W. (2010). Glcm-based texture fea-
ture extraction. Computer Systems & Applications,
6(048).
Gururani, S., Summers, C., and Lerch, A. (2018). In-
strument activity detection in polyphonic music using
deep neural networks. In ISMIR, pages 569–576.
Hao, Y., Xu, Z., Wang, J., Liu, Y., and Fan, J. (2017). An
effective video processing pipeline for crowd pattern
analysis. In 2017 23rd International Conference on
Automation and Computing (ICAC), pages 1–6. IEEE.
Itti, L., Koch, C., and Niebur, E. (1998). A model of
saliency-based visual attention for rapid scene anal-
ysis. IEEE Transactions on pattern analysis and ma-
chine intelligence, 20(11):1254–1259.
Pang, D., Madan, S., Kosaraju, S., and Singh, T. V. (2010).
Automatic virtual camera view generation for lecture
videos. In Tech report. Stanford University.
Rudoy, D., Goldman, D. B., Shechtman, E., and Zelnik-
Manor, L. (2013). Learning video saliency from hu-
man gaze using candidate selection. In Proceedings of
the IEEE Conference on Computer Vision and Pattern
Recognition, pages 1147–1154.
Seo, H. J. and Milanfar, P. (2009). Nonparametric bottom-
up saliency detection by self-resemblance. In 2009
IEEE Computer Society Conference on Computer Vi-
sion and Pattern Recognition Workshops, pages 45–
52. IEEE.
Sotelo, M. A., Rodriguez, F. J., Magdalena, L., Bergasa,
L. M., and Boquete, L. (2004). A color vision-based
lane tracking system for autonomous driving on un-
marked roads. Autonomous Robots, 16(1):95–116.
Toghiani-Rizi, B. and Windmark, M. (2017). Musical in-
strument recognition using their distinctive charac-
teristics in artificial neural networks. arXiv preprint
arXiv:1705.04971.
Yus, R., Mena, E., Ilarri, S., Illarramendi, A., and Bernad,
J. (2015). Multicamba: a system for selecting camera
views in live broadcasting of sport events using a dy-
namic 3d model. Multimedia Tools and Applications,
74(11):4059–4090.
Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., Mc-
Dermott, J., and Torralba, A. (2018). The sound of
pixels. In Proceedings of the European conference on
computer vision (ECCV), pages 570–586.
LAMV: Learning to Predict Where Spectators Look in Live Music Performances
507