cation with paralleling recurrent convolutional neural
network. CoRR, abs/1712.08370.
G
¨
uler, R. A., Neverova, N., and Kokkinos, I. (2018). Dense-
pose: Dense human pose estimation in the wild. In
Proceedings of the IEEE Conference on Computer Vi-
sion and Pattern Recognition, pages 7297–7306.
Hershey, S., Chaudhuri, S., Ellis, D. P. W., Gemmeke,
J. F., Jansen, A., Moore, R. C., Plakal, M., Platt,
D., Saurous, R. A., Seybold, B., Slaney, M., Weiss,
R. J., and Wilson, K. W. (2016). CNN architec-
tures for large-scale audio classification. CoRR,
abs/1609.09430.
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever,
I., and Salakhutdinov, R. (2012). Improving neural
networks by preventing co-adaptation of feature de-
tectors. CoRR, abs/1207.0580.
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy,
A., and Brox, T. (2016). Flownet 2.0: Evolution of
optical flow estimation with deep networks. CoRR,
abs/1612.01925.
Kingma, D. P. and Ba, J. (2014). Adam: A method for
stochastic optimization. CoRR, abs/1412.6980.
Kleiman, Y. and Cohen-Or, D. (2018). Dance to the beat :
Enhancing dancing performance in video.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-
agenet classification with deep convolutional neural
networks. In Proceedings of the 25th International
Conference on Neural Information Processing Sys-
tems - Volume 1, NIPS’12, pages 1097–1105, USA.
Curran Associates Inc.
Lee, H.-Y., Yang, X., Liu, M.-Y., Wang, T.-C., Lu, Y.-D.,
Yang, M.-H., and Kautz, J. (2019). Dancing to mu-
sic. In Wallach, H., Larochelle, H., Beygelzimer, A.,
d’Alch
´
e-Buc, F., Fox, E., and Garnett, R., editors, Ad-
vances in Neural Information Processing Systems 32,
pages 3581–3591. Curran Associates, Inc.
Liu, C., Feng, L., Liu, G., Wang, H., and Liu, S. (2019).
Bottom-up broadcast neural network for music genre
classification. CoRR, abs/1901.08928.
Madison, G., Gouyon, F., Ull
´
en, F., and H
¨
ornstr
¨
om, K.
(2011). Modeling the tendency for music to induce
movement in humans: first correlations with low-level
audio descriptors across music genres. Journal of ex-
perimental psychology. Human perception and perfor-
mance, 37 5:1578–94.
Marchand, U. and Peeters, G. (2016). The Extended Ball-
room Dataset. Late-Breaking Demo Session of the
17th International Society for Music Information Re-
trieval Conf.. 2016.
Moon, S., Kim, S., and Wang, H. (2014). Multimodal trans-
fer deep learning for audio visual recognition. CoRR,
abs/1412.3121.
Ofli, F., Erzin, E., Yemez, Y., and Tekalp, A. (2012).
Learn2dance: Learning statistical music-to-dance
mappings for choreography synthesis. IEEE TRANS-
ACTIONS ON MULTIMEDIA, 14:747–759.
Owens, A. and Efros, A. A. (2018). Audio-visual scene
analysis with self-supervised multisensory features.
In The European Conference on Computer Vision
(ECCV).
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and
Lerer, A. (2017). Automatic differentiation in pytorch.
In NIPS-W.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
Thirion, B., Grisel, O., Blondel, M., Prettenhofer,
P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,
A., Cournapeau, D., Brucher, M., Perrot, M., and
Duchesnay, E. (2011). Scikit-learn: Machine learning
in Python. Journal of Machine Learning Research,
12:2825–2830.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,
Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-
stein, M. S., Berg, A. C., and Li, F. (2014). Ima-
genet large scale visual recognition challenge. CoRR,
abs/1409.0575.
Sachs, C. and Sch
¨
onberg, B. (1963). World History of the
Dance. The Norton Library. Allen & Unwin.
Samanta, S., Purkait, P., and Chanda, B. (2012). Indian
classical dance classification by learning dance pose
bases. In Proceedings of the 2012 IEEE Workshop
on the Applications of Computer Vision, WACV ’12,
pages 265–270, Washington, DC, USA. IEEE Com-
puter Society.
Simonyan, K. and Zisserman, A. (2014). Two-stream
convolutional networks for action recognition in
videos. In Ghahramani, Z., Welling, M., Cortes, C.,
Lawrence, N. D., and Weinberger, K. Q., editors, Ad-
vances in Neural Information Processing Systems 27,
pages 568–576. Curran Associates, Inc.
Soomro, K., Zamir, A. R., and Shah, M. (2012). UCF101:
A dataset of 101 human actions classes from videos in
the wild. CoRR, abs/1212.0402.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S. E.,
Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-
novich, A. (2014). Going deeper with convolutions.
CoRR, abs/1409.4842.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna,
Z. (2015). Rethinking the inception architecture for
computer vision. CoRR, abs/1512.00567.
Toshev, A. and Szegedy, C. (2014). Deeppose: Human pose
estimation via deep neural networks. In Computer Vi-
sion and Pattern Recognition.
van der Maaten, L. and Hinton, G. (2008). Visualizing data
using t-SNE. Journal of Machine Learning Research,
9:2579–2605.
Wang, L., Xiong, Y., Wang, Z., and Qiao, Y. (2015). To-
wards good practices for very deep two-stream con-
vnets. CoRR, abs/1507.02159.
Zhang, H.-B., Zhang, Y.-X., Zhong, B., Lei, Q., Yang, L.,
Du, J.-X., and Chen, D.-S. (2019). A comprehen-
sive survey of vision-based human action recognition
methods. Sensors, 19:1005.
Multimodal Dance Recognition
565