dimensional sequences: Application to polyphonic
music generation and transcription. In Proceedings
of the International Conference on Machine Learning
(ICML), pages 1159–1166, Edinburgh, Scotland.
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau,
D., Bougares, F., Schwenk, H., and Bengio, Y.
(2014). Learning phrase representations using RNN
encoder–decoder for statistical machine translation. In
Empirical Methods in Natural Language Processing
(EMNLP), pages 1724–1734, Doha, Qatar.
Choi, K., Fazekas, G., Sandler, M., and Cho, K. (2017).
Convolutional recurrent neural networks for music
classification. In IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP),
pages 2392–2396, New Orleans,Luisiana.
Chung, J., G
¨
ulc¸ehre, C¸ ., Cho, K., and Bengio, Y. (2014).
Empirical evaluation of gated recurrent neural net-
works on sequence modeling. CoRR, abs/1412.3555.
http://arxiv.org/abs/1412.3555.
Defferrard, M., Benzi, K., Vandergheynst, P., and Bresson,
X. (2017). Fma: A dataset for music analysis. In Pro-
ceedings of the International Society for Music Infor-
mation Retrieval Conference (ISMIR), page 316–323,
Suzhou, China.
Delbouys, R., Hennequin, R., Piccoli, F., Royo-Letelier,
J., and Moussallam, M. (2018). Music mood detec-
tion based on audio and lyrics with deep neural net.
In Proceedings of the International Society for Music
Information Retrieval (ISMIR), pages 370–375, Paris,
France.
Ding, N. and Soricut, R. (2017). Cold-start reinforcement
learning with softmax policy gradient. In Guyon,
I., Luxburg, U. V., Bengio, S., Wallach, H., Fer-
gus, R., Vishwanathan, S., and Garnett, R., editors,
Advances in Neural Information Processing Systems
(NIPS), pages 2817–2826. Curran Associates, Inc.
Dua, D. and Graff, C. (2017). UCI machine learning repos-
itory. http://archive.ics.uci.edu/ml.
Elbayad, M., Besacier, L., and Verbeek, J. (2018). Per-
vasive attention: {2D} convolutional neural networks
for sequence-to-sequence prediction. In Proceedings
of the 22nd Conference on Computational Natural
Language Learning (CONLL), pages 97–107, Brus-
sels, Belgium.
Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L.,
Doll
´
ar, P., Gao, J., He, X., Mitchell, M., Platt, J. C.,
Zitnick, C. L., and Zweig, G. (2015). From captions
to visual concepts and back. In The IEEE Conference
on Computer Vision and Pattern Recognition (CVPR),
pages 1473–1482.
Germain, F. G., Chen, Q., and Koltun, V. (2018).
Speech denoising with deep feature losses. CoRR,
abs/1806.10522. https://arxiv.org/abs/1806.10522.
Hamel, P., Lemieux, S., Bengio, Y., and Eck, D. (2011).
Temporal pooling and multiscale learning for auto-
matic annotation and ranking of music audio. In Pro-
ceedings of the International Society for Music Infor-
mation Retrieval Conference (ISMIR), page 729–734,
Miami, United States.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In 2016 IEEE Con-
ference on Computer Vision and Pattern Recognition
(CVPR), pages 770–778, Las Vegas, Nevada.
Hershey, S., Chaudhuri, S., Ellis, D. P. W., Gemmeke,
J. F., Jansen, A., Moore, R. C., Plakal, M., Platt,
D., Saurous, R. A., Seybold, B., Slaney, M., Weiss,
R. J., and Wilson, K. (2017). Cnn architectures for
large-scale audio classification. In International Con-
ference on Acoustics, Speech and Signal Processing
(ICASSP), pages 131–135, New Orleans,Louisiana.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term
memory. Neural Computation, 9(8):1735–1780.
Hossain, M. Z., Sohel, F., Shiratuddin, M. F., and Laga, H.
(2019). A comprehensive survey of deep learning for
image captioning. ACM Computing Surveys, 51(6):1–
36.
Ioffe, S. and Szegedy, C. (2015). Batch normalization: Ac-
celerating deep network training by reducing internal
covariate shift. In International Conference on Ma-
chine Learning, ICML, pages 448–456, Lille, France.
Jehan, T. and Whitman, B. (2005). Echo nest.
https://developer.spotify.com.
Kim, C. D., Kim, B., Lee, H., and Kim, G. (2019). Au-
dioCaps: Generating captions for audios in the wild.
In Conference of the North American Chapter of the
Association for Computational Linguistics (NAACL),
pages 119–132, Minneapolis, Minnesota.
Kingma, D. P. and Ba, J. (2014). Adam: A method
for stochastic optimization. CoRR, abs/1412.6980.
http://arxiv.org/abs/1412.6980.
Lin, T., Maire, M., Belongie, S. J., Bourdev, L. D., Gir-
shick, R. B., Hays, J., Perona, P., Ramanan, D., Doll
´
ar,
P., and Zitnick, C. L. (2014). Microsoft COCO:
common objects in context. CoRR, abs/1405.0312.
http://arxiv.org/abs/1405.0312.
Oramas, S., Barbieri, F., Nieto, O., and Serra, X. (2018).
Multimodal deep learning for music genre classifica-
tion. Transactions of the International Society for Mu-
sic Information Retrieval (ISMIR), 1(1):4–21.
Ordonez, V., Kulkarni, G., and Berg, T. L. (2011). Im2text:
Describing images using 1 million captioned pho-
tographs. In Shawe-Taylor, J., Zemel, R. S., Bartlett,
P. L., Pereira, F., and Weinberger, K. Q., editors,
Advances in Neural Information Processing Systems
NIPS, pages 1143–1151. Curran Associates, Inc.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002).
BLEU: a method for automatic evaluation of machine
translation. In Annual Meeting of the Association
for Computational Linguistics (ACL), pages 311–318,
Philadelphia, Pennsylvania.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and
Lerer, A. (2017). Automatic differentiation in py-
torch. In Workshop on Autodiff, Advances in Neu-
ral Information Processing Systems (NIPS) ), Long
Beach, California.
Pennington, J., Socher, R., and Manning, C. (2014). Glove:
Global vectors for word representation. In Empirical
Methods in Natural Language Processing (EMNLP),
pages 1532–1543, Doha, Qatar.
Pons, J., Lidy, T., and Serra, X. (2016). Experimenting with
musically motivated convolutional neural networks.
ICAART 2020 - 12th International Conference on Agents and Artificial Intelligence
492