Predicting a Song Title from Audio Embeddings on a Pretrained Image-captioning Network

Avi Bleiweiss

2020

Abstract

Finding the name of a song from a piece played without the lyrics remains a long-standing challenge to music recognition services. In this work, we propose the use of a neural architecture that combines deep learned image features and sequence modeling to automate the task of predicting the song title from an audio time series. To feed our network with a visual representation, we transform the sound signal into a two-dimensional spectrogram. Our novelty lies in model training on the state-of-the-art Conceptual Captions dataset to generate image descriptions, jointly with inference on the Million Song and Free Music Archive test sets to produce song titles. We present extensive quantitative analysis of our experiments and show that using k-beam search our model achieved an out-domain BLEU score of 45.1 compared to in-domain performance of 61.3.

Download


Paper Citation


in Harvard Style

Bleiweiss A. (2020). Predicting a Song Title from Audio Embeddings on a Pretrained Image-captioning Network. In Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, ISBN 978-989-758-395-7, pages 483-493. DOI: 10.5220/0008939004830493


in Bibtex Style

@conference{icaart20,
author={Avi Bleiweiss},
title={Predicting a Song Title from Audio Embeddings on a Pretrained Image-captioning Network},
booktitle={Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,},
year={2020},
pages={483-493},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0008939004830493},
isbn={978-989-758-395-7},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,
TI - Predicting a Song Title from Audio Embeddings on a Pretrained Image-captioning Network
SN - 978-989-758-395-7
AU - Bleiweiss A.
PY - 2020
SP - 483
EP - 493
DO - 10.5220/0008939004830493