Authors:
Wenming Gui
1
;
Zeyu Xia
2
;
Rubin Gong
1
;
Gui Wang
1
;
Bingxu Chen
1
and
Donghui Zhang
1
Affiliations:
1
Jinling Institute of Technology, Nanjing 211169, Jiangsu, China
;
2
Queensland University of Technology, Brisbane City QLD 4000, Australia
Keyword(s):
Singing voice detection, deeper convolutional neural network, recurrent neural network, squeeze, excitation residual convolutional network.
Abstract:
Singing voice detection is a fundamental task in music information retrieval, which benefits other tasks such as singing voice separation. We propose a new algorithm based on a deeper convolution neural network, fed with the logarithmic and mel-scaled spectrogram, to exact and integrate the features of the different layers of the network and to discriminate the singing voice finally. We demonstrate that this deeper network can produce good performances and be designed efficiently to some extent. The experiments are based on the public datasets: Jamendo, Mir1k, RWC pop, and their combined dataset. We also studied what depth of the network is suitable for this task. The experiments show that the optimal depth on the four public datasets is 152.