suggest as main feature the Mel-Frequency Cepstral
Coefficients (MFCC) for the classification of
vocal/instrumental parts applied in the
MUSCONTENT database. Three machine learning
techniques were used for the classification: Gaussian
Mixture Model (GMM), Artificial Neural Network
(ANN) with Feed Forward Backpropagation
algorithm and Learning Vector Quantization (LVQ).
From their results, they claim that LVQ yields the
higher accuracy in the classification. More precisely,
they report 77% classification accuracy for the
ANNs, 77.6% for the LVQ and 60.24% for the
GMM. In our work, we included additional low-
level features and we achieved higher accuracy by
modeling our data with ANN.
In this paper, we introduce a two-stage approach
for (a) the classification of an unknown song into
“monophonic” or “polyphonic” and (b) the
segmentation of a polyphonic song into positions of
interest. Such positions include the boundaries of a
part in a song that only instruments are performing
and no vocal singing is present. We used low-level
timbre features and we built trained artificial neural
networks that are able to discriminate and predict
with high accuracy the unknown songs as
“monophonic” or “polyphonic” and polyphonic
music as “vocal” or “instrumental”.
The main contribution of our work is the use of
ANNs and their application in audio thumb-nailing.
This use has numerous advantages in wide range of
applications (Benediktsson, J., et. al., 1990). The
ability of the adaptation of complex nonlinear
relationships between variables arises from the
imitation of the biological function of the human
brain. Disadvantages include the greater time of
training, and the empirical nature of model
development. The authors have demonstrated how
the disadvantages can be minimized in a wide area
of applications including medicine (Neocleous C.C.,
et. al. 2011).
While the interest of the MIR community on the
audio thumb-nailing focused in popular music, little
work has been done for folk music. Main differences
between popular music and folk music include the
western/non-western instrumentation as well as
fundamental rules in music theory. For instance, the
use of traditional instruments in folk music, create a
significantly different sound in comparison to the
popular music. One common feature of the folk
music is the monophonic performances. These can
be either using a musical instrument or only with
singing voice. Our contribution with a classification
system of a song into “monophonic” or
“polyphonic” is reported in this paper.
We compare our results with two other methods,
named Support Vector Machines and the statistical
Bayesian Classification.
2 METHODS
2.1 Overview
The database we used contains audio signals of 98
Cypriot folk songs. Each audio signal has been
extracted from original cd’s and it has been encoded
with a sampling frequency of 44100 Hz and 16 bit
amplitude. The sampling frequency of 44100 Hz and
the amplitude of 16 bit is the quality that is typically
used in the audio cd’s.
From this database, we isolated 24 songs for
creating a training set while the remaining 74 songs
were used for validation of our system. In the
training set, 17 monophonic songs and 7 polyphonic
songs were chosen. The monophonic songs were 6
vocal songs sung by male performers, 6 vocal songs
sung by female performers and 5 songs performed
with the traditional Cypriot instrument called
“pithkiavli”.
The main idea of our method is illustrated in
Figure 1. The first system takes as input an unknown
song and predicts if it is monophonic or polyphonic.
The second system takes a polyphonic song and
predicts the boundaries of parts of the song that only
instruments are performing (instrumental parts) and
parts in which a singing voice is present (vocal
parts).
Each audio signal was segmented into a sequence
of overlapping audio frames of length 2048 samples
(46 ms) overlapping by 512 samples (12 ms). For
each of these audio frames we extracted the
following audio features: Zero Crossing Rate,
Spectral Centroid, MFCC (13 coefficients). For the
first system the mean and the standard deviation
values of each feature are calculated and are used to
build three feed-forward ANNs. Each of them has 20
neurons in the hidden layer and was trained for 200
epochs. The ANNs were built using monophonic
songs for the first class and polyphonic songs for the
second class. The difference between the three
ANNs is that the instrument that is performing in the
monophonic songs is different for each network.
This system classifies an unknown song into the
class “monophonic” or “polyphonic”. Both systems
1 and 2 require audio frame segmentation and
feature extraction.
For the second system, the entire feature vectors
were used to build one ANN that predicts a value in
AutomatedSegmentationofFolkSongsUsingArtificialNeuralNetworks
145