and 15 ms, respectively. Therefore, a fragment of
60 seconds of music is represented by a 4000 × 6
matrix after MFCC feature extraction.
• Temporal Feature Integration. It is well-known
that the direct use of MFCCs does not provide an
adequate representation for music genre identifi-
cation. Thus, a time integration process based on a
Multivariate Autoregressive (MAR) model (Meng
et al., 2007) recovers this more relevant informa-
tion. For a set of consecutive MFCCs vectors, we
fit an MAR model of lag three:
z
j
=
3
∑
p=1
B
p
z
j−p
+ e
j
,
where z
n
are the MFCCs extracted at the jth win-
dow, e
j
is the prediction error, and B
p
are the
model parameters. The values of matrices B
p
,
p = 1, . . . , 3, together with the mean and covari-
ance of the residuals e
j
are concatenated into a
135 × 1 single feature vector (MAR vector). For
this temporal integration phase, we have consid-
ered a window size and hopsize of 2 and 1 sec-
onds, respectively. Thus, an audio fragment of
60 seconds is represented by a matrix of size
60 × 135 after time integration.
3.2 Song Level Dynamical Features
This section collects the main contributions of this pa-
per. The key component of the classification system is
the song level features that determine the information
that is fed into the classification stage in the form of a
kernel for songs.
Our previous work (Garc
´
ıa-Garc
´
ıa et al., 2010)
shows that the time evolution of the MAR coefficients
is quite relevant to determine the genre. This paper
proposes to learn a common HMM with all the train-
ing songs and use the SSD metric (Section 2.2) to
characterize each song by its transition profile across
all the hidden states. Such strategy yields significantly
better classification rates than not using the time dy-
namics information or using the steady state probabil-
ity of each song visiting each state (computed consid-
ering (
˜
A
n
)
∞
instead of
˜
A
n
as induced transition matrix
for song S
n
).
From this previous experience we identify two
critical elements that determine the quality of the
genre classification:
• What is the best way to define the hidden states?
• Which information encoded in the HMMs is actu-
ally relevant for the genre discrimination task?
With respect to the first question, (Garc
´
ıa-Garc
´
ıa
et al., 2010) points out that a single HMM with
enough number of hidden states yields a pretty use-
ful set of hidden states. The alternative would be to
learn different hidden states for each genre and merge
them in the common model. The former guarantees a
larger number of examples to train each hidden state,
whilst the latter indirectly helps genre discrimination
since hidden states learned from different genres will
be more separated.
With respect to the second question, this paper
proposes a third alternative between the transition
profile explored in (Garc
´
ıa-Garc
´
ıa et al., 2010) and
the stationary frequency of each hidden state: a dy-
namically computed bag of acoustic words song rep-
resentation. In this approach, each hidden state is con-
sidered as an acoustic feature analogous to the role
of words in the bag of words approach to parameter-
ize document collections (Fu et al., 2011). These fre-
quencies are computed dynamically by evolving the
song across the common HMM. The bags of words
are fed into the SVM through an standard RBF gaus-
sian kernel
κ(S
n
, S
m
) = exp(−γkz
n
− z
m
k
2
)
where z
n
and z
m
are the bag of words corresponding
to songs S
n
and S
m
.
We aim at answering these questions by studying
the impact on the genre discrimination accuracy of the
following five sets of song level features :
1HMM+SSD. Learn a single HMM with all the
songs and use the SSD metric to form the ker-
nel for the SVM. This is the approach of (Garc
´
ıa-
Garc
´
ıa et al., 2010).
4HMM+SSD. Learn a separate HMM with the train-
ing songs of each genre. Merge their states in a
single HMM and use all the songs to learn the
transition matrices and initial states probabilities.
Then form the SVM kernel with the SSD metric
in the common HMM (but where the hidden states
were learned independently).
1HMM+BoW. Learn a single HMM with all the
songs as in (1HMM+SSD) but instead of the SSD
metric, use the dynamically computed bag of
acoustic words as features for the SVM.
4HMM+BoW. Learn the hidden states separately as
in (4HMM+SSD) and replace the SSD metric with
the dynamically computed bag of acoustic words.
4HMM+4BoW. Learn one independent complete
HMM per genre (i.e. the hidden states will not be
shared and the transition probabilities will also be
learnt independently per each genre). The bag of
acoustic words that is passed to the classification
stage results from the concatenation of all the bags
of words from all the models. Note that when one
MUSIC GENRE CLASSIFICATION BASED ON DYNAMICAL MODELS
253