MUSIC GENRE CLASSIFICATION BASED ON DYNAMICAL
MODELS
Alberto Garc
´
ıa-Dur
´
an
1
, Jer
´
onimo Arenas-Garc
´
ıa
1
, Dar
´
ıo Garc
´
ıa-Garc
´
ıa
2
and Emilio Parrado-Hern
´
andez
1
1
Dept. of Signal Processing and Communications, Universidad Carlos III de Madrid, 28911 Legan
´
es, Madrid, Spain
2
Research School for Computer Science, Australian National University, Canberra, Australia
Keywords:
Genre classification, HMMs, Dynamical features, Music retrieval.
Abstract:
This paper studies several alternatives to extract dynamical features from hidden Markov Models (HMMs) that
are meaningful for music genre supervised classification. Songs are modelled using a three scale approach: a
first stage of short term (milliseconds) features, followed by two layers of dynamical models: a multivariate
AR that provides mid term (seconds) features for each song followed by an HMM stage that captures long
term (song) features shared among similar songs. We study from an empirical point of view which features
are relevant for the genre classification task. Experiments on a database including pieces of heavy metal, punk,
classical and reggae music illustrate the advantages of each set of features.
1 INTRODUCTION
Automatic music classification (Tzanetakis and Cook,
2002; Meng et al., 2007; Guaus, 2009; Mckin-
ney and Breebaart, 2003) has become a hot topic in
the machine learning community due to the recent
widespread adoption of personal music repositories
and players. Automatic music classification helps
users to organize and efficiently browse their growing
collections, as well as to discover new music that may
result of interest to them. Trivial approaches to mu-
sic classification rely on metadata associated to each
item in the collection, such as composer, performers,
style, year, genre and so on. The more elaborated
content based approaches, relying on the analysis of
musical features extracted from the song waveform,
are more suited for the tasks of music discovering and
automatic compilation of reproducing lists from ex-
amples. Among all the possible criteria to classify
music for these purposes, genre based classification
is the most used since user preferences are generally
identified with some particular musical genres. In this
sense, genre is often a subjective and imprecisely de-
fined feature, specially in overlapping cases, such as
techno vs. electronic, rock vs. alternative rock, etc.
Therefore, content based automatic classification of
music can alleviate the need for each user to tedious
and carefully label their complete music collection.
Some reasonably successful approaches to genre
classification with machine learning discard the time
information. They model songs as sets of i.i.d. feature
vectors and classify individually each one of these
vectors. Finally, the genre with the majority of votes
among all its vectors determines each song overall
classification (Fu et al., 2011).
Our previous work (Garc
´
ıa-Garc
´
ıa et al., 2010)
points out that features exploiting the time dynam-
ics of songs through their sequential modeling can
result in a significant improvement in the genre clas-
sification rate. These features come from the transi-
tion matrix induced by each song in a common hidden
Markov Model that represents the complete song col-
lection. This paper extends this study in the following
directions:
Analyse the impact of learning the hidden states
from a global model trained with songs from all
genres or from individual models trained only
with songs from a determined genre. This is crit-
ical for the trade-off between scalability of the
training and accuracy of the final model.
Study which are the relevant features for the genre
classification task: On the one hand, (Garc
´
ıa-
Garc
´
ıa et al., 2010) shows that genre is captured
inside the transitions profile of each song across
the hidden states; on the other hand, common bag
of features representations (Fu et al., 2011) look at
the frequency of permanence in each hidden state.
250
García-Durán A., Arenas-García J., García-García D. and Parrado-Hernández E. (2012).
MUSIC GENRE CLASSIFICATION BASED ON DYNAMICAL MODELS.
In Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods, pages 250-256
DOI: 10.5220/0003859002500256
Copyright
c
SciTePress
The experimental section of the paper gives some
insight on the advantages and differences of these
two sets of features.
The remainder of the paper is organized as fol-
lows: Section 2 briefly reviews some background ma-
terial including HMMs, used to model the song col-
lection, and Support Vector Machines (SVM), that
serve as final classifier. Section 3 describes in detail
the genre classification scheme with all the analyzed
alternatives. Section 4 illustrates the capabilities of
each set of features with some experiments on a real
dataset with four genres of different a priori separabil-
ity: classical, punk, heavy metal and reggae. Finally,
Section 5 draws the main conclusions of this work and
suggests some lines for future research.
2 BACKGROUND
In this section we briefly review the basic technol-
ogy for the understanding of the genre classification
method. We focus on HMMs, the core of the sequen-
tial processing, a recent and probably not very well
known metric for sequences based on the transition
matrices of HMMs presented in (Garc
´
ıa-Garc
´
ıa et al.,
2011) and SVMs, the final classifiers.
2.1 Hidden Markov Models
Hidden Markov models (HMMs) (Rabiner, 1989)
are a type of parametric, discrete state-space model
widely used in applications concerning sequential
data. Their main assumptions are the independence
of the observations given the hidden states and that
these states follow a Markov chain.
Consider a sequence S of T observation vectors
S = {x
1
, . . . , x
T
}. The HMM assumes that x
t
, the t
th
observation of the sequence, is generated according to
the conditional emission density p(x
t
|q
t
), with q
t
be-
ing the hidden state at time t. The state q
t
can take val-
ues from a discrete set
{
s
1
, . . . , s
K
}
of size K. The hid-
den states evolve following a time-homogeneous first-
order Markov chain, so that p(q
t
|q
t1
, q
t2
, . . . , q
0
) =
p(q
t
|q
t1
).
An HMM is completely defined in terms of the
following distributions:
The initial probabilities vector π =
{
π
i
}
K
i=1
, where
π
i
= p(q
0
= s
i
).
The state transition probability, encoded in a ma-
trix A =
a
i j
K
i, j=1
with a
i j
= p(q
t+1
= s
j
|q
t
= s
i
),
1 i, j K.
The emission pdf for each hidden state p(x
t
|q
t
=
s
i
), 1 i K.
From these definitions, the likelihood of a se-
quence S = {x
1
, . . . , x
T
} can be written in the follow-
ing factorized way:
p(S|θ) =
q
0
,...,q
T
π
q
0
p(x
0
|q
0
)
T
t=1
p(x
t
|q
t
)a
q
t1
,q
t
. (1)
The training of this kind of models in a maximum
likelihood setting is usually accomplished using the
Baum-Welch method (Rabiner, 1989), which is a par-
ticularization of the well-known EM algorithm. The
E-step finds the expected state occupancy and transi-
tion probabilities, which can be done efficiently us-
ing the forward-backward algorithm (Rabiner, 1989).
Then, the M-step updates the parameters in order to
maximize the likelihood given the expected hidden
states sequence. These two steps are then iterated un-
til convergence. It is worth noting that the likelihood
function can have many local maxima, and this algo-
rithm does not guarantee convergence to the global
optimum. Due to this, it is common practice to re-
peat the training several times using different initial-
izations and then select as the correct run the one pro-
viding a larger likelihood.
The forward-backward algorithm implies the cal-
culation of both the forward α and backward β vari-
ables that are defined as follows:
α
k
(t) = p(x
1
, . . . , x
t
, q
t
= s
k
) (2)
β
k
(t) = p(x
t+1
, . . . , x
T
|q
t
= s
k
). (3)
These variables can be obtained in O(K
2
T ) time
through a recursive procedure and can be used to
rewrite the likelihood from Eq. (1) in the following
manner:
p(S|θ) =
K
k=1
α
k
(t)β
k
(t), (4)
which holds for all values of t {1, . . . , T }.
Given a previously estimated A, the state tran-
sition probabilities can be updated using the for-
ward/backward variables and that previous estima-
tion, yielding:
˜a
i j
T
t
0
=1
α
i
(t
0
)a
i j
p(x
t
0
+1
|q
t
0
+1
= s
j
)β
j
(t
0
+ 1). (5)
2.2 State-space Dynamics metric for
sequences
Sometimes the relevant information to be extracted
from sequences modeled with HMMs does not rely in
how often each state is visited, but in its visiting pat-
tern: which hidden states usually precede each state
and what are the most probable next states. The State-
Space Dynamics metric (Garc
´
ıa-Garc
´
ıa et al., 2011) is
aimed at capturing such information.
MUSIC GENRE CLASSIFICATION BASED ON DYNAMICAL MODELS
251
Let us assume we have an HMM of K states
Θ = {π, A, p(x
t
|q
t
= s
i
)} that models the complete
set of training sequences (songs in our case). From
the transition matrix A, we obtain, for each particu-
lar sequence S
n
, an induced transition matrix
˜
A
n
by
running a single M-step of the Forward-Backward al-
gorithm (equation (5) with α
i
(t
0
)
n
and β
i
(t
0
)
n
partic-
ularized for sequence S
n
). The SSD metric expresses
the similarity between two sequences S
n
and S
m
as a
distance between their induced
˜
A
n
and
˜
A
m
. For this
purpose, each row a
n
k
in
˜
A
n
is regarded as a discrete
probability function of the transitions from the k-th
hidden state to the other states in S
n
. Therefore, one
can compute the similarity between rows correspond-
ing to the same hidden state through any divergence
between discrete probabilities. In this paper we adopt
the Bhattacharyya affinity (Bhattacharyya, 1943):
D
B
(a
n
k
, a
m
k
) =
K
i=1
p
a
n
ki
a
m
ki
(6)
The distance between
˜
A
n
and
˜
A
m
is computed from
the mean affinity between their rows as follows:
d
nm
= log
1
K
K
k=1
D
B
(a
n
k
, a
m
k
) (7)
This distance can be further transformed into a scale
sensitive kernel for songs by exponentiation:
κ(S
n
, S
m
) = exp(γd
nm
) (8)
where γ is a scale parameter that has to be either fixed
with domain knowledge or crossvalidated.
2.3 Support Vector Machines
The supervised genre classifier is based on Support
Vector Machines (SVM) (Boser et al., 1992) endowed
with the kernel matrices that incorporate similarities
between sequences. The multiclass classifier is im-
plemented by a pool of one-versus-all binary SVMs
(Rifkin and Klautau, 2004), each one trained to dis-
criminate between one of the genres and the rest.
Given a kernel function on songs κ(S
1
, S
2
), for each
genre we wish to construct a scoring function f
c
(S)
that takes highly positive values (greater than one)
when S is a positive example for genre c and highly
negative values otherwise. This scoring function is
f
c
(S) =
l
i=1
y
c
i
α
c
i
κ(S
i
, S) (9)
where {S
i
, y
c
i
}
l
i=1
are the pairs song/label in the train-
ing set. Label y
c
i
{1, 1} marks S
i
as a positive or
negative example for genre c. The classifier is then
defined by weights α
c
i
, that result from the following
optimization:
max
α
c
1
,...,α
c
l
l
i=1
α
c
i
1
2
l
i, j=1
y
i
y
j
α
c
i
α
c
j
κ(S
i
, S
j
)
subject to
0 α
c
i
C i = 1, . . . , l (10)
where C is a regularization parameter that has to be
fixed using prior knowledge or crossvalidated.
After all the scoring functions are determined, the
overall classification consists in assigning each song
to the genre that achieves the higher value of its scor-
ing function:
3 GENRE CLASSIFICATION
SYSTEM BASED ON
DYNAMICAL FEATURES
The musical genre classification studied in this work
is a generalization of that presented in (Garc
´
ıa-Garc
´
ıa
et al., 2010). It consists in a multiclass pool of one-
vs-all SVMs (see Section 2.3) endowed with a ker-
nel that incorporates dynamic features from the songs
relevant for the identification of the genre. We use a
three level feature selection to capture such informa-
tion. The first two levels are based on the song rep-
resentation of (Meng et al., 2007), that provides with
features describing intervals of 1-2 seconds. Then,
these subsong features are completed with dynami-
cal features extracted from the HMMs (Section 2.1).
This third level actually captures relevant information
for the identification of the genre.
3.1 Subsong Level Features
Following (Meng et al., 2007) audio features at two
different time levels are extracted from each song:
Short-time Feature Extraction. First, MFCCs
are extracted in overlapped windows of short du-
ration. These parameters were originally devel-
oped for automatic speech recognition tasks, but
they have also been extensively applied in Mu-
sic Information Retrieval (MIR) tasks (Sigurdsson
et al., 2006) with generally good results.
In this work we follow (Sigurdsson et al., 2006),
using a bank with 30 filters, and keeping just the
initial 6 coefficients (however, the first coefficient,
which is associated to perceptual dimension of
loudness, is discarded (Meng et al., 2007)). The
window size and hopsize have been fixed to 30
ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods
252
and 15 ms, respectively. Therefore, a fragment of
60 seconds of music is represented by a 4000 × 6
matrix after MFCC feature extraction.
Temporal Feature Integration. It is well-known
that the direct use of MFCCs does not provide an
adequate representation for music genre identifi-
cation. Thus, a time integration process based on a
Multivariate Autoregressive (MAR) model (Meng
et al., 2007) recovers this more relevant informa-
tion. For a set of consecutive MFCCs vectors, we
fit an MAR model of lag three:
z
j
=
3
p=1
B
p
z
jp
+ e
j
,
where z
n
are the MFCCs extracted at the jth win-
dow, e
j
is the prediction error, and B
p
are the
model parameters. The values of matrices B
p
,
p = 1, . . . , 3, together with the mean and covari-
ance of the residuals e
j
are concatenated into a
135 × 1 single feature vector (MAR vector). For
this temporal integration phase, we have consid-
ered a window size and hopsize of 2 and 1 sec-
onds, respectively. Thus, an audio fragment of
60 seconds is represented by a matrix of size
60 × 135 after time integration.
3.2 Song Level Dynamical Features
This section collects the main contributions of this pa-
per. The key component of the classification system is
the song level features that determine the information
that is fed into the classification stage in the form of a
kernel for songs.
Our previous work (Garc
´
ıa-Garc
´
ıa et al., 2010)
shows that the time evolution of the MAR coefficients
is quite relevant to determine the genre. This paper
proposes to learn a common HMM with all the train-
ing songs and use the SSD metric (Section 2.2) to
characterize each song by its transition profile across
all the hidden states. Such strategy yields significantly
better classification rates than not using the time dy-
namics information or using the steady state probabil-
ity of each song visiting each state (computed consid-
ering (
˜
A
n
)
instead of
˜
A
n
as induced transition matrix
for song S
n
).
From this previous experience we identify two
critical elements that determine the quality of the
genre classification:
What is the best way to define the hidden states?
Which information encoded in the HMMs is actu-
ally relevant for the genre discrimination task?
With respect to the first question, (Garc
´
ıa-Garc
´
ıa
et al., 2010) points out that a single HMM with
enough number of hidden states yields a pretty use-
ful set of hidden states. The alternative would be to
learn different hidden states for each genre and merge
them in the common model. The former guarantees a
larger number of examples to train each hidden state,
whilst the latter indirectly helps genre discrimination
since hidden states learned from different genres will
be more separated.
With respect to the second question, this paper
proposes a third alternative between the transition
profile explored in (Garc
´
ıa-Garc
´
ıa et al., 2010) and
the stationary frequency of each hidden state: a dy-
namically computed bag of acoustic words song rep-
resentation. In this approach, each hidden state is con-
sidered as an acoustic feature analogous to the role
of words in the bag of words approach to parameter-
ize document collections (Fu et al., 2011). These fre-
quencies are computed dynamically by evolving the
song across the common HMM. The bags of words
are fed into the SVM through an standard RBF gaus-
sian kernel
κ(S
n
, S
m
) = exp(γkz
n
z
m
k
2
)
where z
n
and z
m
are the bag of words corresponding
to songs S
n
and S
m
.
We aim at answering these questions by studying
the impact on the genre discrimination accuracy of the
following five sets of song level features :
1HMM+SSD. Learn a single HMM with all the
songs and use the SSD metric to form the ker-
nel for the SVM. This is the approach of (Garc
´
ıa-
Garc
´
ıa et al., 2010).
4HMM+SSD. Learn a separate HMM with the train-
ing songs of each genre. Merge their states in a
single HMM and use all the songs to learn the
transition matrices and initial states probabilities.
Then form the SVM kernel with the SSD metric
in the common HMM (but where the hidden states
were learned independently).
1HMM+BoW. Learn a single HMM with all the
songs as in (1HMM+SSD) but instead of the SSD
metric, use the dynamically computed bag of
acoustic words as features for the SVM.
4HMM+BoW. Learn the hidden states separately as
in (4HMM+SSD) and replace the SSD metric with
the dynamically computed bag of acoustic words.
4HMM+4BoW. Learn one independent complete
HMM per genre (i.e. the hidden states will not be
shared and the transition probabilities will also be
learnt independently per each genre). The bag of
acoustic words that is passed to the classification
stage results from the concatenation of all the bags
of words from all the models. Note that when one
MUSIC GENRE CLASSIFICATION BASED ON DYNAMICAL MODELS
253
learns an independent HMM per genre the kernel
based on transition matrices is pointless.
4 EXPERIMENTS
The ability of the song level features presented in Sec-
tion 3.2 to discriminate musical genre is evaluated in
the following classification task. We use a subset of
the garageband dataset described in (Arenas-Garc
´
ıa
et al., 2007). The data set consists of snippets of 60
seconds of songs downloaded from the online music
site http://www.garageband.com
1
. The songs are
in MP3 format, and belong to different genres. For the
experiments we consider a simplified problem where
the goal is to discriminate between four different gen-
res: “Punk”, “Heavy Metal”, “Classical”, and “Reg-
gae”. The dataset includes genres that are a priori hard
to distinguish, like Punk and Heavy Metal plus others
which are easily separated. Each genre is represented
by a subset of 300 songs. MFCC and MAR extraction
settings proceed as described in Section 3.1. For com-
pleteness, we have included in the comparison the re-
sults of a classifier that assigns each song to the genre
whose HMM yields a maximum likelihood (no SVM
as final classifier). This baseline classifier is named
4HMM in the tables.
For each experiment we adopted repeated random
sub-sampling validation as our evaluation scheme.
The training and testing subsets are composed of 175
and 25 songs, respectively. The hyperparameters γ
and C of the SVMs are determined after 5-fold cross
validation over the training set. The presented results
correspond to the average over 10 different random
training/test partitions.
In order to ensure a fair comparison we have cho-
sen the number of hidden states for the HMMs in
a way that the resulting hidden states space has the
same size. This way, single HMMs trained with all
the songs are endowed with 24 hidden states whilst
the independent HMMs trained only with songs of a
same genre have 6 hidden states. The density func-
tions are spherical covariance gaussians.
Table 1 shows the average accuracy achieved by
each feature set together with the standard devia-
tion. The best average performance is obtained by
the 4HMM+BoW features, although it is remarkable the
higher stability in terms of small standard deviation
obtained by the SSD based features. In fact, the per-
formance of the 1HMM+SSD features is almost as good
in spite of the emission pdfs being learned with all
the songs. It seems that the SSD focus on the tran-
1
Downloaded in November, 2005.
sition probabilities compensates for the not so dis-
criminative hidden states. The worse performance of
4HMM+4BoW brings out the advantage of a joint learn-
ing of the transition probabilities.
The individual confusion matrices corresponding
to each feature set, showed in Tables 2–7 open a more
detailed genre-wise discussion. Moreover, Figure 1
shows a Hinton plot of the average occupancy fre-
quency of each hidden state per each genre in the
4HMM cases. The bigger the rectangle, the more fre-
quently is that state visited by the songs belonging to
that genre. States are sorted according to their HMM
(states 1–6 come from the Classic, states 7–12 from
the Punk, states 13–18 from Reggae and states 19–
24 from Heavy HMM). Finally, Figure 2 shows the
Hinton plot of average transition matrices for the four
genres. The bigger the rectangle in position (i, j), the
most probable the transition from state s
i
to state s
j
is.
The states follow the same order as in Figure 1.
Table 1: Comparison among all the strategies on the
garageband data set with the same experimental setup.
Strategy Accuracy
4HMM + 4BoW 65.0 ± 3.10 %
1HMM + SSD 75.30 ± 0.04 %
4HMM + BoW 78.0 ± 3.70 %
4HMM + SSD 72.20 ± 0.03 %
1HMM + BoW 71.40 ± 3.40 %
4HMM 69.5 ±3.7%
Table 2: Confusion matrix for 4HMM + 4BoW.
Classical Punk Reggae Heavy
Classical 0.70 0.12 0.12 0.06
Punk 0.06 0.57 0.01 0.36
Reggae 0.05 0.02 0.79 0.14
Heavy 0.02 0.36 0.08 0.54
5 10 15 20
Class.
Punk
Reag
Heavy
State
Music genre
Figure 1: Hinton diagram of the visiting frequency to each
state for the songs of each genre.
Classical is the easiest to discriminate genre, re-
gardless of the feature set. Figures 1 and 2 show that
this genre takes separate states from the rest. Reggae
ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods
254
5 10 15 20
5
10
15
20
State
State
Classical songs
5 10 15 20
5
10
15
20
State
State
Punk songs
5 10 15 20
5
10
15
20
State
State
Reggae songs
5 10 15 20
5
10
15
20
State
State
Heavy Metal songs
Figure 2: Hinton diagram of the transition matrices for the songs of each genre.
is also easy to discriminate, although there is a higher
overlapping with Punk and Heavy states.
With respect to Heavy Metal and Punk, there is
a higher overlapping in their bags of words, therefore
the independent learning of the hidden states followed
by 4HMM is of certain advantage. In the case of Punk,
Figure 1 shows enough spatial separability from the
Heavy Metal states, so 4HMM+BoW yields better per-
formance than 4HMM+SSD. However, the Heavy Metal
transition matrices are more different than the Punk
ones, as shown in Figure 2, making the SSD kernel
more suited than the BoW for their separation.
Table 3: Confusion matrix for 1HMM + SSD.
Classical Punk Reggae Heavy
Classical 0.89 0.05 0.05 0.01
Punk 0.04 0.70 0.02 0.24
Reggae 0.04 0.06 0.84 0.06
Heavy 0.03 0.32 0.06 0.59
5 CONCLUSIONS
This paper has studied the suitability to discriminate
musical genre of several feature sets extracted from
an HMM based dynamical model of a song collec-
Table 4: Confusion matrix for 4HMM + BoW.
Classical Punk Reggae Heavy
Classical 0.88 0.05 0.06 0.01
Punk 0.04 0.78 0.03 0.15
Reggae 0.04 0.07 0.84 0.05
Heavy 0.01 0.30 0.07 0.62
Table 5: Confusion matrix for 4HMM + SSD.
Classical Punk Reggae Heavy
Classical 0.82 0.09 0.05 0.04
Punk 0.05 0.67 0.04 0.24
Reggae 0.06 0.10 0.76 0.08
Heavy 0.02 0.26 0.08 0.64
Table 6: Confusion matrix for 1HMM + BoW.
Classical Punk Reggae Heavy
Classical 0.87 0.04 0.04 0.05
Punk 0.03 0.69 0.04 0.24
Reggae 0.04 0.03 0.80 0.13
Heavy 0.01 0.42 0.08 0.49
tion. The best classification rates are obtained when
the hidden states of the model are learned indepen-
dently for each genre but then merged in a single
overall HMM where the probabilities of transition be-
MUSIC GENRE CLASSIFICATION BASED ON DYNAMICAL MODELS
255
Table 7: Confusion matrix for 4HMM.
Classical Punk Reggae Heavy
Classical 0.71 0.21 0.02 0.06
Punk 0.03 0.82 0.01 0.14
Reggae 0.01 0.22 0.63 0.14
Heavy 0.00 0.38 0.00 0.62
tween any pair of states are more precisely acquired.
These probabilities of transition carry relevant infor-
mation for the genre discrimination task, as pointed
out by the good results achieved by the SSD kernel
when the states are learned in a common model. In
this sense, this information is able to somehow com-
pensate for the lack of a discriminative learning of the
hidden states.
Future work will be focused on the extension to
more musical genres, to other families of dynamical
models different from HMMs. Another intersting line
of research consists in the combination of the features
related to the frequency of hidden state occupancy
and to the dynamics of the transitions between hid-
den states in a multiple view learning framework.
ACKNOWLEDGEMENTS
This work has been partially supported by the Re-
gional Government of Madrid through grant CCG10-
UC3M/TIC-5511 and by the IST Programme of the
European Community under the PASCAL2 Network
of Excellence IST-2007-216886.
REFERENCES
Arenas-Garc
´
ıa, J., Parrado-Hern
´
andez, E., Meng, A.,
Hansen, L.-K., and Larsen, J. (2007). Discovering
music structure via similarity fusion. In Music, Brain
and Cognition Workshop, NIPS’07.
Bhattacharyya, A. (1943). On a measure of divergence
between two statistical populations defined by their
probability distributions. Bull. Calcutta Math Soc.
Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A
training algorithm for optimal margin classifiers. In
Proceedings of the fifth annual workshop on Compu-
tational learning theory, COLT ’92, pages 144–152.
Fu, Z., Lu, G., Ting, K. M., and Zhang, D. (2011). Music
classification via the bag-of-features approach. Pat-
tern Recognition Letters, 32(14):1768(10).
Garc
´
ıa-Garc
´
ıa, D., Arenas-Garc
´
ıa, J., Parrado-Hern
´
andez,
E., and de Maria F, D. (2010). Music genre classifica-
tion using the temporal structure of songs. In Machine
Learning for Signal Processing (MLSP), 2010 IEEE
International Workshop on, pages 266 –271.
Garc
´
ıa-Garc
´
ıa, D., Parrado-Hern
´
andez, E., and Diaz-de
Maria, F. (2011). State-space dynamics distance for
clustering sequential data. Pattern Recogn., 44:1014–
1022.
Guaus, E. (2009). Audio content processing for automatic
music genre classification: descriptors, databases,
and classifiers. PhD thesis, Universitat Pompeu Fabra,
Spain.
Mckinney, M. and Breebaart, J. (2003). Features for audio
and music classification. In Proceedings of the Inter-
national Symposium on Music Information Retrieval,
pages 151–158.
Meng, A., Ahrendt, P., Larsen, J., and Hansen, L. (2007).
Temporal feature integration for music genre classi-
fication. Audio, Speech, and Language Processing,
IEEE Transactions on, 15(5):1654 –1664.
Rabiner, L. R. (1989). A tutorial on hidden markov models
and selected applications in speech recognition. Pro-
ceedings of the IEEE, pages 257–286.
Rifkin, R. and Klautau, A. (2004). In defense of one-vs-all
classification. J. Mach. Learn. Res., 5:101–141.
Sigurdsson, S., Petersen, K. B., and Lehn-Schiler, T. (2006).
Mel frequency cepstral coefficients: An evaluation of
robustness of mp3 encoded music. In Proceedings of
the Seventh International Conference on Music Infor-
mation Retrieval (ISMIR), pages 286–289.
Tzanetakis, G. and Cook, P. (2002). Musical genre classifi-
cation of audio signals. Speech and Audio Processing,
IEEE Transactions on, 10(5):293 – 302.
ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods
256