MUSIC GENRE CLASSIFICATION BASED ON DYNAMICAL

MODELS

Alberto Garc

ıa-Dur

, Jer

onimo Arenas-Garc

ıa

, Dar

ıo Garc

ıa-Garc

ıa

and Emilio Parrado-Hern

andez

Dept. of Signal Processing and Communications, Universidad Carlos III de Madrid, 28911 Legan

es, Madrid, Spain

Research School for Computer Science, Australian National University, Canberra, Australia

Keywords:

Genre classiﬁcation, HMMs, Dynamical features, Music retrieval.

Abstract:

This paper studies several alternatives to extract dynamical features from hidden Markov Models (HMMs) that

are meaningful for music genre supervised classiﬁcation. Songs are modelled using a three scale approach: a

ﬁrst stage of short term (milliseconds) features, followed by two layers of dynamical models: a multivariate

AR that provides mid term (seconds) features for each song followed by an HMM stage that captures long

term (song) features shared among similar songs. We study from an empirical point of view which features

are relevant for the genre classiﬁcation task. Experiments on a database including pieces of heavy metal, punk,

classical and reggae music illustrate the advantages of each set of features.

1 INTRODUCTION

Automatic music classiﬁcation (Tzanetakis and Cook,

2002; Meng et al., 2007; Guaus, 2009; Mckin-

ney and Breebaart, 2003) has become a hot topic in

the machine learning community due to the recent

widespread adoption of personal music repositories

and players. Automatic music classiﬁcation helps

users to organize and efﬁciently browse their growing

collections, as well as to discover new music that may

result of interest to them. Trivial approaches to mu-

sic classiﬁcation rely on metadata associated to each

item in the collection, such as composer, performers,

style, year, genre and so on. The more elaborated

content based approaches, relying on the analysis of

musical features extracted from the song waveform,

are more suited for the tasks of music discovering and

automatic compilation of reproducing lists from ex-

amples. Among all the possible criteria to classify

music for these purposes, genre based classiﬁcation

is the most used since user preferences are generally

identiﬁed with some particular musical genres. In this

sense, genre is often a subjective and imprecisely de-

ﬁned feature, specially in overlapping cases, such as

techno vs. electronic, rock vs. alternative rock, etc.

Therefore, content based automatic classiﬁcation of

music can alleviate the need for each user to tedious

and carefully label their complete music collection.

Some reasonably successful approaches to genre

classiﬁcation with machine learning discard the time

information. They model songs as sets of i.i.d. feature

vectors and classify individually each one of these

vectors. Finally, the genre with the majority of votes

among all its vectors determines each song overall

classiﬁcation (Fu et al., 2011).

Our previous work (Garc

ıa-Garc

ıa et al., 2010)

points out that features exploiting the time dynam-

ics of songs through their sequential modeling can

result in a signiﬁcant improvement in the genre clas-

siﬁcation rate. These features come from the transi-

tion matrix induced by each song in a common hidden

Markov Model that represents the complete song col-

lection. This paper extends this study in the following

directions:

• Analyse the impact of learning the hidden states

from a global model trained with songs from all

genres or from individual models trained only

with songs from a determined genre. This is crit-

ical for the trade-off between scalability of the

training and accuracy of the ﬁnal model.

• Study which are the relevant features for the genre

classiﬁcation task: On the one hand, (Garc

ıa-

Garc

ıa et al., 2010) shows that genre is captured

inside the transitions proﬁle of each song across

the hidden states; on the other hand, common bag

of features representations (Fu et al., 2011) look at

the frequency of permanence in each hidden state.

250

García-Durán A., Arenas-García J., García-García D. and Parrado-Hernández E. (2012).

MUSIC GENRE CLASSIFICATION BASED ON DYNAMICAL MODELS.

In Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods, pages 250-256

DOI: 10.5220/0003859002500256

 SciTePress

The experimental section of the paper gives some

insight on the advantages and differences of these

two sets of features.

The remainder of the paper is organized as fol-

lows: Section 2 brieﬂy reviews some background ma-

terial including HMMs, used to model the song col-

lection, and Support Vector Machines (SVM), that

serve as ﬁnal classiﬁer. Section 3 describes in detail

the genre classiﬁcation scheme with all the analyzed

alternatives. Section 4 illustrates the capabilities of

each set of features with some experiments on a real

dataset with four genres of different a priori separabil-

ity: classical, punk, heavy metal and reggae. Finally,

Section 5 draws the main conclusions of this work and

suggests some lines for future research.

2 BACKGROUND

In this section we brieﬂy review the basic technol-

ogy for the understanding of the genre classiﬁcation

method. We focus on HMMs, the core of the sequen-

tial processing, a recent and probably not very well

known metric for sequences based on the transition

matrices of HMMs presented in (Garc

ıa-Garc

ıa et al.,

2011) and SVMs, the ﬁnal classiﬁers.

2.1 Hidden Markov Models

Hidden Markov models (HMMs) (Rabiner, 1989)

are a type of parametric, discrete state-space model

widely used in applications concerning sequential

data. Their main assumptions are the independence

of the observations given the hidden states and that

these states follow a Markov chain.

Consider a sequence S of T observation vectors

S = {x

, . . . , x

}. The HMM assumes that x

, the t

observation of the sequence, is generated according to

the conditional emission density p(x

), with q

be-

ing the hidden state at time t. The state q

can take val-

ues from a discrete set

{

, . . . , s

}

of size K. The hid-

den states evolve following a time-homogeneous ﬁrst-

order Markov chain, so that p(q

t−1

, q

t−2

, . . . , q

) =

p(q

t−1

An HMM is completely deﬁned in terms of the

following distributions:

• The initial probabilities vector π =

{

}

i=1

, where

= p(q

= s

• The state transition probability, encoded in a ma-

trix A =



i j



i, j=1

with a

i j

= p(q

t+1

= s

1 ≤ i, j ≤ K.

• The emission pdf for each hidden state p(x

), 1 ≤ i ≤ K.

From these deﬁnitions, the likelihood of a se-

quence S = {x

, . . . , x

} can be written in the follow-

ing factorized way:

p(S|θ) =

∑

,...,q

p(x

)

∏

t=1

p(x

t−1

. (1)

The training of this kind of models in a maximum

likelihood setting is usually accomplished using the

Baum-Welch method (Rabiner, 1989), which is a par-

ticularization of the well-known EM algorithm. The

E-step ﬁnds the expected state occupancy and transi-

tion probabilities, which can be done efﬁciently us-

ing the forward-backward algorithm (Rabiner, 1989).

Then, the M-step updates the parameters in order to

maximize the likelihood given the expected hidden

states sequence. These two steps are then iterated un-

til convergence. It is worth noting that the likelihood

function can have many local maxima, and this algo-

rithm does not guarantee convergence to the global

optimum. Due to this, it is common practice to re-

peat the training several times using different initial-

izations and then select as the correct run the one pro-

viding a larger likelihood.

The forward-backward algorithm implies the cal-

culation of both the forward α and backward β vari-

ables that are deﬁned as follows:

(t) = p(x

, . . . , x

, q

= s

) (2)

(t) = p(x

t+1

, . . . , x

= s

). (3)

These variables can be obtained in O(K

T ) time

through a recursive procedure and can be used to

rewrite the likelihood from Eq. (1) in the following

manner:

p(S|θ) =

∑

k=1

(t)β

(t), (4)

which holds for all values of t ∈ {1, . . . , T }.

Given a previously estimated A, the state tran-

sition probabilities can be updated using the for-

ward/backward variables and that previous estima-

tion, yielding:

˜a

i j

∝

∑

i j

p(x

= s

)β

+ 1). (5)

2.2 State-space Dynamics metric for

sequences

Sometimes the relevant information to be extracted

from sequences modeled with HMMs does not rely in

how often each state is visited, but in its visiting pat-

tern: which hidden states usually precede each state

and what are the most probable next states. The State-

Space Dynamics metric (Garc

ıa-Garc

ıa et al., 2011) is

aimed at capturing such information.

MUSIC GENRE CLASSIFICATION BASED ON DYNAMICAL MODELS

251

Let us assume we have an HMM of K states

Θ = {π, A, p(x

= s

)} that models the complete

set of training sequences (songs in our case). From

the transition matrix A, we obtain, for each particu-

lar sequence S

, an induced transition matrix

running a single M-step of the Forward-Backward al-

gorithm (equation (5) with α

)

and β

)

partic-

ularized for sequence S

). The SSD metric expresses

the similarity between two sequences S

and S

as a

distance between their induced

and

. For this

purpose, each row a

is regarded as a discrete

probability function of the transitions from the k-th

hidden state to the other states in S

. Therefore, one

can compute the similarity between rows correspond-

ing to the same hidden state through any divergence

between discrete probabilities. In this paper we adopt

the Bhattacharyya afﬁnity (Bhattacharyya, 1943):

, a

) =

∑

i=1

(6)

The distance between

and

is computed from

the mean afﬁnity between their rows as follows:

= −log

∑

k=1

, a

) (7)

This distance can be further transformed into a scale

sensitive kernel for songs by exponentiation:

κ(S

, S

) = exp(−γd

) (8)

where γ is a scale parameter that has to be either ﬁxed

with domain knowledge or crossvalidated.

2.3 Support Vector Machines

The supervised genre classiﬁer is based on Support

Vector Machines (SVM) (Boser et al., 1992) endowed

with the kernel matrices that incorporate similarities

between sequences. The multiclass classiﬁer is im-

plemented by a pool of one-versus-all binary SVMs

(Rifkin and Klautau, 2004), each one trained to dis-

criminate between one of the genres and the rest.

Given a kernel function on songs κ(S

, S

), for each

genre we wish to construct a scoring function f

(S)

that takes highly positive values (greater than one)

when S is a positive example for genre c and highly

negative values otherwise. This scoring function is

(S) =

∑

i=1

κ(S

, S) (9)

where {S

, y

}

i=1

are the pairs song/label in the train-

ing set. Label y

∈ {1, −1} marks S

as a positive or

negative example for genre c. The classiﬁer is then

deﬁned by weights α

, that result from the following

optimization:

max

,...,α

∑

i=1

−

∑

i, j=1

κ(S

, S

)

subject to

0 ≤ α

≤ C i = 1, . . . , l (10)

where C is a regularization parameter that has to be

ﬁxed using prior knowledge or crossvalidated.

After all the scoring functions are determined, the

overall classiﬁcation consists in assigning each song

to the genre that achieves the higher value of its scor-

ing function:

3 GENRE CLASSIFICATION

SYSTEM BASED ON

DYNAMICAL FEATURES

The musical genre classiﬁcation studied in this work

is a generalization of that presented in (Garc

ıa-Garc

ıa

et al., 2010). It consists in a multiclass pool of one-

vs-all SVMs (see Section 2.3) endowed with a ker-

nel that incorporates dynamic features from the songs

relevant for the identiﬁcation of the genre. We use a

three level feature selection to capture such informa-

tion. The ﬁrst two levels are based on the song rep-

resentation of (Meng et al., 2007), that provides with

features describing intervals of 1-2 seconds. Then,

these subsong features are completed with dynami-

cal features extracted from the HMMs (Section 2.1).

This third level actually captures relevant information

for the identiﬁcation of the genre.

3.1 Subsong Level Features

Following (Meng et al., 2007) audio features at two

different time levels are extracted from each song:

• Short-time Feature Extraction. First, MFCCs

are extracted in overlapped windows of short du-

ration. These parameters were originally devel-

oped for automatic speech recognition tasks, but

they have also been extensively applied in Mu-

sic Information Retrieval (MIR) tasks (Sigurdsson

et al., 2006) with generally good results.

In this work we follow (Sigurdsson et al., 2006),

using a bank with 30 ﬁlters, and keeping just the

initial 6 coefﬁcients (however, the ﬁrst coefﬁcient,

which is associated to perceptual dimension of

loudness, is discarded (Meng et al., 2007)). The

window size and hopsize have been ﬁxed to 30

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

252

and 15 ms, respectively. Therefore, a fragment of

60 seconds of music is represented by a 4000 × 6

matrix after MFCC feature extraction.

• Temporal Feature Integration. It is well-known

that the direct use of MFCCs does not provide an

adequate representation for music genre identiﬁ-

cation. Thus, a time integration process based on a

Multivariate Autoregressive (MAR) model (Meng

et al., 2007) recovers this more relevant informa-

tion. For a set of consecutive MFCCs vectors, we

ﬁt an MAR model of lag three:

∑

p=1

j−p

+ e

where z

are the MFCCs extracted at the jth win-

dow, e

is the prediction error, and B

are the

model parameters. The values of matrices B

p = 1, . . . , 3, together with the mean and covari-

ance of the residuals e

are concatenated into a

135 × 1 single feature vector (MAR vector). For

this temporal integration phase, we have consid-

ered a window size and hopsize of 2 and 1 sec-

onds, respectively. Thus, an audio fragment of

60 seconds is represented by a matrix of size

60 × 135 after time integration.

3.2 Song Level Dynamical Features

This section collects the main contributions of this pa-

per. The key component of the classiﬁcation system is

the song level features that determine the information

that is fed into the classiﬁcation stage in the form of a

kernel for songs.

Our previous work (Garc

ıa-Garc

ıa et al., 2010)

shows that the time evolution of the MAR coefﬁcients

is quite relevant to determine the genre. This paper

proposes to learn a common HMM with all the train-

ing songs and use the SSD metric (Section 2.2) to

characterize each song by its transition proﬁle across

all the hidden states. Such strategy yields signiﬁcantly

better classiﬁcation rates than not using the time dy-

namics information or using the steady state probabil-

ity of each song visiting each state (computed consid-

ering (

)

∞

instead of

as induced transition matrix

for song S

From this previous experience we identify two

critical elements that determine the quality of the

genre classiﬁcation:

• What is the best way to deﬁne the hidden states?

• Which information encoded in the HMMs is actu-

ally relevant for the genre discrimination task?

With respect to the ﬁrst question, (Garc

ıa-Garc

ıa

et al., 2010) points out that a single HMM with

enough number of hidden states yields a pretty use-

ful set of hidden states. The alternative would be to

learn different hidden states for each genre and merge

them in the common model. The former guarantees a

larger number of examples to train each hidden state,

whilst the latter indirectly helps genre discrimination

since hidden states learned from different genres will

be more separated.

With respect to the second question, this paper

proposes a third alternative between the transition

proﬁle explored in (Garc

ıa-Garc

ıa et al., 2010) and

the stationary frequency of each hidden state: a dy-

namically computed bag of acoustic words song rep-

resentation. In this approach, each hidden state is con-

sidered as an acoustic feature analogous to the role

of words in the bag of words approach to parameter-

ize document collections (Fu et al., 2011). These fre-

quencies are computed dynamically by evolving the

song across the common HMM. The bags of words

are fed into the SVM through an standard RBF gaus-

sian kernel

κ(S

, S

) = exp(−γkz

− z

)

where z

and z

are the bag of words corresponding

to songs S

and S

We aim at answering these questions by studying

the impact on the genre discrimination accuracy of the

following ﬁve sets of song level features :

1HMM+SSD. Learn a single HMM with all the

songs and use the SSD metric to form the ker-

nel for the SVM. This is the approach of (Garc

ıa-

Garc

ıa et al., 2010).

4HMM+SSD. Learn a separate HMM with the train-

ing songs of each genre. Merge their states in a

single HMM and use all the songs to learn the

transition matrices and initial states probabilities.

Then form the SVM kernel with the SSD metric

in the common HMM (but where the hidden states

were learned independently).

1HMM+BoW. Learn a single HMM with all the

songs as in (1HMM+SSD) but instead of the SSD

metric, use the dynamically computed bag of

acoustic words as features for the SVM.

4HMM+BoW. Learn the hidden states separately as

in (4HMM+SSD) and replace the SSD metric with

the dynamically computed bag of acoustic words.

4HMM+4BoW. Learn one independent complete

HMM per genre (i.e. the hidden states will not be

shared and the transition probabilities will also be

learnt independently per each genre). The bag of

acoustic words that is passed to the classiﬁcation

stage results from the concatenation of all the bags

of words from all the models. Note that when one

MUSIC GENRE CLASSIFICATION BASED ON DYNAMICAL MODELS

253

learns an independent HMM per genre the kernel

based on transition matrices is pointless.

4 EXPERIMENTS

The ability of the song level features presented in Sec-

tion 3.2 to discriminate musical genre is evaluated in

the following classiﬁcation task. We use a subset of

the garageband dataset described in (Arenas-Garc

ıa

et al., 2007). The data set consists of snippets of 60

seconds of songs downloaded from the online music

site http://www.garageband.com

. The songs are

in MP3 format, and belong to different genres. For the

experiments we consider a simpliﬁed problem where

the goal is to discriminate between four different gen-

res: “Punk”, “Heavy Metal”, “Classical”, and “Reg-

gae”. The dataset includes genres that are a priori hard

to distinguish, like Punk and Heavy Metal plus others

which are easily separated. Each genre is represented

by a subset of 300 songs. MFCC and MAR extraction

settings proceed as described in Section 3.1. For com-

pleteness, we have included in the comparison the re-

sults of a classiﬁer that assigns each song to the genre

whose HMM yields a maximum likelihood (no SVM

as ﬁnal classiﬁer). This baseline classiﬁer is named

4HMM in the tables.

For each experiment we adopted repeated random

sub-sampling validation as our evaluation scheme.

The training and testing subsets are composed of 175

and 25 songs, respectively. The hyperparameters γ

and C of the SVMs are determined after 5-fold cross

validation over the training set. The presented results

correspond to the average over 10 different random

training/test partitions.

In order to ensure a fair comparison we have cho-

sen the number of hidden states for the HMMs in

a way that the resulting hidden states space has the

same size. This way, single HMMs trained with all

the songs are endowed with 24 hidden states whilst

the independent HMMs trained only with songs of a

same genre have 6 hidden states. The density func-

tions are spherical covariance gaussians.

Table 1 shows the average accuracy achieved by

each feature set together with the standard devia-

tion. The best average performance is obtained by

the 4HMM+BoW features, although it is remarkable the

higher stability in terms of small standard deviation

obtained by the SSD based features. In fact, the per-

formance of the 1HMM+SSD features is almost as good

in spite of the emission pdfs being learned with all

the songs. It seems that the SSD focus on the tran-

Downloaded in November, 2005.

sition probabilities compensates for the not so dis-

criminative hidden states. The worse performance of

4HMM+4BoW brings out the advantage of a joint learn-

ing of the transition probabilities.

The individual confusion matrices corresponding

to each feature set, showed in Tables 2–7 open a more

detailed genre-wise discussion. Moreover, Figure 1

shows a Hinton plot of the average occupancy fre-

quency of each hidden state per each genre in the

4HMM cases. The bigger the rectangle, the more fre-

quently is that state visited by the songs belonging to

that genre. States are sorted according to their HMM

(states 1–6 come from the Classic, states 7–12 from

the Punk, states 13–18 from Reggae and states 19–

24 from Heavy HMM). Finally, Figure 2 shows the

Hinton plot of average transition matrices for the four

genres. The bigger the rectangle in position (i, j), the

most probable the transition from state s

to state s

is.

The states follow the same order as in Figure 1.

Table 1: Comparison among all the strategies on the

garageband data set with the same experimental setup.

Strategy Accuracy

4HMM + 4BoW 65.0 ± 3.10 %

1HMM + SSD 75.30 ± 0.04 %

4HMM + BoW 78.0 ± 3.70 %

4HMM + SSD 72.20 ± 0.03 %

1HMM + BoW 71.40 ± 3.40 %

4HMM 69.5 ±3.7%

Table 2: Confusion matrix for 4HMM + 4BoW.

Classical Punk Reggae Heavy

Classical 0.70 0.12 0.12 0.06

Punk 0.06 0.57 0.01 0.36

Reggae 0.05 0.02 0.79 0.14

Heavy 0.02 0.36 0.08 0.54

5 10 15 20

Class.

Punk

Reag

Heavy

State

Music genre

Figure 1: Hinton diagram of the visiting frequency to each

state for the songs of each genre.

Classical is the easiest to discriminate genre, re-

gardless of the feature set. Figures 1 and 2 show that

this genre takes separate states from the rest. Reggae

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

254

5 10 15 20

State

Classical songs

5 10 15 20

State

Punk songs

5 10 15 20

State

Reggae songs

5 10 15 20

State

Heavy Metal songs

Figure 2: Hinton diagram of the transition matrices for the songs of each genre.

is also easy to discriminate, although there is a higher

overlapping with Punk and Heavy states.

With respect to Heavy Metal and Punk, there is

a higher overlapping in their bags of words, therefore

the independent learning of the hidden states followed

by 4HMM is of certain advantage. In the case of Punk,

Figure 1 shows enough spatial separability from the

Heavy Metal states, so 4HMM+BoW yields better per-

formance than 4HMM+SSD. However, the Heavy Metal

transition matrices are more different than the Punk

ones, as shown in Figure 2, making the SSD kernel

more suited than the BoW for their separation.

Table 3: Confusion matrix for 1HMM + SSD.

Classical Punk Reggae Heavy

Classical 0.89 0.05 0.05 0.01

Punk 0.04 0.70 0.02 0.24

Reggae 0.04 0.06 0.84 0.06

Heavy 0.03 0.32 0.06 0.59

5 CONCLUSIONS

This paper has studied the suitability to discriminate

musical genre of several feature sets extracted from

an HMM based dynamical model of a song collec-

Table 4: Confusion matrix for 4HMM + BoW.

Classical Punk Reggae Heavy

Classical 0.88 0.05 0.06 0.01

Punk 0.04 0.78 0.03 0.15

Reggae 0.04 0.07 0.84 0.05

Heavy 0.01 0.30 0.07 0.62

Table 5: Confusion matrix for 4HMM + SSD.

Classical Punk Reggae Heavy

Classical 0.82 0.09 0.05 0.04

Punk 0.05 0.67 0.04 0.24

Reggae 0.06 0.10 0.76 0.08

Heavy 0.02 0.26 0.08 0.64

Table 6: Confusion matrix for 1HMM + BoW.

Classical Punk Reggae Heavy

Classical 0.87 0.04 0.04 0.05

Punk 0.03 0.69 0.04 0.24

Reggae 0.04 0.03 0.80 0.13

Heavy 0.01 0.42 0.08 0.49

tion. The best classiﬁcation rates are obtained when

the hidden states of the model are learned indepen-

dently for each genre but then merged in a single

overall HMM where the probabilities of transition be-

MUSIC GENRE CLASSIFICATION BASED ON DYNAMICAL MODELS

255

Table 7: Confusion matrix for 4HMM.

Classical Punk Reggae Heavy

Classical 0.71 0.21 0.02 0.06

Punk 0.03 0.82 0.01 0.14

Reggae 0.01 0.22 0.63 0.14

Heavy 0.00 0.38 0.00 0.62

tween any pair of states are more precisely acquired.

These probabilities of transition carry relevant infor-

mation for the genre discrimination task, as pointed

out by the good results achieved by the SSD kernel

when the states are learned in a common model. In

this sense, this information is able to somehow com-

pensate for the lack of a discriminative learning of the

hidden states.

Future work will be focused on the extension to

more musical genres, to other families of dynamical

models different from HMMs. Another intersting line

of research consists in the combination of the features

related to the frequency of hidden state occupancy

and to the dynamics of the transitions between hid-

den states in a multiple view learning framework.

ACKNOWLEDGEMENTS

This work has been partially supported by the Re-

gional Government of Madrid through grant CCG10-

UC3M/TIC-5511 and by the IST Programme of the

European Community under the PASCAL2 Network

of Excellence IST-2007-216886.

REFERENCES

Arenas-Garc

ıa, J., Parrado-Hern

andez, E., Meng, A.,

Hansen, L.-K., and Larsen, J. (2007). Discovering

music structure via similarity fusion. In Music, Brain

and Cognition Workshop, NIPS’07.

Bhattacharyya, A. (1943). On a measure of divergence

between two statistical populations deﬁned by their

probability distributions. Bull. Calcutta Math Soc.

Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A

training algorithm for optimal margin classiﬁers. In

Proceedings of the ﬁfth annual workshop on Compu-

tational learning theory, COLT ’92, pages 144–152.

Fu, Z., Lu, G., Ting, K. M., and Zhang, D. (2011). Music

classiﬁcation via the bag-of-features approach. Pat-

tern Recognition Letters, 32(14):1768(10).

Garc

ıa-Garc

ıa, D., Arenas-Garc

ıa, J., Parrado-Hern

andez,

E., and de Maria F, D. (2010). Music genre classiﬁca-

tion using the temporal structure of songs. In Machine

Learning for Signal Processing (MLSP), 2010 IEEE

International Workshop on, pages 266 –271.

Garc

ıa-Garc

ıa, D., Parrado-Hern

andez, E., and Diaz-de

Maria, F. (2011). State-space dynamics distance for

clustering sequential data. Pattern Recogn., 44:1014–

1022.

Guaus, E. (2009). Audio content processing for automatic

music genre classiﬁcation: descriptors, databases,

and classiﬁers. PhD thesis, Universitat Pompeu Fabra,

Spain.

Mckinney, M. and Breebaart, J. (2003). Features for audio

and music classiﬁcation. In Proceedings of the Inter-

national Symposium on Music Information Retrieval,

pages 151–158.

Meng, A., Ahrendt, P., Larsen, J., and Hansen, L. (2007).

Temporal feature integration for music genre classi-

ﬁcation. Audio, Speech, and Language Processing,

IEEE Transactions on, 15(5):1654 –1664.

Rabiner, L. R. (1989). A tutorial on hidden markov models

and selected applications in speech recognition. Pro-

ceedings of the IEEE, pages 257–286.

Rifkin, R. and Klautau, A. (2004). In defense of one-vs-all

classiﬁcation. J. Mach. Learn. Res., 5:101–141.

Sigurdsson, S., Petersen, K. B., and Lehn-Schiler, T. (2006).

Mel frequency cepstral coefﬁcients: An evaluation of

robustness of mp3 encoded music. In Proceedings of

the Seventh International Conference on Music Infor-

mation Retrieval (ISMIR), pages 286–289.

Tzanetakis, G. and Cook, P. (2002). Musical genre classiﬁ-

cation of audio signals. Speech and Audio Processing,

IEEE Transactions on, 10(5):293 – 302.

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

256