language
(Knees et al., 2000). Another work
described in (Harb and Chen, 2003) based on an
automatic segmentation of the soundtrack music or
speech, using a technique of segmentation into
sentences. The music segments are indexed in a way
that allows a search by similarity. Other jobs using a
classification according to the mood of the songs are
described in (Dang, Shirai, 2009, Kanters, 2009,
Laurier and al., 2008). The classification according
to the mood does not seem interesting to apply it to a
search engine for music because the mood is
metadata subjective words are short and contain
many metaphors that can be understood by humans.
Through this work, we introduce a new
dimension of classification, considering contextual
information about the artist. Thus, each artist sings
songs with a specific emotion, such as Eric Clapton
sings sad songs often but Bob Marley likes to sing
happy songs.
3 CONSTRUCTION
OF TRAINING DATA
In this section we describe how to prepare our
training data, the collection of songs tagged with the
theme described by the title and artist features.
A great blog site Live Journal (www.livejournal.
com) is used, each blog entry is labeled with the
theme of the song given by the title of this latter.
The song title and artist features can be obtained by
simple string matching with the database artist,
obtained from open artist got from the music site
(www.musicmoz.org). The lyrics may be obtained
from the Site (www.lyrics.com).
4 SONG CATEGORIZATION
Research in the field of automatic categorization
remains relevant today since the results are still
subject to improvements. For some tasks, the
automatic classifiers perform almost as well as
humans, but for others the gap is even greater. At
first glance, the main problem is easy to grasp. On
one hand, we are dealing with a bank of songs and
on the other with a set of categories. The goal is to
make a computer application which can determine to
which category belongs a song based on its contents.
The set of categories is determined in advance. The
problem is to group the songs by their similarity.
There are two approaches to solving the problem of
songs categorization: the information using either
acoustic or verbal information. In this paper we will
focus on the words comprising the title of the song
to determine its theme and the characteristics of the
artist to determine what kind of music.
The categorization process includes the
construction of a prediction model that receives in
input the title of the song, and as output it combines
one or more labels.
Prior coding of song is necessary because there is
currently no method of learning which can directly
handle unstructured data in the model construction
stage, or when used in classification.
For most learning methods, we must convert all
texts in a PivotTable "individuals-variables".
In song categorization, we transform the title of
the song into a vector d
j
= d
j
(w
1j
, w
2j
, ..., w
| T | j
),
where T is the set of terms (descriptors) that appear
at least once in the corpus (the collection) learning.
The weight w
kj
correspond to the contribution of
terms t
k
to the semantics of title of song d
j
.
Once we choose the components of the vector
representing the song j, we must decide how to
encode each coordinate of the vector d
j
. There are
different methods to calculate the weight w
kj
. These
methods are based on two observations:
More the term t
k
is frequently in a title of song
d
j
, more it is relevant to the subject of this song.
More often the term t
k
is in a collection, unless it
is used as discriminating between songs.
The Coding terms frequency x inverse document
frequency and Coding terms TFC are the most used.
5 NAIVE BAYES ALGORITHM
In machine learning, different types of classifiers
have been developed to achieve maximum degree of
precision and efficiency, each with its advantages
and disadvantages. But, they share common
characteristics (Sebastiani, 2002).
Naive Bayes Classifier is the most commonly
used algorithm. When we apply the naïve Bayes for
a song categorization task, we look for the
classification that maximizes the probability of
observing the words of titles of the songs.
During the training phase, the classifier
calculates the probability that a new song belongs to
this category based on the proportion of training
songs belonging to this category. It calculates the
probability that a given word is present in a title of
the song, knowing that this song belongs to this
category. Then as a new song should be classified,
we calculate the probability that it belongs to each
class using Bayes rule and the probabilities
KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval
380