the contained audio words. Although BOAW model
makes it possible to represent, index, and retrieve
songs like documents, it suffers from audio word
ambiguity and feature quantization error. Those
unavoidable problems greatly decrease retrieval
precision and recall, since different features may be
quantized to the same audio word, causing many
false local matches between songs. And with the
increasing size of song database (e.g. greater than
one million songs) to be indexed, the discriminative
power of audio words decreases sharply.
To reduce the quantization error, one solution is
to utilize location information of local features in
songs to improve retrieval precision, since the
location relationship among audio words plays a key
role in identifying songs. However, how to perform
location verification efficiently for large-scale
application is very challenging, considering the
expensive computational cost of full location
verification.
In this paper, we propose to address large-scale
similar song retrieval by using effective music
features and location coding strategies. We define
two songs as similar songs when they share some
similar beat-chroma patches with the same or very
similar location relationship. Our approach is based
on the Bag-of-Audio-Words model. To improve the
discrimination of audio words, we utilize the beat-
aligned chroma feature for codebook construction.
All western music can be represented through a set
of 12 pitches or semitones. Beat-aligned chroma
features record the intensity associated with each of
the 12 semitones for a single octave during a defined
time frame. We measured the intensity of the 12
semitones in the time frame of a beat. To verify the
matched local parts of two songs, we apply location
coding scheme to encode the relative positions of
local features in songs as location maps. Then
through location verification based on location
maps, the false matches of local features can be
removed effectively and efficiently, resulting in
good retrieval precision.
The contribution of this paper can be
summarized as follows: 1) we apply beat-chroma
patterns to build descriptive audio codebook for
large scale similar song retrieval; 2) we utilize
location coding method to encode the relative
location relationships among audio words into
location maps; 3) we apply a location verification
algorithm to remove false matches based on location
maps for similar song search.
The rest of the paper is organized as follows. In
Section 2, related work is introduced. Then, our
approach is illustrated in Section 3. In Section 4,
some preliminary experimental results are provided.
Finally, we make the conclusion in Section 5.
2 RELATED WORK
The codebook approach for large scale music
retrieval is proposed in (Seyerlehner et al., 2008). By
using vector quantization, a large set of local
spectral features are divided into groups. Each group
corresponds to a sub-space in the feature space, and
is represented by its center, which is called an audio
word. All audio words constitute an audio codebook.
With local features quantized to audio words, song
representation is very compact. And by inverted-file
index, all the songs can be efficiently indexed as
bag-of-audio-words, achieving fast search response.
Recently, beat-chroma feature becomes a popular
and useful local feature for music retrieval and
classification. By learning codebook from millions
of beat-chroma features, the common patterns in
beat-synchronous chromagrams of all the songs can
be obtained (Bertin-Mahieux et al., 2010). Each
individual codeword consists of short beat-chroma
patches of between 1 and 8 beats, optionally aligned
to bar boundaries (Bertin-Mahieux et al., 2010). This
approach dug the deeper common patterns
underlying different pieces of songs than the
previous "shingles" (Casey and Slaney, 2007).
However, beat-chroma feature quantization
reduces the discriminative power of local
descriptors. Different beat-chroma features may be
quantized to the same audio word and cannot be
distinguished from each other. On the other hand,
with audio word ambiguity, descriptors from local
patches of different semantics may also be very
similar to each other. Such quantization error and
audio word ambiguity will cause many false matches
of local features between songs and therefore
decrease retrieval precision and recall.
To reduce the quantization error, one solution is
to utilize location information of local features in
songs to improve retrieval precision. Many
geometric verification approaches (Wu et al., 2009;
Zhou et al., 2010) have been proposed for image
retrieval. Among them, spatial coding approach
(Zhou et al., 2010) is an efficient global geometric-
verification method proposed to verify spatial
consistency of features in the entire image, which
relies on visual words distribution in two
dimensional images. Motivated by the spatial coding
algorithm (Zhou et al., 2010), we apply its simpler
version to encode the relative location relationships
Large Scale Similar Song Retrieval using Beat-aligned Chroma Patch Codebook with Location Verification
209