VISUAL PITCH CLASS PROFILE

A Video-based Method for Real-time Guitar Chord Identiﬁcation

M. Cicconet, P. Carvalho, L. Velho and M. Gattass

Vision and Graphics Laboratory at IMPA and Computer Science Department at PUC-Rio, Brazil

Keywords:

Guitar Chord Identiﬁcation, Chord Detection, Pitch Class Proﬁle, Chroma Vector.

Abstract:

We propose a video-based method for real-time guitar chord identiﬁcation which is analogous to the state-of-

the-art audio-based method. While the method based on audio data uses the Pitch Class Proﬁle feature and

supervised Machine Learning techniques to “teach” the machine about the chord “shape”, we use as feature

the approximated positions of ﬁngertips in the guitar fretboard (what we call Visual Pitch Class Proﬁle),

captured using especial hardware. We show that visual- and audio-based methods have similar classiﬁcation

performance, but the former outperforms the latter with respect to the immunity to noise caused by strumming.

1 INTRODUCTION

Despite being a predominantly auditory activity, mu-

sic has also a visual component which is worth noting.

In the case of guitar playing, for instance, it is com-

mon to identify chords by visual inspection instead of

by auditive perception.

Most studies in Musician-Computer Interaction,

however, have been done by the Computer Music

community, and there are still few Computer Vision

approaches.

In this work we propose a new method for real-

time guitar chords identiﬁcation using only Computer

Vision techniques. The method is analogous to the

state-of-the-art audio-based method, which uses a su-

pervised Machine Learning technique to train the ma-

chine with the patterns of different chords, using as

training feature the Pitch Class Proﬁle. We kept the

Machine Learning part and replaced the auditory fea-

ture by a visual one, namely the approximate posi-

tions of ﬁngertips at the guitar ﬁngerboard.

2 PREVIOUS WORK

In (Burns and Wanderley, 2006) a camera is mounted

on the guitar headstock in order to capture the ﬁrst

ﬁve frets. The Linear Hough Transform is used to de-

tect strings and frets, and the Circular Hough Trans-

form is used to locate the ﬁngertips. The purpose is

to identify chords and notes sequences in real-time by

detecting the ﬁngertips positions in guitar fretboard

coordinates.

The work of (Kerdvibulvech and Saito, 2007) is

more ambitious. They use stereo cameras and aug-

mented reality ﬁducial markers to locate the guitar ﬁn-

gerboard in 3D, and colored markers (with different

colors) attached to the ﬁngertips to determinate their

three-dimensional position relative to the fretboard.

The two mentioned works use only visual infor-

mation. In (Quested et al., 2008), visual information

is used to enhance the performance of an audio-based

musical information retrieval method. The point is

that once the fundamental frequency of the played

note is known (via audio), the video information helps

solving the ambiguity regarding which string was ac-

tually ﬁngered or plucked. The same idea is used in

(Paleari et al., 2008), but their system is not designed

to work in real-time.

3 THE AUDIO-BASED METHOD

According to (Cabral, 2008), most of the audio-based

chord recognition methods rely on the use of the Pitch

Class Proﬁle (PCP) audio feature along with some su-

pervised Machine Learning method. We now deﬁne

the PCP, also known as the Chroma Vector, as de-

487

Cicconet M., Carvalho P., Velho L. and Gattass M. (2010).

VISUAL PITCH CLASS PROFILE - A Video-based Method for Real-time Guitar Chord Identiﬁcation.

In Proceedings of the International Conference on Computer Vision Theory and Applications, pages 487-490

DOI: 10.5220/0002824904870490

 SciTePress

Figure 1: Capture hardware. On the left, an infrared camera surrounded by four infrared light sources. In the center, a hollow

disk made with reﬂexive material. Four of them are used to locate the plane containing the ROI. On the right, middle-phalanges

gloves with small rods coated so as to easily reﬂect light.

scribed by (Jehan, 2005).

At regular intervals of the audio ﬁle an audio

chunk is taken, multiplied by a Hann-window, and the

magnitudes of the Discrete Fourier Transform coef-

ﬁcients are computed. Then the 84 amplitudes cor-

responding to MIDI notes ranging from 24 to 107

are captured and a 12-dimensional vector is obtained

by summing the amplitudes corresponding to musical

notes of the same key in different octaves. The ele-

ments of this vector are normalized to the range [0, 1]

to avoid taking into account differences of loudness in

different windows.

In the seminal work of (Fujishima, 1999) the

Nearest Neighbor method is used, and the machine

is trained with “ideal” chroma vectors: those whose

entries are 1’s in the notes of the chord and 0’s other-

wise.

For the purpose of evaluating our visual method

and comparing its performance with the analogous

audio-based technique, we have implemented an

audio-based chord detection algorithm similar the

one introduced by Fujishima, where instead of using

“ideal” chords, the machine was trained with samples

from real recorded chords, and at the classiﬁcation

phase we have used the K Nearest Neighbor algorithm

with K > 1.

4 PROPOSED VIDEO-BASED

METHOD

Let us deﬁne the Region of Interest (ROI) in the scene

of a person playing guitar as being the region includ-

ing the strings, from the nut to the bridge.

Figure 1 shows the equipment that supports our

method. We use a infrared camera to capture the

scene, which is properly illuminated with infrared

light. Special markers (ﬁducials) are attached to the

guitar in order to easily locate the instrument, and

for the ﬁngers, reﬂexive gloves dress the middle pha-

langes.

The pipeline of our chord detection method is il-

lustrated in Figure 2. The developed software takes

advantage of some nice and robust algorithms imple-

mented in OpenCV, an open source Computer Vision

library (Bradski and Kaehler, 2008).

Figure 2: Chord detection pipeline, from top to bottom. (1)

A threshold is applied to take only guitar and ﬁnger mark-

ers. (2) Guitar ﬁducials and ﬁnger rods are detected using

a contour detection algorithm. (3) A projective transfor-

mation “immobilize” the guitar, regardless the movement

caused by the musician. (4) The projective transform is ap-

plied to the north-most extreme of ﬁnger roads in order to

roughly locate the ﬁngertips in guitar-fretboard coordinates.

VISAPP 2010 - International Conference on Computer Vision Theory and Applications

488

First, a threshold is applied to the input image, so

that the only non-null pixels are those of the guitar

and ﬁnger markers. Then, using the contour detec-

tion algorithm and contour data structure provided by

OpenCV, guitar and ﬁnger markers can be separated.

Note that guitar ﬁducials and ﬁnger markers are, re-

spectively, contours with and without a hole. Once

the positions of the four guitar ﬁducials are known in

the image, by using their actual positions in guitar ﬁn-

gerboard coordinates a projective transformation (ho-

mography) can be determined and applied in order to

“immobilize” the guitar and easily extract the ROI.

This homography is then applied to the north-most

extreme of the ﬁnger rods, so we get the rough posi-

tion of ﬁngertips in guitar fretboard coordinates, since

the distal phalanges are, in general, nearly perpendic-

ular to the ﬁngerboard.

We use a supervised Machine Learning technique

to train the machine with the guitar chords we want

it to identify. The chord a musician plays is viewed

by the system as an eight-dimensional vector com-

posed by the coordinates (after projective transfor-

mation) of the four ﬁngertips, from the little to the

index ﬁnger. By analogy with the PCP, we call this

eight-dimensional vector the Visual Pitch Class Pro-

ﬁle (VPCP).

Summarizing, the proposed algorithm for real-

time guitar chord detection has two phases. In the

ﬁrst (the training phase), the musician chooses the

chords that must be identiﬁed and takes some samples

from each one of them, where by sample we mean

the eight-dimensional vector formed with the posi-

tions of the north-most extreme of the ﬁnger rods, i.e.,

the VPCP. In the second (the identiﬁcation phase), the

system receives the vector corresponding to the chord

to be identiﬁed and classiﬁes it using the K Nearest

Neighbor algorithm.

5 COMPARISON AND

DISCUSSION

Before talking about quantitative comparisons, let’s

address some theoretical aspects. Video methods,

even knowledge-based, are immune to wrong tuning

of the instrument. Despite not being desirable to play

a wrong tuned instrument, this feature is good for be-

ginners, that are not able to have a precisely regu-

lated guitar. On the other hand, it can be argued that

knowledge-based methods only work properly when

trained by the ﬁnal user itself, since the shapes of

some given chord are slightly different from person to

person. This is a fact, but the knowledge-based tech-

niques using audio data also have to face with this

problem, since different instruments, with different

strings, produce slightly different songs for the same

chord shape.

Seeking quantitative comparisons, we take 100

samples from each one of the 14 major and minor

chords in the keys of C, D, E, F, G, A, B, choosing

just one shape per chord (in the guitar there are many

realizations of the same chord). The video samples

were taken by ﬁxing a given chord and, while mov-

ing a little bit the guitar, waiting until 100 samples

were saved. For the audio samples, for each chord we

recorded nearly 10 seconds of a track consisting of

strumming in some rhythm keeping ﬁxed the chord.

The audio data was then pre-processed in order to re-

move parts corresponding to strumming (where there

is high noise). Then, at regular intervals of about 12

milliseconds an audio chunk of about 45 milliseconds

was processed to get its Pitch Class Proﬁle, as de-

scribed in Section 3.

These audio and video samples tend to form clus-

ters in R

and R

, respectively. Figure 3 provides

some analysis of them. Note that in both cases the

samples are placed very close to the mean of the re-

spective cluster, but there are more outliers in the au-

dio data.

Regarding classiﬁcation performance, both meth-

ods behaved similarly in the tests we have conducted.

The difference is that the audio-based algorithm is

sensitive to the noise caused by strumming, while the

video-based method don’t care about it. This is il-

lustrated in Figure 4, where the same chord sequence

(played twice) was performed and analyzed by the

two methods, using 20 Nearest Neighbors for clas-

siﬁcation. Note how more stable is the video-based

method. It can also be seen that both algorithms have

problems with chord transitions.

6 CONCLUSIONS AND FUTURE

WORK

We have seen that both methods have similar classiﬁ-

cation performance, but the VPCP algorithm is more

stable in the sense that (1) the clusters formed at the

training phase are better deﬁned and (2) the visual

method is not sensitive to the noise caused by strum-

ming.

Given the high similarity between the classical

audio-based method and our proposed video-based al-

gorithm, a natural direction of research is to combine

both classiﬁers using some data fusion technique.

There is also some issues of the VPCP method

which have to be treated. The ﬁrst is to eliminate the

need of the middle-phalanges gloves. Although they

VISUAL PITCH CLASS PROFILE - A Video-based Method for Real-time Guitar Chord Identification

489

Figure 3: Analysis of the audio and video sample clusters.

A square (respectively, a triangle) represent the average (re-

spectively, the maximum) distance between the class sam-

ples and the class mean vector. The asterisk represent the

distance between the cluster mean vector and the nearest

cluster mean vector. This shows that the clusters of video

samples are better deﬁned relatively to those from audio

samples.

are unobtrusive for the user, having a undressed hand

is highly desirable. Also, since the rods stay out of

the fretboard and we are applying a perspective (plane

to plane) transformation, rotating the guitar in the di-

rection of the neck axis causes a vertical shift in the

transformed rod north-most points.

To cope with these issues we plan to use two cal-

ibrated cameras, one working in the infrared range

to capture the guitar (which may have some infrared

LEDs), and the other in the visible range to capture

the ﬁngertips using traditional hand- and contour- de-

tection methods.

REFERENCES

Bradski, G. and Kaehler, A. (2008). Learning OpenCV:

Computer Vision with the OpenCV Library. O’Reilly.

Burns, A. and Wanderley, M. (2006). Visual methods for

Figure 4: The same chord sequence, played twice, is ana-

lyzed by the traditional audio-based algorithm and our pro-

posed video-based method. While the former needs some

extra processing to cope with the noise caused by strum-

ming, our video based method is immune to that. However,

both techniques have problems with chord transitions.

the retrieval of guitarist ﬁngering. In Conference in

New Interfaces for Musical Expression.

Cabral, J. (2008). Harmonisation Automatique en Temps

Reel. PhD thesis, Universite Pierre et Marie Curie.

Fujishima, T. (1999). Real-time chord recognition of mu-

sical sound: A system using common lisp music. In

International Computer Music Conference.

Jehan, T. (2005). Creating Music by Listening. PhD thesis,

Massachusetts Institute of Technology.

Kerdvibulvech, C. and Saito, H. (2007). Vision-based gui-

tarist ﬁngering tracking using a bayesian classiﬁer and

particle ﬁlters. Advances in Image and Video Technol-

ogy.

Paleari, M., Huet, B., Schutz, A., and Slock, D. (2008). A

multimodal approach to music transcription. In 15th

International Conference on Image Processing.

Quested, G., Boyle, R., and Ng, K. (2008). Polyphonic note

tracking using multimodal retrieval of musical events.

In International Computer Music Conference.

VISAPP 2010 - International Conference on Computer Vision Theory and Applications

490