Figure 2: Loudness and Chroma features of a 30 seconds long portion of the song Anything Goes, by ACDC. The features
are extracted from windows with 4096 frames, and the overlap is of 2048 frames. The audio sample rate is of 44100 frames
per second.
2 RELATED WORK
When speaking about audio visualization, it is worth
clarifying what sort of visualization is being treated.
We can mention at least three cases where terms like
music visualization or audio visualization apply.
First, there are the online visualization algorithms.
They show in real-time (or at least as real-time as pos-
sible) some feature extracted from a small window
around the audio portion being played or sung. The
purpose of this kind of visualizer can be of aesthetic
nature, enhancing audio playback software (Lee et al.,
2007), as a tool for musical training (Ferguson et al.,
2005), as a help for the hearing impaired, enhanc-
ing the awareness of their surroundings (Azar et al.,
2007), or as an interface to control real-time compo-
sition parameters (Sedes et al., 2004).
Second, in works like (Kolhoff et al., 2006), a pic-
ture representing the entire audio file is rendered with
the purpose of facilitate the search in large databases
or simply as a personalized icon of the file. In (Ver-
beeck and Solum, 2009) the goal is to present a high
level segmentation of the audio data in terms of intro,
verse, chorus, break, bridge and coda, the segments
being arranged as concentric rings, the order from the
center being the same of the occurrence of that part of
the music in the piece.
Finally, there are the methods aiming at rendering
a picture (matrix) of the entire file in which at least
one dimension is time-related with the audio. Perhaps
the most important example is (Foote, 1999), where
a square matrix of pairwise similarity between small
consecutive and overlapping audio chunks is com-
puted. (We will come back to this kind of matrix later,
in section 4.) Another very common example consist
of plotting time against some 1- or n-dimensional au-
dio feature, like, say, the magnitudes of the Discrete
Fourier Transform coefficients of the audio chunk.
For n-dimensional features the resulting matrix is nor-
malized so that the minimum value is 0 and the max-
imum 255, resulting in a grayscale image. Figure 2
shows this kind of plot in the case of the loudness (1-
dimensional) and chroma (12-dimensional) features.
3 PROPOSED METHOD
The musical content of a piece can be essentially di-
vided in two categories: the harmonic (or melodic)
and the rhythmic (or percussive). Rigorously speak-
ing it is impossible to perform such a division, since
melody and rhythm are (sometimes strongly) corre-
lated. There are, however, ways to extract most of the
content of one or another category, which is the main
issue in the Musical Information Retrieval commu-
nity. The problem with these methods is that they are
usually computationally expensive. In fact, in a pre-
vious work (Cicconet and Carvalho, 2009) we have
tried to segment the music piece and to perform some
kind of clustering. But good clustering requires high
computational cost. This means that the user has to
wait a long time for the result, which is not desirable.
Thus, we decided to use two simple audio de-
scriptors, each one related with a different component
of the audio (rhythm or melody), and combine them
in such a way that the corresponding visual features
complement each other.
Another important point behind this decision re-
lies on the segmentation and clustering capabilities of
the human visual system. It consists of billions of
neurons working in parallel to extract features from
every part of the visual field simultaneously (Ware,
2004).
Therefore, instead of spending computational
time segmenting and clustering the audio data, we
moved to the direction of taking advantage of the hu-
man visual capabilities in doing that, and facilitating
the analysis by presenting significative audio informa-
tion.
The visualization algorithm we propose is as fol-
lows.
We assume that the audio file is mono-channel and
sampled at 44100 frames per second; otherwise we
apply some processing (adding the channels and/or
resampling) to obtain audio data with the desired fea-
tures. These properties are not critical; the computa-
tions could be done for each channel separately and
the sample rate could be a different one.
At regular intervals of 2048 frames, an audio
chunk of length 4096 frames is multiplied by a Hann
window, and the Discrete Fourier Transform (DFT) of
it is computed. We choose 2048 as hop size to obtain
IVAPP 2010 - International Conference on Information Visualization Theory and Applications
86