A COMPARATIVE ANALYSIS OF TIME-FREQUENCY
DECOMPOSITIONS IN POLYPHONIC PITCH ESTIMATION
F. J. Ca˜nadas-Quesada, P. Vera-Candeas, N. Ruiz-Reyes, J. Carabias, P. Caba˜nas and F. Rodriguez
Telecommunication Engineering Department, University of Ja´en, Polytechnic School, Linares, Ja´en, Spain
Keywords:
Polyphonic signal, Time-frequency decomposition, Fundamental frequency, Constant Q Transform, STFT,
Note-event, Candidate, Overlapped partial, Spectral modeling.
Abstract:
In a monaural polyphonic music context, time-frequency information used by most of the multiple funda-
mental frequency estimation systems, extracted from temporal-domain of the polyphonic signal, is mainly
computed using fixed-resolution or variable resolution time-frequency decompositions. This time-frequency
information is crucial in the polyphonic estimation process because it must clearly represent all useful infor-
mation in order to find the set of active pitches. In this paper, we present a preliminary study analyzing two
different decompositions, Constant Q Transform and Short Time Fourier Transform, which are integrated in
the same multiple fundamental frequency estimation system, with the aim of determining what decomposition
is more suitable for polyphonic musical signal analysis and how each of them influences in the accuracy results
of the polyphonic estimation considering low-middle-high frequency evaluation.
1 INTRODUCTION
Multiple fundamental frequency (Multiple-F0) esti-
mation has long been an interesting subject in signal
processing and it is still being a challenging task in
the field of monaural polyphonicmusical signals. The
goal of a multiple-F0 estimation system is to find both
the number of active pitches (polyphony) and the fre-
quencies associated to these active pitches in a piece
of music at a given time. Multiple-F0 estimation sys-
tems cover a wide range of recent audio applications:
musical transcription (Marolt, 2004) (Poliner and El-
lis, 2007) (Emiya et al., 2008), bass-melody detection
(Goto, 2004), sound manipulation (Neubacker, 2009),
content-based music retrieval (IDMT, 2009) or sound
source separation (Burred and Sikora, 2007) (Li et al.,
2009).
Most of the multiple-F0 estimation systems per-
form a preprocessing stage in which time-frequency
decomposition is computed from the input signal us-
ing the time-domain. This time-frequency decompo-
sition provides useful information in order to estimate
the spectral content of a polyphonic signal and it is
usually computed using a transformation with fixed
resolution (Klapuri, 2003) (Yeh et al., 2005) (Bello
et al., 2006) (Carabias et al., 2008) (Ca˜nadas Quesada
et al., 2008) or variable resolution (Kameoka et al.,
2007) (Saito et al., 2008) (Smaragdis, 2009).
In this paper, we present a comparative study ana-
lyzing two different time-frequency decompositions,
specifically, Constant Q Transform (variable resolu-
tion) and Short Time Fourier Transform (fixed res-
olution), which are integrated in a joint multiple
fundamental frequency estimation system with the
aims of determining what decomposition is more suit-
able to estimate the set of active pitches in a poly-
phonic musical signal and analyzing the performance
of time-frequency decomposition in low-middle-high
frequency regions.
The remainder of this paper is structured as fol-
lows. In Section 2 we briefly review the Con-
stant Q Transform and Short Time Fourier Transform
describing the advantages/disadvantages of each of
them. In Section 3 the multiple-F0 estimation system
is described. Experimental results are shown in Sec-
tion 4. Finally, conclusions are presented in Section
5.
145
J. Cañadas-Quesada F., Vera-Candeas P., Ruiz-Reyes N., Carabias J., Cabañas P. and Rodriguez F. (2010).
A COMPARATIVE ANALYSIS OF TIME-FREQUENCY DECOMPOSITIONS IN POLYPHONIC PITCH ESTIMATION.
In Proceedings of the International Conference on Signal Processing and Multimedia Applications, pages 145-150
DOI: 10.5220/0002955601450150
Copyright
c
SciTePress
2 DECOMPOSITIONS OF
MUSICAL SIGNALS
2.1 Short Time Fourier Transform of
Musical Signals
The Short Time Fourier Transform (STFT) is the stan-
dard tool to perform signal analysis in the frequency
domain. Considering a frame-by-frame evaluation,
the STFT of the signal x[n] related to the t
th
frame
can be formulated as (see eq. 1),
X
STFT
(t, k) =
M1
d=0
x
h
(t 1)J+ d
i
w[d]e
j
2π
M
dk
, (1)
k = 0,...,M-1, where w[d] is a N samples Ham-
ming window and parameter J is a
N
8
samples time
shift. Nevertheless, the length N of each frame can
be extended to M using a zero padding technique.
Specifically, we have used three values of M which
are referred as STFT-1N (M=N), STFT-2N (M=2N)
and STFT-8N (M=8N). Each STFT has been com-
puted using a sampling rate fs=44100 Hz, N = 4096
(92.9 ms) and J = 512 (11.6 ms).
Several researchers claim that the STFT is not
suitable to analyze musical signals because each fre-
quency bin is computed on a linear frequency scale
(fixed resolution) which provides little resolution at
low frequencies and tends to concentrate too much
needed information at high frequencies. Consider as
the music unit a semitone which presents a constant
ratio of 2
1
12
between the fundamentals. Using a fixed
frequency resolution
f
s
N
=10.8 Hz, if we analyze the
event-note C2 (65.4 Hz) and C6 (1047.0 Hz), we can
observed that there is sufficient resolution to discrimi-
nate F0
C6
F0
C#6
=62 Hz but it is not sufficient to dis-
criminate F0
C2
F0
C#2
=3.9 Hz. In this last case, all
information belonging to three event-notes (C2, C#2
and D2) is represented in the same frequency bin.
2.2 Constant Q Transform of Musical
Signals
The Constant Q Transform (CQT) (Brown, 1991)
uses a varying time window to calculate the logarith-
mic frequency spectrum. Each octave is composed of
the same number of frequency bins b. In this manner,
each frequency f
k
can formulated as (see eq. 2),
f
k
= f
min
2
k
b
Hz (2)
k = 0,...,N-1. The number of frequency bin k as-
sociated to f
k
can be calculated using equation 3,
k =
b
log
10
2
log
10
f
k
f
min
(3)
Moreover, in order to achieve a constant Q factor,
both the resolution f
k
and the number of samples N
k
of the window vary inversely with the frequency f
k
,
f
k
= f
k+1
f
k
= f
k
(2
1
b
1
) Hz (4)
Q =
f
k
f
k
=
1
(2
1
b
1
)
(5)
N
k
=
Qf
s
f
k
(6)
Considering the t
th
frame, the constant Q Trans-
form of the signal x[n] is defined in equation 7,
X
CQT
(t, k) =
1
N
k
N
k
1
d=0
x
h
(t 1)J+ d
i
w[d, k]e
j
2π
N
k
dk
(7)
In this paper, we have evaluated the direct calcu-
lation of CQT transformation (Brown, 1991) with the
following parameters: b = 48 (an octave-tone resolu-
tion) and f
min
= 32.7 Hz.
Although CQT is more computationallyexpensive
than STFT, it is an interesting and desirable tool to
apply to musical signals because exhibits two main
advantages. First, it exhibits more resolution in the
low frequency range. Second, the spectral localiza-
tion of a note-event depends on the spectral localiza-
tion of the fundamental frequency (F0) but the relative
spectral localizations of the harmonics with respect to
each other are invariant drawingthe same spectral pat-
tern for all note-events (see Figure 1).
0 20 40 60 80 100 120 140 160 180
0
0.2
0.4
0.6
0.8
1
1.2
Frequency bin
Amplitude
Figure 1: Spectral pattern computed by CQT of an har-
monic signal composed of 12 harmonics with equal ampli-
tude.
Figure 2 and Figure 3 exhibits the visual differ-
ences between STFT and CQT when an excerpt of
monophonic signal composed of nine event-notes (pi-
ano) is analyzed. Figure 2 indicates that the spectral
SIGMAP 2010 - International Conference on Signal Processing and Multimedia Applications
146
spacing between harmonics changes in function of
each fundamental frequency. It can be observed that
this spacing grows with the frequency. However, Fig-
ure 3 shows how the logarithmic frequency spectrum
maintains the relative spacing between harmonics in-
dependently of the fundamental frequency, creating
the same frequency distribution for all note-events.
Frames
Frequency (bins)
0 100 200 300 400 500
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Figure 2: Short Time Fourier Transform (STFT). The en-
ergy of each frequency is represented by the gray level.
Frames
Frequency (bins)
0 100 200 300 400 500
0
50
100
150
200
250
300
350
400
450
Figure 3: Constant Q Transform (CQT). The energy of each
frequency is represented by the gray level.
3 MULTIPLE-F0 ESTIMATION
SYSTEM
In order to evaluate the two time-frequency transfor-
mations (STFT and CQT), we have simplified the
multiple-F0 estimation system proposed in (Ca˜nadas
Quesada et al., 2010) removing both temporal in-
formation extracted from λ previous frames and the
HMM stage. The reason is because we only are in-
terested to analyze the performance of the transfor-
mation in the multiple-F0 estimation, at each frame,
without any kind of other complementary informa-
tion. Next, the most relevant stages of this system
are described. For more details of the multiple-F0 es-
timation system, see (Ca˜nadas Quesada et al., 2010).
The block diagram of the integrated multiple-F0 esti-
mation system is shown in Figure 4.
Figure 4: Multiple-F0 estimation system.
In the spectral analysis stage, both STFT and
CQT are calculated to search meaningful spectral
peaks from the input musical signal. To facilitate
the multiple-F0 estimation, a high amount of spuri-
ous peaks is removed, using a frequency-dependent
threshold (Every and Szymanski, 2006), achieving a
new spectrum from the original input spectrum which
is composed of only significant spectral peaks.
The next stage is the searching F0-candidates.
From the previous stage, a significant spectral peak is
regarded as F0-candidate if its frequency is included
in the interval defined by note-events ranging from
C2 (MIDI note 36) to B6 (MIDI note 95) in a well-
tempered music scale. This interval has been se-
lected because it is a typical analysis interval con-
sidered in multiple-F0 estimation systems (Klapuri,
2003) (Emiya et al., 2008) (Ca˜nadas Quesada et al.,
2008). For each F0-candidate, a harmonic spectral
pattern is estimated in the logarithmic frequency do-
main.
After all possible harmonic patterns belonging to
F0-candidates have been defined at frame level, an ex-
haustive search for all possible combinations of these
patterns is performed.
Assuming that the amplitude spectrum of a poly-
phonic music signal is additive, each combination of
harmonic patterns is modeled as a sum of weighted
Gaussian spectral models in which non-overlapped
partials are used to infer overlapped partials interpo-
lating linearly the nearest non-overlapped partials.
The next stage consists of computing a spectral
distance measure between the current audio frame and
the Gaussian spectral models for each combination.
Although there are many other distances, we decided
to use the Euclidean spectral distance because other
distances, specifically the Kullback-Liebler (KL) di-
A COMPARATIVE ANALYSIS OF TIME-FREQUENCY DECOMPOSITIONS IN POLYPHONIC PITCH
ESTIMATION
147
vergence, did not show significant differences (re-
ferred to accuracy results) between the KL and Eu-
clidean distances. Moreover, the Euclidean distance
provides a higher computational efficiency because it
requires a lower number of operations.
The optimum combination, composed of the most
likely set of active pitches, minimizes the Euclidean
spectral distance, that is, achieves the highest spectral
similarity which explains most of the harmonic peaks
present in the signal.
4 EXPERIMENTAL RESULTS
The performance of four time-frequency decomposi-
tions (STFT-1N, STFT-2N, STFT-8N and CQT) has
been evaluated using 3 excerpts: F1 (Sonata No. 8
C minor Pathetique), F2 (Piano Sonata in G major
Hoboken XVI:40) and F3 (Sonata No. 13 Bb ma-
jor KV 333) of monaural polyphonic music signals
played by a Yamaha Disklavier playback grand pi-
ano (Poliner and Ellis, 2007). For each excerpt, the
first 20 seconds were analyzed taking into account
three frequency regions: low (MIDI 36 - MIDI 60),
middle-high (MIDI 60 - MIDI 95) and low-middle-
high (MIDI 36 - MIDI 95) frequency regions. Ac-
curacy and error measures were computed using the
metrics proposed in (Poliner and Ellis, 2007). A user
interface has been implemented to show the poly-
phonic estimation (ascreen-shot of the GUI editor can
be seen in Figure 5).
Figure 5: GUI for visualization of input parameters and
pitch estimations of the system. Each black rectangle rep-
resents a reference note-event. Each white rectangle repre-
sents an estimated note-event provided by our system.
For each frequency range, accuracy results are
shown in Figure 6. As can be seen in Figure 6,
the STFT-8N presents the best accuracy performance
with respect to the other STFT evaluated, followed by
the STFT-2N and the STFT-1N. Although the STFT-
2N and the STFT-8N exhibit similar results, we only
compare the STFT-8N and the CQT. Figure 6(a) indi-
cates that the CQT, compared to STFT-8N, achieves
better accuracy rates in low frequency range but this
improvement does not exist in the middle-high fre-
quency range (see Figure 6(b)). A possible reason is
because the CQT uses variable resolution which pro-
vides higher spectral discrimination in the low fre-
quency range where the fundamental frequencies of
event-notes are closer. This variable resolution avoids
allocating adjacent event-notes in the same frequency
bin. Nevertheless, analyzing the overall frequency
range (see Figure 6(c)) we can observe that the per-
formance between the CQT and the STFT-8N is ap-
proximately the same.
F1 F2 F3
0
10
20
30
40
50
60
70
80
90
100
Evaluation file
Acc (%)
STFT 1N
STFT 2N
STFT 8N
CQT
F1 F2 F3
0
10
20
30
40
50
60
70
80
90
100
Evaluation file
Acc (%)
STFT 1N
STFT 2N
STFT 8N
CQT
F1 F2 F3
0
10
20
30
40
50
60
70
80
90
100
Evaluation file
Acc (%)
(a)
(b) (c)
STFT 1N
STFT 2N
STFT 8N
CQT
Figure 6: Frame-level accuracy results. (a) Low frequency
range. (b) Middle-high frequency range. (c) Low-middle-
high frequency range.
Figure 7, Figure 8 and Figure 9 show the error dis-
tribution into different categories for each frequency
range and time-frequency transformation. The CQT
provides the best substitution error E
sub
and miss error
E
miss
rates in low frequencyrange (see Figure 7(a) and
Figure 8(a)) but it does not ocurr in middle-high fre-
quency range. However, CQT provides the best E
sub
and E
miss
rates when low-middle-high frequency re-
gions are analyzed (see Figure 7(c) and Figure 8(c)).
Considering false alarm error E
fa
rates, CQT exhibits
worse performance than STFT-8N in low frequency
range (see Figure 9(a)). A possible reason for these
results can be derived by the higher resolution of CQT
in low frequency which generate a higher amount of
spurious F0-candidates.
In order to compare the complexity of each trans-
formations, a set of files of random duration have been
transformed in time-frequency domain by means of
the analyzed transformations. Results report that the
CQT requires the most computational cost with re-
SIGMAP 2010 - International Conference on Signal Processing and Multimedia Applications
148
F1 F2 F3
0
5
10
15
20
25
30
35
40
45
50
Evaluation file
E
sub
(%)
STFT 1N
STFT 2N
STFT 8N
CQT
F1 F2 F3
0
5
10
15
20
25
30
35
40
45
50
Evaluation file
E
sub
(%)
STFT 1N
STFT 2N
STFT 8N
CQT
F1 F2 F3
0
5
10
15
20
25
30
35
40
45
50
Evaluation file
E
sub
(%)
(a)
(b)
(c)
STFT 1N
STFT 2N
STFT 8N
CQT
Figure 7: Frame-level substitution error results. (a) Low
frequency range. (b) Middle-high frequency range. (c)
Low-middle-high frequency range.
F1 F2 F3
0
5
10
15
20
25
30
35
40
45
50
Evaluation file
E
miss
(%)
STFT 1N
STFT 2N
STFT 8N
CQT
F1 F2 F3
0
5
10
15
20
25
30
35
40
45
50
Evaluation file
E
miss
(%)
STFT 1N
STFT 2N
STFT 8N
CQT
F1 F2 F3
0
5
10
15
20
25
30
35
40
45
50
Evaluation file
E
miss
(%)
(a) (b)
(c)
STFT 1N
STFT 2N
STFT 8N
CQT
Figure 8: Frame-level miss error results. (a) Low frequency
range. (b) Middle-high frequency range. (c) Low-middle-
high frequency range.
spect to the others evaluated transformations. Specif-
ically, the computational cost of the CQT is propor-
tional to the duration of the input signal multiplied
by a factor of approximately 50. However, this fac-
tor decreases to perform the computation with the
other transformations: 0.26 (STFT-1N), 0.38 (STFT-
2N) and 0.78 (STFT-8N).
5 CONCLUSIONS
This paper presents a preliminary analysis of two
different time-frequency decompositions, Constant
Q Transform and Short Time Fourier Transform,
in a polyphonic pitch estimation context applied to
monaural music signals. As shown in the results, the
CQT provides better accuracy rates in low frequency
range which it is attractive to apply in applications fo-
F1 F2 F3
0
5
10
15
20
25
30
35
40
45
50
Figure 9: Frame-level false alarm error results. (a) Low fre-
quency range. (b) Middle-high frequency range. (c) Low-
middle-high frequency range.
cused on low frequency regions as accompaniment or
bass music lines detections. However, this improved
performance is approximately the same of the STFT-
8N when the overall frequency range (low-middle-
high) is analyzed. In order to analyze polyphonic sig-
nals in a wider spectral range, the STFT-8N is more
suitable because exhibits the best trade-off between
accuracy and computational cost, allowing the imple-
mentation of possible real-time applications.
ACKNOWLEDGEMENTS
This work was supported in part by the Spanish
Ministry of Education and Science under Project
TEC2009-14414-C03-02and the Andalusian Council
under project P07-TIC-02713.
REFERENCES
Bello, J., Daudet, L., and Sandler, M. (2006). Automatic
piano transcription using frequency and time-domain
information. IEEE Transactions on Speech and Audio
Processing, 14(6):2242–2251.
Brown, J. (1991). Calculation of a constant q spectral trans-
form. Journal of the Acoustical Society of America,
89(1):425–434.
Burred, J. and Sikora, T. (2007). Monaural source separa-
tion from musical mixtures based on time-frequency
timbre models. Proc. International Conference on
Music Information Retrieval (ISMIR). Vienna, Aus-
tria.
Ca˜nadas Quesada, F., Ruiz-Reyes, N., Vera-Candeas, P.,
Carabias-Orti, J., and Maldonado, S. (2010). A
multiple-f0 estimation approach based on gaussian
A COMPARATIVE ANALYSIS OF TIME-FREQUENCY DECOMPOSITIONS IN POLYPHONIC PITCH
ESTIMATION
149
spectral modeling for polyphonic music transcription.
accepted to appear in Journal of New Music Research.
Ca˜nadas Quesada, F., Vera-Candeas, P., Ruiz-Reyes, N.,
Mata-Campos, R., and Carabias-Orti, J. (2008). Note-
event detection in polyphonic musical signals based
on harmonic matching pursuits and spectral smooth-
ness. Journal of New Music Research, 89(8):1653
1660.
Carabias, J., Vera, P., Ruiz, N., Mata, R., and Canadas,
F. (2008). Polyphonic piano transcription based on
spectral separation. 124thAudio Engineering Society
(AES). Amsterdam, The Netherlands, 2008.
Emiya, V., Badeau, R., and David, B. (2008). Automatic
transcription of piano music based on hmm tracking of
jointly-estimated pitches. Proc. European Conference
on Signal Processing (EUSIPCO).
Every, M. and Szymanski, J. (2006). Separation of syn-
chronous pitched notes by spectral filtering of har-
monics. IEEE Transactions on Audio, Speech, and
Language Processing, 14(5):1845–1856.
Goto, M. (2004). A real-time music-scene-description sys-
tem: Predominant-f0 estimation for detecting melody
and bass lines in real-word audio signals. Speech
Communications, 43(4):311–329.
IDMT, F. (2009). Musicline. http://www.musicline.de/de/
melodiesuche/input.
Kameoka, H., Nishimoto, T., and Sagayama, S. (2007).
A multipitch analyzer based on harmonic temporal
structured clustering. IEEE Trans. on Audio, Speech
and Language Processing, 15(3):982–994.
Klapuri, A. (2003). Multiple fundamental frequency esti-
mation by harmonicity and spectral smoothness. IEEE
Trans. Speech and Audio Processing, 11(6):804–816.
Li, Y., Woodruff, J., and Wang, D. (2009). Monaural musi-
cal sound separation based on pitch and common am-
plitude modulation. IEEE Trans. on Audio, Speech
and Language Processing, 17(-):1361–1371.
Marolt, M. (2004). A connectionist approach to automatic
transcription of polyphonic piano music. IEEE Trans-
actions on Multimedia, 6(3):439–449.
Neubacker, P. (2009). Celemony. http://www.celemony.
com.
Poliner, G. and Ellis, D. (2007). A discriminative model for
polyphonic piano transcription. EURASIP Journal on
Advances in Signal Processing, 2007(1):154–162.
Saito, S., Kameoka, H., Takahashi, K., Nishimoto, T., and
Sagayama, S. (2008). Specmurt analysis of poly-
phonic music signals. IEEE Trans. on Audio, Speech
and Language Processing, 16(3):639–650.
Smaragdis, P. (2009). Relative pitch tracking of multiple
arbitrary sounds. Journal of the Acoustical Society of
America, 125(5):3406–3413.
Yeh, C., Roebel, A., and Rodet, X. (2005). Multiple funda-
mental frequency estimation of polyphonic music sig-
nals. in Proc. International Conference on Acoustics,
Speech, and Signal Processing (ICASSP). Philadel-
phia, USA.
SIGMAP 2010 - International Conference on Signal Processing and Multimedia Applications
150