2.2 ERB (Equivalent Rectangular
Bandwidth) Filterbank
We follow a frame based processing to allow for the
dynamic nature of hum signals. The normalized au-
dio is split into frames of 30ms with an overlap of
10ms to ensure a smooth variation in signal charac-
teristics. This signal is passed through an ERB fil-
terbank stretching from 50Hz to 8kHz. There are
126 uniform 0.25 ERB apart filters in the ERB scale
in the frequency range of interest. Signal rectifica-
tion and energy integration within the 30ms window
is performed to simulate the workings of the inner
ear(B.C.J.Moore et al., 1997). Each frame of audio
now has 126 excitation energy features that are fed to
the range adaptation block that simulates a time local-
ized dynamic range adaptation.
2.3 Dynamic Range Adaptation
A large window of 5 secs is chosen to adjust the
dynamic range of hearing. Within each 5 second
window there are 2500 frames of audio. Each frame
of audio has 126 bins called the T-F bin. To simulate
the dynamic range adaptation, we choose the T-F
bin that has the maximum energy over a 5s window.
Even though humans have a huge dynamic range
of over 100dB, the dynamic range within the 5
second window is restricted to 35dB, by choosing the
maximum energy bin and neglecting all audio bins
below 35dB of this. This enables us to neglect low
energy bins that may experience a substantial change
in partial loudness but would be inconsequential for
the total loudness that is finally perceived(See .2.4
for details).
Furthermore, for each frame in this 5s window, we
choose a maximum T-F bin and retain only those
T-F bins whose energies are within 25dB of this
maximum and make the rest of the T-F bin energies
zero. This step has the effect of neglecting low
energy sub-bands from contributing to the actual
onset detection process. We clarify this step with
an example. Let us say sub-band j of frame i has a
loudness of 0.05 sones, while the maximum loudness
in frame i is 1 sone contributed by sub-band k. Let,
for frame i + 1 the loudness in sub-bands j and k
are 0.1 and 1.5 respectively. Then, as explained in
Sec.2.5, unless we weigh the sub-bands or even if
we take relative changes, the loudness change in
sub-band j is more significant than that in sub-band
k. But it is obvious that sub-band k contributes more
than sub-band j to the total loudness at frame i and
hence is more appropriate to consider the changes
occurring there.
This dynamic range adaptation gives us around 7%
improvement in onset detection for polyphonic au-
dio over a previous version of the same algo-
rithm(Thoshkahnaand K.R.Ramakrishnan, 2008) and
hence this step was retained even though the ear does
not display such a short term adaptation phenomenon
that we know of. The empirical values of 35dB and
25dB were arrived at after testing on a variety of au-
dio. This modified audio signal is used as the ex-
citation signal to the loudness model(B.Moore and
B.Glasberg, 1983). We use the model of loudness
for human auditory system proposed by Moore et
al(B.C.J.Moore et al., 1997) to detect onsets in poly-
phonic audio.
2.4 Moore’s Model of Loudness
We have used the modifications done by Timoney et
al (J.Timoney et al., 2004) to the Moore’s loudness
model with certain changes as explained below. The
equation to compute loudness within each subband is
as follows;
L
i
(k) = C.(E
sig
(i, k)
α
− E
th
(i)
α
) (4)
where L
i
(k) is the partial loudness in the i
th
sub-
band of the ERB filterbank for the k
th
frame, E
sig
(i, k)
is the excitation of the i
th
subband of the k
th
frame
and E
th
(i) is the excitation due to the threshold of
hearing at the i
th
subband. We get the E
th
(i) by pass-
ing pure sinusoids ( of rms MAF ( Minimum Audible
Field ) values at the filter centres ) through the ERB
filterbank. The constant α does the audibility range
compression that occurs in the human auditory sys-
tem and has a value of 0.093 and the constant C is
used to calibrate the model and has a value of 0.583.
Calibration involved the same procedure provided in
(J.Timoney et al., 2004), except that the model is
adapted to our requirements of a higher sampling rate
and lower ERB filter distance. The model finally pro-
vides the loudness in sones and positive values ( i.e
only L
i
> 0) from each subband is weighed by the
ERB distance and added to provide the total loudness
of the frame.
2.5 Using the Loudness Lodel for Onset
Detection
As noted in section.1, the output loudness of each sub-
band is used to find the potential onsets. Since the
loudness in each subband is specified in sones, an on-
set will be seen as a sudden change in the partial loud-
ness. Thus we find the increase in subband loudness
SIGMAP 2009 - International Conference on Signal Processing and Multimedia Applications
96