Figure 3: Rough estimate with OSMS vs. final estimate
with RARS. Results for frequency bin k=8.
3.2 Speech Presence Probability
In order to calculate the speech presence probabil-
ity the idea proposed by Cohen (Cohen and Berdugo,
2002) is used. Firstly the a posteriori SNR is calcu-
lated using the OSMS estimated noise power as
ζ(k, m) =
R
x
(k, m)
R
OSMS
(k, m)
. (2)
Since ζ(k, m) is computed using overestimated noise
power, it cannot be used directly. To overcome this ef-
fect the a posteriori SNR is smoothed over the neigh-
boring frequency bins to take into account the strong
correlation of speech presence across the frequency
bins in the same frame (Cohen and Berdugo, 2002).
Smoothed SNR is given by
˜
ζ(k, m) =
i= j
∑
i=− j
w(i) · ζ(k − i, m) (3)
where,
i= j
∑
i=− j
w(i) = 1 (4)
and 2 j + 1 is a window length for the frequency
smoothing.
˜
ζ(k, m) is then compared with a thresh-
old ∆ to derive a VAD index I(k, m) as follows,
I(k, m) =
1 , if
˜
ζ(k, m) > ∆
0 , otherwise,
(5)
where ∆ is an empirically determined threshold and
I(k, m) = 1 represents speech present bin. ∆ = 4.7
was proposed by Cohen (Cohen and Berdugo, 2002).
Based on the VAD index the speech presence proba-
bility is then given by
p(k, m) = γ · p(k, m − 1)+ (1 − γ) · I(k, m), (6)
where γ is a constant determined empirically. Values
of γ ≤ 0.2 are suggested for a better estimate (Cohen
and Berdugo, 2002). p(k, m) is the probability for the
bin to be speech. If I(k, m) = 1, then value of p(k, m)
increases, else if I(k, m) = 0, the value of p(k, m) de-
creases. It should be pointed out that Eq. (3) implic-
itly takes correlation of speech presence in adjacent
bins into consideration. Note also that the threshold
∆ in Eq. (5) plays an important role in speech detec-
tion. If the threshold ∆ is low, speech presence can be
detected with higher confidence thus avoiding overes-
timation (Cohen and Berdugo, 2002).
3.3 Smoothing Parameter
With the help of the above derived speech presence
probability a time frequency dependent smoothing
parameter
η(k, m) = β + (1 − β) · p(k, m) (7)
is updated, where β is a constant. Values of β ≥ 0.85
yield a better estimate of η as proposed in (Cohen
and Berdugo, 2002). If p(k, m) is high, then value
of η(k, m) will be high. Else if p(k, m) is low, then
value of η(k, m) will be low. η(k, m) takes value in
the range β ≤ η(k, m) ≤ 1 . It is expected that the
smoothing parameter will be close to 1 during speech
presence regions.
3.4 Tracking Fast Changes
An algorithm to track the fast changes in noise power
is proposed here. The adaptation time for the pro-
posed algorithm is around 0.5 sec, thus close to that of
Rapid Adaption for Highly Non-Stationary Environ-
ments (RAHNSE approach) (Rangachari and Loizou,
2006). A simple and effective idea as proposed
in (Erkelens and Heusdens, 2008a) is applied here,
which ensures that the proposed approach can track
quickly changes in the noise power. First a refer-
ence noise power estimate using OSMS with a short
window (0.5 sec) is computed. The corrupted speech
power is smoothed with a low value smoothing con-
stant. The idea here is to push the noise estimate
into the right direction when there is an increase in
noise power. The smoothed corrupted speech power
is given by
P(k, m) = α · P(k − 1, m)+(1 − α) · R
x
(k, m), (8)
where values of α ≤ 0.2 are suggested for better
smoothing. From the smoothed power spectrum,
P
min
is found for a window length of at least 0.5
sec. Because of small smoothing constant, smoothed
spectrum power almost follows the corrupted speech
NOISE POWER ESTIMATION USING RAPID ADAPTATION AND RECURSIVE SMOOTHING PRINCIPLES
15