Table 1: EER (%) obtained from speaker verification tests
with MFCC feature vectors and the GMM classifier.
Noise
SNR
Average
20 dB 15 dB 10 dB 5 dB
Clean 1.48
Babble 2.85 5.06 11.20 25.00 11.03
Destroyer 4.84 12.14 23.70 37.16 19.46
Factory 5.04 10.13 19.94 30.98 16.52
Leopard 4.43 8.35 14.92 23.92 12.91
Volvo 4.60 7.40 13.26 20.51 11.44
Average 4.35 8.62 16.60 27.51 14.27
For each speaker S, multiple copies of the clean
training utterance Φ
S
are corrupted by the artificial
colored noises, resulting in multicondition data sets
Φ
l
S
(l = 1, ..., m). Following the procedure addressed
in Section 2.2, m α-GMM (λ
l
S
) for speaker S are ob-
tained from the corrupted data sets Φ
l
S
. In analogy to
(10), each of these models are parametrized by
λ
l
S
= {w
l
i
,~µ
l
i
,K
l
i
|i = 1,... ,M} , l = 1,... ,m.
(12)
The colored multicondition training model (Λ
S
) of
speaker S is given by the collection of all the parame-
ters estimated in (12), i. e.,
Λ
S
= {w
l
i
,~µ
l
i
,K
l
i
|l = 1, ...,m; i = 1,... ,M}. (13)
In order to adapt the Colored-MT to the α-GMM
classifier, the probability p(~x|λ
S
) is adjusted to follow
the α-integration of all m× M Gaussian densities:
p(~x|Λ
S
) = c
′
"
m
∑
l=1
M
∑
i=1
w
l
i
b
l
i
(~x)
1−α
2
#
2
1−α
, (14)
where c
′
is a new normalization constant.
3 EXPERIMENTS AND RESULTS
The speaker verification experiments are conducted
with a subset composed of 168 speakers (106 males
and 62 females) of the TIMIT database (Fisher et al.,
1986). The speech database is composed of ten utter-
ances per speaker, with sampling rate of 16 kHz and
average duration of 3 seconds. The speech segments
of ten speakers (5 males and 5 females) are concate-
nated to obtain the UBM. From each of the 158 re-
maining speakers, eight utterances are separated to
train the models, and the other two are used for tests.
Five environmental acoustic noises (Babble, De-
stroyer, Factory, Leopard and Volvo), collected from
NOISEX-92 database (Varga and Steeneken, 1993),
are used to corrupt the test speech utterances. The
values of SNR adopted for the tests are 5, 10, 15 and
20 dB, and also the clean speech.
Table 2: EER (%) obtained from speaker verification tests
with MFCC + pH feature vectors and the GMM classifier.
Noise
SNR
Average
20 dB 15 dB 10 dB 5 dB
Clean 1.31
Babble 2.85 4.97 11.53 23.55 10.72
Destroyer 4.75 11.17 22.73 35.76 18.60
Factory 3.91 7.38 13.92 25.63 12.71
Leopard 4.11 7.09 14.44 22.45 12.02
Volvo 3.16 5.78 9.49 16.14 8.65
Average 3.76 7.28 14.42 24.71 12.54
Two sets of experiments are presented in this
work. In the first one, the speaker verification task
is evaluated with the α-GMM classifiers considering
the MFCC and the fusion of MFCC and pH as speech
feature vectors. All the α-GMM are obtained with
32 Gaussian densities. The conventional GMM is a
particular case of the α-GMM classifier (α = −1).
The second set of experiments are conducted with
the MFCC + pH features fusion combined with the
Colored-MT technique and the α-GMM classifier.
3.1 MFCC and pH Fusion
The MFCC feature matrix is composed by 12-
dimensional vectors, obtained from frames of 20 ms
and 50% of frame overlapping. It is adopted a Mel-
scale filterbank composed by 26 filters and a pre-
emphasis factor of 0.97. The pH are estimated from
three consecutive speech frames using Daubechies
wavelets filters (Daubechies, 1992) with 12 coeffi-
cients, using scale range from 2 to 8. A total of J = 8
decomposition scales are considered to obtain the H
j
values. Including the estimated values of H
0
from
the original speech signal, 9-dimensional pH vectors
are extracted to compose the feature matrices. Thus,
in the experiments with the MFCC + pH fusion, the
speech feature vectors have 21 components.
3.1.1 GMM
Tabs. 1 and 2 show the EER results obtained from
the speaker verification experiments considering the
GMM with single MFCC and MFCC + pH feature
vectors, respectively. Note that, compared to single
MFCC, the MFCC + pH fusion achieves better accu-
racy, i. e., lower EER values, for all the five noise
sources and also for clean speech. The contribution
of the pH feature achieves 6.02% of absolute EER re-
duction for test utterances corrupted by the Factory
noise with SNR of 10 dB. The average EER results
considering all five noises is reduced from 14.27%
to 12.54%, which represents 1.73% of absolute im-
provement. Fig. 3 illustrates the DET curves obtained
BIOSIGNALS 2012 - International Conference on Bio-inspired Systems and Signal Processing
140