In our study the signal was divided into 30 ms long
frames with 1/3 overlap. MECB of orders p =
0.5,0.6,. ..,1.0 were extracted. DMECB were calcu-
lated for a fixed p
1
of 1.0 and p
2
0.5...0.9.
3 COMBINATION OF FEATURES
Combining different acoustic features can be per-
formed in a number of ways. One way is to concate-
nate the feature vectors of the corresponding frames.
However, this leads to feature vectors of very high
dimensionality, which means much more data is re-
quired for reliable training of a classifier. Thus the
concatenation was only done for low-dimensional
feature vectors pH, while for the high-dimensional
features another method was used. A GMM en-
ables modelling the conditional probability density
functions in the feature space for each class. A
GMM classifier returns a score for each given pattern,
which is an estimation of the log likelihood ratio for
the hypothesis that the speaker is who he claims to
be (Reynolds and Rose, 1995). These scores from
GMM classifiers for each of the acoustic features
were used as features. The resulting score feature vec-
tors were used with an SVM classifier.
4 EXPERIMENTS AND RESULTS
All experiments were conducted on NIST 2001
Speaker Recognition Evaluation (SRE) database,
single-speaker files. The audio files sampled at 8 kHz
were pre-emphasised with filter coefficient of 0.97
and divided into frames as described above. For all
features a Gaussian Mixture Model (GMM) classifier
of 512 multivariate normal distributions with diagonal
covariance matrices was used (Reynolds and Rose,
1995). The Universal Background Models (UBM)
were trained on samples from 82 male and 56 fe-
male speakers. The resulting Detection Error Trade-
off (DET) curves and the Equal Error Ratios (EER)
are shown in Fig. 1(a)–(g).
Individual Features. The results achieved with
MFCC features with the first and second differences
were taken as the baseline (Fig. 1(a)). As seen from
the DET curves in Fig. 1(b), adding the first difference
to MLSF improves the speaker verification accuracy,
which is in agreement with the results in (Cordeiro
and Ribeiro, 2006). Adding the second difference im-
proves the accuracy further. Because of high dimen-
sionality of the resulting feature vectors (48) more
training data may lead to better system performance.
Fig. 1(c) shows the DET curves for Residual
Phase features and two different order LP filters. The
difference in the LP filter order does not result in a
significant difference in the speaker verification accu-
racy. It was also found that adding the first difference
features does not change the system performance ei-
ther, so the second difference was not tried.
Features pH
4+6+12
were obtained by concatenat-
ing feature vectors pH
4
, pH
6
, and pH
12
for each
frame. It was found that performance of the speaker
verification system is similar when either one of pH
4
,
pH
6
, pH
12
are used. Concatenating them into 12-
dimensional pH
4+6+12
vectors leads to a dramatic im-
provement in the accuracy with EER dropping from
29.0% to 20.8% (Fig. 1(d)).
The accuracy of speaker verification for MECB
p
features declines with p of FrFT (Fig. 1(e)). This is
in accordance with the results reported in (Wang and
Wang, 2005), while the results for DMECB
1.0−p
2
fea-
tures with various p
2
(Fig. 1(f)) are different from
that reported in the paper: the highest speaker veri-
fication accuracy was achieved for p
2
= 0.5 and for
p
2
= 0.6...0.9 the accuracy decreased with increase
of p
2
. Adding the difference features to MECB and
DMECB did not lead to accuracy improvement.
Table 1: Equal error rates for MECB features of different
orders.
MECB
p
, p 1.0 0.9 0.8 0.7 0.6 0.5
EER, % 17.6 18.7 21.2 24.2 27.5 31.4
Table 2: Equal error rates for DMECB features of different
orders.
DMECB
1.0−p
2
, p
2
0.9 0.8 0.7 0.6 0.5
EER, % 19.7 19.4 18.9 18.3 17.8
Table 3: Summary of equal error rates for different feature
types and their SVM combination.
Feature type
EER, % Feature type EER, %
MFCC+∆+∆∆ 9.5 Residual phase 21.5
MLSF+∆+∆∆ 16.0 pH
4+6+12
20.8
MECB
1.0
17.6 DMECB
1.0−0.5
17.8
Combined 8.7
Combination of Features. To make the results
comparable to those of acoustic features alone a 5-
fold cross-validation scheme was applied. The test
set of speakers was divided into 5 approximately
equal parts. Every time one different part was left
for testing and four others were used for training
the SVM, resulting in 5 experiments in total. The
SVM was designed to produce a soft decision, which
BIOSIGNALS 2008 - International Conference on Bio-inspired Systems and Signal Processing
222