NOISE ROBUST SPEAKER VERIFICATION BASED ON THE MFCC

AND pH FEATURES FUSION AND MULTICONDITION TRAINING

L. Z˜ao

and R. Coelho

Graduate Program in Defense Engineering, Military Institute of Engineering (IME), Rio de Janeiro, Brazil

Electrical Engineering Department, Military Institute of Engineering (IME), Rio de Janeiro, Brazil

Keywords:

Colored noise, Multicondition training, Speaker veriﬁcation, α-GMM.

Abstract:

This paper investigates the fusion of Mel-frequency cepstral coefﬁcients (MFCC) and pH features, combined

with the multicondition training (MT) technique based on artiﬁcial colored spectra noises, for noise robust

speaker veriﬁcation. The α-integrated Gaussian mixture models (α-GMM), an extension of the conventional

GMM, are used in the speaker veriﬁcation experiments. Five real acoustic noises are used to corrupt the

speech signals in different signal-to-noise ratios (SNR) for tests. The experiments results show that the use of

MFCC + pH feature vectors improves the accuracy of speaker veriﬁcation systems based on single MFCC. It

is also shown that the speaker veriﬁcation system with the MFCC + pH fusion and the α-GMM with the MT

technique achieves the best performance for the speaker veriﬁcation task in noisy environments.

1 INTRODUCTION

Over the last decades, automatic speaker veriﬁca-

tion or authentication has been demonstrated to be

an interesting solution for applications with security

concerns, such as access control, data security and

forensic investigations (Naik, 1990) (Campbell et al.,

2009). The main goal of a speaker veriﬁcation task is

to accept or reject a claimed identity.

Speaker veriﬁcation systems are composed of a

training and a testing phase. The training phase has

three steps: speech acquisition/pre-processing, fea-

tures extraction and speaker modeling. In the test-

ing phase, the pre-processing and features extraction

steps are also present. Then, the extracted features are

compared to the speakers models and the appropriate

decision is taken.

The MFCC (Davis and Mermelstein, 1980)

and GMM-UBM (universal background model)

(Reynolds and Rose, 1995) based system achieves

high recognition accuracies for clean speech

(Reynolds, 1995). However, its performance can

be severely degraded when the speech signals are

corrupted by acoustic noise (Ming et al., 2007).

This paper proposes the fusion of the MFCC and

pH (Sant’Ana et al., 2006) features combined

with a colored-noise-based multicondition training

(Colored-MT) technique (Z˜ao and Coelho, 2011) to

improve the noise robustness of speaker veriﬁcation

tasks. The proposed solution is evaluated without any

speech enhancement (Boll, 1979), orthogonalization

(Fukunaga, 1990), missing-feature (Cooke et al.,

2001) or score-normalization (Bimbot et al., 2004)

techniques. The results are presented for the GMM

and α-GMM (Wu et al., 2009) classiﬁers.

For the veriﬁcation experiments the speech utter-

ances are collected from the TIMIT database (Fisher

et al., 1986). The speech signals are corrupted by the

acoustic noises (Babble, Destroyer, Factory, Leop-

ard and Volvo) obtained from the NOISEX-92 (Varga

and Steeneken, 1993) database, considering SNR val-

ues of 5, 10, 15 and 20 dB. The experiments results

show that the proposed solution is very promising for

speaker veriﬁcation in noisy environments.

The remainder of this work is organized as fol-

lows. Section 2 provides the basic concepts of a

speaker veriﬁcation system, including the speech fea-

tures and classiﬁers adopted in this work. This Sec-

tion also presents the colored-noise-based multicon-

dition training technique for the α-GMM classiﬁer.

Section 3 describes the speaker veriﬁcation experi-

ments conductedin different noisy environments. The

results are presented and discussed in the same Sec-

tion. Finally, Section 4 concludes this work.

137

Zão L. and Coelho R..

NOISE ROBUST SPEAKER VERIFICATION BASED ON THE MFCC AND pH FEATURES FUSION AND MULTICONDITION TRAINING.

DOI: 10.5220/0003890501370143

In Proceedings of the International Conference on Bio-inspired Systems and Signal Processing (BIOSIGNALS-2012), pages 137-143

ISBN: 978-989-8425-89-8

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

2 SPEAKER VERIFICATION

Given a claimed identity of a speaker S and an ob-

served speech segment (Y), the veriﬁcation task can

be stated as a hypothesis test. In fact, the system ac-

cepts one of the following statements as true:



: Y belongs to speaker S.

: Y does not belong to speaker S.

To decide whether the observed speech segment

belongs or not to the claimed speaker, the following

log-likelihood ratio test is generally applied:

log p(Y|λ

) − log p(Y|λ

UBM

)



≥ θ, accept H

< θ, accept H

(1)

In (1), p(Y|λ

) is the probability density function

(pdf) of Y given it was spoken by the claimed speaker

S, modeled by λ

. In the same way, p(Y|λ

UBM

) is

the pdf of Y given that it is not from the claimed

speaker, i. e., the speech segment belongs to an in-

truder. λ

UBM

is generally modeled by GMM-UBM.

The choice of the decision threshold θ is a tradeoff

between the false rejection (FR) and false acceptance

(FA) errors. These probabilities are usually evaluated

by detection error tradeoff (DET) curves. The equal

error rate (EER) corresponds to the point where the

FR and FA probabilities are equal.

2.1 Speech Features

Speech features are generally computed or extracted

using Hamming windows with length of 20 to 30 ms

and 50% of frame period overlapping. From each

frame, a set of coefﬁcients is obtained to form a

speech feature vector.

2.1.1 MFCC

Usually, MFCC is applied as speech feature in

speaker recognition systems since it is considered a

good representation of the human auditory system.

They are extracted using Mel scale band ﬁlters. The

Mel-frequency scale is related to the linear-frequency

scale as:

MEL

= 1127· log



700



(2)

The MFCC coefﬁcients are then calculated by the

discrete cosine transform (DCT):

∑

k=1

· cos





k−





, d = 1, 2,...,D,

(3)

where F is the number of ﬁlters in the Mel-frequency

ﬁlterbank, S

is the log-energy output of the k

ﬁlter,

and D is the number of cepstrum coefﬁcients. The

MFCC extraction schematic is depicted in Fig. 1.

DCT

Mel−Frequency

Filterbank

log

Coefficients

MFCC

Speech

Signal

FFT

Pre−emphasis

Figure 1: Representation of the MFCC extraction (FFT: fast

Fourier transform).

2.1.2 pH

The pH feature was proposed in (Sant’Ana et al.,

2006) and consists of a vector of Hurst (H) param-

eters. The Hurst parameter (0 ≤ H ≤ 1) expresses the

time-dependence or scaling degree of the speech sig-

nal.

Let the speech signal be represented by a stochas-

tic process X(t), with the normalized autocorrelation

coefﬁcient function deﬁned by

ρ(k) =

Cov[X(t),X(t + k)]

Var[X(t)]

. (4)

The Hurst parameter is deﬁned by the decaying rate

of ρ(k), whose asymptotic behavior is given by

ρ(k) ∼ H(2H − 1)k

2(H−2)

, k → ∞. (5)

The Wavelet-based Multi-dimensional Estimator

(M-dim-wavelets) (Sant’Ana et al., 2006) was pro-

posed as a pH feature extractor and is based on the es-

timator described in (Veitch and Abry, 1999). It uses

the discrete wavelet transform (DWT) to successively

decompose a sequence of speech samples into the ap-

proximation (a( j,k)) and detail (d( j, k)) coefﬁcients,

where j is the decomposition scale and k is the coefﬁ-

cient index of each scale. From each detail sequence,

d( j, k), generated by the ﬁlter bank in a given scale j,

a Hurst parameter H

is estimated. The set of H

val-

ues and the value obtained for the entire speech signal

) compose the pH feature. Fig. 2 shows an ex-

ample of the M-dim-wavelets estimator considering 3

decomposition stages. The M-dim-wavelets estima-

tor can be described in the following steps (Sant’Ana

et al., 2006):

1. Wavelet decomposition: the DWT is applied

to the speech samples generating the detail se-

quences d( j,k).

2. Variance estimation of the detail coefﬁcients: for

each scale j, the variance σ

= (1/n

)

∑

d( j, k)

is evaluated, where n

is the number of available

coefﬁcients for each scale j. It can be shown

(Veitch and Abry, 1999) that E[σ

] = c

2H−1

where c

is a constant.

3. pH estimation: plot y

= log

(σ

) versus j. Using

a weighted linear regression, one get the slope a

BIOSIGNALS 2012 - International Conference on Bio-inspired Systems and Signal Processing

138

d (2, )k

d (3, )k

d (1, )k

= Decimator

Band−Pass

Filter

Low−Pass

Filter

Hurst

Estimator

Band−Pass

Filter

Low−Pass

Filter

Hurst

Estimator

Band−Pass

Filter

Low−Pass

Filter

Hurst

Estimator

Figure 2: An example of the pH M dim wavelets estimator with 3 decomposition stages.

of the plot and the Hurst parameter is estimated as

H = (1+ a)/2. Apply the Hurst estimator to the

entire speech signal (H

) and then to each of the J

detail sequences obtained in the ﬁrst step (see Fig

2). The resulting (J + 1) H values compose the

pH feature.

The Daubechies wavelets ﬁlters (Daubechies, 1992)

are used in the estimation of the pH vectors. The

multi-resolution analysis (Vetterli and Kovacevic,

1995) adopted in the DWT of the Hurst estimator is

a powerful theory that enables the detail and approx-

imation coefﬁcients to be easily computed by a sim-

ple discrete time convolution. It is important to note

that the linear computational complexity of the pyra-

midal algorithm to obtain the DWT is O(n) where

n is the signal samples length, while the FFT (fast

Fourier transform), used to obtain the Mel-cepstral

coefﬁcients, is O(nlog(n)).

2.2 α-GMM

The α-integrated GMM was proposed in (Wu et al.,

2009) as an extension of the conventional GMM for

speaker classiﬁcation. The authors were motivated by

the fact that human brains must use complex ways of

informationintegration, such as the α-integration, and

not only the linear combination.

Given a set of Gaussian densities b

(~x) and corre-

sponding weights w

, i = 1,. ..,M, the α-GMM is de-

ﬁned as the α-integration of the densities (Wu et al.,

2009):

p(~x|λ

) = c f

−1

(

∑

i=1

(~x)]

)

, (6)

where

(~x)] =

(



1−α



(~x)

(1−α)/2

, α 6= 1

log[b

(~x)] , α = 1

, (7)

−1

(y) =

(



1−α



1−α

, α 6= 1

exp(y), α = 1

, (8)

and c is a normalization constant.

Note that (6) can be rewritten as

p(~x|λ

) = c

∑

i=1

(~x)

1−α

. (9)

As in the regular GMM, the α-GMM of each

speaker S is completely parametrized by the mean

vectors (~µ

), covariance matrices (K

) and the weights

of the Gaussian densities:

= {w

,~µ

|i = 1,. . .,M} . (10)

Let Φ

denote the training speech segment of

speaker S, and X the extracted feature matrix com-

posed of feature vectors~x

, t = 1,. . .,Q. The parame-

ters of λ

are estimated using the adapted expectation-

maximization (EM) algorithm (Wu, 2009) as to max-

imize the likelihood function

p(X|λ

) =

∏

t=1

p(~x

|λ

). (11)

It can be noticed from (9) that the GMM is a

particular case of the α-GMM, which corresponds to

α = −1. By choosing values of α smaller than -1, the

α-GMM classiﬁer emphasizes the larger probability

values, and de-emphasizes the smaller ones. The idea

of this work is to use this property to compensate the

training and testing mismatch caused by environmen-

tal acoustic noises.

2.3 Multicondition Training based on

Colored Noises

This Section presents the colored-noise-based multi-

condition training technique adopted in this work for

the speaker veriﬁcation task. As introduced in (Z˜ao

and Coelho, 2011), artiﬁcial noises are generated with

Gaussian distribution and power spectral densities

(PSD) characterized by the shape S( f) ∝ 1/ f

, with

β ∈ [0,2]. The PSD shapes are obtained by ﬁltering

a Gaussian white noise sequence using the Al-Alaoui

(Al-Alaoui, 1993) transfer function.

NOISE ROBUST SPEAKER VERIFICATION BASED ON THE MFCC AND pH FEATURES FUSION AND

MULTICONDITION TRAINING

139

Table 1: EER (%) obtained from speaker veriﬁcation tests

with MFCC feature vectors and the GMM classiﬁer.

Noise

SNR

Average

20 dB 15 dB 10 dB 5 dB

Clean 1.48

Babble 2.85 5.06 11.20 25.00 11.03

Destroyer 4.84 12.14 23.70 37.16 19.46

Factory 5.04 10.13 19.94 30.98 16.52

Leopard 4.43 8.35 14.92 23.92 12.91

Volvo 4.60 7.40 13.26 20.51 11.44

Average 4.35 8.62 16.60 27.51 14.27

For each speaker S, multiple copies of the clean

training utterance Φ

are corrupted by the artiﬁcial

colored noises, resulting in multicondition data sets

(l = 1, ..., m). Following the procedure addressed

in Section 2.2, m α-GMM (λ

) for speaker S are ob-

tained from the corrupted data sets Φ

. In analogy to

(10), each of these models are parametrized by

= {w

,~µ

|i = 1,... ,M} , l = 1,... ,m.

(12)

The colored multicondition training model (Λ

) of

speaker S is given by the collection of all the parame-

ters estimated in (12), i. e.,

= {w

,~µ

|l = 1, ...,m; i = 1,... ,M}. (13)

In order to adapt the Colored-MT to the α-GMM

classiﬁer, the probability p(~x|λ

) is adjusted to follow

the α-integration of all m× M Gaussian densities:

p(~x|Λ

) = c

′

∑

l=1

∑

i=1

(~x)

1−α

, (14)

where c

′

is a new normalization constant.

3 EXPERIMENTS AND RESULTS

The speaker veriﬁcation experiments are conducted

with a subset composed of 168 speakers (106 males

and 62 females) of the TIMIT database (Fisher et al.,

1986). The speech database is composed of ten utter-

ances per speaker, with sampling rate of 16 kHz and

average duration of 3 seconds. The speech segments

of ten speakers (5 males and 5 females) are concate-

nated to obtain the UBM. From each of the 158 re-

maining speakers, eight utterances are separated to

train the models, and the other two are used for tests.

Five environmental acoustic noises (Babble, De-

stroyer, Factory, Leopard and Volvo), collected from

NOISEX-92 database (Varga and Steeneken, 1993),

are used to corrupt the test speech utterances. The

values of SNR adopted for the tests are 5, 10, 15 and

20 dB, and also the clean speech.

Table 2: EER (%) obtained from speaker veriﬁcation tests

with MFCC + pH feature vectors and the GMM classiﬁer.

Noise

SNR

Average

20 dB 15 dB 10 dB 5 dB

Clean 1.31

Babble 2.85 4.97 11.53 23.55 10.72

Destroyer 4.75 11.17 22.73 35.76 18.60

Factory 3.91 7.38 13.92 25.63 12.71

Leopard 4.11 7.09 14.44 22.45 12.02

Volvo 3.16 5.78 9.49 16.14 8.65

Average 3.76 7.28 14.42 24.71 12.54

Two sets of experiments are presented in this

work. In the ﬁrst one, the speaker veriﬁcation task

is evaluated with the α-GMM classiﬁers considering

the MFCC and the fusion of MFCC and pH as speech

feature vectors. All the α-GMM are obtained with

32 Gaussian densities. The conventional GMM is a

particular case of the α-GMM classiﬁer (α = −1).

The second set of experiments are conducted with

the MFCC + pH features fusion combined with the

Colored-MT technique and the α-GMM classiﬁer.

3.1 MFCC and pH Fusion

The MFCC feature matrix is composed by 12-

dimensional vectors, obtained from frames of 20 ms

and 50% of frame overlapping. It is adopted a Mel-

scale ﬁlterbank composed by 26 ﬁlters and a pre-

emphasis factor of 0.97. The pH are estimated from

three consecutive speech frames using Daubechies

wavelets ﬁlters (Daubechies, 1992) with 12 coefﬁ-

cients, using scale range from 2 to 8. A total of J = 8

decomposition scales are considered to obtain the H

values. Including the estimated values of H

from

the original speech signal, 9-dimensional pH vectors

are extracted to compose the feature matrices. Thus,

in the experiments with the MFCC + pH fusion, the

speech feature vectors have 21 components.

3.1.1 GMM

Tabs. 1 and 2 show the EER results obtained from

the speaker veriﬁcation experiments considering the

GMM with single MFCC and MFCC + pH feature

vectors, respectively. Note that, compared to single

MFCC, the MFCC + pH fusion achieves better accu-

racy, i. e., lower EER values, for all the ﬁve noise

sources and also for clean speech. The contribution

of the pH feature achieves 6.02% of absolute EER re-

duction for test utterances corrupted by the Factory

noise with SNR of 10 dB. The average EER results

considering all ﬁve noises is reduced from 14.27%

to 12.54%, which represents 1.73% of absolute im-

provement. Fig. 3 illustrates the DET curves obtained

BIOSIGNALS 2012 - International Conference on Bio-inspired Systems and Signal Processing

140

Table 3: EER (%) obtained of speaker veriﬁcation tests with the α-GMM classiﬁer for different values of α.

Noise SNR

MFCC MFCC + pH

α = −4 α = −6 α = −8 α = −4 α = −6 α = −8

Babble

20 dB 2.97 3.52 2.54 3.05 3.00 3.48

15 dB 4.94 5.06 4.55 5.06 5.06 5.35

10 dB 12.03 12.12 11.08 11.08 11.39 12.44

5 dB 26.58 25.00 24.37 22.47 24.03 24.07

Average 11.63 11.43 10.63 10.41 10.87 11.34

Destroyer

20 dB 5.29 5.38 4.65 5.25 4.72 5.45

15 dB 12.08 11.70 11.71 12.34 10.30 12.34

10 dB 23.42 22.54 22.15 22.47 23.56 23.32

5 dB 34.81 34.49 34.72 35.76 38.03 37.44

Average 18.90 18.53 18.31 18.96 19.15 19.64

Factory

20 dB 5.06 5.18 5.06 4.18 4.04 4.98

15 dB 10.50 10.13 10.35 8.22 7.79 8.49

10 dB 20.57 19.30 19.94 15.05 15.19 15.82

5 dB 30.66 30.35 29.11 25.26 26.27 25.85

Average 16.70 16.24 16.12 13.18 13.32 13.79

Leopard

20 dB 4.71 4.75 4.36 4.11 4.75 4.84

15 dB 9.81 8.93 8.82 7.41 7.59 8.23

10 dB 17.24 17.12 16.46 14.45 14.24 14.56

5 dB 25.85 24.68 25.58 21.20 22.15 23.10

Average 14.40 13.87 13.80 11.79 12.18 12.68

Volvo

20 dB 4.75 4.75 4.53 3.80 3.16 3.85

15 dB 8.21 8.10 9.28 5.70 6.09 6.75

10 dB 13.93 13.24 15.11 10.24 10.38 10.79

5 dB 20.25 20.57 21.20 16.77 16.38 16.77

Average 11.79 11.91 12.53 9.13 9.01 9.54

Average 14.68 14.40 14.28 12.69 12.91 13.40

2010

False Acceptance Rate (%)

False Rejection Rate (%)

2010

SNR = 20 dB

SNR = 15 dB

SNR = 10 dB

Figure 3: The DET curves obtained with 12 MFCC (red

lines) and 12 MFCC + 9 pH (blue lines) with the GMM

classiﬁer for test speech signals corrupted by the Leopard

noise with SNR of 20, 15 and 10 dB.

with the MFCC + pH fusion (blue lines), and with the

single MFCC (red lines) for the Leopard noise and

SNR values of 20, 15 and 10 dB.

3.1.2 α-GMM

This Section presents the results obtained with the

single MFCC and MFCC + pH feature vectors con-

sidering the α-GMM with values of α: -4, -6 and -8.

Tab. 3 shows the EER values obtained with the

testing speech utterances corrupted by the ﬁve acous-

tic noises. It can be seen that the best average accu-

racy was achieved with α = −4 and for the MFCC

+ pH fusion. This performance was achieved for all

acoustic noises except for the Destroyer. The best av-

erage EER improvement of 3.52% was achieved for

the Factory noise. It is important to notice that the

α-GMM-based system does not outperform the con-

ventional GMM (α = −1) approach (refer to Tabs. 1

and 2).

Fig. 4 illustrates the DET curves for the Fac-

tory noise with SNR of 15 dB obtained for the GMM

(α = −1) and α-GMM classiﬁers. The dashed (bot-

tom) lines indicate the operating points obtained with

the fusion of MFCC + pH features, while the continu-

ous (top) lines are related to the single MFCC feature.

Note that, considering each set of speech features, the

GMM-based systems (red curves) achieve better per-

formance than those based on the α-GMM classiﬁer.

NOISE ROBUST SPEAKER VERIFICATION BASED ON THE MFCC AND pH FEATURES FUSION AND

MULTICONDITION TRAINING

141

8 10 12 14

False Acceptance Rate (%)

False Rejection Rate (%)

α = −1

α = −4

α = −6

α = −8

MFCC + pH

MFCC

Figure 4: The DET curves obtained with 12 MFCC (con-

tinuous lines) and 12 MFCC + 9 pH (dashed lines) with the

α-GMM classiﬁer for test speech corrupted with Factory

noise and SNR of 15 dB.

3.2 Colored-MT Technique

Following the procedure deﬁned in (Z˜ao and Coelho,

2011), three artiﬁcial noises are generated for the

Colored-MT technique, with colored spectra deﬁned

by the PSD decaying rate: β = 0 (white), β = 1 (pink)

and β = 2 (brown). These noises are used to corrupt

all the speech segments available for training with

SNR of 15 dB, including the UBM. The MFCC + pH

feature matrices, extracted from each of the corrupted

training utterances, are used to obtain the α-GMM.

Thus, a total of 3 × 32 = 96 Gaussian densities are

stored for each speaker. Tab. 4 presents the EER re-

sults obtained in the experiments with the Colored-

MT technique with α-GMM classiﬁer. The results are

presented considering the values of α: -1, -4, -6 and

-8.

The use of GMM with the Colored-MT leads to

an average EER of 6.96% (Tab. 4). This means an

absolute improvement of 5.58% in the EER when

compared to the accuracy results with the MFCC

+ pH vectors and the GMM without the multicon-

dition training. It can also be observed that with

the Colored-MT the α-GMM classiﬁer with α = −6

achieves the best veriﬁcation accuracy, for all the ﬁve

noise sources.

4 CONCLUSIONS

This paper examined the use of the fusion of the

MFCC and pH speech features and the colored-noise-

based multicondition training technique for noise ro-

bust speaker veriﬁcation. The GMM and α-GMM

Table 4: EER (%) of speaker veriﬁcation experiments with

MFCC + pH features with the Colored-MT technique and

the α-GMM classiﬁer.

Noise SNR

α-GMM classiﬁer

α = −1 α = −4 α = −6 α = −8

Babble

20 dB 3.80 2.85 3.48 3.03

15 dB 4.11 3.82 3.98 3.48

10 dB 6.52 6.33 5.92 6.27

5 dB 12.34 12.34 12.28 12.97

Average 6.69 6.33 6.41 6.44

Destroyer

20 dB 6.95 6.65 6.01 6.26

15 dB 11.25 11.08 10.37 10.44

10 dB 19.43 18.35 18.04 17.72

5 dB 30.38 30.91 29.69 28.39

Average 17.00 16.75 16.03 15.70

Factory

20 dB 1.58 1.58 1.90 1.75

15 dB 1.58 2.17 1.74 1.77

10 dB 3.16 3.34 2.95 3.41

5 dB 7.02 6.96 6.96 6.88

Average 3.34 3.51 3.39 3.45

Leopard

20 dB 2.41 2.45 2.25 2.85

15 dB 2.90 3.28 2.99 3.14

10 dB 5.56 6.52 5.44 6.01

5 dB 12.26 13.61 11.70 13.29

Average 5.78 6.46 5.59 6.32

Volvo

20 dB 1.82 1.56 1.58 1.90

15 dB 1.36 1.26 1.54 2.14

10 dB 1.58 1.77 1.58 1.90

5 dB 3.16 2.85 2.85 3.16

Average 1.98 1.86 1.89 2.28

Average 6.96 6.98 6.66 6.84

were considered for the speaker and intruder mod-

eling. The experiments were conducted with a sub-

set of the TIMIT database corrupted with ﬁve acous-

tic noises from NOISEX-92, with different values of

SNR. The results showed that the MFCC + pH vec-

tors and the α-GMM under multicondition training

achieved the best improvement for the speaker veri-

ﬁcation task in noisy environments.

ACKNOWLEDGEMENTS

This work was partially supported by the Univer-

sal/CNPq (472461/2009-5) research grant.

REFERENCES

Al-Alaoui, M. (1993). Novel digital integrator and differ-

entiator. Electronics Letters, 29(4):376–378.

Bimbot, F., Bonastre, J. F., Fredouille, C., Gravier, G.,

Chagnolleau, M. I., Meignier, S., Merlin, T., Garcia,

O. J., Delacretaz, P., and Reynolds (2004). A Tutorial

on Text-Independent Speaker Veriﬁcation. EURASIP

Journal on Applied Signal Processing, 4:430–451.

BIOSIGNALS 2012 - International Conference on Bio-inspired Systems and Signal Processing

142

Boll, S. (1979). Suppression of Acoustic Noise in Speech

Using Spectral Subtraction. IEEE Transactions on

Acoustics, Speech and Signal Processing, 27:113–

120.

Campbell, J., Shen, W., Campbell, W., Schwartz, R., Bonas-

tre, J.-F., and Matrouf, D. (2009). Forensic Speaker

Recognition. IEEE Signal Processing Magazine,

26:95–103.

Cooke, M., Green, P., Josifovski, L., and Vizinho, A.

(2001). Robust Automatic Speech Recognition with

Missing and Unreliable Acoustic Data. Speech Com-

munication, 34:267–285.

Daubechies, I. (1992). Ten lectures on wavelets. Society

for Industrial and Applied Mathematics, Philadelphia,

USA.

Davis, S. and Mermelstein, P. (1980). Comparison of para-

metric representations for monosyllabic word recog-

nition in continuously spoken sentences. IEEE Trans-

actions on Acoustics, Speech and Signal Processing,

28(4):357–366.

Fisher, W. M., Doddington, G. R., and Goudie-Marshall,

K. M. (1986). The DARPA Speech Recognition Re-

search Database: Speciﬁcations and Status. Pro-

ceedings of DARPA Workshop on Speech Recognition,

pages 93–99.

Fukunaga, K. (1990). Introduction to Statistical Pattern

Recognition (2nd ed.). Academic Press Professional,

Inc., San Diego, CA, USA.

Ming, J., Hazen, T., Glass, J., and Reynolds, D. (2007). Ro-

bust speaker recognition in noisy conditions. IEEE

Transactions on Audio, Speech, and Language Pro-

cessing, 15(5):1711–1723.

Naik, J. (1990). Speaker Veriﬁcation: A Tutorial. IEEE

Communications Magazine, pages 42–48.

Reynolds, D. and Rose, R. (1995). Robust text independent

speaker identiﬁcation using gaussian mixture speaker

models. IEEE Trans. on Speech and Audio Process-

ing, 3:72–82.

Reynolds, D. A. (1995). Speaker identiﬁcation and veriﬁca-

tion using gaussian mixture speaker models. Speech

Communication, 17:91–108.

Sant’Ana, R., Coelho, R., and Alcaim, A. (2006). Text-

Independent Speaker Recognition Based on the Hurst

Parameter and the Multidimensional Fractional Brow-

nian Motion Model. IEEE Transactions on Audio,

Speech and Language Processing, 14(3):931–940.

Varga, A. and Steeneken, H. (1993). Assessment for au-

tomatic speech recognition ii: Noisex-92: a database

and an experiment to study the effect of additive noise

on speech recognition systems. Speech Communica-

tions, 12(3):247–251.

Veitch, D. and Abry, P. (1999). A wavelet-based joint es-

timator of the parameters of long-range dependence.

IEEE Transactions on Information Theory, 45(3):878

–897.

Vetterli, M. and Kovacevic, J. (1995). Wavelets and sub-

band coding. Englewood Cliffs: Prentice-Hall.

Wu, D. (2009). Parameter Estimation for α-GMM Based on

Maximum Likelihood Criterion. Neural Computation,

21(6):1776–1795.

Wu, D., Li, J., and Wu, H. (2009). α-Gaussian Mixture

Modelling for Speaker Recognition. Pattern Recogni-

tion Letters, 30(6):589–594.

Z˜ao, L. and Coelho, R. (2011). Colored noise based multi-

condition training technique for robust speaker identi-

ﬁcation. IEEE Signal Processing Letters, 18(11):675–

678.

NOISE ROBUST SPEAKER VERIFICATION BASED ON THE MFCC AND pH FEATURES FUSION AND

MULTICONDITION TRAINING

143