score level fusion on different features.
The outline of paper is as follows. In sections 2,
we describe the different feature vectors used in this
work. At section 3, we give the experimental
protocol adopted and the results that found at section
4. Finally, a conclusion is given in Section 5.
2 FEATURE EXTRACTION
OVERVIEW
The speech signal continuously changes due to
articulatory movements and therefore, the signal
must be analyzed within short frames of about 20–
30 ms duration. Within this interval, the signal is
assumed to remain stationary and a spectral feature
vector is provided for each frame.
2.1 Mel-Frequency Cepstral
Coefficients (MFCCs)
and Linear Frequency Cepstral
Coefficients (LFCCs)
The mel-frequency cepstral coefficients (MFCCs)
(Harris, 1978) were introduced in early 1980s for
speech recognition applications and since then have
also been adopted for speaker identification
applications. A sample of speech signal is first
extracted through a window. Typically, two
parameters are important for the windowing
procedure: the duration of the window (ranges from
20–30 ms) and the shift between two consecutive
windows (ranges from 10–15 ms) (Harris, 1978) The
values correspond to the average duration for which
the speech signal can be assumed to be stationary or
its statistical and spectral information does not
change significantly. The speech samples are then
weighed by a suitable windowing function, such as,
Hamming or Hanning window (Harris, 1978), that
are extensively used in speaker verification. The
weighing reduces the artifacts (such as side lobes and
signal leakage) due to the use of a finite duration
window size for analysis. The magnitude spectrum of
the speech sample is then computed using a fast
Fourier transform (FFT). For a discrete signal {x[n]}
with 0 <n <N, where N is the number of samples of
an analysis window, is the sampling frequency, the
discrete Fourier transform (DFT) is used and is given
by equation bellow:
2
1
0
/2
)()()(
N
t
Ntfi
etxtwfS
(1)
Where
1i
is the imaginary unit and
1,...,1,0
Nf
denotes the discrete frequency
index. Here,
T
Nwww )]1()...0([ is a time-
domain window function which usually is
symmetric and decreases towards the frame
boundaries. Then,
)( fS
is processed by a bank of
band-pass filters. The filters that are generally used
in MFCC computation are triangular filters (Moore,
1995), and their center frequencies are chosen
according a logarithmic frequency scale, also known
as Mel-frequency scale. The filter bank is then used
to transform the frequency bins to Mel-scale bins by
the following equations:
2
fSfwbm
f
by
(2)
where
b
w
is the
th
b Mel-scale filter’s weight for
the frequency
f
and
fS
is the FFT of the
windowed speech signal. The rationale for choosing
a logarithmic frequency scale conforms to the
response observed in the human auditory system that
has been validated through several biophysical
experiments (Moore, 1995). The Mel-frequency
weighted magnitude spectrum is processed by a
compressive non-linearity (typically a logarithmic
function) which also models the observed response
in a human auditory system. The last step in MFCC
computation is a discrete cosine transform (DCT)
which is used to de-correlate the Mel-scale filter
outputs. A subset of the DCT coefficients are chosen
(typically the first and the last few coefficients are
ignored) and represent the MFCC features used in
the enrollment and the verification phases. The
Linear Frequency Cepstral Coefficients (LFCCs)
(Xing et al., 2009) are similar to MFCCs, with a
difference in the structure of the Mel filter bank. In
the high frequency region, the Mel filters was
replaced by a linear filter bank in order to capture
more spectral details in this region.
2.2 MFCCs based on Asymmetric
Tapers
Usually, speaker/speech recognition systems for
short-time analysis of a speech signal use standard
symmetric- tapers such as Hamming, Hann, etc.
These tapers have a poor magnitude response under
mismatched conditions and a larger time delay
(Alam et al., 2012). One elegant technique for
reducing the time delay and enhancing the
magnitude response under noisy conditions is to
SIGMAP2013-InternationalConferenceonSignalProcessingandMultimediaApplications
34