where w
i
s are weights for different features. As
mentioned in the feature extraction section, we
extract 15 features for each frame and 76 frames for
each utterance. Hence we have a feature vector of
length 1140. To weigh features, we notice that
among 15 coefficients extracted from each frame,
the lower ones are more important than the higher
coefficients, hence we must assign bigger weights to
them. To accomplish this, we assign weights to each
feature according to its index in the feature vector
modulo 15. For a linear weighting we have
( 1) mod15
i
wi
(10)
And for exponentially increasing weight, we have
1 mod 15i
i
we
(11)
We experimented both weighting schemes and
concluded that linear weighting outperforms the
other scheme.
3.2 Experimental Setup
In this paper, the database we used was the set of
digits of the TIMIT. The TIMIT contains broadband
recordings of 630 speakers of eight major dialect
regions of American English. We use 2700 digits (0
to 9) of this database. Depicted in Figure 4 is a two-
dimensional visualization of TIMIT dataset using
Principal Component Analysis.
For extracting the feature vectors of spoken
digits, the MFCC and the MFDWC are used. Every
frame of each digit has 15 features. First the MFCC
feature vectors are used to train and test the HMM-
based and SVDD-based digit recognition. Then, the
MFDWC feature vectors are used. The MFDWCs
consist of 15 coefficients obtained by DWT with
scales of 1, 2, 4 and 8. By this arrangement of
coefficients, the lower coefficients are more
important than the higher ones. For this purpose we
use feature vectors consisting of 15 coefficients and
5 lower coefficients. If we use 15 coefficients, we
obtained the 1140-dimensions digit feature vector
sequence and if we use 5 coefficients, dimension of
feature vector is 380.
In SVM classification the ”one-against-all”
approach is used. So 10 classes of digits are
obtained. We use Weighted Polynomial and
Exponentially Weighted Polynomial kernel
functions by using the first 5 and all 15 coefficients
as the feature vectors. We use Simple Polynomial
and Gaussian kernel functions by using all 15
coefficients as the feature vector.
The digit recognition systems are tested on noisy
environment speech (SNR=5dB and Noisy Speech is
obtained by Speech Signal + White Noise). This
kind of noise on speech data can severely deteriorate
the performance of speech recognition. The accuracy
rates of HMM-based digit recognition are shown in
Table 1. The result of the HMM-based and SVDD-
based digit recognition using MFDWC is better than
the MFCC feature vectors. It’s because of
localization and multi resolution characteristics of
the Wavelet Transform (WT). In Table 2 the
accuracy rates of SVDD-based digit recognition
separately (for each digit) and using MFDWCs are
shown for each class of digits (zero to nine spoken
data). The resulted accuracy rates show that
Weighted Polynomial kernel functions are better
than the other kernel functions. When the Weighted
Polynomial kernel functions are used, appropriate
coefficients can be applied for each feature
(Equation 9). By comparing between 5-dimensions
feature vectors and 15-dimentions feature vectors,
it’s inference that we can use 5-dimensions feature
vectors with improved learning, because the time
and space complexity of 5-dimensions feature
vectors are about much less than 15-dimensions.
Table 3 represents accuracy rates of the SVDD-
based digit recognition using MFCC feature vector.
Comparing Tables 1 with Table 2 and Table 3, it
is inferred that the HMM-based digit recognition are
better than the SVDD-based on the MFCC feature
vectors but the SVDD can compete with the HMM
classifier in speech recognition on the MFDWC
feature vectors.
Figure 4: A 2-D visualization of TIMIT dataset using
PCA.
Comparing Tables 1 with Table 4, showing
accuracy rates of the SVDD-based digit recognition
on noisy (SNR=5dB) test data on the MFDWC
feature vector, it is observed that the SVDD can
handle the noisy environments in the speech
recognition better than HMM classifier.
SUPPORT VECTOR DATA DESCRIPTION FOR SPOKEN DIGIT RECOGNITION
35