training set. In this section we focus on this training
set.
An word-based approach requires a very large
training set to perform an optimal speech
recognition. The set dimension depends on the
chosen vocabulary. An ideal recognition system uses
the whole vocabulary of the speech language, which
may contain thousands words. Most speech
recognition systems use low-size vocabularies,
having tens words, or medium-size vocabularies
with hundreds words.
A large vocabulary size may represent a
disadvantage for the word-based speech recognition
techniques. Vocabulary size is equal with the classes
number because each word corresponds to a class in
the recognition process. It is obvious that a
phoneme-based recognition system uses a much
smaller number of classes, because the number of
words of a language (tens thousands) is much
greater than the number of the phonemes of that
language (usually 30-40 depending on language).
We consider creating a vocabulary containing
several words only initially and extending it over
time. Let N be our vocabulary size. For each word
from the vocabulary we consider a set of speakers,
each of them recording a spoken utterance of that
word.
Thus, we obtain a set of digital audio signals for
each word. All these recorded sounds represent the
prototypes of the system. For each i we get a set of
signal prototypes
},,...,{
1
i
n
i
i
SS
where Ni ,1= ,
i
n
is the number of users which produce the i
th
word
and
i
j
S
represents the audio signal of the spoken
word recorded by the j
th
speaker,
i
nj ,1= . The
sequence
},...,,...,,...,{
1
11
1
1
N
n
N
n
N
SSSS
represents
the training set of our recognition system.
Also, we set class labels for all these signals. The
label of a signal of a spoken word will be its
transcript (the written word). Therefore, for each
Ni ,1= and
i
nj ,1=
, we set a signal label
)(
i
j
Sl
.
Obviously, it results:
,1),(...)()(
1
NiSlSlil
i
n
i
i
=∀===
, (1)
where
)(il
represents the label of the i
th
word
related class.
The prototype vectors, representing the feature
vectors of the training set, are then computed. We
perform the training feature extraction by applying a
mel cepstral analysis to the signals
i
j
S
, the Mel
Frequency Cepstral Coefficients (MFCC) being the
dominant features used for speech recognition (Minh
2000, Furui 1986, Logan 2000).
A short-time signal analysis is performed on each
of these vocal sounds. Each signal is divided in
overlapping segments of 256 samples with overlaps
of 128 samples. Then, each resulted signal segment
is windowed, by multiplying it with a Hamming
window of length 256. We compute the spectrum of
each windowed sequence, by applying DFT
(Discrete Fourier Transform) to it, and obtain the
acoustic vectors of the current
i
j
S signal. Mel
spectrum of these vectors is computed by converting
them on the melodic scale that is described as:
)700/1(log2595)(
10
ffmel +
, (2)
where f represents the physical frequency and mel(f)
is the mel frequency. The mel cepstral acoustic
vectors are obtained by applying first the logarithm,
then the DCT (Discrete Cosinus Transform) to the
mel spectral acoustic vectors.
Then we compute the delta mel frequency
cepstral coefficients (DMFCC), as the first order
derivatives of MFCC, and the delta delta mel
frequency cepstral coefficients (DDMFCC), as the
second order derivatives of MFCC. We prefer to use
the delta delta mel cepstral acoustic vectors for
describing speech content. These acoustic vectors
have a dimension of 256 samples. To reduce this
size, we truncate each acoustic vector to the first 12
coefficients, which we consider to be sufficient for
speech featuring. Then we create a 12 row matrix by
positioning these truncate delta delta mel cepstral
vectors as columns. The obtained DDMFCC-based
matrix represents the final speech feature vector.
Thus, the training feature set becomes
)}(),...,(),...,(),...,({
1
11
1
1
N
n
N
n
N
SVSVSVSV
,
where each feature vector
)(
i
j
SV
represents a 12
row matrix whose column number depends on
i
j
S
length.
3 INPUT SPEECH ANALYSIS
In this section we focus on the input vocal sound
analysis. As we mentioned in introduction, we
consider only discrete speech sounds to be
recognized by our system. Also, we set the condition
that the words of input spoken utterance belong to
the given vocabulary.
ICETE 2004 - WIRELESS COMMUNICATION SYSTEMS AND NETWORKS
364