engine are described.
3.1 Database
As the focus of this work is on the quantification of
the performance difference due to the acoustic
mismatch between training and testing materials,
two databases were used: a clean speech database
and a noise only database. With this arrangement it
is possible to precisely control the type and amount
of noise to be added in each situation. These two
databases are described in the sequel.
3.1.1 Clean Speech Corpus
The speech corpus comprises 40 adult speakers (20
male and 20 female) (Ynoguti, 1999). Each of these
speakers recorded 40 phonetically balanced
sentences in Brazilian Portuguese. Therefore, this
corpus has 1600 utterances. 30 speakers (15 of each
gender) were used to train the systems (1200
utterances) and remaining ones were selected for the
performance tests (400 utterances).
The sentences were drawn from (Alcaim,
Solewicz and Moraes, 1992) and comprise 694
different words. Thus, this database was built for
continuous speech recognition with speaker
independence, for a medium vocabulary.
All the utterances were manually transcribed
using a set of 36 phonemes. The recordings were
performed in a low noise environment, with 11025
Hz sampling rate and coded with 16 bit linear PCM
per sample. For this work, the sampling frequency
was lowered to 8 kHz because the noise database
was acquired at this rate.
3.1.2 Noisy Speech Corpus
To generate the noise corrupted versions of the
speech utterances, the Aurora Database (Pearce and
Hirsch, 2000) noises were used. This database is
actually a noise corrupted speech corpus, but it also
provides recordings of the noises alone.
The available noise types are: airport, exposition,
restaurant, street, subway, train, babble and car. All
noise types were used to train the system. From
these, only car noise type was used to evaluate the
performance of the system in order to reduce the
total simulation time. For each clean utterance of the
training speech corpus, 8 noise corrupted versions
were created, combining each noise type with
signal-to-noise ratios of 15 and 20 dB. Therefore,
the noise corrupted training speech corpus has now
1200 clean speech recordings × 8 noise types × 2
SNR levels = 19200 utterances. Similarly, for each
clean utterance of the testing speech corpus, 1 noise
corrupted version was created, combining a noise
type with signal-to-noise ratios from 0 to 20 dB,
with steps of 1 dB. Therefore, the noise corrupted
testing speech corpus has now 400 clean speech
recordings ×21 SNR levels = 8400 utterances.
3.1.3 Speech Recognition Engine
A continuous density HMM based recognition
engine developed by (Ynoguti and Violaro, 2000)
was used for the tests. This system uses the One Pass
(Ney, 1984) search algorithm and context
independent phones as fundamental units where
each one of them was modeled with a 3 state
Markov chain as shown in Figure 2. For each HMM
state, a mixture of 10 multidimensional Gaussian
distributions with diagonal covariance matrix was
used.
Figure 2: Markov chain for each phone model.
As acoustic parameters, 12 mel-cepstral coefficients,
together with their first and second derivatives were
used. Therefore, the feature vectors have dimension
36.
Finally, a bigram language model was used to
improve the recognition rates.
These choices were chosen based on previous
tests (Ynoguti and Violaro, 2000).
3.2 Performance Evaluation Method
The recognition performance can be determined by
comparing the hypothesis transcription (recognized
by the speech recognizer) with the reference
transcription (correct sentence).
There are different metrics that are used to
evaluate the performance of an automatic speech
recognition system, being the following the most
common:
Sentence error rate: number of correctly
recognized sentences divided by the total number
of sentences;
Word error rate: for this metric, the word
sequences are compared using a dynamic
alignment algorithm based on word chains in order
to find the deletion (D), substitution (S) and
ModelAdaptationviaMAPforSpeechRecognitioninNoisyEnvironments
93