PROGRESSES IN CONTINUOUS SPEECH RECOGNITION

BASED ON STATISTICAL MODELLING

FOR ROMANIAN LANGUAGE

Corneliu Octavian Dumitru

1,2

University Politehnica Bucharest, Faculty of Electronics Telecommunications and Information Technology, Romania

ARTEMIS Department, GET/INT, Evry, France

Inge Gavat, Diana Militaru

University Politehnica Bucharest, Faculty of Electronics Telecommunications and Information Technology, Romania

Keywords: MFCC, LPC, PLP, statistical modelling, monophone, triphone.

Abstract: In this paper we will present progresses made in Automatic Speech Recognition (ASR) for Romanian

language based on statistical modelling with hidden Markov models (HMMs). The progresses concern

enhancement of modelling by taking into account the context in form of triphones, improvement of speaker

independence by applying a gender specific training and enlargement of the feature categories used to

describe speech sequences derived not only from perceptual cepstral analysis but also from perceptual linear

prediction.

1 INTRODUCTION

After years of research and development, the

problem of automatic speech recognition and

understanding is still an open issue. The end goal of

translation into text, accurate and efficient,

unaffected by speaker, environment or equipment

used is very difficult to achieve and many challenges

are to be faced.

Some factors which make difficult the problem

of automatic speech recognition (ASR) are voice

variations in context or in environment, syntax, the

size of vocabulary. How this problems can easy be

accommodated by humans, to continue studies in

human speech perception can be very important to

improve performance in speech recognition by

machine. As alternative, human performance can be

regarded as guide for ASRs and in this moment

neither the best systems built for English language

can not reach human performance.

In this paper we will present progresses made in

ASR for Romanian language based on statistical

modelling with hidden Markov models (HMMs)

using the HTK toolkit, developed by the Cambridge

University Engineering Department (Woodland,

1994). We started in this domain with a system for

continuous speech recognition (Gavat, 2003) based

on monophone models, regarding words like phone

sequences, neglecting co-articulation. The obtained

results challenged us to improvements that

addressed especially acoustical context modelling

but also speaker independence enhancement and

optimization of features choice describing speech.

The remainder of this paper will be also

structured as follows: Section 2 presents an

overview of the built ASR system. Section 3

describes the context based acoustical modelling,

realized with triphones. Section 4 gives an overview

of the Rumanian database and the changes made in

view of the experiments for speaker independence

enhancement. Tests are described and discussed in

section 5. Finally, conclusions and future works are

given in section 6.

2 SYSTEM OVERWIEV

The functional schema of our speech recognition

system is given in figure 1. The system acts in three

262

Octavian Dumitru C., Gavat I. and Militaru D. (2007).

PROGRESSES IN CONTINUOUS SPEECH RECOGNITION BASED ON STATISTICAL MODELLING FOR ROMANIAN LANGUAGE.

In Proceedings of the Fourth International Conference on Informatics in Control, Automation and Robotics, pages 262-267

DOI: 10.5220/0001619802620267

 SciTePress

main phases: training, testing and evaluation of

results.

In the training phase the current parameters of

HMMs are established by incrementally refining an

initial set of “white” acoustical HMM models.

In the testing phase the test data are verified with

the training database using the grammar, the lexicon,

the word network and the testing dictionary.

In the evaluation phase, by comparing the

transcription of the test speech sequences with the

reference transcription, the recognition correctness

and accuracy are calculated.

In each phase a data preparation stage is

introduced in order to build a set of speech data files

and their transcription in the required format. An

important step in this stage is feature extraction.

Feature extraction is the lowest level of

automatic speech recognition and it lies in the task

of extracting the limited amount of useful

information from high-dimensional data. The feature

extractors maintain much of the characteristics of the

original speech and eliminate much of the

extraneous information.

After feature extraction, a sequence O of feature

vectors are the input data for the testing phase, in

order to establish the correct uttered sentence W. We

need to estimate therefore the probability of acoustic

features given the phonetic model, so that we can

recognize the input data for the correct sentence.

This probability is referred to as acoustic probability

P (O|W).

HMMs are the most common models used in

automatic speech recognition systems to model the

joint probability distribution of feature vectors for a

given utterance model (Gavat, 2000).

Mathematically the problem in continuous

speech recognition is to find a sequence of words

such that

)()(maxarg

WPWOPW

(1)

The most probable sentence W given the observation

sequence O can be computed by taking this product

of two probabilities for each sentence and choosing

the sentence for which the product is the highest.

Evaluation

of results

Testing

Training

Phonetic transcription

dictionary

Training speech

transcription list

Training

dictionary

Phonems list Labeling

Phonetic

transcription of

training sequences

Training

database

Features

extraction

Models initialisation

Parameters estimation

Models refinement

Alignment

Sequences labels

Maxim plausibility

decision

Training models for

phonems

Features

extraction

Testing database

Speech

aquisition

Grammar and lexic

are “correct”?

Phonetic

transcription of

recognized

speech sequences

Phonetic transcription of

speech testing sequences

Word

network

Testing

dictionary

Aplication

grammar

Aplication

vocabulary

Phonetic transcription

database

Testing

sequences list

yes

Comparison

Correctness,

accuracy

Speech

sequences

recognized

Figure 1: The speech recognition system.

PROGRESSES IN CONTINUOUS SPEECH RECOGNITION BASED ON STATISTICAL MODELLING FOR

ROMANIAN LANGUAGE

263

The prior probability P (W) is calculated using a

language model appropriate for the recognition task,

and the acoustic probability P (O|W), is calculated

by concatenating the HMMs of the words in the

sequence W and using the Viterbi algorithm for

decoding. A silence or a ’short pause’ model is

usually inserted between the HMMs to be

concatenated.

3 TRIPHONE MODELLING

For small vocabulary recognition, word models are

widely used, since they are accurate and trainable. In

the situation of a specific and limited task they

become valid if enough training data are available,

but they are typically not generalizable. Therefore,

usually for not very limited tasks are preferred

phonetic models based on monophones, because the

phones, as smallest linguistic units, are easy

generalizable and of course also trainable.

Monophones constitute the foundation of any

training method, in any language, and we also

started with them. But a refinement of this initial

step was necessary because in real speech, the words

are not simple strings of independent phonemes.

The realization of a phoneme is strongly affected by

its immediately neighboring phonemes by co-

articulation. Because of this, monophone models

have been changed in time with triphone models that

became the actual state of the art in automatic

speech recognition with large vocabularies (Young,

1992), (Young, 1994).

A triphone model is a phonetic model that takes

into consideration the left and the right neighbouring

phones. This immediate neighbour – phonemes are

called respectively the left and the right context; a

phoneme constitutes with the left and right context a

triphone. For example in the SAMPA (Speech

Assessment Methods Phonetic Alphabet)

transcription “m - a + j” of the Romanian word

”mai”, regarded as triphone, the phoneme “m” has

as left context “a” and as right context “j”.

For each such a triphone a model must be trained: in

Romanian that will give a number which equals

40,000 models, situation totally unacceptable for a

real world system. In our speech recognition task we

have modelled only internal – word triphones and

the adopted state tying procedure has conducted to a

controllable situation.

Figure 2: Different models for triphones around the phoneme “a”.

)

ICINCO 2007 - International Conference on Informatics in Control, Automation and Robotics

264

If triphones are used in place of monophonemes,

the number of needed model increases and it may

occur the problem of insufficient training data. To

solve this problem, tying of acoustically similar

states of the models built for triphones

corresponding to each context is an efficient

solution. For example, in figure 2b, four models are

represented for different contexts of the phoneme

“a”, namely the triphones “k – a + S”, “g – a + z”, “n

– a + j”, “m – a + j”. In figure 2c, 2d, there are

represented the clusters formed with acoustically

similar states of the corresponding HMMs.

The choice of the states and the clustering in

phonetic classes are achieved by mean of phonetic

decision trees. A phonetic decision tree built as a

binary tree, is shown in figure 3 and has in the root

node all the training frames to be tied, in other words

all the contexts of a phoneme. To each node of the

tree, beginning with the parent – nodes, a question is

associated concerning the contexts of the phoneme.

Possible questions are, for example: is the right

context a vowel (R = Consonant?), is the left context

a phoneme “a” (L = a?); the first answer designates a

large class of phonemes, the second only a single

phonetic element. Depending on the answer, yes or

no, child nodes are created and the frames are placed

in them. New questions are further made for the

child nodes, and the frames are divided again.

The questions are chosen in order to increase the

log likelihood of the data after splitting. Splitting is

stopped when increasing in log likelihood is less

than an imposed threshold resulting a leaf node. In

such leaf nodes are concentrated all states having the

same answer to the question made along the path

from the root node and therefore states reaching the

same leaf node can be tied as regarded acoustically

similar. For each leaf node pair the occupancy must

be calculated in order to merge insufficient occupied

leaf nodes.

A decision tree is built for each state of each

phoneme. The sequential top down construction of

the decision trees was realized automatically, with

an algorithm selecting the questions to be answered

from a large set of 130 questions, established after

knowledge about phonetic rules for Romanian

language.

4 DATABASE

The data are sampled by 16 kHz, quantified with 16

bits, and recorded in a laboratory environment.

For continuous speech recognition, database for

training is constituted by 3300 phrases, uttered by 11

speakers, 7 males and 4 females, each speaker

reading 300 phrases.

The databases for testing contained 220 phrases

uttered by 11 speakers, each of them reading 20

phrases.

The training database contains over 3200 distinct

words; the testing database contains 900 distinct

words.

In order to carry out our experiments about

speaker independence, the database was reorganized

as follows: one database for male speakers (MS),

one database for female speakers (FS) and one

database for male and female speakers (MS and FS).

In all cases we have excluded one MS and one FS

from the training and used for testing.

Figure 3: Phonetic tree for phoneme m in state 2.

R=Consonant?

=Vowel?

L=a

L=Vowel_medial?

=p?

PROGRESSES IN CONTINUOUS SPEECH RECOGNITION BASED ON STATISTICAL MODELLING FOR

ROMANIAN LANGUAGE

265

The speech files from these databases were analysed

in order to extract the interesting features. The

feature extraction methods used are based on linear

predictive coding (LPC), perceptual linear prediction

(PLP) (Hermansky, 1990) mel-frequency cepstral

coefficients (MFCC).

5 EXPERIMENTAL RESULTS

To assess the progresses made with our ASR system

we initiated comparative tests for the performance

expressed in word recognition rate (WRR) to

establish the values under the new conditions versus

the preceding ones. The comparison is made for the

following situations:

 Triphone modelling/monophone modelling

 Gender based training/mixed training

 LPC and PLP/MFCC.

The results obtained in the experiments realized

under these conditions are summarized in Table 1,

Table 2, and Table 3.

Table 1: Word Recognition Rate: Training MS, testing MS

or FS.

WRR (%)

Training

Type

MFCC_

D_A

LPC PLP

Monophone 56.33 30.85 34.02

Testing

Triphone 81.02 49.73 68.10

Monophone 40.98 23.23 25.12

Testing

Triphone 72.86 47.68 59.00

The WRR are:

 For 12 LPC coefficients the word recognition

rates are low: 30.85% (monophone) training

and testing with MS and 49.73% (triphone);

31.11% (monophone) training and testing with

FS and 61.15% (triphone); 26.10%

(monophone) training MS and FS and testing

with MS and 51.5% (triphone).

 For 5 PLP coefficients the obtained results are

very promising, giving word recognition rates

about 58.55% (triphone training and testing

FS), 68.10% (triphone training and testing MS)

and 70.11% (triphone training MS and FS and

testing MS).

 For 36 MFCC_D_A coefficients (mel-cepstral

coefficients with first and second order

variation) we obtained the best results, as we

expected: monophone 56.33% and triphone

81.02%, training and testing with MS;

monophone 56.67% and triphone 78.43%,

training and testing with FS; monophone

57.44% and triphone 78.24%, training MS and

FS and testing with MS.

Table 2: Word Recognition Rate: Training FS, testing MS

or FS.

WRR (%)

Training

Type

MFCC_

D_A

LPC PLP

Monophone 53.56 26.72 23.78

Testing

Triphone 69.23 49.73 53.02

Monophone 56.67 31.11 34.22

Testing

Triphone 78.43 61.15 58.55

Table 3: Word Recognition Rate: Training MS and FS,

testing MS or FS.

WRR (%)

Training

MS and

Type

MFCC_

D_A

LPC PLP

Monophone 57.44 26.10 47

Testing

Triphone 78.24 51.50 70.11

Monophone 49.89 24.06 41.22

Testing

Triphone 74.95 50.49 69.65

6 CONCLUSION

After the experiments made we have following

conclusions:

 The triphone modelling is effective, conducting

to increasing in WRR between 15% and 30%

versus the monophone modelling. The maximal

enhancement exceeds 30% for training MS and

testing FS for MFCC_D_A (see figure 4.a).

 A gender based training conduct to good result

for test made with speakers from the same

gender (training MS / testing MS: 81.02%,

testing FS: 72.86%; training FS/testing FS:

78.43%, testing MS: 69.23%); changing gender

in testing versus training leads to a decrease in

WRR around 10%. For a mixed trained data

base changing gender determines only

variations around 5% in WRR (see figure 4.b).

ICINCO 2007 - International Conference on Informatics in Control, Automation and Robotics

266

123

MFCC LPC PLP

WRR differencies (%) triphone/monophone

trai ni ng MS /testing

training FS/testing

trai ni ng

mixed/testing MS

trai ni ng

mixed/testing FS

123

MFCC LPC PLP

WRR (% ) triphone

training MS/testing

training FS/testing

training

mixed/testing MS

training

mixed/testing FS

Figure 4: a) Chart 1; b) Chart 2.

REFERENCES

Woodland P.C., Odell J.J., Valtchev V., Young S.J, 1994,

Large Vocabulary Continuous Speech Recognition

Using HTK, Proceedings of ICASSP 1994, Adelaide.

Gavat I., Dumitru C.O., Costache G., Militaru D., 2003,

Continuous Speech Recognition Based on Statistical

Methods, Proceedings of SPED 2003, pp. 115-126,

Romania.

Young, S.J., 1992, The General Use of Tying in Phoneme-

Based HMM Speech Recognizers, Proceedings of

ICASSP 1992, vol. 1, pp. 569-572.

Young S.J., Odell J.J., Woodland P.C., 1994, Tree Based

State Tying for High Accuracy Modelling, ARPA

Workshop on Human Language Technology.

SAMPA (Speech Assessment Methods Phonetic

Alphabet),

http://www.phon.ucl.ac.uk/home/sampa/home.htm

Hermansky H., 1990, Perceptual Linear Predictive

Analysis of Speech, J. Acoust. Soc. America, vol.87,

no.4, pp. 1738-1752.

Oancea E., Gavat I., Dumitru C.O., Munteanu D., 2004,

Continuous speech recognition for Romanian language

based on context-dependent modelling, Proceedings of

COMMUNICATION 2004, pp. 221-224, Romania.

Dumitru C.O., Gavat I., 2005, Features Extraction,

Modelling and Training Strategies in Continuous

Speech Recognition for Romanian Language, Proc.

EUROCON 2005, pp. 1425-1428, Serbia &

Montenegro.

Gavat I., Zirra, M., Grigore O., Sabac B, Valsan Z., Cula

O., Pascu A., 2000, Elemente de sinteza si

recunoasterea vorbirii, Ed. Printech, Bucharest,

Romania.

Lupu, E., Pop, G. Petre, 2004, Prelucrarea numerica a

semnalului vocal. Elemente de analiza si

recunoastere, Ed. Risoprint , Cluj-Napoca, Romania.

PROGRESSES IN CONTINUOUS SPEECH RECOGNITION BASED ON STATISTICAL MODELLING FOR

ROMANIAN LANGUAGE

267