NN and Hybrid Strategies for Speech Recognition in

Romanian Language

Corneliu-Octavian Dumitru and Inge Gavat

Faculty of Electronics Telecommunications and Information Technology

Politehnica University Bucharest, 313 Splaiul Independetei, Bucharest, Romania

Abstract. In this paper we present results obtained with learning structures

more “human likely” than the very effective and widely used hidden Markov

model. Good results were obtained with simple artificial neural networks like

the multilayer perceptron or the Kohonen maps. Hybrid structures have proven

also their efficiency, the neuro-statistical hybrid applied enhancing the digit

recognition rate of the initial HMM. Also fuzzy variants of the MLP and HMM

gave good results in the tested tasks of vowel recognition.

1 Introduction

To make a first step on the way to bring near to HSR (Human Speech Recognition)

the ASRU (Automatic Speech Recognition and Understanding) performance, it could

be important to compare how speech recognition, as the receiving part of the verbal

communication, is realised by machines and by humans [8]. Even though the process

of verbal communication is a very natural one and thus seems to be a fairly easy for

humans, there are several underlying operations that need to be carried out before the

communication can be considered successful. The operations designed for ASR try to

model what we know about speech and language, or what we assume about it. The

models are often much simpler than the reality, and thus are imperfect.

In Fig. 1, a schematic overview of both speech recognition processes, the HSR

and ASRU is shown [6]. The first operation in human communication comprises the

hearing of the message. We first have to realise that somebody is talking to us and

then we have to listen to what he is saying. In ASR, an equivalent process is done, by

recording the message with a microphone.

Both systems, human and automatic, need to have some knowledge about the

sounds that are used. If one of the talkers uses sounds the other talker does not know,

they cannot understand each other. For this we need a vocabulary that is a set of

words.

When humans process the message, they extract the meaning out of what was

said. They can do so by inferring the meaning from the actual sequence of words that

they recognised, because they can directly associate meanings to words and more

important to word sequences.

Dumitru C. and Gavat I. (2008).

NN and Hybrid Strategies for Speech Recognition in Romanian Language.

In Proceedings of the 4th International Workshop on Artiﬁcial Neural Networks and Intelligent Information Processing, pages 51-60

DOI: 10.5220/0001508300510060

 SciTePress

The system however, searches for the word or word sequence that was most likely

spoken, given acoustic signal. Even if this was successful, it is still far away from

understanding what the meaning of this sequence of words is. Of course, approaches

already exist that try to extract the meaning out of recognised word sequences.

Although the procedures in the human and the automatic systems seem to be very

likely, the results are very different, the differences being pointed by Lippmann in his

well–known study [11]. For complicated tasks, involving sentence recognition, the

performance difference is not so surprising and can mainly be explained by the ad-

vantage constituted by the natural context use of humans [3]. For simple tasks, like

vowel recognition for instance when the context is not advantaging humans, they are

indeed better than the machine and that remains surprising [4].

Because of the good human performance in the speech recognition task, it could be

interesting to mimic the most important human action in this process, namely the

learning and to do it also in a more “human likely” maner by involving neuronal and

fuzzy techniques.

Fig. 1. Overview of a HSR system (right side) and an ASRU system (left side).

The paper has the following structure. Section 2 will describe the basic learning

structures applied to build acoustical models. There are investigated classical neural

strategies, like the multilayer perceptron (MLP) and the Kohonen maps (KM) but also

hybrid strategies like a neuro-statistic model or fuzzy variants of a neural structure,

namely the MLP and of a statistical structure, namely the HMM. The Section 3 pre-

sents a large variety of conditions into which can be carried out the basic experiments

with our Automatic Speech Recognition System for Romanian Language (ASRS_RL)

in tasks for vowel recognition and digit recognition. Section 4 concludes the paper.

What do you think about the idea to visit the Paris?

Decoding Processing

Feature

Extraction

Dictionar

Language

Model

Acoustic

Models

Situation

Prosod

Recognised:

What about Paris

Knowledge

about the

word

Topic

Speech Understanding

Additional

Knowlegde

Sources

Question:

What do you think about the proposal

I just made

2 Learning Strategies

Learning is the basic process for humans in acquiring knowledge and was success-

fully mimicked in technical systems. Artificial Neural Networks generating the neural

strategies are a good example. They lead to good recognition performance and can

also improve through hybridization the performance of the statistical method based

on HMM.

2.1 Neural Strategies

The neural strategies can model very well an important characteristic in human learn-

ing, namely associativity: all inputs of the ANN concur to obtain the resulting output,

and therefore the recognition performance is high. Due to the fixed number of inputs

their flexibility in accommodating time sequences is to low for more complicated

recognition tasks, like continuous speech recognition, but appropriate to recognize

vowels or digits. Further we will discuss two fundamental artificial networks, namely

the MLP and the KM.

Multilayer Perceptron (MLP)

MLP is the most common ANN architecture used for speech recognition. Typically,

MLPs have a layered feed–forward architecture, with an input layer, one or more

intermediate (hidden) layers, and one output layer. The structure without hidden layer

is called Boolean network and is a simple perceptron [13].

Each layer computes a set of linear discriminative functions, followed by a non-

linear function, which is often a sigmoid function.

The numbers of neurons in the hidden layer was experimentally determined, trying

to achieve an optimum between the following two opposite requirements: (a) lower

computing volume and more rapid process of convergence in the learning period; (b)

better performances from the correct classification of input patterns percentage.

In the learning phase are determined the optimum values [14] for weights connect-

ing the pairs of neurons from the adjoint layers in the input–output direction using the

Back-Propagation algorithm.

Kohonen Maps

Kohonen maps are competitive neural networks with topological character. The set-

ting up of the winner neurons at output is done with keeping the topological relations

between the input vectors.

That is the reason for which this neural network is successfully used in pattern rec-

ognition [9]. In the learning phase the structures are trained and the weights of the

networks are established in two steps: (a) the determination of the winner neurons;

(b) the adaptation of the weights for the winner neurons and for the neurons existing

in a certain neighborhood. In this step important are: (a) the neighborhood dimension

r(t) decreasing during the learning; (b) the learning rate

(t) following in our experi-

ments one of the laws [5]:

)(

−

= tt

2/1

)(

−

= tt

(1)

2.2 Hybrid Strategies

Hybrid strategies can improve the performance of the emerging ones. In order to

make they more “human likely”, we have applied a neuro-statistical hybrid, adding to

a HMM a MLP as a posteriori probability estimator and realizing fuzzy variants of a

MLP and a HMM.

Neuro-statistic Hybrid (HMM –MLP)

The HMM-based speech recognition methods make use of a probability estimator, in

order to approximate emission probabilities p(x

), where x

represents the observed

data feature, and q

is the hypothesized HMM state. These probabilities are used by

the basic HMM equations, and because the HMM is based on a strict formalism,

when the HMM is modified, there is a great risk of losing the theoretical foundations

or the efficiency of the training and recognition algorithms. Fortunately, a proper use

of the MLPs can lead to obtain probabilities that are related with the HMM emission

probabilities [10].

In particular, MLPs can be trained to produce the a posteriori probability p(x

that is, the a posteriori probability of the HMM state given the acoustic data, when

each MLP output is associated with a specific HMM state. Many authors have shown

that the outputs of an ANN used as described above can be interpreted as estimates of

a posteriori probabilities of output classes conditioned by the input, so we will not

insist on this matter, but we will mention an important condition, useful for finding an

acceptable connectionist probability estimator: the system must contain enough pa-

rameters to be trained to a good approximation of the mapping function between the

input and the output classes [14].

Thus, the a posteriori probabilities that are estimated by MLPs can be converted

in emission probabilities by applying Bayes' rule (2) to the MLP outputs:

)(

)/(

)(

)/(

xqp

qxp

(2)

That is, the emission probabilities are obtained by dividing the a posteriori esti-

mations from the MLP outputs by estimations of the frequencies of each class, while

the scaling factor p(x

) is considered a constant for all classes, and will not modify the

classification.

This was the idea that leads to hybrid neuro-statistical methods, that is, hybrid

MLP-HMM methods, applied for solving the speech recognition problem.

Fuzzy Variants

Human judgement is rarely a binary one, and therefore binary logic even if very sim-

ple, is not the best solution to model human acting in speech classification tasks. It

seems that fuzzy logic, able to a nuanced, shaded processing is much more suitable

and indicated to be used in machine performing such tasks. In this subsection of our

paper we will introduce fuzzy logic on two ways: realizing a fuzzification of the input

parameters, like in the fuzzy – MLP, or introducing instead the probabilistic similarity

measure applied in the usual HMM the fuzzy similarity measure, like in the fuzzy

(generalized) HMM.

Fuzzy-MLP

Introducing a fuzzy processing of the input features of the MLP is a solution to im-

prove the MLP performances [16].

First, the input values are described through a combination of 3 membership val-

ues in the linguistic property sets: low, medium and high. For doing this the

mem-

bership function is used:

⎪

⎩

⎪

⎨

⎧

≤−≤−−

otherwise

crforcr

2/||||0)/||(||21

2/||||0)/||||1(2

),,(

λλ

λπ

(3)

where:

>0 is the radius of the

function with c as the central point, || . || denotes

the Euclidian norm.

For each component

of the input vector

the parameters of the

member-

ship function for each linguistic property: low (l), medium (m) and high (h) are com-

puted using the relations:

2/)(

minmax)( jijiFijm

FF −=

)(min)( FijmjiFijm

dnjiFjim

fFcFji /)()(1

min)(

−=

)(1()(1

5.0)

FjiFjimFji

−

dnFjimjiFjih

fcF /)()

)(max(

−=

)()()(

5.0

FjihFjimFjih

(4)

where

minmax

jiji

denote the upper and lower bounds of the observed range of

feature and

and

f is a parameter controlling the extent of overlapping.

After this, the structure of the fuzzy neural network, like the classical one, is com-

posed from a hidden layer and an output layer.

The output vector is defined as the fuzzy class membership values. The member-

ship value of the training

iniii

FFFF )...(

= to class

C is computed using:

dikik

fzF ))/1/(1)( +=

(5)

where:

ff , are constants controlling the amount of fuzziness in the class-

membership set,

z is the weighted distance between the input vector

F and the

mean

kkkk

OOOO )...(

111

= of the k-th class, defined as:

∑

−=

]/)[(

kjkjjiik

vOFz

(6)

where:

is the standard deviation of the j-th vectors' component from the

C class.

In the training stage, the Back-Propagation algorithm is used to determine the

weights which minimized the mean square error (mse) between the real output

d and

the desired one

∑∑

∈

−=

trainF

ydmse

))((

(7)

During training, the learning rate is gradually decreased in discrete steps {1, 0.5,

0.3, 0.1, 0.05, 0.03, 0.01}, until the network converges to a minimum error solution.

Fuzzy-HMM

The generalized model

),,(

πλ

BA= can be characterized by the same parameters [7]

like the classical, well known model. The major difference in the fuzzy variant, is the

interpretation of the probability densities for the classical HMM, as fuzzy densities.

On this way the probabilistic similarity measure applied in the classical HMM is

replaced by a more suitable fuzzy similarity measure.

The succession of parameter vectors, called the observation sequence, O, pro-

duces the state sequence S of the model, and, visiting for example at the moment t+1

the state

Sq =

, the symbol b

is generated. The corresponding symbol fuzzy

density

)(

measures the grade of certainty of the statement that we observed

given that we are visiting state

.To perform classification tasks, the fuzzy similar-

ity measure must be calculated. Based on the fuzzy forward and backward variables,

a fuzzy Viterbi algorithm is proposed in [12] for the case of the Choquet integral with

respect to a fuzzy measure and multiplication as intersection operator.

The fuzzy formulation of the

forward variable α, bring an important relaxation in

the assumption of statistical independence.

The joint measure

{

}

{

}

(

)

yOO

...,,

can be written as a combination of two

measures defined on

OOO ...,,,

and on the states respectively, no assumption

about the decomposition of this measure being necessary, where

{

}

yyyY ...,,,

represent the states at time t+1 (Ω is the space of observation vectors).

For the standard HMM, the joint measure

),...,,,(

121 jtt

SqOOOP

can be

written as the product

)()...,,,(

121 jtt

SqPOOOP =⋅

, so that two assumptions of

statistical independence must be made: the observation at time t+1,

1+t

O , is inde-

pendent of the previous observations

OOO ...,,,

and the states at time t+1 are

independent of the same observations,

OOO ...,,,

These conditions find a poor match in case of speech signals and therefore we

hope in improvements due to the relaxation permitted by the fuzzy measure.

Training of the generalized model can be performed with the re-estimation formulas

also done in [12] for the Choquet integral. For each model we have trained with the

reestimation formulas the corresponding generalized models, GHMMs, with 3-5

states, analog to the classical case.

After the training, we have calculated the fuzzy measure

(

)

/OP , with the fuzzy

Viterby algorithm and made the decisions for recognition in the same manner like for

the classical HMM: the correct decision corresponds to the model for which the

calculated measure has a maximum.

3 Experimental Results

The experiment results are made by ASRS_RL, with multiple options for a large

variety of speech recognition experiments [1].

In the next two sub-sections are applied the learning strategies presented in Sec-

tion 2 for vowel and digit recognition and the obtained performance is evaluated.

The used databases (for vowel and digit) are sampled by 16 kHz, quantified with

16 bits, and recorded in a laboratory environment.

Vowel Recognition

The learning strategies applied in our recognition experiments are: the Kohonen

maps, the MLP, the fuzzy-MLP and the fuzzy-HMM.

In the First Experiment

, the vowel recognition rate using the VDRL database

(Vowel Database for Romanian Language) and the MFCC coefficients (in form of 12

mel-frequency cepstral coefficients) was determined; the results obtained with MLP

and HMM as learning strategies are comparatively presented in Table 1.

Table 1. Vowel recognition rate in the case of training with MS and testing with MS and FS.

MLP

HMM

Vowel

MS FS MS FS

a 100.00% 80.71% 100.00% 82.28%

e 85.81% 43.25% 94.82% 50.67%

i 85.71% 85.71% 95.15% 92.41%

o 90.90% 51.33% 97.00% 52.78%

u 88.88% 71.42% 94.83% 77.85%

Mean 91.26% 66.48% 96.36% 71.20%

The database VDRL contained speech data from 19 speakers (9 males and 10 fe-

males) each reading the same 5 vowels (a, e, i, o, u). The database was organized as

follows: one database for male speakers (MS), one database for female speakers (FS).

In booth cases one male speaker (MS) and one female speaker (FS) was excluded

from the training database and used their data for the testing.

The experimented MLP is a two-layer perceptron trained with Back-Propagation

algorithm, having in the output layer 5 nodes corresponding to the 5 vowels to be

classified and 100 nodes in the hidden layer (experimentally chosen).

The number of the input nodes is equal to the number of features (12 MFCC).

The HMMs chosen for comparison are Bakis (or left-right) structures with five

states and for each vowel one model is created [15].

In Table 1, are displayed only the results in the case of training MS and testing

with MS and FS. Similarly results were obtained for the training with FS [2].

In the Second Experiment, the vowels were described by three formant frequencies

and the error rates obtained with different learning strategies are given in Table 2.

The database for formants contains 500 formant vectors, 100 for each vowel for

the training and 250 formant vectors, 50 for each vowel for the testing.

The learning structures applied [5], [14] for these investigations are:

(1)

KM with the input layer with 3 neurons, corresponding to the three formant

frequencies and three variants for the output layer: unidimensional with 25 neu-

rons, bidimensional with 5

×5 neurons, and toroidal with 25 neurons.

(2)

MLP with 3 layers organized as it follows: (a) the input layer with 3 neurons,

corresponding to the three formant frequencies; (b) the hidden layer with 0 (Boo-

lean network) or 4 neurons, (c) the output layer with 5 neurons corresponding

each to a processed class (in our case the vowels a, e, i, o, u).

(3)

Fuzzy-MLP.

Table 2. Error rates (%) in for different learning strategies for the case of format.

Vowel KM 1-dim KM 2-dim KM toroidal Boolean MLP Fuzzy MLP

a 2.60% 1.20% 2.00% 4.00% 0.00% 1.50%

e 3.20% 2.40% 2.40% 6.00% 2.50% 1.00%

i 2.20% 1.60% 1.20% 4.00% 1.00% 0.50%

o 1.80% 1.20% 1.20% 10.00% 1.00% 1.00%

u 2.20% 2.10% 1.80% 10.00% 1.00% 0.00%

Mean 2.40% 1.70% 1.72% 6.80% 1.10% 0.80%

In the Third Experiment the parameterization is realized with the mel-cepstral coef-

ficients and the first and second order differences of these coefficients deduced from

homomorfic filtering. The error rates obtained in the vowel recognition tests are given

in Table 3, comparatively for the generalized HMM (fuzzy-HMM) and the classical

HMM.

Table 3. Error rates (%) for generalized and for classical HMMs.

Vowel

Fuzzy - HMM Classical HMM

5.10% 6.90%

2.40% 4.80%

3.80% 7.30%

2.50% 5.90%

0.70% 3.90%

Mean

2.90% 5.78%

Digit Recognition

In the second sub-section the performances obtained in digit recognition are evaluate

with the hybrid strategies (HMM– MLP) for unenrolled and enrolled speaker.

The

DDRL database (Digit Database for Romanian Language) contained speech

data from 9 speakers (6 males and 3 females) each speakers reading 9 digits (unu, doi,

trei, patru, cinci, şase, şapte, opt, nouă). We excluded two MS and one FS from the

database and used them for the testing.

The digit parameters were extracted by cepstral analysis, in form of 12 mel-

frequency cepstral coefficients (MFCC). The hybrid system (HMM-MLP) consists of

9 hybrid models corresponding to 9 digits. Each hybrid model is made of 5 states,

each state being associated with one output node of the MLP. The MLP has one hid-

den layer (100 nodes experimentally chossen), and the input layer consisting of 12

nodes.

We compare the two kinds of tests (using DDRL database): first, with enrolled

speakers, which mean that the speakers were involved both in training and testing,

and second, with unenrolled speakers were the testing speakers are not involved in

the training.

The results obtained for hybrid strategies are compared with other learning strate-

gies (hidden Markov models, Support Vector Machine) and the performance being

appreciated by their recognition rate and by their generalization capacity [5]. The

word recognition rate (WRR) is reported in Table 4.

Table 4. The WRR (%) for different learning strategies.

Learning strategies

Enrolled speakers Unrolled speakers

HMM-MLP 98.50% 98.30%

HMM 98.00% 97.50%

SVM 97.70% 91.70%

4 Conclusions

This paper reports a study focussed on the learning strategies applied in speech rec-

ognition for vowel and digit recognition.

(1)

For vowel recognition in Romanian language the following conclusion can be

reported:

a) The recognition rates in the case of MLP are higher than in the case of HMM.

A possible explanation can be the fact that the model training is discriminative,

while in the case of HMM the training is not discriminative, which represents a

disadvantage of HMM utilization.

b) The KM 2-dimensional structures and the toroidal have the same perform-

ance, weaker is the performance of the 1-dimensional structure. The best bal-

anced situation corresponds to the 2-dimensional map 5x5, in which all neurons

are associated to a vowel to be recognized.

The performance obtained in the case of the Boolean network is unacceptable,

but the MLP acts well.

Using fuzzy-MLP structure it is an improvement with a mean value of 0.30%

comparative with the non-fuzzy structure.

c) A mean decreasing of nearly 3% is realized in the error rate by adopting the

fuzzy-HMM instead of the probabilistic one.

(2)

For digit recognition in Romanian language using our database (DDRL) the fol-

lowing observation can be reported:

a) Chosen this approach, which combine the HMM with MLP into a hybrid sys-

tem is a very goad solution because the results are higher the results obtained for

HMM and SVM.

b) It is to seen that SVM performances are slightly after that of the HMM, but is

really promising, taking into account that the HMM has the benefit of a so long

refinement time.

References

1. Dumitru, C.O.: Modele neurale si statistice pentru recunoasterea vorbirii. Ph.D. thesis,

Bucharest (2006).

2. Dumitru, C.O., Gavat, I.: Vowel, Digit and Continuous Speech Recognition Based on

Statistical, Neural and Hybrid Modelling by Using ASRS_RL. EUROCON 2007, Warsaw,

Poland (2007), 856-863.

3. Gavat, I., & all: Elemente de sinteza si recunoasterea vorbirii. Ed. Printech, Bucharest

(2000).

4. Gavat, I., Dumitru, C.O., Costache, G.: Speech Signal Variance Reduction by Means of

Learning Systems. MEDINF 2003, Craiova-Romania (2003), 68-69.

5. Gavat, I, Dumitru, O., Iancu, C., Costache, G.: Learning Strategies in Speech Recognition.

The 47th International Symposium ELMAR 2005, Zadar-Croatia (2005), 237-240.

6. Goronzy, S.: Robust Adaptation to Non-Native Accents in Automatic Speech Recognition.

Springer - Verlag Berlin Heidelberg, Germany (2002).

7. Huang, X., Acero, A., Hon, H.W.: Spoken Language Processing – A Guide to Theory,

Algorithm, and System Development. Prentice Hall (2001).

8. Juanhg, B.H., Furui, S.: Automatic Recognition and Understanding of Spoken Language–A

First Step Toward Natural Human–Machine Communication. Proc. IEEE, Vol.88, No.8,

(2000), 1142-1165.

9. Kohonen T.: Adaptive, Associative and Self-Organizing Function in Neural Computing.

Artificial Neural Networks, IEEE Press, Piscataway- NJ (1992), 42-51.

10. Lippmann, R. and Singer, E.: Hybrid neural network/HMM approaches to word spotting.

Proc. ICASSP ’93. Minneapolis (1993), 565-568.

11. Lippmann, R.P.: Human and Machine Performance in Speech Recognition Tasks. Speech

Communications, Vol.22, No.1 (1997), 1-15.

12. Mahomed, M. and Gader, P.: Generalized hidden Markov models. IEEE Transactions on

Fuzzy Systems. (2000), 67-93.

13. Morgan, D.P., Scotfield, C.L.: Neural Networks. Prentice Hall, New York (1992).

14. Valsan, Z., Gavat, I., Sabac, B., Cula, O., Grigore, O., Militaru, D., Dumitru, C.O.: Statisti-

cal and Hybrid Methods for Speech Recognition in Romanian. International Journal of

Speech Technology, Vol.5, No.3 (2002), 259-268.

15. Young, S.J.: The general use of tying in phoneme-based HMM speech recognizers. Proc.

ICASSP’92, Vol.1, San Francisco (1992), 569-572.

16. Wang, Z. and Klirr, G.: Fuzzy Measure Theory. New York: Plenum (1992).