Inﬂuence of Different Phoneme Mappings on the Recognition Accuracy

of Electrolaryngeal Speech

Petr Stanislav and Josef V. Psutka

Department of Cybernetics, University of West Bohemia, Univerzitn´ı 8, 306 14 Pilsen, Czech Republic

Keywords:

Automatic Speech Recognition, Laryngectomees, Electrolaryngeal Speech, Phoneme Mapping.

Abstract:

This paper presents the initial steps towards building speech recognition system that is able to efﬁciently

process electrolaryngeal substitute speech produced by laryngectomees. Speakers after total laryngectomy are

characterized by restricted aero-acoustic properties in comparison with normal speakers and their speech is

therefore far less intelligible. We suggested and tested several approaches to acoustic modeling within the

ASR system that would be able to cope with this lower intelligibility. Comparative experiments were also

performed on the healthy speakers. We tried several mappings that unify unvoiced phonemes with their voiced

counterparts in the acoustic modeling process both on monophone and triphone level. Systems using zerogram

and trigram language models were evaluated and compared in order to increase the credibility of the results.

1 INTRODUCTION

A malignant disease of vocal folds does not occur as

often as for example breast cancer or lung cancer.

However, if treatment is not successful, the conse-

quences of this illness could be very serious. In ex-

treme cases, the total laryngectomy (which includes

the removal of vocal folds) is performed. Therefore

the person who undergoes this surgery is not able to

speak in a standard way.

There are several methods of restoring the speech

for total laryngectomees. Esophageal speech belongs

to the most common methods used for speech restora-

tion. The idea is based on releasing gases from esoph-

agus instead of lungs. Another method uses a tra-

cheoesophageal prosthesis that connects the larynx

with pharynx. The air passing into the pharynx causes

the required vibrations and the utterance can be cre-

ated. Another option how to produce the necessary

excitations is using an external device - the electrolar-

ynx.

This paper describes the inﬂuence of the Czech

phoneme/triphone mapping on the accuracy of the

speech recognition results of a total laryngectomee.

The obtained results are compared with accuracy

of the speech recognition of a healthy person us-

ing the same phoneme/triphone mapping. In Section

2, the difference between speech production of the

nonlaryngectomme and laryngectomee speaker is de-

scribed. Section 3 explains the process of the acoust-

ic model creation and principle of the phoneme map-

ping. Section 4 presents obtained results and the Sec-

tion 5 concludes the paper.

2 TOTAL LARYNGECTOMEES

The total larygectomy is a surgery during which the

vocal folds affected by cancer are removed. The

differences between the healthy speaker and speaker

without vocal folds are shown in Figure 1 and Fig-

ure 2, respectively. The healthy vocal folds excite

a stream of air from the lungs and then the excited

stream is modulated in the nasal and oral cavity.

The modulated stream comes out from the mouth as

speech. However, in case of total laryngectomees,

there is no connection between the larynx and the oral

cavity. Therefore the ﬂow of air does not ﬂow form

the lungs to the mouth, but to the tracheostoma that

is used for breathing. Therefore the speech could not

be produced in the same way as in the case of non-

larygectomees (Nakamura, 2010).

One way of replacing removed vocal folds is to

use an electromechanical device called electrolarynx.

The electrolarynx is useful for total laryngectomees

who have not obtained any tracheoesophageal pros-

thesis and have not been able to learn an esophageal

speech.

The basic part of this device is a battery powered

electric motor that excites a vibration plate. The pa-

204

Stanislav P. and V. Psutka J..

Inﬂuence of Different Phoneme Mappings on the Recognition Accuracy of Electrolaryngeal Speech.

DOI: 10.5220/0004129502040207

In Proceedings of the International Conference on Signal Processing and Multimedia Applications and Wireless Information Networks and Systems

(SIGMAP-2012), pages 204-207

ISBN: 978-989-8565-25-9

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

Figure 1: Scheme of ingestion and breathing for nonlaryn-

gectomees.

Figure 2: Scheme of ingestion and breathing for total laryn-

gectomees.

tient attaches the electrolarynx either to the soft parts

of the neck or to the lower jaw and the vibrating plate

substitutes the missing vocal fold vibrations as it is

shown in Figure 3. This method is easily manage-

able. After a very short time, the speaker is able to

produce continuous speech. Yet this method still has

some ﬂaws, for example the monotonous mechanical

voice of a speaker or the poor speech intelligibility in

a noisy environment due to a constant volume level.

Moreover, the electrolarynx creates continuous sound

that might be irritating and one hand of the speaker is

always occupied holding the device while speaking.

3 ACOUSTIC MODELS

The aim of this paper was to verify the assump-

tion that the laryngectomees who use an electrolarynx

produce voiced phonemes only (because the electro-

larynx provides continuous excitation). Source data

were obtained from two female speakers, one was

Figure 3: Usage of electrolarynx.

a person with healthy voice and one total laryngec-

tomee. The female who underwent total laryngec-

tomy did so already 10 years ago and was able to

speak with electrolarynx in long utterances.

Both women read 5000 sentences which amounts

to more than 10 hours of speech from each speaker.

The source texts were selected from the database that

was created from the web pages of Czech newspa-

per publishers(Radov´a and Psutka, 2000). Special

consideration was given to the sentence selection as

we wanted to have a representative distribution of the

more frequent triphone sequences (reﬂecting their rel-

ative occurrence in natural speech). The corpus was

recorded in the ofﬁce where only the speaker was

present.

The digitization of the analogue signal was pro-

vided at 44.1 kHz sample rate and 16-bit resolution

format by the special DPA miniature omnidirectional

microphone. The front-end worked with MFCC pa-

rameterization with 26 ﬁlters and 12 MFCC cepstral

coefﬁcients plus energy with both delta and delta-

delta sub-features (see (Psutka and et al., 2007) for

methodology). Therefore one feature vector contains

39 coefﬁcients. Feature vectors are computed each 10

milliseconds (100 frames per second).

If a laryngectomee uses electrolarynx for speech

production, he is not able to speak when the device

is off. And since the vibrating plate provides con-

stant excitation, it is not possible for him to pro-

duce unvoiced phonemes. This assumption was ver-

iﬁed by recording isolated Czech words that dif-

fer only in voicing - ’koza’ and ’kosa’(’goat’ and

’scythe’ in English), where ’z’ is voiced and ’s’ is

unvoiced. There was no audible difference between

both utterances. Comparing the acoustic proper-

ties also did not reveal any signiﬁcant difference be-

tween analyzed words. Therefore selected unvoiced

triphones/phonemes were replaced by corresponding

voiced ones (see Table 1) in the acoustic modeling

process.

InfluenceofDifferentPhonemeMappingsontheRecognitionAccuracyofElectrolaryngealSpeech

205

Table 1: Corresponding pairs of phonemes.

Unvoiced phoneme Voiced phoneme

f v

k g

s z

ˇz

t d

4 EXPERIMENTS

Two different approaches were tested and compared

together (for both training corpuses). In the ﬁrst one

the basic speech unit was monophone in contrast to

triphone in the second one. In all our experiments

the individual basic speech unit was represented by

a three-state HMM with a continuous output prob-

ability density function assigned to each state. As

the number of the Czech triphones is large, phonetic

decision trees were used to tie the states of the tri-

phones. Several experiments were performed to de-

termine the best recognition results according to the

number of clustered states and also to the number of

mixtures. The prime Gaussians triphone/monophone

acoustic model trained with the Maximum Likelihood

(ML) criterion was made with HTK-Toolkit v.3.4.

The special systems using phonemes mapping

were built for testing of speech recognition. The main

idea of the experiment is based on the vocalization of

all produced phonemes. In this case no difference be-

tween results given by system without mapping and

phonemes mapping system should be detected. In

speciﬁc case the accuracy of recognition could even

be improved due to reduction of the system perplex-

ity. The system does not use the full phonetic set.

Conversely, in case of nonlarygectomees the re-

duction of the phonetic set could lead to reducing the

accuracy. Remember that the source data were cho-

sen with an emphasis of inclusion of all Czech tri-

phone/monophone in corresponding representation.

The test set consists of 500 sentences for both

training corpuses (nonlarygectomees and laryngec-

tomee speech). This portion of sentences (10% of

the original training set) contains approximately 1

hour of speech for each speaker. In all recognition

experiments, a language model based on zerogram

as well as a trigram-based one were applied in or-

der to judge a quality of developed acoustic models.

The perplexity of the zerogram language model was

2885 (in other words, the recognition lexicon con-

tained 2885 words) and there were no OOV words.

The trigram languages models were trained by SRI

Language Modeling Toolkit (Stolcke, 2002) using

modiﬁed Kneser-Ney smoothing that proved to be

efﬁcient in our previous language modeling experi-

ments (Praˇz´ak et al., 2008). We have collected large

corpus containing the data from newspapers (520 mil-

lion tokens), web news (350 million tokens), subtitles

(200 million tokens) and transcriptions of some TV

programs (175 million tokens). The model contained

the most frequent 360K words with OOV amounting

to 3.8%. The perplexity of the recognition task was

3380.

The veriﬁcation of the assumption was realized by

an acoustic models using triphone/monophone for the

speech recognition. All models are created for both

speakers. Firstly the baseline acoustic model with-

out mapping was created. Then the model that maps

only voiceless triphones/monophones. Due to identi-

ﬁcation of the inﬂuence of each phoneme on system

accuracy four more models were built.

• acoustic model with mapping ’f’ on ’v’;

• acoustic model with mapping ’k’ on g;

• acoustic model with mapping ’s’,’ ˇs’ on ’z’, ’ˇz’;

• acoustic model with mapping ’t’, ’

t’ on ’d’, ’

d’;

For veriﬁcation of our assumptions 24 acoustic

models were created (6 monophone model and 6 tri-

phone models for each speaker). Obtained recog-

nition accuracy is given in Table 2 for monophone

model with zerogrambased language model in case of

and Table 3 for monophone model with trigram lan-

guage model with 360K words lexicon.

From these tables it could be seen that every

change of phonetic set causes reducing of speech

recognition accuracy for nonlaryngectomee. How-

ever, for total laryngectomees it is not possible to con-

ﬁrm this assumption clearly. From computed results

it is possible to obtain information about accuracy,

thus about decreasing of accuracy due to replacing

unvoiced monophones/triphones by voiced one. The

same character of result was obtained from phoneme

mapping ’t’, ’

t’ → ’d’, ’

d’ and ’f’ → ’v’. Conversely,

if ’k’ was replaced by ’g’ then the higher speech

recognition accuracy was obtained than for baseline

model. From replacing ’s’,’ˇs’ → ’z’,’ˇz’ the obtained

results were not clear. Therefore the further work will

be focused on solution of this problem.

5 CONCLUSIONS

We have presented our initial investigations into the

challenging problem of transcribing electrolaryngeal

substitute speech of total laryngectomees. We have

SIGMAP2012-InternationalConferenceonSignalProcessingandMultimediaApplications

206

Table 2: Accuracy of the ASR system with monophone

acoustic models and zerogram based language model for

laryngectomee speaker and nonlaryngectomee speaker.

Acoustic model Laryng. [%] Nonlaryng. [%]

Baseline 83.05 91.35

’f’ → ’v’ 83.05 89.96

’k’ → ’g’ 83.10 90.58

’s, ’ˇs’ → ’z’, ’ˇz’ 83.71 88.77

’t’, ’

t’ → ’d’, ’

d’ 82.47 90.05

All voiced 82.78 86.58

Table 3: Accuracy of the ASR system with monophone

acoustic models and trigram based language model contain-

ing 360k words for laryngectomee and nonlaryngectomeer.

Acoustic model Laryng. [%] Nonlaryng. [%]

Baseline 84.92 87.47

’f’ → ’v’ 84.51 87.42

’k’ → ’g’ 85.50 86.36

’s’, ˇs’ → ’z’, ’ˇz’ 84.75 84.81

’t’, ’

t’ → ’d’, ’

d’ 84.38 86.38

All voiced 84.34 83.77

Table 4: Accuracy of the ASR system with triphone acous-

tic model s and zerogram based language model for laryn-

gectomee and nonlaryngectomee.

Acoustic model Laryng. [%] Nonlaryng. [%]

Baseline 82.60 92.66

’f’ → ’v’ 82.23 92.41

’k’ → ’g’ 83.30 92.57

’s’, ’ˇs’ → ’z’, ’ˇz’ 83.28 92.28

’t’, ’

t’ → ’d’, ’

d’ 82.13 92.28

All voiced 82.18 91.03

Table 5: Accuracy of the ASR system with triphone acous-

tic models and trigram based language model containing

360k words for, laryngectomee and nonlaryngectomeer.

Acoustic model Laryng. [%] Nonlaryng. [%]

Baseline 87.65 95.80

’f’ → ’v’ 87.51 95.46

’k’ → ’g’ 88.38 95.55

’s’, ’ˇs’ → ’z’, ’ˇz’ 88.31 95.07

’t’, ’

t’ → ’d’, ’

d’ 87.60 95.39

All voiced 86.97 94.53

focused on the problem with voiced and unvoiced

phonemes. The test results for both monophone- and

triphone-based acoustic models showed that the sub-

stitution of all unvoiced phonemes for voiced ones de-

creased recognition accuracy for both language mod-

els. But on the other hand there were phoneme sub-

stitutions (e.g. ’k’ → ’g’) that increased the accu-

racy. The interesting issue is how can for instance

substitution ’s’, ’ˇs’ → ’z’, ’ˇz’ give better recogni-

tion results in tests with monopohone-based than in

the triphone-based acoustic models in comparison to

baseline acoustic models. This can be due to a more

complex phonetic structure in triphone-based acous-

tic model that can represent small differences between

phonemes in different surroundings even if there are

pronounced as voiced sound. We would like to inves-

tigate such interesting issues in our future work.

ACKNOWLEDGEMENTS

This work was supported by the European Regional

Development Fund (ERDF), project ”New Technolo-

gies for Information Society” (NTIS), European Cen-

tre of Excellence, ED1.1.00/02.0090 and by the grant

of the University of West Bohemia and project No.

SGS-2010-054.

REFERENCES

Nakamura, K. (2010). Doctoral Thesis: Speaking Aid Sys-

tem Using Statistical Voice Conversion for Electrola-

ryngeal Speech. PhD thesis, Japan.

Praˇz´ak, A., Ircing, P.,

Svec, J., and Psutka, J. V. (2008).

Efﬁcient combination of n-gram language models and

recognition grammars in real-time lvcsr decoder. In

9th International Conference on Signal Processing

Proceedings, pages 587–591, Peking, China. IEEE.

Psutka, J. V. and et al. (2007). Searching for a robust mfcc-

based parameterization for asr application. SIGMAP

2007: Proceedings of the Second International Con-

ference on Signal Processing and Multimedia Appli-

cations, pages 196–199.

Radov´a, V. and Psutka, J. (2000). UWB-S01 corpus: A

czech read-speech corpus. Proceedings of the 6th In-

ternational Conference on Spoken Language Process-

ing.

Stolcke, A. (2002). SRILM - an extensible language model-

ing toolkit. International Conference on Spoken Lan-

guage Processing.

InfluenceofDifferentPhonemeMappingsontheRecognitionAccuracyofElectrolaryngealSpeech

207