FOUR-PHASE RE-SPEAKER TRAINING SYSTEM

Aleš Pražák, Zdeněk Loose, Josef Psutka, Vlasta Radová

Department of Cybernetics, University of West Bohemia, Plzeň, Czech Republic

Luděk Müller

SpeechTech s.r.o., Plzeň, Czech Republic

Keywords: Speech recognition, LVCSR, Online captioning, Re-speaker training, Application.

Abstract: Since the re-speaker approach to the automatic captioning of TV broadcastings using large vocabulary

continuous speech recognition (LVCSR) is on the increase, there is also a growing demand for training

systems that would allow new speakers to learn the procedure. This paper describes a specially designed re-

speaker training system that provides gradual four-phase tutoring process with quantitative indicators of a

trainee progress to enable faster (and thus cheaper) training of the re-speakers. The performance evaluation

of three re-speakers who were trained on the proposed system is also reported.

1 INTRODUCTION

With the rise of computer technology and the

progress in the probabilistic approach to the large

vocabulary continuous speech recognition (LVCSR),

this technology suggests itself to be used for

automatic or semi-automatic captioning of live TV

broadcasting. There are in general two ways how to

use the speech recognition for live TV captioning.

The first one is a direct recognition of the audio

stream of a TV program. This approach is usable

only for very specific TV programs with defined

acoustic characteristics, minimum non-speech

sounds, limited domain of the discourse and a

specific manner of speech. A typical task for this

approach is fully automatic captioning of parliament

meetings or broadcast news. The parliament meeting

procedures as well as broadcast news design in

almost all countries ensure stable acoustic

environment and cultivated speech of only one

person at a time, so the recognition accuracy can be

high enough for trouble-free reading and

understanding of the captions (Neto et al. 2008). We

operate such a system together with the Czech

Television, the public service broadcaster in the

Czech Republic, for almost four years (Pražák et al.

2007).

The second approach for semi-automatic

captioning of live TV broadcasting uses so-called

"re-speaker" (or shadow-speaker) technology. The

re-speaker is a skilled and specifically trained

speaker, who listens to and re-speaks the original

dialogues of a TV program, alternatively using

his/her own words. This approach is suitable for

arbitrary TV programs, especially for programs with

more speakers speaking simultaneously or with a

noisy acoustic environment, such as TV debates or

sport programs. It also simplifies the task of LVCSR

significantly – the re-speaker works in a quiet

environment, uses a well-defined acoustic channel

and produces a refined speech. Moreover, the

acoustic model in the LVCSR can be personalized

specifically for the given speaker. Alternatively, the

re-speaker is allowed to use his/her own words, so

the final captions can be shorter (easily readable)

and more comprehensible for hearing-impaired. This

in fact represents the translation from one language

to the same one.

Probably the first broadcasting company that

introduced LVCSR technology in the real caption

generation process was BBC in 2003 (Evans 2003).

Since then, similar systems have been developed and

employed in production use in several countries all

around the world (Boulianne et al. 2006), (Homma

et al. 2008).

2 TRAINING SYSTEM

It is impractical to bring a re-speaker to the real

217

Pražák A., Loose Z., Psutka J., Radová V. and Müller L..

FOUR-PHASE RE-SPEAKER TRAINING SYSTEM.

DOI: 10.5220/0003604502170220

In Proceedings of the International Conference on Signal Processing and Multimedia Applications (SIGMAP-2011), pages 217-220

ISBN: 978-989-8425-72-0

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

captioning system and let him/her do the real job of

a skilled re-speaker throwing away the resulting

captions for a few months. More effective is to

develop a training system that shortens (and thus

cheapens) the training process. We have developed a

special training system for re-speakers that provides

gradual training process under surveillance of a

skilled supervisor and with quantitative indicators of

their progress to enable easier and more objective

decision about their suitability for re-speaking.

2.1 Overview

The proposed training system is a multi-user system

that uses real-time LVCSR system to create captions

upon the re-speaker's dictation and keyboard

commands. We use in-house LVCSR system that is

based on Hidden Markov Models (HMMs), lexical

(phonetic prefix) trees and a trigram language

model. The implementation is focused on low-

latency real-time operation with very large

vocabularies on multi-core systems. Due to high

efficient decoder parallelization and a graphic

processor unit (GPU) utilization, we can recognize

more than 500 000 words in real-time on four-core

notebooks. This is very important to allow intensive

re-speaker training in all conditions with demanded

recognition accuracy. The system supports also word

graph generation for confidence measure

computation.

After the system start, the re-speaker chooses

his/her profile and automatically sets the

microphone volume. This is accomplished in two

steps - in the first step, the volume on silence is set

(to filter the background noise), in the second step

the optimal volume of speech is set. The soundtrack

of a TV program is played during the whole process

to simulate real training conditions that influence the

re-speaker's utterance.

Now, the re-speaker chooses one of four training

phases based on his/her training progress. The first

training phase enables training of a re-speaker’s skill

to listen and speak simultaneously. The second

phase assists in optimizing the re-speaker’s utterance

to the LVCSR system demands. The third training

phase enables free re-speaking with a support of

some keyboard commands and finally, the fourth

phase simulates the real captioning system with all

the features such as manual punctuation.

2.2 Training Phases

The first training phase is intended to train re-

speaker’s skill to listen and speak simultaneously.

The re-speaker opens prepared video file and

practices speaking while playing any part of the

video. The aim is not to re-speak word by word, but

become accustomed to speaking meaningfully while

listening to and perceiving the original soundtrack.

This phase does not employ LVCSR system, but all

utterances are recorded for later playback by the re-

speaker. Each time the re-speaker starts playing

video, new record is created and metadata, such as

timestamp, position in video and repetition count of

each segment, are logged. This allows the supervisor

to trace re-speaker’s training process. Recorded

utterances are played back simultaneously with the

video as they were recorded, but the sound volume

balance between recorded utterances and the original

soundtrack can be set at will.

The second phase of the training system assists

in optimizing the re-speaker’s utterance to the

LVCSR system demands, so this phase integrates

the LVCSR system and displays its output to the re-

speaker. The main objective of the re-speaker is to

re-speak utterances in the original soundtrack word

by word so that the recognition accuracy is as high

as possible. The re-speaker just mechanically re-

speaks what he/she listens to, so he/she can focus on

altering the utterance (mainly the pronunciation) and

its influence on the recognition results. This implies

that the video files for the second phase should

contain neither overlapping speakers nor slips of the

tongue nor out-of-vocabulary (OOV) words.

Anyway, the speech rate of some speakers can be

too high for some re-speakers at the beginning of the

training. That is why there is an option to slow down

the playback rate of the video file in real-time. The

WSOLA algorithm ensures that the pitch remains

the same even for high changes of the playback rate

(Verhelst 2000).

We have implemented two quantitative

indicators that indicate the training progress of the

re-speaker. The first one is common recognition

accuracy, but supplemented with highlighting of

misrecognitions in the recognized text. A

Levenshtein alignment is carried out just during

recognition in real-time, so that the recognition

substitutions (underlining), insertions (strikeout) and

deletions (omission triangle) may be highlighted

(see Figure 1). By definition, the video file

transcription has to be available for the second

phase.

SIGMAP 2011 - International Conference on Signal Processing and Multimedia Applications

218

Figure 1: Phase two of the training system.

The second indicator of the training system is a

so-called "suitability measure" with the meaning of

suitability of the re-speaker’s utterance to the

LVCSR acoustic model. The suitability measure is

computed only from correctly recognized words and

the language model is omitted, so it expresses the re-

speaker’s ability to comply with the LVCSR system

almost independently from the particular recognized

text. In the same way as in the case of

misrecognitions, the words with low suitability

measure are highlighted (different word color) in the

recognized text (see Figure 1). During the playback

of the recorded utterances, the words just being

played are highlighted (different background color),

so the relationship between played sound and

misrecognitions and suitability measure of words

can be tracked. This indicator should have

increasing tendency during the whole training

process of the re-speaker.

The next important task of the second phase of

the training system is the acoustic data gathering and

acoustic model adaptation to the speaker’s voice

characteristics. The acoustic model adaptation is

applied in two stages. The first adaptation stage is

applied iteratively just during the recognition in real-

time, so the recognized text can be improved

immediately. An unsupervised incremental fMLLR

adaptation is carried out in the background, so the

re-speaker’s effort is not influenced (Pražák et al.

2009). In the second adaptation stage, all the data

gathered during the whole re-speaker’s training are

used for MAP adaptation enhanced with the SAT

approach. The acoustic model adapted in that

manner is then used in the real captioning system.

The third phase of the training system is very

close to the real captioning system, in order to make

the re-speaker ready to the real captioning. The re-

speaker re-speaks video files with his/her own words

and learns to use some of the features that improve

the final captions. The recognition accuracy

indicator is not available, since the re-speaker‘s

utterance transcriptions cannot be known in advance,

whereas the suitability measure is still displayed to

check the re-speaker’s training progress. As the

suitability measure should be computed only from

the correctly recognized words, the confidence

measure estimated from the LVCSR system result is

used to guess the correctly recognized words instead

of exact transcription.

It comes from the principle of the LVCSR

system, that last few words of the recognized text

change as new acoustic signal is received and the

best hypothesis based on acoustic and language

model is recomputed. Since only static closed

captions are still required in the TV broadcasting,

these last so-called "pending" words (four at

maximum) are ignored during the caption

generation, but they can be displayed to the re-

speaker highlighted (different word color), so he/she

knows, which words can be eventually corrected

(see Figure 1). To be able to quickly correct any of

the pending words (not only the last one), the best

method is to erase them all and re-speak

consequently. The best way is to use idle re-

speaker‘s hands and keyboard commands, so the re-

speaker presses optional key and re-speaks erased

words fluently (Wald et al. 2007). This feature

should not be ignored, because it can dramatically

decrease the error rate of the final captions.

The next feature that allows generation of high-

quality captions during the real captioning session is

the possibility to dispatch the pending words to the

captioner to be broadcasted immediately. This is

very important when the re-speaker does not speak

for a longer time (because of TV jingle or he/she is

listening ahead) and the pending words remain

unsent. It can significantly reduce the delay between

the words uttered in the original soundtrack and

corresponding words in the captions.

The fourth phase of the training system simulates

the real captioning with all the features needed. In

addition to the erasing or dispatching of the pending

words a re-speaker should use his/her hands to make

punctuation and for other special functions. The

system for punctuation indication is closely

connected with the LVCSR system. The key pressed

during the inter-word pause is processed by the

system that presents the punctuation symbol directly

in its result. This approach allows the LVCSR

system to benefit from extra information using

FOUR-PHASE RE-SPEAKER TRAINING SYSTEM

219

language model considering punctuation symbols as

words. Similar method is used for speaker change

marking too. Other keys are reserved for special

announcements and situations.

The re-speaker is trained on the real video files

containing OOV words (mainly named entities) that

should be added to the LVCSR system just during

the captioning. This is a crucial procedure that

should not take much time. We have implemented a

method for word additions in real-time during the

recognition of LVCSR system. The simplest way is

to type a word (or multi-word) and confirm it. The

system searches for the word in the special large lists

of named entities and if succeeds, it adds the word

with correct phonetic transcription(s) to the LVCSR

system. In the case of unknown word, automatic

phonetic transcription is proposed and optionally

modified by the re-speaker. This approach

minimizes the time for word addition in most cases.

3 EVALUATION

To assess the training possibilities of the described

system, we have evaluated three re-speakers who got

through all the training phases. The re-speakers

produced the captions for the same real TV debate

(61 minutes) and the final captions were evaluated.

The recognition accuracy (including the pending

words corrections) and some statistics collected

during the captioning are presented in Table 1.

Table 1: Evaluation of re-speakers on the TV debate.

NZ EK PZ

Training time

138 hours 98 hours 78 hours

Recognition accuracy

97,37 % 97,34 % 94,32 %

Suitability measure

85,66 % 84,41 % 82,38 %

Words

4294 2960 4217

Word additions

8 12 12

Word corrections

83 66 73

Word dispatches

60 136 100

Commas

319 170 287

Full stops

341 248 335

Question marks

49 48 57

New speakers

109 103 120

4 CONCLUSIONS

The proposed system facilitates the re-speaker

training by decomposing the education process into

four gradual phases, making the training easier and

thus faster. The overall training time is highly

individual, but according to our expertise, we expect

the time of intensive training plan to be set from 2 to

3 months (100 training hours at minimum).

As the development of our training system

continues, we want to allow synchronous training of

caption correctors that are indispensable for error-

free captioning of some critical TV broadcastings.

ACKNOWLEDGEMENTS

This work was supported by the Ministry of

Education of the Czech Republic under the projects

MŠMT 2C06020 and MŠMT LC536 and by the

grant of The University of West Bohemia, project

No. SGS-2010-054.

REFERENCES

Boulianne, G., Beaumont, J.-F., Boisvert, M., Brousseau,

J., Cardinal, P., Chapdelaine, C., Comeau, M., Ouellet,

P., Osterrath, F., 2006. In International Conference on

Spoken Language Processing.

Evans, M. J., 2003. Speech Recognition in Assisted and

Live Subtitling for Television. WHP 065. BBC R&D

White Papers.

Homma, S., Kobayashi, A., Oku, T., Sato, S., Imai, T.,

Takagi, T., 2008. New Real-Time Closed-Captioning

System for Japanese Broadcast News Programs. In

Computers Helping People with Special Needs.

Springer.

Neto, J., Meinedo, H., Viveiros, M., Cassaca, R., Martins,

C., Caseiro, D., 2008. Broadcast news subtitling

system in Portuguese. In IEEE International

Conference on Acoustics, Speech and Signal

Processing.

Pražák, A., Müller, L., Psutka, J. V., Psutka, J., 2007.

LIVE TV SUBTITLING - Fast 2-pass LVCSR System

for Online Subtitling. In International Conference on

Signal Processing and Multimedia Applications.

Pražák, A., Zajíc, Z., Machlica, L., Psutka, J. V., 2009.

Fast Speaker Adaptation in Automatic Online

Subtitling. In International Conference on Signal

Processing and Multimedia Applications.

Verhelst, W., 2000. Overlap-add methods for time-scaling

of speech. In Speech Communication. Elsevier.

Wald, M. and Bell, J.-M. and Boulain, P. and Doody, K.

and Gerrard, J., 2007. Correcting automatic speech

recognition captioning errors in real time. In

International Journal of Speech Technology.

SIGMAP 2011 - International Conference on Signal Processing and Multimedia Applications

220