states and mixtures based on manual word transcrip-
tions (supervised approach). In case of online adap-
tation, where no manual transcriptions are available,
recognized word sequence is used instead (unsuper-
vised approach). Since the recognition process is not
error-free, some technique for confidence tagging of
recognized words should be used to choose only well-
recognized segments of speech for speaker adaption.
3.1 Confidence Measure
To apply the online speaker adaption as soon as pos-
sible, word confidences have to be evaluated very
fast for partial word sequences generated periodically
along with incoming acoustic signal. We use poste-
rior word probabilities computed on the word graph
as a confidence measure (Wessel et al., 2001). For
fast evaluation of word confidences, the size of partial
word graphs is reduced in the time axis to limit the
time of the confidence measure evaluation. In addi-
tion, a special modification of the word graph topol-
ogy is applied in the beginning and at the end of the
graph for correct estimation of word confidences near
word graph ends.
3.2 Force Alignment
The force alignment of adaptation utterances to the
HMM states and mixtures is performed only for well-
recognized segments of speech. To use only the trust-
worthy segments of speech, we use a quite strict cri-
terion for word selection - only words, which have
confidence greater than 0.99 and their neighboring
words have confidence greater than 0.99 too, are se-
lected. This ensures that the word boundaries of se-
lected words are correctly assigned. The force align-
ment is then performed in three steps. In the first step,
a state network is constructed based on phonetic tran-
scriptions of recognized words. A lexical tree struc-
ture is used in the case of more phonetic transcriptions
for one word to reduce the network size. In the second
step, the Viterbi search with the beam pruning is ap-
plied on the state network to produce a state sequence
corresponding to the selected words. Finally, feature
vectors are assigned to the HMM state mixtures based
on their posterior probability densities.
4 EXPERIMENTS
We have performed some experiments of the auto-
matic online subtitling related to a real task running in
the Czech public service television. The task concerns
subtitling of live transmissions of the Czech Parlia-
ment meetings without the use of a shadow speaker.
Hence, the original speech signal was being recog-
nized.
4.1 Experimental Setup
An acoustic model was trained on 100 hours of parlia-
ment speech records with manual transcriptions. We
use three-state HMMs and 8 mixtures of multivariate
Gaussians for each state. The total number of 43 080
Gaussians is used for the SI model. In addition,
discriminative training techniques were used (Povey,
2003). The analogue input speech signal is digitized
at 44.1 kHz sampling rate and 16-bit resolution for-
mat. We use PLP parameterization with 19 filters and
12 PLP cepstral coefficients with both delta and delta-
delta sub-features. Feature vectors are computed at
the rate of 100 frames per second.
A language model was trained on about 24M to-
kens of normalized Czech Parliament meeting tran-
scriptions (Chamber of Deputies only) from different
electoral periods. To allow subtitling of arbitrary (in-
cluding future) electoral period, five classes for rep-
resentative names in all grammatical cases were cre-
ated. See (Pra
ˇ
z
´
ak et al., 2007) for details. The vocab-
ulary size is 177 125 words. For the fast online recog-
nition, we use a class-based bigram language model
with Good-Turing discounting trained by SRI Lan-
guage Modeling Toolkit. For a more accurate con-
fidence measure of recognized words, the class-based
trigram language model is used.
The experiments were performed on 12 test
records from different parliament speakers, 5 min-
utes each, 6 612 words in total. To simulate condi-
tions during a real subtitling, the data for the adap-
tation were accumulated from the beginning of each
test record and individual adaptation steps were per-
formed iteratively whenever the amount of adaption
data reaches the pre-specified level. Evaluation of
the recognition accuracy was done on the whole test
records, thus the influence of each adaptation step ap-
proved itself only on parts of records after its applica-
tion.
4.2 Online Adaptation Strategy
There are several online adaptation strategies that
come into question. Firstly, incremental fMLLR ap-
proach should be used since it requires only moder-
ate number of adaptation data. Moreover, the num-
ber of transformation matrices should be continuously
increased as the amount of adaptation data grows.
The optimum adaptation strategy should generate
FAST SPEAKER ADAPTATION IN AUTOMATIC ONLINE SUBTITLING
129