3.1 Acoustic Processing
The analogue input speech signal is digitized at 44.1
kHz sampling rate and 16-bit resolution format. The
aim of the front-end processor is to convert contin-
uous acoustic signal into a sequence of feature vec-
tors. We performed experiments with MFCC and PLP
parameterizations, see (J. Psutka, 2001) for method-
ology. The best results were achieved using 27 fil-
ters and 12 PLP cepstral coefficients with both delta
and delta-delta sub-features. Feature vectors are com-
puted at the rate of 100 frames per second.
Each individual basic speech unit is represented
by a three-state HMM with a continuous output prob-
ability density function assigned to each state. In this
task, we use only 8 mixtures of multivariate Gaus-
sians for each state. The choice of an appropriate ba-
sic speech unit with respect to the recognition network
structure and its decoding is discussed later.
3.2 Recognition Network
Our LVCSR system uses a lexical tree (phonetic pre-
fix tree) structure for representation of acoustic base-
forms of all words of the system vocabulary. In a lex-
ical tree, the same initial portions of word phonetic
transcriptions are shared. This can dramatically re-
duce the search space for a large vocabulary, espe-
cially for inflectional languages, such as Czech, with
many words of the same word stem. The automatic
phonetic transcription (with pronunciation exceptions
defined separately) is applied to all words of the sys-
tem vocabulary and resulted word baseforms for all
pronunciation variants are added to the lexical tree.
To better model the pronunciation of words we
used triphones (context dependent phonemes) as the
basic speech units. By using a triphone lexical tree
structure, the in-word triphone context can be eas-
ily implemented in the lexical tree. However, the
full triphone cross-word context leads to fan-out im-
plementation by generation of all cross-word context
triphones for all tree leaves. This results in enor-
mous memory requirements and vast computational
demands. To respect the requirement of the real-time
operation we have proposed an approximation of the
triphone cross-word context.
One of the possible approaches is to use mono-
phones (context independent phonemes) instead of
triphones on the word boundaries. However, this
brings the necessity to train two different types of
acoustic model units and also mutually normalize the
monophone and triphone likelihoods. To cope with
this problem, we use only triphone state likelihoods
and merge the triphone states corresponding to the
same monophone within a given phone context. As
the system vocabulary is limited, not all right and
left cross-word contexts have to be modeled. This
approach results in so-called biphones that represent
merged triphone states with only one given context
- right in the root and left in the leaves. The bi-
phone likelihood is computed as the mean of the like-
lihoods of merged triphone states. The proposed bi-
phone cross-word context represents a better approx-
imation than a simple replacement of triphones by
monophones on the word boundaries. In addition,
this approach increases neither the recognition net-
work complexity nor the decoding time, but only the
duration of offline recognition network creation.
3.3 Recognition Network Decoder
Since bigram language model is implemented in the
first pass, a lexical tree copy for each predecessor
word is required. The lexical tree decoder uses a time-
synchronous Viterbi search with token passing and ef-
fective beam pruning techniques applied to re-entrant
copies of a lexical tree. The beam pruning is used
inside and also at the level of the lexical tree copies,
but a sudden increase of hypothesis log-likelihoods
occurs due to application of language model proba-
bilities at time of word to word (lexical tree to lex-
ical tree) transitions. Fortunately, early application
of the knowledge of language model can be carried
out by factorizing language model probabilities along
the lexical tree. In the lexical tree, more words share
the same initial part of their phonetic transcriptions
and thus only the maximum of their language model
probabilities is implemented towards the root of the
lexical tree during the factorization. In addition, com-
monly used linear transformation of language model
log-likelihoods is carried out for optimal weighting of
language and acoustic models.
To deal with requirement of real-time operation,
an effective method for managing lexical tree copies
is implemented. The algorithm controls lexical tree
to lexical tree transitions and lexical tree copies cre-
ation/discarding. The number of lexical tree copies
decoded in real-time is limited, so the control algo-
rithm keeps only the most perspective hypotheses and
avoids their undesirable alternations, which protects
the decoding process from time consuming creation
of lexical tree copies. The algorithm also manages
and records tokens passed among lexical tree copies
in order to identify the best path at the end of the
decoding. In addition, for word graph generation
not only the best, but several (n-best) word to word
transitions are stored. HTK Standard Lattice For-
mat (S. Young, 1999) is used to store the word graph.
LIVE TV SUBTITLING - Fast 2-pass LVCSR System for Online Subtitling
141