2 LANGUAGE MODEL TUNING
In fact, each sport is unique with specific
expressions, phrases and a manner of speech of the
TV commentator. That is why a language model
should be based on many transcriptions of the TV
commentary of the given sport.
We did manual transcriptions of 90 ice-hockey
matches, both international and Czech league
matches. These transcriptions contain the names of
players and teams involved in the match, but also
other names that the commentators talked about. To
make a general language model suitable for all ice-
hockey matches, we would need to add all names of
ice-hockey players in the world. This would swell
the vocabulary of the recognition system and slow
the system down, moreover the accuracy of
transcription would drop. The only way is to prepare
a language model specifically for each match by
adding only names of players of two competing
teams. A class-based language model takes the role
in this task.
During manual transcriptions of TV ice-hockey
commentaries, some words were labelled with the
tags representing several semantic classes. The first
class represents the names of players that take part in
the match. The second class is used for the names of
competing teams or countries and the next class for
the designations of sport places (stadium, arena etc.).
The names which do not relate to the transcribed ice-
hockey match were not labelled (for example
legendary players like "Jagr"), because they are
more or less independent of the match. Since the
Czech language is highly inflectional, further 27
classes were used for the names in other
grammatical cases and their possessive forms.
Finally, taking into account the above mentioned
tags instead of the individual names, two class-based
trigram language models were trained - one for in-
game commentary and one for studio interviews.
The manual transcriptions of 90 commentaries
contain 750k tokens with 25k unique words. These
data cannot cover commentary of forthcoming ice-
hockey matches. To make the vocabulary and
language model more robust, other data from
newspapers (175M tokens) and TV news
transcriptions (9M tokens) were used. Only data
with automatically detected topic of sport
(Skorkovská et al., 2011) were used and mixed with
ice-hockey commentaries based on perplexity of the
test data. For in-game language model, the weights
were 0.65 for ice-hockey commentaries, 0.30 for
newspaper data and 0.05 for TV news transcriptions,
while for studio interviews the weights were 0.20,
0.65 and 0.15, respectively. The final vocabulary of
the recognition system contains 455k unique words
(517k baseforms).
Finally, before recognition of each ice-hockey
match, the language model classes have to be filled
with the actual words. The names of players of two
competing teams (line-ups) are acquired and
automatically declined into all possible word forms.
Since the player can be referred to by his full name
or surname only, both representations are generated.
Other language model classes are filled by the
names of teams and designations of sport places
corresponding to the given ice-hockey match.
3 DIRECT RECOGNITION
The acoustic data for direct subtitling (subtitling
from the original audio track) was collected over
several years especially from the Ice-hockey World
Championships as well as from the Winter Olympic
Games and the Czech Ice-hockey League matches.
All these matches were broadcasted by the Czech
Television. Sixty nine matches were transcribed for
the acoustic modelling purposes. All these matches
were manually annotated and carefully revised
(using annotation software Transcriber). Total
amount of data was more than 100 hours of speech.
The digitalization of an analogue signal was
provided at 44.1 kHz sample rate, 16-bit resolution.
The front-end processor was based on the PLP
parameterization (Hermansky, 1990) with 27 band
pass filters and 16 cepstral coefficients with both
delta and delta-delta sub-features. Therefore one
feature vector contains 48 coefficients. The feature
vectors were calculated each 10ms. Many noise
reduction techniques were tested to compensate for
very intense background noise. The J-RASTA
(Koehler et al., 1994) seems to be the best noise-
reduction technique in our case (see details in Psutka
et al., 2003).
The individual basic speech unit in all our
experiments was represented by a three-state HMM
with a continuous output probability density
function assigned to each state. As the number of
possible Czech triphones is too large, phonetic
decision trees were used to tie states of the
triphones. Several experiments were performed to
determine the best recognition results depending on
the number of clustered states and also on the
number of mixtures per state. The best recognition
results were achieved for 16 mixtures of multivariate
Gaussians for each of 7700 states (see Psutka, 2007
for methodology).
SIGMAP2013-InternationalConferenceonSignalProcessingandMultimediaApplications
152