We supplemented the formula (1) by the semantic
probability of the hypothesis h, ๐
๎ฏฆ๎ฏ๎ฏ
(โ), weighted
by ฮณ. We propose to estimate this semantic probability
by a DNN-based model.
We propose to go beyond a simple score
combination, like in eq. (2). We design a DNN-based
rescoring model estimating ๐
๎ฏฆ๎ฏ๎ฏ
(โ) as follow: the
model takes acoustic, linguistic, and textual
information as input. We believe that the acoustic and
linguistic information should be trained together with
the semantic information to give an accurate
rescoring model.
2.2 N-best Rescoring Procedure
To keep a tractable size of the input vectors of the
rescoring DNN, the rescoring is based on the
comparison of ASR hypotheses, two per two. Then,
our proposed DNN model uses a pair of hypotheses.
For each hypothesis pair (h
i
, h
j
), the expected DNN
output v is: (a) 1, if the WER of h
i
is lower than the
WER of h
j
; (b) 0 otherwise.
The algorithm of the N-best list rescoring is as
follows. For a given sentence, for each hypothesis h
i
we want to compute the cumulated score score
sem
(h
i
).
The obtained cumulated score score
sem
(h
i
) is used as
a pseudo probability P
sem
(h
i
) and combined with the
acoustic and linguistic likelihoods with the proper
weighting factor (to be optimized) according to eq.
(2). In the end, the hypothesis that obtains the best
score is chosen as the recognized sentence.
To compute the cumulated score score
sem
(h
i
), for
each hypothesis pair (h
i
, h
j
) of the N-best list of this
sentence:
๏ง we apply the DNN semantic model and obtain
the output value v
ij
(between 0 and 1). A value
v
ij
close to 1 means that h
i
is better than h
j
;
๏ง we update the scores of both hypotheses as:
score
sem
(h
i
) += v
ij
; score
sem
(h
j
) += 1-v
ij
2.3 DNN-Based Semantic Model
The proposed model takes input as feature vectors
which include acoustic (likelihood given by the
acoustic model), linguistic (probability given by the
language model), and textual information (text of the
hypotheses). We hypothesize that training all this
information together is better than combining the
probabilities obtained by different models.
The text of the hypothesis pair is given to the
BERT model. Then, the token embeddings of BERT,
representing this pair, are given to a bi-LSTM layer,
followed by max pooling and average pooling, and
then by a fully connected layer (FC) with a ReLU
(Rectified Linear Unit) activation function. Finally,
the output of this FC is concatenated with the acoustic
and linguistic information of the hypothesis pair and
passed through the second FC layer followed by a
sigmoid activation function (to obtain a value
between 0 and 1). Finally, the output v
ij
is obtained.
Figure 1 presents the architecture of the proposed
rescoring model.
Figure 1: Architecture of the proposed rescoring model.
3 EXPERIMENTAL CONDITIONS
3.1 Corpus Description
For this study, we use the publicly available TED-
LIUM corpus (Fernandez et al., 2018), which
contains recordings from TED conferences. Each
conference of the corpus is focused on a particular
subject. We chose this corpus because it contains
enough data to train a reliable acoustic model. We use
the train, development, and test partitions provided
within the TED-LIUM corpus: 2351 conferences for
training (452 hours), 8 conferences (1h36) for
development, and 11 conferences (2h27) for the test
set. We use the development set to choose the best
parameter configuration, and the test set to evaluate
the proposed methods with the best configuration. We
compute the WER to measure the performance.
This research work was carried out as part of an
industrial project, studying the recognition of speech
in noisy conditions, more precisely in fighter aircraft.
As it is very difficult to have an access to real data
recorded in a fighter aircraft, we add noise to the train,
development and test sets to get closer to the actual