that to continue in the same way with the both sub-
clusters. The number of final clusters can naturally
be the power of two only. This way produces more
size-balanced clusters and it does not need as much
computation time as the first direct way. But the final
clusters do not need to be so compact.
2.1 Algorithm Description
The algorithm is based on similar criterion like the
main training algorithm – maximize likelihood L
of the training data with reference transcription and
models. The result of the algorithm is a set of trained
acoustic models and a set of lists where all utter-
ances are assigned to exactly one cluster. Number of
clusters (classes) n has to be set in advance and for
gender-dependent modeling or for hierarchical split-
ting is naturally n = 2. The process is modification of
the Expectation-Maximization (EM) algorithm. The
unmodified EM algorithm is applied for estimation of
acoustic model parameters. The clustering algorithm
goes as follows:
1. Random splitting of training utterances into n
clusters. The clusters should have similar size. In
case of two initial classes there is reasonable to
start the algorithm from gender-based clusters.
2. Train (retrain) acoustic models for all clusters.
3. Posterior probability density P(u|M) of each
utterance u with its reference transcription is
computed for all models M (so-called forced-
alignment).
4. Each utterance is assorted to the cluster with the
maximal evaluation P(u|M) computed in the pre-
vious step:
M
t+1
(u) = arg max
M
P(u|M). (1)
5. If clusters changed than go back to step 2. Other-
wise the algorithm is terminated.
Optimality of results of the clustering algorithm is
not guaranteed. Besides, the algorithm depends on
initial clustering. Furthermore, even convergence of
the algorithm is not guaranteed, because there can be
a few utterances which are reassigned all the time.
Therefore, it is suitable to apply a little threshold
as a final stopping condition or to use fixed number
of iterations. Thus, if we would like to verify that
the gender-dependent splitting is ”optimal” so we use
this male/female distribution as initial and start algo-
rithm. The intention is to complete the algorithm with
more refined clusters, in which ”masculine” female
and ”feminine” male voices and also errors in man-
ual male/female annotations will be reclassified. This
should improve a performance of the recognizer.
3 DISCRIMINATIVE TRAINING
Discriminative training (DT) was developed in a re-
cent decade and provides better recognition results
than classical training based on Maximum Likelihood
criterion (ML) (Povey, 2003; McDermott, 2006). In
principle, ML based training is a machine learn-
ing method from positive examples only. DT on
the contrary uses both positive and negative exam-
ples in learning and can be based on various ob-
jective functions, e.g. Maximum Mutual Informa-
tion (MMI) (Bahl at al., 1986), Minimum Clas-
sification Error (MCE) (McDermott, 2006), Mini-
mum Word/Phone Error (MWE/MPE) (Povey, 2003).
Most of them require generation of lattices or many-
hypotheses recognition run with appropriate language
model. The lattices generation is highly time con-
suming. Furthermore, these methods require good
correspondence between training and testing dictio-
nary and language model. If the correspondence is
weak, e.g. there are many words which are only in
the test dictionary then the results of these methods
are not good. In this case, we can employ Frame-
Discriminative training, which is independent on a
used dictionary and language model (Kapadia, 1998).
In addition, this approach is much faster. In lat-
tice based method with the MMI objective function
the training algorithm seeks to maximize the poste-
rior probability of the correct utterance given the used
models (Bahl at al., 1986):
F
MMI
(λ) =
R
∑
r=1
log
P
λ
(O
r
|s
r
)
κ
P(s
r
)
κ
∑
S
P
λ
(O
r
|s)
κ
P(s)
κ
, (2)
where λ represents the acoustic model parameters, O
r
is the training utterance feature set, s
r
is the correct
transcription for the r’th utterance, κ is the acoustic
scale which is used to amply confusions and here-
with increases the test-set performance. P(s) is a
language model part. Optimization of the MMI ob-
jective function uses Extended Baum-Welch update
equations and it requires two sets of statistics. The
first set, corresponding to the numerator (num) of the
equation (2), is the correct transcription. The sec-
ond one corresponds to the denominator (den) and
it is a recognition/lattice model containing all possi-
ble words. An accumulation of statistics is done by
forward-backward algorithm on reference transcrip-
tions (numerator) as well as generated lattices (de-
nominator). The Gaussian means and variances are
updated as follows (Kapadia, 1998):
ˆµ
jm
=
Θ
num
jm
(O) − Θ
den
jm
(O) + D
jm
µ
0
jm
γ
num
jm
− γ
den
jm
+ D
jm
(3)
SIGMAP 2009 - International Conference on Signal Processing and Multimedia Applications
132