symmetric. Over the last several years, various
measures to symmetrize the KL divergence have
been introduced in the literature. Among these
measures, we choose simply summing the two
combinations to define KL distance:
,
||
||
(4)
Although Jeffreys (Jeffreys, 1946) do not develop
Eq. (4) to symmetrize KL divergence, the so-called
J-divergence equals the sum of the two possible KL
divergences between a pair of probabilistic
distributions. Because using full covariance causes
the number of parameters to increase in proportion
to the square of dimensions of the features, a
diagonal covariance matrix is generally adopted, in
which the elements outside the diagonal are taken to
be zero. In this case, Gaussian distributions have
independent and uncorrelated dimensions. So Eq. (4)
can be written as the following closed-form
expression:
,
1
2
1
1
2
(5)
3.2 Approximation by the Nearest Pair
In speech recognition, the KL distance is required to
be calculated for GMMs. However, it is not easy to
analytically determine the KL distance between two
GMMs. For GMMs, the KL distance has no closed-
form expression, such as the one shown in Eq. (5).
For this reason, approximation methods have been
introduced for GMMs. The simple method adopted
here is to use the nearest pair of mixture
distributions (Hershey and Olsen, 2007),
,
min
,
,
(6)
where i, j are components of mixture M. As shown
in Eq. (5) and (6), the mixture weight is not
considered at this stage. So this approximation using
a closed-form expression is still based on a single
Gaussian distribution. In our experiments, the
average (d
KL2ave
) and the maximum (d
KL2max
) are also
evaluated.
3.3 Approximation by Montecarlo
Method
In addition to approximation based on the closed-
form expression, the KL distance can be
approximated from pseudo-samples using the Monte
Carlo method. Monte Carlo simulation is the most
suitable method to estimate the KL distance for
high-dimensional GMMs. An expectation of a
function over a mixture distribution,
f(x)=Σπ
m
N(x;µ
m
,σ
2
m
), can be approximated by
drawing samples from f(x) and averaging the values
of the function at those samples. In this case, by
drawing the sample x
1
, …,x
N
~ f(x), we can
approximate (Bishop, 2006).
||
||
≡
1
(7)
In this approximation, Eq. (7), D
MC
(f||g) converges
to D(f||g) as N→∞. To draw x from the GMM f(x),
first, the size of the sample is determined on the
basis of the prior probability of each distribution, π
m
,
and then samples are generated from each single
Gaussian distribution.
3.4 Approximation by Gibbs Sampler
Furthermore, for sampling from multivariate
probabilistic distributions, the Markov Chain Monte
Carlo (MCMC) method has been widely applied to
simulate the desired distribution. A Gibbs sample is
drawn such that it depends only on the previous
variable. The conditional distribution of the current
variable x
f
on the previous variable x
g
has the
following normal distribution.
;
,
1
(8)
where, ρ is the correlation coefficient. Herein, the
full-covariance matrix cannot be calculated due to
the insufficient training data in our experiments;
therefore, we adopt the unique correlation
coefficients from the full training data. The 10,000
(10K) samples from the beginning of the chain, the
so-called burn-in period, are removed. In our
experiments, we generate samples of size 10K and
100K for the MC and MCMC methods. For the
symmetric property, we calculate arithmetic mean
(AM), geometric mean (GM), and harmonic mean
(HM) from the resulting KL divergence with MC
and MCMC sampling (Johnson and Sinanovi´c, S.,
2001). The maximum and minimum between the
two divergences, D(f||g) and D(g||f) are also
calculated for comparison.
3.5 Bhattacharyya Distance and Others
The Bhattacharyya distance, which is another
ExperimentalEvaluationofProbabilisticSimilarityforSpokenTermDetection
443