2 THE BASELINE DIARIZATION
SYSTEM
In the ICSI-SRI Fall 2004 diarization system a guess
is made as to the number of individual speakers (K);
that guess must be much greater than the number of
actual speakers. The audio file is divided up into
60 millisecond windows with each window overlap-
ping the previous one by 20 milliseconds. For each
window, nineteen mel-frequency cepstral coefficients
(MFCC) are extracted as acoustic feature vectors for
that window. These feature vectors are assigned se-
quentially to the K speakers; this grouping of fea-
ture vectors is called a segment (and there are K seg-
ments). Wooters et al. (2004) report that this speaker
change detection initialization method is as effective
as those based on distance measures (Barras et al.,
2004) or BIC (Zhou and Hansen, 2000).
A K state Hidden Markov Model (HMM) is cre-
ated where, of course, each of its states acoustically
models a single potential speaker. Gaussian Mixture
Models (GMM) are established to initialize the states
of the HMM. The Viterbi decoding algorithm is used
to re-assign feature vectors to other states and the
GMM is thus updated. Several sub-states are linked
to each K state and these share the state’s probability
density function (pdf). Upon entering a state, the fea-
ture vectors cannot change to another state unless they
have travelled through all the sub-states one-by-one.
This imposes a minimum number of features (equiva-
lent to more than 0.9 seconds), which are assigned to
a state each time. This iteratively refines the segment
boundary assigned to each state. This approach was
first reported by Ajmera et al.(2002).
Wooters et al. (2004) advise that an agglomerative
clustering technique with BIC merging and stopping
criteria (Ajmera and Lapidot, 2002) always gives the
best performance for clustering segments. Bayesian
Information Criterion (BIC) (Schwarz, 1978) is a
model selection criterion which prefers those models
that have large log-likelihood values, but penalizes it
with model complexity (the number of parameters in
the model) (Schwarz, 1978). For a pair of segments x
and y which are assigned to different states, their BIC
merging score is computed according to Eq.1.
BIC
score
= L
z
− (L
x
+ L
y
) −1/2α(P
z
log(n
z
) − P
x
log(n
x
) − P
y
log(n
y
)), (1)
where L
z
is the log-likelihood function for the merg-
ing model, P is the number of parameters used in the
model and n is the number of features in the segment.
The pair of states whose segments have the highest
BIC score will be merged, and the state model re-
trained. The merging process continues until there are
run speech/non-speech detection
extract feature vectors
initialize K-state HMM model
initialize
GMM for
each state
build the UBM based
on the audio file itself
and set the model
complexity
automatically
use MAP-adaptation to
initialize the GMM
model for each state
run Viterbi decoding to reassign the features
adapt the GMM from the UMB
for each state
compute the CLR for all pairs of
states
use normalized cuts to merge the
states
compute the
intra-cluster/inter-cluster ratio
if K = 1 if K > 1
output the result
with minimum
intra-cluster/inter-cluster ratio
update the GMM for each state
select the pair of segments with
largest BIC score
merge them if
BIC score > 0
stop if BIC
score < 0
output result
original ICSI
original
ICSI
new
system
new system
Figure 1: The original ICSI method compared with the new
system.
no pairs of states whose BIC score is larger than zero;
the clustering then stops. In the ICSI-SRI diarization
system, the number of parameters used in the merging
model is set to be equal to the sum of the number of
parameters used in each model, so the α parameter is
not required. The states that remain in the HMM are
potential speakers; the segments are thus indexed and
categorised. Ajmera and Wooters (2003) have created
an alternative algorithm which integrates the segmen-
tation and clustering together.
Sinha and Tranter (2005) and Barras et al. (2006)
have included a post-processing step in the speaker
diarization system in order to improve the perfor-
mance. This involves a Universal Background Model
(UBM), which is pre-trained either with other audio
files or with the data itself, and a Maximum a Pos-
teriori (MAP) mean-adaptation (Barras and Gauvain,
2003) is then applied to each cluster from the UBM
to give the state model. The Cross-Likelihood Ratio
(CLR) (Sinha et al., 2005) instead of BIC is applied
as the merging criterion.
Figure 1 illustrates and contrasts the original ICSI-
SRI system with that described by the authors.
SIGMAP 2007 - International Conference on Signal Processing and Multimedia Applications
318