IMPROVEMENTS IN SPEAKER DIARIZATION SYSTEM

Rong Fu and Ian D. Benest

Department of Computer Science, the University of York, YO10 5DD, York, UK

Keywords:

Speaker Diarization, Model Complexity Selection, Universal Background Model.

Abstract:

This paper describes an automatic speaker diarization system for natural, multi-speaker meeting conversations

using one central microphone. It is based on the ICSI-SRI Fall 2004 diarization system (Wooters et al., 2004),

but it has a number of signiﬁcant modiﬁcations. The new system is robust to different acoustic environments

- it requires neither pre-training models nor development sets to initialize the parameters. It determines the

model complexity automatically. It adapts the segment model from a Universal Background Model (UBM),

and uses the cross-likelihood ratio (CLR) instead of the Bayesian Information Criterion (BIC) for merging.

Finally it uses an intra-cluster/inter-cluster ratio as the stopping criterion. Altogether this reduces the speaker

diarization error rate from 25.36% to 21.37% compared to the baseline system (Wooters et al., 2004).

1 INTRODUCTION

For the purposes of this paper, speaker diarization is

the process by which an audio recording of a meeting

is indexed according to the speakers who made oral

contributions. The NIST Rich Transcription Eval-

uations (NIST, 2004) refers to this as “who spoke

when”. Research in speaker diarization currently fo-

cuses on three main areas. First, through its indexing

and modelling, diarization enables audio databases

to be searched for particular individuals. Second,

the provision of diarization improves the success rate

of automatic speech recognisers by enabling them

to adapt to different speakers. The third application

is in the provision of more structured transcripts of

recorded meetings, news broadcasts and telephone

conversations (Tranter and Reynolds, 2006). It is the

ﬁrst area on which this paper focuses. The process

reported here is performed without the knowledge of,

for example, the number of speakers involved, their

gender, and the positions of extraneous noises such as

laughter, coughing, paper shufﬂing and so on. While

meetings can be recorded using either a single cen-

tral microphone or multiple-microphones (where each

person has their own microphone) (Jin et al., 2004),

this work concentrates on single-microphone record-

ings of meetings between a number of individuals. Of

course the system needs to be robust with regard to the

acoustic environment - implying that there should be

no pre-training of the acoustic models, and the tuning

of parameters should be automatic.

The system described in this paper is based on

the ICSI-SRI Fall 2004 diarization system (Wooters

et al., 2004). This was selected because it adopts

the single microphone speaker diarization task, and

performs acoustic modelling using only the audio ﬁle

itself. In contrast with their system (Wooters et al.,

2004), this new one adopts an alternative approach to

determining the model complexity parameters auto-

matically, using the cross log-likelihood ratio rather

than the Bayesian Information Criterion (BIC) as the

merging rule; in addition it uses the intra-cluster/inter-

cluster ratio as the criterion for identifying the number

of speakers.

The paper is structured as follows. Section 2 de-

scribes the basic diarization system adopted by Woot-

ers et al.(2004), which is taken to be the baseline for

comparison. In section 3 the techniques used to im-

prove on the basic system are described. The experi-

mental arrangement and results are presented in sec-

tion 4 and conclusions are offered in section 5.

317

Fu R. and D. Benest I. (2007).

IMPROVEMENTS IN SPEAKER DIARIZATION SYSTEM.

In Proceedings of the Second International Conference on Signal Processing and Multimedia Applications, pages 313-319

DOI: 10.5220/0002140703130319

 SciTePress

2 THE BASELINE DIARIZATION

SYSTEM

In the ICSI-SRI Fall 2004 diarization system a guess

is made as to the number of individual speakers (K);

that guess must be much greater than the number of

actual speakers. The audio ﬁle is divided up into

60 millisecond windows with each window overlap-

ping the previous one by 20 milliseconds. For each

window, nineteen mel-frequency cepstral coefﬁcients

(MFCC) are extracted as acoustic feature vectors for

that window. These feature vectors are assigned se-

quentially to the K speakers; this grouping of fea-

ture vectors is called a segment (and there are K seg-

ments). Wooters et al. (2004) report that this speaker

change detection initialization method is as effective

as those based on distance measures (Barras et al.,

2004) or BIC (Zhou and Hansen, 2000).

A K state Hidden Markov Model (HMM) is cre-

ated where, of course, each of its states acoustically

models a single potential speaker. Gaussian Mixture

Models (GMM) are established to initialize the states

of the HMM. The Viterbi decoding algorithm is used

to re-assign feature vectors to other states and the

GMM is thus updated. Several sub-states are linked

to each K state and these share the state’s probability

density function (pdf). Upon entering a state, the fea-

ture vectors cannot change to another state unless they

have travelled through all the sub-states one-by-one.

This imposes a minimum number of features (equiva-

lent to more than 0.9 seconds), which are assigned to

a state each time. This iteratively reﬁnes the segment

boundary assigned to each state. This approach was

ﬁrst reported by Ajmera et al.(2002).

Wooters et al. (2004) advise that an agglomerative

clustering technique with BIC merging and stopping

criteria (Ajmera and Lapidot, 2002) always gives the

best performance for clustering segments. Bayesian

Information Criterion (BIC) (Schwarz, 1978) is a

model selection criterion which prefers those models

that have large log-likelihood values, but penalizes it

with model complexity (the number of parameters in

the model) (Schwarz, 1978). For a pair of segments x

and y which are assigned to different states, their BIC

merging score is computed according to Eq.1.

BIC

score

= L

− (L

+ L

) −1/2α(P

log(n

) − P

log(n

) − P

log(n

)), (1)

where L

is the log-likelihood function for the merg-

ing model, P is the number of parameters used in the

model and n is the number of features in the segment.

The pair of states whose segments have the highest

BIC score will be merged, and the state model re-

trained. The merging process continues until there are

run speech/non-speech detection

extract feature vectors

initialize K-state HMM model

initialize

GMM for

each state

build the UBM based

on the audio file itself

and set the model

complexity

automatically

use MAP-adaptation to

initialize the GMM

model for each state

run Viterbi decoding to reassign the features

adapt the GMM from the UMB

for each state

compute the CLR for all pairs of

states

use normalized cuts to merge the

states

compute the

intra-cluster/inter-cluster ratio

if K = 1 if K > 1

output the result

with minimum

intra-cluster/inter-cluster ratio

update the GMM for each state

select the pair of segments with

largest BIC score

merge them if

BIC score > 0

stop if BIC

score < 0

output result

original ICSI

original

ICSI

new

system

new system

Figure 1: The original ICSI method compared with the new

system.

no pairs of states whose BIC score is larger than zero;

the clustering then stops. In the ICSI-SRI diarization

system, the number of parameters used in the merging

model is set to be equal to the sum of the number of

parameters used in each model, so the α parameter is

not required. The states that remain in the HMM are

potential speakers; the segments are thus indexed and

categorised. Ajmera and Wooters (2003) have created

an alternative algorithm which integrates the segmen-

tation and clustering together.

Sinha and Tranter (2005) and Barras et al. (2006)

have included a post-processing step in the speaker

diarization system in order to improve the perfor-

mance. This involves a Universal Background Model

(UBM), which is pre-trained either with other audio

ﬁles or with the data itself, and a Maximum a Pos-

teriori (MAP) mean-adaptation (Barras and Gauvain,

2003) is then applied to each cluster from the UBM

to give the state model. The Cross-Likelihood Ratio

(CLR) (Sinha et al., 2005) instead of BIC is applied

as the merging criterion.

Figure 1 illustrates and contrasts the original ICSI-

SRI system with that described by the authors.

SIGMAP 2007 - International Conference on Signal Processing and Multimedia Applications

318

3 THE NEW APPROACH

3.1 Model Complexity Selection

Model complexity is deﬁned as the number of param-

eters within the model. For the Gaussian Mixture

Model with a ﬁxed dimension and covariance type

(diagonal or full), the parameter which determines the

model complexity is the number of components in the

mixture model. In speaker diarization systems, there

are two steps which are inﬂuenced by the model com-

plexity; which pair of segments are to be merged, and

how should the model for a new segment be estab-

lished. A GMM with a small number of components

is too general and tends to over-merge the segments,

while a GMM with a large number of components is

too speciﬁc and tends to under-merge the segments.

Usually GMMs are trained by the EM algorithm and

unfortunately this is sensitive to the initialization en-

vironment. As a result an inappropriate estimate of

the number of components reduces the accuracy of

the model.

Anguera et al. (2005), and Wooters et al. (2004)

pre-determine this complexity with a ﬁxed value; so

the model complexity cannot be changed according

to segment length and model similarity. The problem

is that a small segment with high complexity loses its

generalization. Anguera et al. (2006) automatically

determined the number of components depending on

the length of the segments. However, there was still

a need for a parameter that was adjusted by external

training sets. A problem also arose from the merging

process: the model of a populated segment doubles its

size after merging two segments of the same length,

ignoring both the similarity between the two segments

and their common speaker characteristics. The prob-

lem becomes more serious when UBM is introduced

to derive the segment models - as discussed in section

3.2.

The approach developed by Figueiredo and Jain

(2002), which overcomes the sensitivity limitation of

the EM algorithm, was adopted in the new system, but

with a modiﬁed stopping criterion. It automatically

determines the model complexity, and gives a model

that better ﬁts the data.

Consider a ﬁnite data set X = {x

,...x

} ⊂ ℜ

and an M-component GMM ﬁnite mixture distribu-

tion of dataset X, its pdf can be written as:

p(x|θ) =

∑

m=1

p(x|µ

,σ

). (2)

where

∑

m=1

= 1 and π

≤ 1,m = 1,··· ,M. π

is the mixing parameter of the mixture model, and µ

and σ

are the mean and covariance parameters of the

Gaussian component m. Then logp(X|θ) is deﬁned

as:

logp(X|θ) =

∑

i=1

log

∑

m=1

p(x

(i)

|µ

,σ

). (3)

θ = {π,µ,σ} (4)

Usually an EM algorithm is applied to obtain the max-

imum likelihood (ML) estimate

= argmax

{logp(X|θ)} (5)

The EM algorithm (McLachlan and Krishnan, 1997)

runs iteratively with an E-step followed by an M-step.

The E-step computes the conditional expectation of

the complete log-likelihood, given the dataset and

current estimate of all parameters - the so-called Q-

function. The M-step updates the estimate of θ in or-

der for it to maximize the Q-function (McLachlan and

Krishnan, 1997). The EM algorithm will monotoni-

cally increase the logp(x|θ) value until it reaches one

of the local maxima. Using the EM algorithm to train

the GMM model, label variables Z = {z

,··· ,z

} are

introduced so as to indicate which component pro-

duces which sample. The conditional expectations of

, where z

is the label which shows whether data

is produced by component m at iteration t, is given

by:

) = E[z

] =

p(x

)

∑

j=1

p(x

)

(6)

and π

t+1

will be updated thus:

∑

i=1

)/n; (7)

is the posteriori probability of Z

, given the ob-

servation x

. Since the EM algorithm theoretically

searches the local maxima, its result is sensitive to

its initialization values. The results can always be

adjusted to get a better global maximum by split-

ting or merging the components (Ueda et al., 2000).

Figueiredo and Jain (2002) developed an algorithm

that successfully determines the number of compo-

nents used in building the GMM, and, at the same

time, optimizes the distribution of components. It

seamlessly integrates the Minimum Message Length

(MML) criterion into the EM algorithm. MML, like

BIC, is a model selection criterion. It is based on

information coding theory and was ﬁrst developed

by (Wallace and Dowe, 1987). It prefers the model

which minimizes the right-hand side of Eq.8:

Length(

θ,X) = −logp(

θ) − logp(X|

θ)

log|I(

θ)| + p/2(1+ log(1/12)), (8)

IMPROVEMENTS IN SPEAKER DIARIZATION SYSTEM

319

where p is the number of parameters in the model.

Figueiredo and Jain (2002) assumed an independence

between the mixing parameter π and the compo-

nent parameter

θ(m), and expresses a standard non-

informative Jeffreys’s prior for π, and

θ,

) ∝

(1)

(

)|; (9)

p(π

,··· ,π

) =

|H| = (π

π2· ··π

)

−1/2

; (10)

and approximates |I(θ)| to |I

(θ)|, the complete-data

Fisher information matrix, which is described by

Eq.11:

(θ) = n∗ block− diag{π

(

),··· ,π

(

),H},

(11)

where I

(θ

) is the Fisher matrix for a single observa-

tion produced by the mth component, with |H| deﬁned

in Eq.10. Then Eq.8 can be rewritten as:

Length(

θ,X) =

∑

m=1

log(nπ

) +

(1+ log

)

log(n),

(12)

where p in Eq.8 is equal to NM + M, where N is the

number of parameters in each component. When the

of a component m is equal to 0, the component

will no longer contribute to the model, so then M is

the number of components whose mixing parameter

is larger than 0. Eq.12 is the criterion provided by

Figueiredo and Jain (2002), which can be integrated

into the EM algorithm in a closed form. In Eq.12,

the term

∑

m=1

log(nπ

) containing the variable π

is combined with Eq.5; then this is maximized:

∂(logp(X|

θ))

∂π

+ λ(

∑

m=1

− 1) +

∑

m=1

log(nπ

)) = 0,

(13)

Instead of Eq.7, Eq.14 is calculated as the update

value for π

n− MN/2

(

∑

i=1

max(o

) − N/2,0)) (14)

where o

) is deﬁned as in Eq.6.

By initializing the model complexity to a large

value M, this algorithm will reduce the complexity

value to M

by removing the components that have

insufﬁcient evidence to support them. But there may

be an additional decrease in Eq.12 caused by a de-

crease in M

. So the algorithm needs to compute

the criterion value for each possible M

in order to

get the best result. This is computationally expensive

when the number of components is large. An alter-

native stopping criterion is the local Kullback-Leibler

divergence criterion in which those components with

the lowest local Kullback-Leibler divergence are re-

moved and the model retrained until the value of the

right side of Eq.12 no longer decreases. The local

Kullback-Leibler divergence criterion of component

m is described by Eq.15:

merge

(m|

θ) =

(x|

θ)log

(x|

θ)

′

(x|

θ)

dx, (15)

where f

is the current density function g, and f

′

the density function g

′

, whose component m has been

removed, all weighted by the posteriori function of m:

(x|

θ) = g(x

)p(m|x

θ); (16)

′

(x|

θ) = g

′

(m|x

θ); (17)

3.2 UBM and MAP Adaptation

Sometimes the segment is short and may be disturbed

by the acoustic environment so that the model on

which it is based is not sufﬁcient to represent the

speaker characteristics. Then the Universal Back-

ground Model (UBM) is introduced to derive the seg-

ment models as referred to earlier in section 2.

UBM is always pre-trained from other speech cor-

pora. The audio ﬁle itself can be used to train the

UBM, or a combination of corpora and the speech ﬁle

may be used (Sinha et al., 2005) (Barras et al., 2006).

The authors adopted the audio ﬁle itself to build the

UBM and used the technique described in the last sec-

tion to develop the background model. A mean-only

MAP adaptation is always used to build the segment

model from the UBM. It updates the components’

means in the segment model by adapting them from

the UBM gradually. However, sometimes the seg-

ments produced by the system are very short and not

sufﬁcient to cover the space modelled by the UBM.

As a result there may be some loss of segment char-

acteristics - hence the reason why this technique is

always applied as a secondary stage.

The authors’ system adjusts the weight parameter

in the UBM and removes those components for which

there is insufﬁcient evidence (according to Eq.14) for

their retention. This step is used at the beginning

of the process, so determining the complexity of the

UBM model is important. The adapted weight esti-

mator is described by Eq.18

max(

∑

i=1

p(m|x

) − N/2,0)

∑

m=1

max(

∑

i=1

p(m|x

) − N/2,0)

, (18)

SIGMAP 2007 - International Conference on Signal Processing and Multimedia Applications

320

where N is the number of parameters in each compo-

nent, and p(m|x

) is the posteriori probability of com-

ponent m given x

. Usually the mean-MAP adaptation

is then performed (Barras and Gauvain, 2003):

ˆµ

(

∑

i=1

p(m|x

)

∑

i=1

p(m|x

)

, (19)

This adaptation is performed for two iterations.

3.3 Cross-likelihood Ratio

CLR is used in place of BIC as the merging measure

adopted in most post-processing stages. Between any

two given segments s

and s

, CLR is deﬁned by:

CLR(s

) = log(

L(x

|θ

)L(x

|θ

)

L(x

|θ

ubm

)L(x

|θ

ubm

)

), (20)

where L(x

|θ

) is the average likelihood of the acous-

tic feature being in segment i given the model j, thus

removing the inﬂuence of the length of the segments

(Sinha et al., 2005). The CLR values are computed

for each pair of segments to form the CLR matrix.

3.4 Intra-cluster/Inter-cluster Ratio

Using BIC to merge and judge the stopping of the

speaker diarization task is, computationally, a local

solution. It considers those pairs of segments that

have the highest BIC score, without a global view of

the overall similarity between segments. The authors

found that it always under-estimated the number of

speakers appearing in the audio ﬁle. Futhermore in

the new system, CLR is applied as the merging cri-

terion instead of BIC; and BIC is no longer the natu-

ally stopping criterion. So in place of BIC, an intra-

cluster/inter-cluster ratio was used as the stopping cri-

terion. Intra-cluster/inter-cluster ratio compares the

way in which a state model represents its features,

with how the models of other states represent those

features; in so doing, it takes a global view of all the

states. Assuming that there are k clusters left in the

HMM, the intra-cluster/inter-cluster ratio is computed

by Eq.21:

intra− cluster

inter − cluster

∑

i=1

CLR(s

)

∑

j, j6=i

CLR(s

)

. (21)

The new system iteratively merges the segments un-

til there is only one segment left. Then the speaker

diarization solution is the cluster solution with mini-

mum intra-cluster/inter-cluster ratio.

4 EXPERIMENT

4.1 Data

The experiments reported in this paper used 36

recorded meetings as evaluation data, 18 from the

Interactive Systems Laboratories (ISL) part A meet-

ing corpus (ISL, 2004) and 18 from the International

Computer Science Institute (ICSI) meeting speech

corpus (ICSI, 2004). The average meeting length

was around 40 minutes, and the number of speakers

present in the meetings varied from 3 to 9. In the

meeting audio ﬁles, speaker changes are frequent, and

most have a short duration. Many exchanges overlap

with each other, hindering the speaker diarization.

4.2 Speaker Diarization Error

The baseline metric of performance is the diarization

error rate (DER), which is deﬁned by NIST (2004).

It is obtained by comparing the results of diarization,

with a manually labelled transcript. The DER process

recognises three error types: the missed rate (speech

in the transcript that is not found by the system un-

der test), false alarm rate (speech found by the system

that is not in the transcript) and speaker error rate (the

sound is assigned to the wrong speaker). These are

computed by matching the speakers assigned by the

system to those in the transcript, using a one-to-one

mapping that maximises the total overlap between the

transcript and system speakers (as explained in (NIST,

2004)).

The main application of the new system is to index

and cluster the audio ﬁles. Usually the type of non-

speech appearing in the meeting audio ﬁles is silence

and noise. The false alarm rate does not inﬂuence the

indexing and clustering, nor does the missed speech

fragments as these last for less than 0.3 seconds. So an

energy based speech/non-speech detection is enough.

Since DER is a time-weighted evaluation measure, it

is primarily driven by dominant speakers. To reduce

the DER error rate, it is more important to ﬁnd the

main speakers completely and correctly rather than

accurately ﬁnd the speakers who do not speak very

often. Therefore when there is a dominant speaker

in the audio ﬁle, the DER measure fails to evaluate

how well the sytem ﬁnds other speakers. So segment

purity (SEGP) and speaker weighted speaker purity

(SPKP) are also introduced to evaluate the system.

Segment purity is a measure of how much speech in a

segment belongs to the correct speaker. Speaker pu-

rity measures how much speech is assigned correctly

to a speaker, weighted by the number of speakers.

IMPROVEMENTS IN SPEAKER DIARIZATION SYSTEM

321

In the new system, the 19

order MFCC was used

as acoustic feature vectors, extracted from 30 mil-

lisecond analysis windows each overlapping the pre-

vious one by 20 milliseconds. The initial guess at the

number of speakers was 40, and the minimum length

constraint was 0.9 seconds.

4.3 Conversation Overlap

Since there are many oral exchanges that overlap each

other in the audio ﬁles, it is possible that these re-

duce system performance, and this was investigated

in three experiments. First, the overlapping parts

were removed from the audio ﬁle and the diarization

system applied. Second, the overlapping parts were

included in the speaker diarization process, but ig-

nored in the evaluation process. Third, the overlap-

ping parts were included in the speaker diarization

process and, during evaluation, those parts were as-

signed to a speaker. Perhaps surprisingly, both the

second and the third experiments gave better results

than the ﬁrst experiment. It seems that the overlap-

ping parts actually help to build the speaker models

and thus do not degrade the recognition rate. How-

ever, sometimes the overlapped parts are regarded as

new speakers, making the number of people detected

more than those actually present. The third approach

gives the best result because the overlapping speech

assigned to either of the speakers is thought to be cor-

rect.

4.4 Model Complexity

Three different methods for determining the model

complexity were investigated: ﬁrst, ﬁxing the num-

ber of Gaussians in the GMM; second, automatically

determining the value for model complexity depend-

ing on the segment length - the method adopted by

Anguera et al. (2006); and third, doing the same as

number two, but using the approach introduced for

the new system. A diagonal GMM was used and over-

lapped speech was ignored during the evaluation. All

three were integrated into the original ICSI system,

and their results are compared in Table 1. The ap-

proach used in the new system performed well when

the segments were short. However, it is computation-

ally expensive to ﬁnd the number of components in

a long segment, and only gives slightly better results.

Since the approach is robust to the initialization envi-

ronment, its results are stable.

Table 1: Model-Complexity-Determining Methods.

method DER(%) SEGP(%) SPKP(%)

5 Gaussian/GMM 25.36 26.83 32.76

8 Gaussian/GMM 26.35 28.76 34.76

Anguera et al 22.25 22.76 28.76

authors’ approach 22.74 24.76 30.76

4.5 Applying the UBM

This part provides the biggest improvement. The new

EM approach used to train the GMM can automati-

cally remove the inefﬁcient components, helping to

adapt the segment model from the UBM, especially

for short segments. Table 2 shows the difference be-

tween the segment model adapted from the UBM,

and that obtained from the ICSI system. Again, the

overlapped speech was ignored during the evaluation.

The UBM was built from the audio data itself and its

model complexity was automatically determined. The

segment models were then adapted from the UBM by

weight and mean-MAP. BIC was retained for deter-

mining the merging and stopping criteria.

Table 2: Effect of UBM-MAP Adaptation.

method DER(%) SEGP(%) SPKP(%)

UBM-MAP 19.74 21.85 28.36

Non UBM-MAP 25.36 26.83 32.76

4.6 Merging and Stopping Criteria

Finally, CLR and intra-cluster/inter-cluster ratio were

used instead of BIC for the merging and stopping

criteria. CLR seems to work well with the UBM

adapted segment model. It considers the similar-

ity between the segments and their similarity with

the UBM. The intra-cluster/inter-cluster stopping cri-

terion still under-estimates the number of speakers;

this may happen when there is a dominant speaker

at the meeting or the utterances from some speakers

are short. This stopping criterion almost correctly de-

tected the number of speakers, but the overall speaker

error rate was degraded. As shown in Table 3, the new

system reduces the error rate from 25.36% to 21.37%,

an improvement of about 19%. These results are ob-

Table 3: Overall Performance (without the overlap speech).

method DER(%) SEGP(%) SPKP(%)

Baseline System 25.36 26.83 32.76

New System 21.37 22.56 27.40

tained when the overlapped parts were ignored during

SIGMAP 2007 - International Conference on Signal Processing and Multimedia Applications

322

the evaluation. The best results, for both the baseline

system and the new one, were achieved when the re-

sults were evaluated using the approach that included

the overlapped parts, as shown in Table 4.

Table 4: Overall Performance (including the overlap).

method DER(%) SEGP(%) SPKP(%)

Baseline System 21.76 23.23 29.76

New System 17.21 18.56 26.40

In comparison with the ICSI system the diariza-

tion error rate was reduced from 21.76% to 17.21%.

5 CONCLUSION

This paper has described a new speaker diarization

system for natural, multi-speaker meeting conversa-

tions based on a single microphone. The system re-

quires no prior training, and the experiments showed

that the performance achieved 21.37% speaker di-

arization error rate. It is expected that different

speech/non-speech detection techniques and further

purity tests will improve the performance of the sys-

tem.

REFERENCES

Ajmera, J. and Lapidot, I. (2002). Improved unknown-

multiple speaker clustering using hmm. In IDIAP RR.

pp.02–23.

Barras, C. and Gauvain, J. L. (2003). Feature and score

normalization for speaker veriﬁcation of cellular data.

In ICASSP Proc.

Barras, C., Zhu, X., Meignier, S., and Gauvain, J. L. (2004).

Improving speaker diarization. In Fall 2004 Rich tran-

scription Workshop (RT-04) Proc.

Barras, C., Zhu, X., Meignier, S., and Gauvain, J. L. (2006).

Multistage speaker diarization of broadcast news. In

IEEE Trans. SL Proc. pp.1505–1512.

ICSI (2004). Icsi meeting speech. In In-

ternational Computer Science Institute.

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?

catalogId=LDC2004S02.

ISL (2004). Isl meeting speech part 1,

2004. In Interactive Systems Laboratories.

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?

catalogId=LDC2004S05.

Jin, Q., Laskowski, K., Schultz, T., and Waibel, A. (2004).

Speaker segmentation and clustering in meetings. In

International Conference on Acoustics, Speech, and

Signal Processing (ICASSP), Meeting Recognition

Workshop Proc.

McLachlan, G. and Krishnan, T. (1997). The EM algorithm

and extensions. John Wiley & Sons, New York, 1st

edition.

NIST (2004). Fall 2004 rich transcription (rt-

04) evaluation plan, 2004. In National In-

stitute of Standards and Technology. Avail-

able:http://www.nist.gov/speech/tests/rt/rt04

/fall/docs/rt04f-eval-plan-v14.pdf.

Schwarz, G. (1978). Estimating the dimension of a model.

In Annals of Statistics Proc. Vol.6, pp.461–464.

Sinha, R., Tranter, S. E., Gales, M. J., and Woodland, P. C.

(2005). The cambridge university march 2005 speaker

diarization system. In Eur. Conf. Speech Communica-

tion Technology, Proc. pp.2437–2440.

Tranter, S. E. and Reynolds, D. A. (2006). An overview of

automatic speaker diarization systems. In IEEE Trans.

Speech and Language (SL) Proc. Vol.14, pp.1557–

1565.

Ueda, N., Nakano, R., Gharhamani, Z., and Hinton, G.

(2000). Smem algorithm for mixture models. In Neu-

ral Computation Proc. Vol.12, pp.2109–2128.

Wallace, C. and Dowe, D. (1987). Estimation and inference

via compact coding. In J. Royal Statistical Soc. (B).

Vol.49, pp.241–252.

Wooters, C., Fung, J., Peskin, B., and Anguera, X. (2004).

Toward robust speaker segmentation: Icsi-sri fall 2004

diarization system. In Fall 2004 Rich transcription

Workshop (RT-04) Proc.

Zhou, B. and Hansen, J. (2000). Improving speaker diariza-

tion. In Int. Conf. Spoken Langrage Process Proc.

Vol.3, pp.714–717.

IMPROVEMENTS IN SPEAKER DIARIZATION SYSTEM

323