and includes the speech of several speakers in one au-
dio channel (Hub, 1997). To evaluate the performance
of the speaker changing point detection, two criteria
were used: the precision of speaker changing points
that were found and the number of missed changing
points. The precision indicates the percentage of true
turning points from the total number of turning points
that were found. The recall indicates how many of the
true turning points were missed. These two are com-
bined into an F-score. F-score indicates how good
a system is: it is high when both precision and re-
call values are high and low when either of them is
low (Nishida and Kawahara, 2003).
Table 1: F-score, precision and recall for different features
and their combination via SVM. d is the dimensionality of
the acoustic feature vectors.
Feature d F-score Precision Recall
MFCC 26 0.62 0.61 0.62
MLSF 10 0.42 0.29 0.80
pH
4
5 0.52 0.67 0.43
pH
6
4 0.53 0.67 0.44
pH
12
3 0.55 0.68 0.46
HOCOR
1
6 0.42 0.54 0.35
HOCOR
2
5 0.37 0.47 0.30
HOCOR
3
4 0.31 0.39 0.26
HOCOR
4
3 0.30 0.38 0.25
FrFTMFCC
0.9
12 0.61 0.73 0.56
SVM
1
10 0.64 0.72 0.58
SVM
2
6 0.65 0.75 0.58
Table 1 (except for the two bottom rows) shows
the speaker changing point detection results achieved
when different acoustic features were used to calcu-
late the variance BIC and the peak detection algorithm
was used to detect speaker changing points from the
BIC values. It is worth noticing that using pH features
gives F-scores comparable to those when MFCC fea-
tures are used, even though the dimensionality of fea-
ture vectors of pH features is far less than those of
MFCC. This suggests that pH features may be a bet-
ter choice when the training data set is small.
The features used for SVM combination 1
(SVM
1
) are the 10 variance BIC values resulted from
the 10 acoustic features. The results in Table 1 show
that the proposed SVM speaker changing point de-
tection scheme improves the speaker changing point
detection performance as compared to each of the
individual acoustic features, with a higher F-score
of 0.64. This means that other acoustic features,
which were originally proposed for speaker recogni-
tion problem, can be used for the problem of speaker
segmentation as well. Because of low both preci-
sion and recall values achieved on HOCOR features,
a combination of the acoustic features was attempted
without HOCOR features. The results (SVM
2
in Ta-
ble 1) were comparable with those of SVM
1
. How-
ever, elimination of any other acoustic features from
the combination degraded the speaker segmentation
performance.
This study demonstrates that the new features
do carry additional information about speaker dif-
ferences to MFCC features, and some of them also
have attractiveness because of their low dimensional-
ity. Further study may find better ways of how to in-
tegrate complimentary information about speaker dif-
ferences contained in the new features with traditional
features such as MFCC and LPCC.
REFERENCES
(1997). NIST HUB-4E Broadcast News Evaluation.
Ajmera, J., McCowan, I., and Bourlard, H. (2004). Robust
speaker change detection. IEEE Signal Process. Lett.,
11(8).
Chen, S. and Gopalakrishnan, P. (1998). Speaker, environ-
ment and channel change detection and clustering via
the Bayesian Information Criterion. In DARPA Speech
Recognition Workshop, pages 127–132.
Cordeiro, H. and Ribeiro, C. (2006). Speaker characteriza-
tion with MLSF. In Odyssey 2006: The Speaker and
Language Recognition Workshop, San Juan, Puerto
Rico.
Nishida, M. and Kawahara, T. (2003). Unsupervised
speaker indexing using speaker model selection based
on Baysian Information Criterion. In Proc. IEEE
ICASSP, volume 1, pages 172–175.
Oppenheim, A. and Schafer, R. (2004). From frequency to
quefrency: a history of the cepstrum. Signal Process-
ing Magazine, IEEE, (5):95–106.
Sant’Ana, R., Coehlo, R., and Alcaim, A. (2006). Text-
independent speaker recognition based on the Hurst
parameter and the multidimensional fractional Brow-
nian motion model. IEEE Trans. Acoust., Speech, Sig-
nal Process., 14(3):931–940.
Veith, D. and Abry, P. (1998). A wavelet-based joint estima-
tor of the parameters of long-range dependence. IEEE
Trans. Inf. Theory, 45(3):878–897.
Wan, V. and Campbell, M. (2000). Support vector machines
for speaker verification and identification. pages 775–
784.
Zheng, N. and Ching, P. (2004). Using Haar transformed
vocal source information for automatic speaker recog-
nition. In IEEE ICASSP, pages 77–80, Montreal,
Canada.
COMBINING NOVEL ACOUSTIC FEATURES USING SVM TO DETECT SPEAKER CHANGING POINTS
227