results show a significant improvement to the
classification accuracy obtained from combining the
new feature with the HZCRR, the LSTER, or both.
Table 2: Classification errors (percentage) with different
combinations of the three features using SVM and GMM.
As observed, the total errors introduced using the
three features are 4.69% and 3.17% with the SVM
and the GMM classifiers, respectively. To ensure the
effectiveness of the proposed features, evaluation of
the classification performance is extended to file-
level, in addition to the segment-level evaluation
(one-second window) described earlier. We made this
evaluation based on a majority voting strategy at file-
level. We used the same speech-music database in
this test and reached just 1.63% error, i.e. one speech
file out of 61 speech-music test files.
As shown in table 2, better classification results
are achieved over music files, as compared to speech,
when the
BDFV is used. Most sounds generated by
musical instruments have a harmonic structure,
which is not the case with speech signals that may
have a mixed harmonic/non-harmonic structure due
to their diverse voicing characteristics. This diversity
is well identified by the sinusoidal model that
measures the harmony of the audio signals.
Nevertheless, the BDFV feature of the sinusoidal
model plus the HZCRR and the LSTER form a
powerful feature set for speech/music discrimination.
Still, further performance improvement could be
expected to achieve by combining other features of
the sinusoidal model as an extension to this work.
6 CONCLUSIONS
In this study, we have proposed a new feature based
on the sinusoidal model, called BDFV, for audio
classification to speech and music. This feature is the
variance of the birth-death frequencies in the
sinusoidal model of an audio signal, as a measure of
the harmony. Our classification results show a high
discriminating performance of this feature, as
compared to typical features such as the HZCRR and
the LSTER features that are widely used for audio
classification. It is also revealed that a higher
classification performance is achieved, by combining
this new feature with the HZCRR and the LSTER,
which has been evaluated using the model-based,
insensitive to threshold GMM and the SVM
classifiers. Through this work, it has been shown that
the sinusoidal model features are very effective in
audio classification, due to capability of the model to
identify the harmonic structure.
REFERENCES
Ei-Maleh, K., Klein, M., Petrucci, G., kabal, P. 2000.
Speech/music discrimination for multimedia
Applications. In Proc ICASSP- 2000, pp. 2445-2448.
Ajmera, J., McCowan, I., Bourlard, H., 2002. Robust
HMM based speech/music segmentation. In Proc
ICASSP- 2002, pp. 297-300.
Saunders, J., 1996. Real-time discrimination of broadcast
speech/music. In Proc ICASSP-96, pp. 993-996.
Scheirer, E., Slaney, M., 1997. Construction and evaluation
of a robust multifeature speech/music discriminator. In
Proc. ICASSP- 97, pp. 21-24.
Lu, L., Zhang, H.-J., 2002. Content Analysis for Audio
Classification and Segmentation. In IEEE Trans.
Speech & Audio Proc., vol. 10, pp. 504 – 516.
Li, S. Z., 2000. Content-based audio classification and
retrieval using the nearest feature line method.In IEEE
Trans. Speech & Audio Proc., vol. 8, pp. 619 – 625.
McAulay, R., Quatieri, T., 1986. Speech analysis/synthesis
based on a Sinusoidal representation. In IEEE Trans.
Acous., Speech & Sig. Proc., Vol. ASSP-34, No.4, pp.
744-754.
Smith, J. O., Serra, X., 1987. PARSHL: An
analysis/synthesis program for non-harmonic sound
based on Sinusoidal representation. In http://www-
ccrma.stanford.edu/~jos/parshl/parshl.pdf.
Berenzweig, A. L., Ellis, D. P. W., 2001. Locating singing
voice segments within music signals. In Proc IEEE
WASPAA, Mohonk NY, pp. 119–122.
Guo, G., Li, S. Z., 2003. Content-based audio
classification and retrieval by support vector machines.
In IEEE Trans. Neural Networks Proc., vol. 14, pp.
209-215.
915 300 315 300
Total Length
(sec)
→
Total
Vocal
Music
Non-
Vocal
Music
Speech
Features/
Classifier
↓
12.13 10.66 15.87 9.66 HZCRR+
LSTER/SVM
5.46 0.66 2.53 13.33 HZCRR+
BDFV/SVM
4.91 0.33 2.22 12.33 LSTER+
BDFV/SVM
4.69
0 2.22 12 HZCRR+
LSTER+
BDFV/SVM
3.17
0.66 1.58 9.66 HZCRR+
LSTER+
BDFV/GMM
SIGMAP 2008 - International Conference on Signal Processing and Multimedia Applications
144