SPEECH/MUSIC DISCRIMINATION BASED ON WAVELETS FOR

BROADCAST PROGRAMS

E. Didiot, I. Illina, O. Mella, D. Fohr, J.-P. Haton

LORIA-CNRS & INRIA Lorraine

BP 239, 54506 Vandoeuvre-les-Nancy, France

Keywords:

Speech/music discrimination, wavelets, static and dynamic parameters, long-term parameters, classiﬁers fu-

sion.

Abstract:

The problem of speech/music discrimination is a challenging research problem which signiﬁcantly impacts

Automatic Speech Recognition (ASR) performance. This paper proposes new features for the Speech/Music

discrimination task. We propose to use a decomposition of the audio signal based on wavelets, which allows

a good analysis of non stationary signal like speech or music. We compute different energy types in each

frequency band obtained from wavelet decomposition. Two class/non-class classiﬁers are used : one for

speech/non-speech, one for music/non-music. On the broadcast test corpus, the proposed wavelet approach

gives better results than the MFCC one. For instance, we have a signiﬁcant relative improvements of the error

rate of 39% for the speech/music discrimination task.

1 INTRODUCTION

Discrimination between speech and music consists in

segmenting an audio stream into acoustically homo-

geneous segments such as speech, music and speech

on music. This segmentation task plays an impor-

tant role in various multimedia applications. Let us

mention several examples. For automatic transcrip-

tion of broadcast news or programs, non-speech seg-

ments must be discarded to avoid high recognition er-

ror rate. Audio indexing of multimedia documents

requires that music segments have to be labelled.

Speech/music discrimination can speed up the task

of putting subtitles because by skipping non- speech

segments. Automatic real-time captioning of live TV

transmissions of events also needs speech/non-speech

detection.

Speech/music discrimination requires two steps : pa-

rameterization and classiﬁcation of audio signal.

The parameterization step consists in extracting dis-

criminative features from the audio signal. This ar-

ticle presents a new approach for speech/music dis-

crimination based on the wavelet decomposition of

the signal. To our knowledge, a such approach has

been never used for this task. Our motivation to ap-

ply wavelets to speech/music discrimination is their

ability to extract time-frequency features and to deal

with non-stationary signals. Moreover, the multi-

band decomposition made by the dyadic wavelet

transform is close to the one made by the human

ear (I. Daubechies, 1996). Therefore, we study sev-

eral features based on wavelet decomposition and test

them on some broadcast programs. We also com-

pare their performance with Mel Frequency Cepstral

Coefﬁcients (MFCC) because the latter have shown

good results in speech/music discrimination (Carey

et al., 1999; Logan, 2000), in music modeling (Logan,

2000) and in musical genre classiﬁcation (Tzanetakis

and Cook, 2002). Besides, MFCC features are widely

used in speech recognition.

The classiﬁcation step consists in classifying the au-

dio signal in different categories: speech, music,

speech on music. For that two approaches can be

considered: either a “class/non-class” approach that

builds a classiﬁer for each category or a “competing”

approach allowing the competition of several cate-

gories in a single classiﬁer.

Moreover, both approaches can use different methods

to classify: k-Nearest Neighbours (kNN), Gaussian

Mixture Models (GMM), Hidden Markov Models

(HMM), Neural Networks,...

We decide to use the class/non-class approach with

intent to obtain the best parameterization for each cat-

egory. The classiﬁcation method is based on a Viterbi

151

Didiot E., Illina I., Mella O., Fohr D. and Haton J. (2006).

SPEECH/MUSIC DISCRIMINATION BASED ON WAVELETS FOR BROADCAST PROGRAMS.

In Proceedings of the International Conference on Signal Processing and Multimedia Applications, pages 151-156

DOI: 10.5220/0001572901510156

 SciTePress

algorithm using HMM models, because this simul-

taneously performs classiﬁcation and segmentation.

Besides, in order to decrease the error rate, a classiﬁer

fusion is evaluated.

This paper is organized as follows. Section 2 in-

troduces the new features. Section 3 describes

our speech/music classiﬁcation system. Section 4

presents the training and test corpora. Experiments

are detailled in section 5: the speech/non-speech and

music/non-music discriminations and then the classi-

ﬁer fusion. Finally, section 6 gives some conclusions.

2 WAVELET-BASED

PARAMETERS

Wavelet-based signal processing has been success-

fully used for various problems : for example, in de-

noising task or, recently, in automatic speech recogni-

tion (Sarikaya and Hansen, 2000; Deviren, 2004).

Discrete Wavelet Transform (DWT) analyses the sig-

nal in different frequency bands with various resolu-

tions. Such an analysis allows a simultaneous analy-

sis in time and frequency domains. S. Mallat (Mal-

lat, 1998) has shown that such a decomposition can

be obtained by successive low-pass (G) and high-pass

(H) ﬁlterings of the time domain signal and by down-

sampling the signal by 2 after each ﬁltering. This

process is repeated on the results of the low-pass ﬁl-

tering until the required number of frequency bands is

obtained. Figure 1 shows a two-level decomposition

where the symbol ↓ 2 denotes a down-sampling by

2. The signal is decomposed into approximation co-

signal

input

at a lower scale

detail coeffs

Detail at

lowest scale

Approximation

at lowest scale

(Wavelet coefficients)

Low pass filter

High pass filter

Downsampling

Figure 1: Discrete Wavelet Transform.

efﬁcients and detail coefﬁcients. Approximation co-

efﬁcients correspond to local averages of the signal.

Detail coefﬁcients, named “wavelet coefﬁcients”, de-

pict the differences between two successive local av-

erages, ie. between two successive approximations of

the signal.

For speech/music discrimination task, we propose

to use only wavelet coefﬁcients to parameterize the

acoustic signal. The use of wavelet coefﬁcients allows

to capture the sudden modiﬁcations of the signal. In-

deed, the wavelet coefﬁcients have high values during

such events. In our study, we compute dyadic wavelet

transform corresponding to octave-band ﬁlter banks.

The dyadic wavelet transform performs a non-

uniform bandwidth decomposition of the signal, and

thus permits to obtain a decreasing frequency resolu-

tion when frequency increases. So this wavelet de-

composition gives a multi-resolution analysis of the

signal : a ﬁne time resolution and a coarse frequency

resolution at high frequencies and inversely at low fre-

quencies.

Several features based on energy are computed on

wavelet coefﬁcients in each frequency band. In the

following, w

denotes the wavelet coefﬁcient at posi-

tion k and band j. N

denotes the number of coefﬁ-

cients at band j, and f

the feature vector for band j.

We compute :

• Logarithm of energy (E). The instantaneous energy

= log





k=1

)





(1)

• Logarithm of Teager energy (T

E). The dis-

crete Teager Energy Operator T EO introduced by

Kaiser is used (Kaiser, 1990).

= log





−1

k=1

|(w

)

− w

k−1

k+1





(2)

3 SPEECH/MUSIC

DISCRIMINATION SYSTEM

3.1 Parameterization

The signal is sampled at 16kHz. After pre-emphasis,

we use a 32ms Hamming window with a 10ms shift.

Our parameters are :

• Baseline MFCC features: 12 MFCC coefﬁcients

with their ﬁrst and second derivatives. Finally, a 36

coefﬁcient vector is obtained.

• Wavelet based features: The above-described en-

ergy features are calculated on wavelet coefﬁcients

obtained with two wavelet families : daubechies

wavelet and coiﬂet. Multiresolution parameters are

computed for two decomposition levels, i.e. for dif-

ferent number of bands (5 and 7).

Our static features are computed on a very short

time duration (32ms) and the question which may

be asked is: can an human ear reliably identify

such a short segment as speech or as music? We

thus decide to also study some long-term parame-

ters. Firstly, we test the ﬁrst and second derivatives

of the energy parameters. Secondly, Scheirer and

SIGMAP 2006 - INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND MULTIMEDIA

APPLICATIONS

152

Slaney have shown that the use of variance com-

puted on a one-second window improves the re-

sults in speech/music discrimination (Scheirer and

Slaney, 1997). Therefore, the study of this long-

term parameter seems interesting.

3.2 System Description

Our classiﬁcation approach is a “Class/Non-class”

one (Pinquier, 2002). In other words, class detec-

tion is performed by comparing a class model and

a non-class model estimated on the same represen-

tation space. Two subsystems are implemented :

speech/non-speech and music/non-music.

The decisions of both classiﬁers are merged and the

audio signal is classiﬁed into three categories: speech

(S), music (M), and speech on music (SM). Each class

is modelled by an HMM model with between 8 and 64

gaussians per state. The Viterbi algorithm is used to

provide the best sequence of models, describing the

audio signal. A frame by frame decision would lead

to unrealistic 10ms-length segments. To avoid this, a

0.5s minimal duration is imposed for each recognized

segment.

4 CORPORA

4.1 Training Corpus

The HMM models were trained on two databases :

“Audio CDs” and “Broadcast programs”. The “Audio

CDs” corpus (120 mn) is made up of several tracks

of instrumental music and songs extracted from CDs.

The “Broadcast programs” corpus (976 mn) contains

programs from French radios: broadcast news as well

as interviews and musical programs.

4.2 Test Corpus

We carried out experiments on a broadcast corpus

composed of three 20-minutes shows (interviews and

musical programs). This corpus is considered as quite

difﬁcult. Indeed, there are a lot of superimposed seg-

ments, speech with music or songs with an effect of

“fade in-fade out”. Moreover, this part contains an al-

ternation of broad-band speech and telephone speech

and some interviews are very noisy. It is made of 52%

of speech frames, 18% of speech on music frames and

30% of music frames. Thus, this corpus allows us

to evaluate the proposed parameterization on difﬁcult

broadcast programs. Conﬁdence interval is ±1% at

the 0.05 level of signiﬁcance.

5 EXPERIMENTAL RESULTS

5.1 Error Rate Calculation

To evaluate our different features, three error rates are

computed as follows:

• Global classiﬁcation error rate:

100 ∗ (1 − (n

+ n

)/T ) (3)

• Music/Non-Music classiﬁcation error rate:

100∗(1−(n

)/T ) (4)

• Speech/Non-Speech classiﬁcation error rate:

100∗(1−(n

)/T ) (5)

with n

the number of frames recognized as z when

labeled y and T the total number of frames.

Moreover, we consider the 12 MFCC coefﬁcients

with their ﬁrst and second derivatives as the baseline

features because they give the best global discrim-

ination error rate compared to other MFCC-based

features also evaluated on our test corpus.

Table 1 presents the distribution of recognized

frames into speech, music or speech on music cate-

gories for the global discrimination task with MFCC

parameterization. This Table shows the hardness of

the speech/music discrimination, especially for super-

imposed segments of speech and music.

Table 1: Frames distribution (%) for global discrimination

task using 12 MFCC coefﬁcients with their ﬁrst and second

derivatives.

labelled

recognized

S SM M

S 60.9 30.8 8.3

SM 10.1 74.9 15.0

M 2.9 2.5 93.8

5.2 Speech/non-speech

Discrimination

After preliminary experiments, we chose two families

of wavelets: daubechies wavelet with 4 vanishing

moments (db-4) and coiﬂet with 2 vanishing mo-

ments (coif-1). We used two decomposition levels:

5 and 7, and, computed two energy features on the

wavelet coefﬁcients: instantaneous (E) and Teager

E) energies.

Speech/non-speech discrimination results are sum-

marized in Table 2. We can notice that energy fea-

tures computed on the “coif-1” wavelet parameteriza-

tion with 5 bands give sligthy better results than the

SPEECH/MUSIC DISCRIMINATION BASED ON WAVELETS FOR BROADCAST PROGRAMS

153

Table 2: Error rates (%) for speech/non-speech discrimina-

tion task using wavelets db-4 and coif-1, 5 and 7 bands.

Wlt Nb Param. Error rate

MFCC+∆+∆∆ 5.8

Static parameters

db-4 5 E 5.3

db-4 5 T E 5.4

db-4 7 E 6.2

db-4 7 T E 5.4

coif-1 5 E 4.2

coif-1 5 T E 4.2

coif-1 7 E 6.8

coif-1 7 T E 6.1

Dynamic parameters

coif-1 14 E+∆ 3.4

coif-1 14 T E+∆ 2.7

coif-1 21 E+∆+∆∆ 3.1

coif-1 21 T E+∆+∆∆ 2.7

Long-term parameters

MFCC+∆+∆∆ (Var. on 1s) 4.2

coif-1 7 E Var 1s 3.5

coif-1 7 T E Var 1s 3.2

MFCC parameters. The addition of dynamic para-

meters, more precisely ﬁrst derivatives, gives signiﬁ-

cantly better performance than MFCC parameters or

static wavelet features. Besides, with the same num-

ber of parameters (7), the long-term wavelet parame-

ters based on variance computation provide an im-

provement compared to the static ones.

5.3 Music/non-music Discrimination

For the music/non-music discrimination task, the re-

sults are presented in Table 3. Whatever wavelet type,

number of bands or energy type, the static wavelet pa-

rameters achieve a dramatic decrease of the error rate

compared to MFCC parameterization. On the other

hand, adding derivative components or using long-

term wavelet features is not helpful.

5.4 Global Discrimination

We then conducted some experiments to test differ-

ent features computed on the “coif-1” wavelet para-

meterization with 7 bands for the global discrimina-

tion task. The results presented in Table 4 conﬁrm

the previous obtained conclusions : static wavelet fea-

tures signiﬁcantly decrease the error rate compared to

MFCC ones. The addition of dynamic coefﬁcients re-

duces the error rate a little bit more. Finally, variance-

based long-term parameters are not very helpful.

Table 3: Error rates (%) for music/non-music discrimina-

tion results using wavelets db-4 and coif-1, 5 and 7 bands.

Wlt Nb Param. Error rate

MFCC+∆+∆∆ 23.1

Static parameters

db-4 5 E 15.3

db-4 5 T E 15.1

db-4 7 E 16.1

db-4 7 T E 16.5

coif-1 5 E 16.5

coif-1 5 T E 17.0

coif-1 7 E 14.5

coif-1 7 T E 14.6

Dynamic parameters

coif-1 14 E+∆ 15.2

coif-1 14 T E+∆ 15.0

coif-1 21 E+∆+∆∆ 17.4

coif-1 21 T E+∆+∆∆ 17.4

Long-term parameters

MFCC+∆+∆∆ (Var. on 1s) 23.3

coif-1 7 E Var 1s 16.3

coif-1 7 T E Var 1s 16.4

Table 4: Error rates (%) for global discrimination task using

wavelets coif-1 and 7 bands.

Param. Nb Error rate

MFCC+∆+∆∆ 36 26.2

Static parameters

E 7 21.6

T E 7 18.4

Dynamic parameters

E+∆ 14 17.4

T E+∆ 14 17.6

Long-term parameters

E Var 1s 7 18.7

T E Var 1s 7 18.6

5.5 Fusion of Different Classiﬁers

In order to improve performance of all the discrim-

ination tasks, we combine the outputs of several

class/non-class classiﬁers. The classiﬁers differ by

the parameterization and features they use. Two types

of classiﬁer output fusion were tested.

In the ﬁrst one, called “fusion A”, to outperform the

results of the global discrimination task, we combine

the outputs of the best speech/non-speech classiﬁer

and of the best music/non-music one. For both

classiﬁers, best results are obtained with the 7-band

“coif-1” wavelet parameterization. Regarding the

energy features computed on this decomposition, the

SIGMAP 2006 - INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND MULTIMEDIA

APPLICATIONS

154

best speech/non-speech discrimination is achieved

with Teager energy and its ﬁrst derivative and the

best music/non-music one with instantaneous energy.

In the second one, called “fusion B”, we choose

three parameterizations for each discrimination task

(speech/non-speech and music/non-music). Then,

the outputs of these classiﬁers are merged using the

majority voting strategy.

We assume that these parameterizations are well

performing methods, bring diversity and produce

different kinds of mistakes. Combination of such

experts should reduce overall classiﬁcation error and

as a consequence emphasize correct outputs.

For every discrimination task, the three parameter-

izations are chosen as follows: we select the best

static feature, the best “dynamic feature” ( static

components plus derivatives) and the best long-term

one. According to our experiments, we obtain:

For speech/non-speech task:

• coif-1 instantaneous energy with 5 bands,

• coif-1 Teager energy with 7 bands with ﬁrst deriv-

atives,

• variance on 1 second computed on coif-1 Teager

energy with 7 bands.

For music/non-music task:

• coif-1 instantaneous energy with 7 bands,

• coif-1 Teager energy with 7 bands with ﬁrst deriv-

atives,

• variance on 1 second computed on coif-1 instanta-

neous energy with 7 bands.

Table 5 shows the results of the three discrimination

tasks using both fusion approaches. Besides, Table 5

mentions the error rate obtained by the best classiﬁer

for the global discrimination task (ﬁrst line). For fu-

sion A, only the global discrimination error rate must

be considered: we can notice a non signiﬁcant im-

provement. In the other hand, fusion B slightly im-

proves the speech/non-speech and music/non-music

discriminations. Moreover,it provides a signiﬁcant

decrease of the global classiﬁcation error rate.

To conclude the experimental part, Table 6 shows

the classiﬁcation results using the best fusion of clas-

siﬁers. Compared to Table 1 (MFCC parameters), a

signiﬁcant reduction of misclassiﬁed segments is ob-

served.

6 CONCLUSION

In this paper, we propose new features based

on wavelet decomposition of the audio signal for

speech/music discrimination.

These features are obtained by computing different

Table 5: Error rates (%) for the 3 discrimination tasks using

the fusion of classiﬁers.

Param. M/NM S/NS GR

best feature GR

coif-1 7b E+∆

15.0 3.4 17.4

best feature S/NS

coif-1, 7bds, T

E+∆

– 2.7 Fusion A

best feature M/NM

coif-1, 7bds, E

14.5 – 17.0

majority vote with

3 classiﬁers S/NS

– 2.5 Fusion B

majority vote with

3 classiﬁers M/NM

14.0 – 16.1

Table 6: Frame distribution (%) for global discrimination

task using the best fusion of classiﬁers.

labelled

recognized

S SM M

S 76.9 22.5 0.5

SM 8.9 86.3 4.6

M 0.2 4.1 94.3

energies on wavelet coefﬁcients. Compared to the

MFCC parametrization, the wavelet decomposition

gives a non-uniform time resolution for the different

frequency bands. Moreover, this parameterization is

more robust to signal non-stationarity and allows to

obtain a more compact representation of the signal.

We have tested these new features on a difﬁcult real-

world corpus composed of broadcast programs with

superimposed segments, speech with music or songs

with an effect of “fade in-fade out’.

The new parameterization gives better results than

MFCC-based one for speech/music discrimination.

Best improvements are obtained for the music/non-

music discrimination task, with a relative gain of 40%

in error rate. Moreover, Teager energy feature based

on coif-1 wavelet seems to be a robust feature for

discrimination between speech, music and speech on

music.

Another interesting point is that the proposed para-

meterizations use a reduced number of coefﬁcients to

represent the signal compared to MFCC one.

Finally, the fusion between the classiﬁers using the

three best speech/non-speech, music/non-music para-

meterizations improves the speech/music discrimina-

tion results. At last, for the speech/music/speech on

music discrimination task, a relative gain of 39% in

error rate is obtained, compared to MFCC parameters.

SPEECH/MUSIC DISCRIMINATION BASED ON WAVELETS FOR BROADCAST PROGRAMS

155

REFERENCES

Carey, M., Parris, E., and Lloyd-Thomas, H. (1999). A

Comparison of Features for Speech, Music Discrim-

ination. In ICASSP-99.

Deviren, M. (2004). Revisiting speech recognition systems:

dynamic Bayesian networks and new computational

paradigms. PhD thesis, Universit

e Henri Poincar

Nancy, France.

I. Daubechies, S. M. (1996). A Nonlinear Sqeezing of

the Continuous Wavelet Transform based on Auditory

Nerve Models. In Wavelets in Medecine and Biology.

Kaiser, J. (1990). On a Simple Algorithm to Calculate the

’Energy’of a Signal. In ICASSP-90.

Logan, B. (2000). Mel Frequency Cepstral Coefﬁcients for

Music Modeling. In International Symposium on Mu-

sic Information Retrieval (ISMIR).

Mallat, S. (1998). A Wavelet Tour of Signal Processing.

Academic Press.

Pinquier, J. (2002). Speech and music classiﬁcation in audio

documents. In ICASSP-02.

Sarikaya, R. and Hansen, J. (2000). High Resolution

Speech Feature Parameterization for Monophone-

based Stressed Speech Recognition. IEEE Signal

Processing Letters, 7(7):182–185.

Scheirer, E. and Slaney, M. (1997). Construction and Evalu-

ation of a Robust Multifeature Speech/Music Discrim-

inator. In ICASSP-97.

Tzanetakis, G. and Cook, P. (2002). Musical Genre Classiﬁ-

cation of Audio Signals. IEEE Transaction on Speech

and Audio Processing, 10(5):293–302.

SIGMAP 2006 - INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND MULTIMEDIA

APPLICATIONS

156