MFCC-BASED REMOTE PATHOLOGY DETECTION ON SPEECH

TRANSMITTED THROUGH THE TELEPHONE CHANNEL

Impact of Linear Distortions: Band Limitation, Frequency Response and Noise

Rub´en Fraile, Nicol´as S´aenz-Lech´on, Juan Ignacio Godino-Llorente, V´ıctor Osma-Ruiz

Department of Circuits & Systems Engineering, Universidad Polit´ecnica de Madrid

Carretera de Valencia Km 7, 28031 Madrid, Spain

Corinne Fredouille

Laboratoire Informatique d’Avignon, Universit´e d’Avignon et des Pays de Vaucluse

339, chemin des Meinajaries, 84911 Avignon Cedex 9, France

Keywords:

Speech analysis, Pattern classiﬁcation, Biomedical signal analysis, Communication channels.

Abstract:

Advances in speech signal analysis during the last decade have allowed the development of automatic al-

gorithms for a non-invasive detection fo laryngeal pathologies. Performance assessment of such techniques

reveals that classiﬁcation success rates over 90% are achievable. Bearing in mind the extension of these au-

tomatic methods to remote diagnosis scenarios, this paper analyses the performance of a pathology detector

based on Mel Frequency Cepstral Coefﬁcients when the speech signal has undergone the distortion of an ana-

logue communications channel, namely the phone channel. Such channel is modeled as a concatenation of

linear effects. It is shown that while the overall performance of the system is degraded, success rates in the

range of 80% can still be achieved. This study also shows that the performance degradation is mainly due to

band limitation and noise addition.

1 INTRODUCTION

The social and economical evolution of developed

countries during the last years has led to an in-

creased number of professionals whose working ac-

tivity greatly depends on the use of their voice. It

has been reported that this number has reached one

third of the total labor force and, in parallel, that

approximately 30% of the population suffers from

some kind of voice disorder along their lives (Sder-

sten and Lindhe, 2007). In this context, methods for

objective assessment of vocal function have a relevant

interest (Umapathy et al., 2005) and, among them,

speech analysis has the additional features of being

non-invasive and allowing easy data colection (Baken

and Orlikoff, 2000).

Speech assessment for the detection of patholo-

gies has been traditionally realised through the analy-

sis of global distortion and noise measurements taken

from records of sustained vowels (Umapathy et al.,

2005) (Baken and Orlikoff, 2000). Classiﬁcation

performances over 90% in terms of success rates

have been reported for automatic pathology detec-

tion systems based on such parameters (e.g. (Boy-

anov and Hadjitodorov, 1997)). Recently, alternative

approaches based on Mel-frequency Cepstral Coef-

ﬁcients (MFCC) with similar performance (Godino-

Llorente and Gomez-Vilda, 2004) have also been pro-

posed. These approaches have the advantage of relay-

ing on robust parameters whose calculation does not

require prior pitch estimation (Fraile et al., 2008a).

Moreover, analysis in cepstral domain for this appli-

cation is further justiﬁed by the presence of in the

cepstrum information about the level of noise (Mur-

phy and Akande, 2005). Additional reasons that sup-

port the speciﬁc processing involved in MFCC calcu-

lation can be found in (Fraile et al., 2008a), (Godino-

Llorente et al., 2006) and (Fraile et al., 2008b).

From another point of view, remote diagnosis is

one of the foreseen applications of telemedicine (TM

Alliance Team, 2004). In this context, the use of

a non-invasive diagnosis technique such as speech

analysis is well suited to that application. Moreover,

since the analogue wired telephone network is one

of the most mature and widely extended communi-

cations infrastructures, it seems reasonable to expect

Fraile R., Sáenz-Lechón N., Godino-Llorente J., Osma-Ruiz V. and Fredouille C. (2009).

MFCC-BASED REMOTE PATHOLOGY DETECTION ON SPEECH TRANSMITTED THROUGH THE TELEPHONE CHANNEL - Impact of Linear

Distortions: Band Limitation, Frequency Response and Noise.

In Proceedings of the International Conference on Bio-inspired Systems and Signal Processing, pages 41-48

DOI: 10.5220/0001534200410048

 SciTePress

that it will become one of the supporting technologies

for that medical service. However, the feasibility of

such application will heavily depend on the ability of

voice analysis to extract signiﬁcant information from

speech signals even after the distortion caused by the

communications channel.

Up to now, some preliminary works on this is-

sue have been carried out and published. In the ﬁrst

place, pathology detection on voice transmitted over

the phone has been shown to experiment a perfor-

mance degradation ﬁgure around 15% when detection

is based on traditional acoustic parameters (Moran

et al., 2006). Secondly, the impact of several speech

coders on voice quality has been studied, but with-

out regarding the additional degradation introduced

by communications channels (Jamieson et al., 2002).

Last, the problem of analysing the effect of the ana-

logue telephne channel on a MFCC-based system for

pathology detection has also been approached (Fraile

et al., 2007), but without differentiating among the

different distortions introduced by the channel and

without accounting for noise distortion.

Considering all above-mentioned aspects, that is,

the adequateness of MFCC for automatic pathology

detection and the interest of analyzing the impact of

the analogue telephonechannel on speech quality, this

paper offers a detailed report on the effect of the dis-

tortions introduced by the telephone channel on the

performance of automatic pathology detection based

on MFCC. More speciﬁcally, a study more complete

than that of (Fraile et al., 2007) is provided in which

the effects of band limitation, frequency response of

the channel and additive noise are analysed sepa-

rately. This way, the results of the study are useful,

not only for remote diagnosis applications such as the

one described before, but also for setting minimum

conditions, in terms of bandwidth and noise levels,

for speech recording in clinical applications.

The rest of the paper is organised as follows: sec-

tion 2 contains the speciﬁc formulation of MFCC and

the values for related parameters used in the study,

section 3 describes the model of telephone channel

that has been considered, in section 4 the database,

classiﬁer and procedure used for the experiment are

detailed, results are reported in section 5 and, last,

section 6 is dedicated to the conclusions.

2 MFCC FORMULATION

As argued in (Fraile et al., 2008a), the variability of

the speech signal is specially relevant in the pres-

ence of pathologies, thus justifying the use of short-

term signal processing. A framework for such short-

term processing in the case of speech is provided

in (Deller et al., 1993). Within this framework, the

short-time MFCC deﬁnition given in (Fraile et al.,

2008b), which is slightly different from the original

proposal in (Davis and Mermelstein, 1980) but it has

an easier interpretation, is used:

[q] =

M + 1

∑

k=1

log



(k)



· cos



πk

M + 1

· q



(1)

where p is the frame index, q is the index of the

MFCC that ranges from 0 to M, M is the number

of Mel-band ﬁlters used for spectrum smoothing and



(k)



is the estimate of the spectral energy of the

speech signal in the k

Mel band. Speciﬁcally:

(k) =

∑

∈I

1−



− F

M+1



∆f

· |S

(i)| (2)

where S

(i) is the i

element of the short-time dis-

crete Fourier transform of the p

speech frame, f

its associated Mel frequency,



k− 1

M + 1

k+ 1

M + 1



(3)

is the k

band in Mel-frequency scale, ∆ f

/2 is the

width of these Mel bands and F

is the maximum

frequency in Mel domain, which corresponds to half

the sampling frequency of the speech signal. The fre-

quency transformation that allows passing from linear

to Mel scale is:

= 2595· log



700



(4)

For the herein reported application, speech frame

duration has been chosen to be 20 ms, which allows

capturing the spectral envelope of speech for funda-

mental frequencies above 50 Hz, thus covering the

cases of both male and female voices (Baken and Or-

likoff, 2000). Overlap between consecutive frames

was 50%. The number of Mel band ﬁlters M has been

made equal to 31, since that value has shown to ex-

hibit good preformance (Fraile et al., 2008b) and vec-

tors of 21 MFCC, that is q ∈ [0, 20], have been used

as feature vectors for each speech frame.

3 TELEPHONE CHANNEL

MODEL

The task of assessing the impact of the analogue

telephone channel on the performance of a MFCC-

based pathology detector was done bearing in mind

BIOSIGNALS 2009 - International Conference on Bio-inspired Systems and Signal Processing

Figure 1: Block diagram of the analogue telephone channel

model.

the same modeling methodology as in (Fraile et al.,

2007). Such methodology comprises the main as-

pects of the model proposed in (Dimolitsas and Gunn,

1988). Namely, the linear effects of the channel have

been assumed to be the dominant ones: amplitude,

phase and noise distortions. Normative restrictions

on amplitude and phase distortion imposed by (ITU,

1998) have also been taken into account. The block

diagram of the overall channel model is drawn in ﬁg-

ure 1 and it consists of the following elements:

• Amplitude Distortion. Its limits are normalised in

(ITU, 1998) for the 300-3400 Hz band and no re-

strictions are imposed outside that band.

• Phase Distortion. Its limits for the 300-3400 Hz

band are also speciﬁed in (ITU, 1998) and they are

mainly referred to the phase effects at the edges of

that band.

• Noise Distortion. This distortion can be split in

noise at the transmitter side, which undergoes the

same amplitude and phase distortion as the speech

signal, and noise at the receiver side that does not

suffer that distortion.

• Bandwidth Limitation. This has to be carried out

as the ﬁrst stage of the detector due to the uncer-

tainty about the distortion out of the 300-3400 Hz

band. Another reason for this limitation is that the

telephone network adds some signalling in the 0-

300 Hz band (ITU, 1998).

3.1 Amplitude Distortion

The analogue telephone channel acts as a band-pass

ﬁlter. Attenuation of high frequencies comes from

the low-pass behaviour of the transmission line while

attenuation of low frequencies (below 300 Hz) al-

lows the use of out-of-band signalling. Limits recom-

mended by (ITU, 1998) for the amplitude response

of the channel are represented as continuous lines in

ﬁgure 2.

The simulation of the amplitude and phase distor-

tion of the channel has been realised separately, as

proposed in (Dimolitsas and Gunn, 1988) and illus-

trated in ﬁgure 1. Within such a setup, the amplitide

distortion has been modeled as a band-pass linear-

phase system, hence achieving null phase distortion

in this stage, implemented by means of a symmetric

FIR ﬁlter. Bearing in mind restrictions in (ITU,1998),

a 176-order ﬁlter has been designed that has the fre-

quency response plotted in ﬁgure 2 (dashed line).

0 300 600 900 1200 1500 1800 2100 2400 2700 3000 3300 3600 3900

−21

−18

−15

−12

−9

−6

−3

Frequency (Hz)

Gain (dB)

Upper G.120 bound

Lower G.120 bound

Simulated filter

Figure 2: Amplitude response of the channel: restrictions

(continuous line) and model (dashed line).

3.2 Phase Distortion

Regarding phase distortion, (ITU, 1998) imposes

limits to group delay variations within the pass band.

Namely, different limits are speciﬁed for the low and

high parts of the band, as represented by the thick

lines in ﬁgure 3. A simple procedure to obtain an

all-pass ﬁlter that achieves phase distortion around

certain frequencies is to design an IIR ﬁlter having

zeros and poles in the frequencies at which phase

distortion has to be greatest. For the ﬁlter to be

all-pass, zero and pole modules must be symmetric

with respect to the unit radius circle of the z-plane.

Speciﬁcally, the implemented ﬁlter corresponds to

the following transfer function:

H (z) = H

(z; f

low

) · H



z; f

high



(5)

MFCC-BASED REMOTE PATHOLOGY DETECTION ON SPEECH TRANSMITTED THROUGH THE TELEPHONE

CHANNEL - Impact of Linear Distortions: Band Limitation, Frequency Response and Noise

0 500 1000 1500 2000 2500 3000 3500 4000

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

0.02

Frequency (Hz)

Group delay (sec)

All−pass filter

Lower band limit (ITU G.120)

Upper band limit (ITU G.120)

Figure 3: Phase response of the channel: restrictions (con-

tinuous line) and model (dahsed line).

(z; f

) =





1− rz

−1

j2π

1−

−1

j2π

1− rz

−1

− j2π

1−

−1

− j2π





(6)

where r = 1.01, f

low

= 250 Hz, f

high

= 3450 Hz

and f

is the sampling frequency of the speech record.

The obtained frequency-dependent group delay is de-

picted in ﬁgure 3. It can be noticed that the maximum

phase distortion happens at the limits of the pass band

of the FIR ﬁlter, as speciﬁed by (ITU, 1998).

3.3 Band Limitation

The above-mentionedspeciﬁcations for the frequency

response of the telephone channel only cover the band

between 300 and 3400 Hz, thus leaving uncertainty

as for the distortion that the speech signal undergoes

out of that band. In addittion, as speciﬁed by (ITU,

1998), out-of-band signalling is allowed in the 0-300

Hz band. This adds the possibility of narrow-band

noise distortion to the lack of normalisation of the re-

sponse of the channel within that band. These facts

make it logical to perform a band limitation of the

speech signal prior to its analysis, as indicated in ﬁg-

ure 1. In this way, only the 300-3400 Hz band of the

signal is further processed. This band limitation pro-

cedure is of common use in other speech processing

applications (Reynolds et al., 1995).

The band limitation has a direct effect on the com-

putation of MFCC. Speciﬁcally, the ∆ f

parameter in

2 depends on both the bandwidth of the signal and the

number of mel-band ﬁlters used for MFCC calcula-

tion. When limiting the frequency band of the sig-

nal, two strategies may be followed in the subsequent

analysis: either maintaining the number of mel bands,

hence reducing ∆ f

, or keeping ∆ f

approximately

equal by reducing the number of bands. The perfor-

mance of these two options will be analysed in section

3.4 Additive Noise

The fourth modeled distortion of the telephone chan-

nel is noise. Although more complex models ex-

ist for telephone noise modelling (Dimolitsas and

Gunn, 1988), herein a simpler approach, similar to

(Reynolds et al., 1995), has been chosen. Namely,

noise has been considered to be additive and white

Gaussian (AWGN). Yet, a differentiation has been

made between noise that suffers the same channel ef-

fects as the speech signal, accounting for the trans-

mitter side, and noise that does not pass through the

channel, hence the receiverside. In both cases, signal-

to-noise ratio (SNR) has been controlledby tuning the

power of noise to the speciﬁc power of each processed

signal.

4 SIMULATION PROCEDURE

4.1 Database

All the herein reported results have been obtained us-

ing a well-known database distributed by Kay Ele-

metrics (MEE, 1994). More speciﬁcally, the utilized

speech records correspond to sustained phonations of

the vowel /ah/ (1-3 s. long) from patients with nor-

mal voices and a wide variety of organic, neurologi-

cal, traumatic, and psychogenicvoice disorders in dif-

ferent stages (from early to mature). The subset taken

corresponds to that reported in (Parsa and Jamieson,

2000) and it corresponds to 53 records from healthy

patients (normal set) and 173 to ill patients (patholog-

ical set).

The speech samples were collected in a controlled

environment and sampled at sampling rates equal to

either 50 or 25 kHz with 16 bits of resolution. A

down-sampling with a previous half band ﬁltering has

been carried out over some registers in order to adjust

every utterance to the sampling rate of 25 kHz.

4.2 Classiﬁer

The chosen classiﬁer consists of a 3 layered Mul-

tilayer Perceptron (MLP) neural-network (Haykin,

1994) with 40 hidden nodes having logistic activation

functions (as in (Godino-Llorente and Gomez-Vilda,

2004)) and two outputs with linear activations. The

use of two linear outputs allows obtaining two val-

ues for each speech frame, characterised by its MFCC

vector c

. In the training phase of the MLP, one output

is trained to produce a value of “0” for pathological

voice frames and “1” for normal voice frames, while

the other output is trained to produce a “0” for normal

BIOSIGNALS 2009 - International Conference on Bio-inspired Systems and Signal Processing

data and a “1” for pathological data. In the testing

phase, each output value is an estimation of the likeli-

hood of that frame to be either normal L

nor

) (ﬁrst

output) or pathological L

pat

) (second output).

These likelihoods, whilst not probabilities, give

an idea of how feasible is that any particular frame

corresponds to each class or set. Their precise val-

ues depend on the value of the feature vector com-

ponents and on the learned parameters of the MLP.

Since the orders of magnitude of both likelihoods

may signiﬁcantly differ, it is more usual to compute

log-likelihoods; the classiﬁcation decision for the p

frame is, then, based on the difference between log-

likelihoods, as described in (Bimbot et al., 2004):

log[L

nor

)] − log[L

pat

)] > θ (7)

If the previous condition is met, then the speech

frame is classiﬁed as normal, if not, it is considered

pathological. In ideal conditions, that is, if the like-

lihoods could be perfectly estimated by the classiﬁer,

then the value for the threshold θ should be θ =0. In

practice, however, this is not the case and the choice

of θ helps to make the decision system more or less

conservative. Nevertheless, since decisions in this

case should not be taken at the frame level, but at the

record level, a mean log-likelihood difference is com-

puted and this is the value actually compared to the

threshold:

frames

∑

p=1

log[L

nor

)] − log[L

pat

)] > θ

(8)

where N

frames

is the number of frames of the speech

record.

4.3 Testing Protocol

The testing of each detection scheme consists of an

iterative process. Within each iteration 70% of the

available speech records have been randomly chosen

for training the classiﬁer, that is, to estimate the likeli-

hood functions mentioned above. Among the remain-

ing 30% of records, one third (10%) have been used

for cross-validation during training in order to get

an objective criterion for ﬁnishing the training phase

(Haykin, 1994). The rest (20%) have been used for

testing. For each testing record, a decision accord-

ing to the previously described framework has been

taken. Last, with the decisions corresponding to all

the testing records, misclassiﬁcation rates for differ-

ent values of θ and the corresponding iteration have

been computed. Twenty iterations with independently

chosen training, validation and testing sets have been

repeated.

5 RESULTS

There are several performance indicators for the eval-

uation of detection systems. A summary of the most

typically used for speech applications can be found

in (Bimbot et al., 2004). Among these indicators, the

DET plot (Martin et al., 1997) and the Equal Error

Rate (EER) have been chosen for this study as graphic

and quantitative indicators, respectively. For the DET

plot, false alarm has been deﬁned as the event of de-

tecting a normal voice as pathological, while miss

means the event of detecting a pathological voice as

normal. In this context, the DET curve represents the

relationship between miss and false alarm rates as the

threshold θ in (7) and (8) changes and the EER is the

point at which the DET curve crosses the diagonal of

the graph, i.e. the value of miss and false alarm rates

when θ is tuned so that they coincide. In all experi-

ments, the results have been computed both at frame

and record levels, corresponding to (7) and (8).

5.1 Effect of Band Limitation

As indicated in ﬁgure 1, the ﬁrst step in the speech

analysis after transmission through the telephone

channel is band limitation. This involves taking only

the spectral energy between 300 Hz and 3400 Hz for

spectrum smoothing using the Mel ﬁlter bank. Such

bandwidth reduction can be achieved in two differ-

ent ways. The ﬁrst of them consists in maintaining

the number of ﬁlters (M=31), thus reducing their in-

dividual widths. The second option, instead, involves

maintaining the ﬁlter width by reducing the number

of ﬁlters. It can be checked that if the band is split in

16 Mel bands (M=16), very similar Mel-ﬁlter widths

are achieved. However, this means reducing the num-

ber of MFCC from 21 (q ∈ [0,20]) to 16 (q ∈ [0,15] ),

since q < M due to the periodic nature of the discrete-

time Fourier transform.

In ﬁgure 4, the different performances of both al-

ternatives are represented by means of the averaged

empirical EER and their 95% conﬁdence intervals.

The results indicate, on the one hand, that a signiﬁ-

cant increase in EER is produced by the band limita-

tion inherent to the telephonic channel. Such obser-

vation is complementary to results reported in (Pou-

choulin et al., 2007), where it was shown that the most

relevant band for dysphonia detection was between 0

and 3000 Hz. The herein reported results indicate that

there is signiﬁcant information within the lower part

of that band, that is, below 300 Hz. On the other hand,

the plot in ﬁgure 4 also indicates that maintaining the

size of the Mel-bands gives similar results to keep-

ing the number of bands, but with the advantage of

MFCC-BASED REMOTE PATHOLOGY DETECTION ON SPEECH TRANSMITTED THROUGH THE TELEPHONE

CHANNEL - Impact of Linear Distortions: Band Limitation, Frequency Response and Noise

1 2 3

Frame Level

EER (%)

1 2 3

Record Level

EER (%)

Figure 4: Average EER (central line of each box) and their

95% conﬁdence interval (top and bottom of each box) at

frame level (up) and record level (down). Case (1) corre-

sponds to the original records and 31 Mel-band ﬁlters, (2)

to band-limited signals with 31 Mel bands and (3) to band-

limited signals with 16 Mel bands.

lower dimensionality. Consequently, this will be the

preferred option for the next experiments.

5.2 Effect of Amplitude Distortion

In (Fraile et al., 2007), it was shown that the am-

plitude distortion of the speech signal has the ef-

fect of performing a quasi-linear transformation in the

MFCC values. Taking this into account and recalling

(1), the transformed MFCC can be written as:

˜c

[q] = A+ c

[q] + (9)

M + 1

∑

k=1

log|ξ(k)| · cos



πk

M + 1

· q



where A is a constant that depends on the amplitude

response of the ﬁlter and ξ(k) is a variable term that

depends on the relation between the spectrum of the

speech signal and the response of the ﬁlter within the

Mel-frequency band.

Figure 5 shows the plots that illustrate the av-

erage EER with the associated conﬁdence intervals

when the training stage of the classiﬁer is done with

the original speech records, with band limitation and

M=16, and the testing is done with the outputs of ﬁl-

tering those records with the ﬁlter corresponding to

ﬁgure 2 (case 3). To ease comparison, plots corre-

sponding to the original records without band limita-

tion (case 1) and the bandlimited analysiswith no dis-

tortion (case 2) are plotted in the same graph. It can be

noticed that the limited distortion allowed within the

300-3400 Hz band by ITU speciﬁcations (ITU, 1998)

1 2 3

Frame Level

EER (%)

1 2 3

Record Level

EER (%)

Figure 5: Average EER and 95% conﬁdence intervals at

frame level (up) and record level (down). Case (1) cor-

responds to the original records and 31 Mel-band ﬁlters,

(2) to band-limited signals with 16 Mel bands and (3)

to amplitude-distorted band-limited signals with 16 Mel

bands.

has the consequence of not affecting greatly the per-

formance of the system.

5.3 Effect of Phase Distortion

As proven in (Fraile et al., 2007), the computation of

MFCC involves calculation of the modulus of the dis-

crete Fourier transform of the signal, as indicated in

(1). Consequently,MFCC are insensitive to phase dis-

tortions and there is no need to analyse this effect of

the channel.

5.4 Effect of Noise Distortion

The last effect of the channel to be analysed is noise

distortion. This has been modelled as AWGN with

different power levels. The effect of noise was anal-

ysed both independently and in conjunction with the

band-limiting scheme explained before. As for the

independent analysis, the obtained distributions of

EER for different levels of signal-to-noiseratio (SNR)

are plot in ﬁgure 6. In all cases, the training was

done with the clean records and the testing with the

noisy ones. The plot indicates that for SNR values

around 30 dB the overall performance does not de-

grade greatly. However, if SNR falls below 24 dB,

the error rate at record level tends to grow above 15%.

While the effect of noise in the case of the telephone

channel is not isolated from other distortions, these

results are also useful for determining the minimum

required quality of speech recordings for pathology

assessment. Under the AWGN assumption, SNR val-

BIOSIGNALS 2009 - International Conference on Bio-inspired Systems and Signal Processing

1 2 3 4 5

Frame Level

EER (%)

1 2 3 4 5

Record Level

EER (%)

Figure 6: Average EER and 95% conﬁdence intervals at

frame level (up) and record level (down). Case (1) corre-

sponds to the original records and cases (2) to (5) to SNR

values of 30 dB, 24 dB, 18 dB and 12 dB, respectively.

ues below 24 dB seem not to be acceptable for this

application.

The ﬁgure of 20 dB has been considered as a

reference for the combined analysis of band limita-

tion and amplitude and noise distortions. It has been

found that, coherently with above-reported results,

there is not any signiﬁcant difference between adding

the noise previously to the amplitude distortion (trans-

mitter side) or after (receiver side). For the subse-

quent experiment, noise addition has been split in two

parts: half of the power prior to amplitude distortion

and half of the power after. Figure 7 shows the plots

of average EER for the original speech records and

those obtained after the three distortions (band limi-

tation, amplitude distortion and noise addition). On

the whole, the average EER suffers a degradation of

1 2

Frame Level

EER (%)

1 2

Record Level

EER (%)

Figure 7: Average EER and 95% conﬁdence intervals at

frame level (up) and record level (down). Case (1) corre-

sponds to the original records and case (2) to records under-

going the full modeled channel distortion.

0.1 0.2 0.5 1 2 5 10 20 40

0.1

0.2

0.5

False Alarm probability (in %)

Miss probability (in %)

Frame Level, distorted

Record Level, distorted

Frame Level, original

Record Level, original

Figure 8: DET plot of the pathology detection system for

the original speech records (gray) and those with simulated

telephone channel distortion (black).

below 10%, yielding a success classiﬁcation rate over

80% at the record level. A DET plot of the same re-

sults is depicted in ﬁgure 8.

6 CONCLUSIONS

Within this paper, the performance of a speech pathol-

ogy detector based on Mel FrequencyCepstral Coefﬁ-

cients when the speech signal has undergone the dis-

tortion of an analogue communications channel has

been analysed. Namely the telephone channel has

been modeled as a concatenation of linear effects:

band limitation, amplitude distortion, phase distor-

tion and noise addition. It has been shown that while

the overall performance of the system is degraded,

success rates over 80% can still be achieved. This

study also reveals that the performance degradation

is mainly due to band limitation and noise addition.

Amplitude distortion, if complying with norm (ITU,

1998), has little impact and phase distortion has no

impact at all.

As for the most relevant sources of distortion, it

has been shown that the loss of information in the 0-

300 Hz band makes performance to decrease signif-

icantly. Additionally, the effect of noise degradation

becomes very relevant for values of SNR below 24

dB. For SNR equal to 20 dB, and considering band-

width limitation and amplitude distortion too, success

classiﬁcation rate can reach 80%. This ﬁgure is better

than the results reported in (Moran et al., 2006).

The whole set of reported results allow to con-

clude, in the ﬁrst place, that remote pathology de-

tection on speech transmitted through the analogue

telephone channel seems feasible and, in the second

place, that MFCC parameterization can provide a ro-

bust method for assessing the quality of degraded

speech signals.

MFCC-BASED REMOTE PATHOLOGY DETECTION ON SPEECH TRANSMITTED THROUGH THE TELEPHONE

CHANNEL - Impact of Linear Distortions: Band Limitation, Frequency Response and Noise

ACKNOWLEDGEMENTS

This research was carried out within projects funded

by the Ministry of Science and Technology of

Spain (TEC2006-12887-C02) and the Universidad

Polit´ecnica de Madrid (AL06-EX-PID-033). The

work has also received support from European COST

action 2103.

REFERENCES

(1994). Voice disorders database v.1. CD-ROM. Mas-

sachusetts Eye and Ear Inﬁrmary.

(1998). Transmission characteristics of national networks.

Series G: Transmission Systems and Media, Digital

Systems and Networks Rec. G.120 (12/98), ITU-T.

Baken, R. J. and Orlikoff, R. F. (2000). Clinical Measure-

ment of Speech and Voice. Singular Publishers, San

Diego (USA).

Bimbot, F., Bonastre, J. F., Fredouille, C., Gravier, G.,

Magrin-Chagnolleau, I., Meignier, S., Merlin, T.,

Ortega-Garcia, J., Petrovska, D., and Reynolds, D. A.

(2004). A tutorial on text-independent speaker veriﬁ-

cation. EURASIP Journal on Applied Signal Process-

ing, 2004(4):430–451.

Boyanov, B. and Hadjitodorov, S. (1997). Acoustic analysis

of pathological voices. A voice analysis system for the

screening of laryngeal diseases. IEEE Engineering in

Medicine and Biology, 16(4):74–82.

Davis, S. B. and Mermelstein, P. (1980). Comparison

of parametric representations for monosyllabic word

recognition in continuously spoken sentences. IEEE

Transactions on Acoustics, Speech and Signal Pro-

cessing, ASSP-28(4):357–366.

Deller, J. R., Proakis, J. G., and Hansen, J. H. L. (1993).

Discrete-time processing of speech signals. Macmil-

lan Publishing Company, New York (USA).

Dimolitsas, S. and Gunn, J. E. (1988). Modular, off

line, full duplex telephone channel simulator for high

speed data transceiver evaluation. IEE Proceedings,

135(2):155–160.

Fraile, R., Godino-Llorente, J. I., S´aenz-Lech´on, N., Osma-

Ruiz, V., and Gomez-Vilda, P. (2007). Analysis of

the impact of analogue telephone channel on MFCC

parameters for voice pathology detection. In Proceed-

ings of the 8th INTERSPEECH Conference (INTER-

SPEECH 2007), pages 1218–1221.

Fraile, R., Godino-Llorente, J. I., S´aenz-Lech´on, N., Osma-

Ruiz, V., and G´omez-Vilda, P. (2008a). Use of

cepstrum-based parameters for automatic pathology

detection on speech. Analysis of performance and the-

oretical justiﬁcation. In Proceedings of Biosignals

2008, volume 1, pages 85–91.

Fraile, R., Saenz-Lechon, N., Godino-Llorente, J. I., Osma-

Ruiz, V., and Gomez-Vilda, P. (2008b). Use of mel-

frequency cepstral coeffcients for automatic pathol-

ogy detection on sustained vowel phonations: Math-

ematical and statistical justiﬁcation. In Proceedings

of the International Symposium on Image/Video Com-

munications over ﬁxed and mobile networks, volume

Accepted.

Godino-Llorente, J. I. and Gomez-Vilda, P. (2004). Au-

tomatic detection of voice impairments by means of

short-term cepstral parameters and neural network

based detectors. IEEE Transactions on Biomedical

Engineering, 51(2):380–384.

Godino-Llorente, J. I., Gomez-Vilda, P., and Blanco-

Velasco, M. (2006). Dimensionality reduction of a

pathological voice quality assessment system based

on gaussian mixture models and short-term cepstral

parameters. IEEE Transactions on Biomedical Engi-

neering, 53(10):1943–1953.

Haykin, S. (1994). Neural networks: A comprehensive

foundation. Macmillan, New York.

Jamieson, D. G., Parsa, V., Price, M. C., and Till, J. (2002).

Interaction of speech coders and atypical speech ii:

Effects on speech quality. Journal of Speech, Lan-

guage and Hearing Research, 45:689–699.

Martin, A. F., Doddington, G. R., Kamm, T., Ordowski, M.,

and Przybocki, M. A. (1997). The DET curve in as-

sessment of detection task performance. In Proceed-

ings of Eurospeech ’97, volume IV, pages 1895–1898,

Rhodes, Crete.

Moran, R. J., Reilly, R. B., de Chazal, P., and Lacy, P. D.

(2006). Telephony-based voice pathology assessment

using automated speech analysis. IEEE Transactions

on Biomedical Engineering, 53(3):468–477.

Murphy, P. J. and Akande, O. O. (2005). Quantiﬁcation

of glottal and voiced speech harmonics-to-noise ratios

using cepstral-based estimation. In Proceedings of the

International Conference on Non-Linear Speech

Processing (NOLISP’05), pages 224–232.

Parsa, V. and Jamieson, D. G. (2000). Identiﬁcation

of pathological voices using glottal noise measures.

Journal of Speech, Language and Hearing Research,

43(2):469–485.

Pouchoulin, G., Fredouille, C., Bonastre, J. F., Ghio, A., and

Giovanni, A. (2007). Frequency study for the charac-

terization of the dysphonic voices. In Proceedings of

the 8th INTERSPEECH Conference (INTERSPEECH

2007), pages 1198–1201.

Reynolds, D. A., Zissman, M. A., Quatieri, T. F., O’Leary,

G. C., and Carlson, B. A. (1995). The effects of tele-

phone transmission degradations on speaker recogni-

tion performance. In Proceedings of ICASSP ’95, vol-

ume 1, pages 329–332, Detroit, MI, USA.

Sdersten, M. and Lindhe, C. (2007). Voice ergonomics -

an overview of recent research. In Berlin, C. and Bli-

gard, L. O., editors, Proceedings of the 39th Nordic

Ergonomics Society Conference.

TM Alliance Team (2004). Telemedicine 2010: Visions for

a personal medical network. Technical Report BR-29,

ESA Publications Division.

Umapathy, K., Krishnan, S., Parsa, V., and Jamieson, D. G.

(2005). Discrimination of pathological voices using

a time-frequency approach. IEEE Transactions on

Biomedical Engineering, 52(3):421–430.

BIOSIGNALS 2009 - International Conference on Bio-inspired Systems and Signal Processing