Automatic Classiﬁcation of Parkinson’s Disease

Through the Fusion of Sustained Vowel Descriptors

Sahar Hafsi

1,2

, Linda Marrakchi-Kacem

, Farouk Mhamdi

1,3

and Sonia Djaziri-Larbi

Signals & Smart Systems Lab (L3S), National School of Engineering of Tunis, University of Tunis El Manar, Tunisia

Laboratory of Informatics Paris Descartes (LIPADE), University Paris Cit

e, France

Higher School of Statistics and Information Analysis, University of Carthage, Tunisia

Keywords:

Parkinson’s Disease, Speech as a Biomarker, Acoustic Descriptors, Feature Fusion, Machine Learning.

Abstract:

Voice disorders are early symptoms of Parkinson’s disease (PD) and have motivated the use of speech as a

biomarker for PD. In particular, dysfunctional phonation of sustained vowels has gained increasing interest in

the automatic classiﬁcation of PD. However, most studies typically focus on a single vowel to extract disease

descriptors, which may limit the detection of subtle vocal alterations present in PD patients. The main objective

of this study is to investigate the contribution of analyzing two vowels for the automatic classiﬁcation of PD,

as opposed to relying on a single vowel. In this paper, we propose a novel automatic approach to identify

dysphonia in PD by combining speech descriptors extracted from two sustained vowels, /a:/ and /i:/. This

fusion enables the detection of a broader range of vocal alterations, thereby increasing the robustness of the

predictive models. A preprocessing of the speech signals was performed, followed by feature selection using

the ReliefF algorithm. Then, a robust nested cross-validation was applied to evaluate the models. The results

clearly indicate higher classiﬁcation performance when combining the descriptors of /a:/ and /i:/.

1 INTRODUCTION

Parkinson’s Disease (PD) is the second most common

neurodegenerative disease in the world, that affects

the central nervous system (Bhat et al., 2018). The

diagnosis of PD is usually based on medical observa-

tion of speciﬁc clinical signs including a range of mo-

tor symptoms. However, some non-motor symptoms

of PD, manifest at an early stage and in a subtle way,

making their clinical observation and interpretation

difﬁcult. Studies (Ho et al., 1998) report that approx-

imately 90% of patients with PD have some form of

voice impairment which is one of the earliest indica-

tors of this disease (Harel et al., 2004). To assist clin-

icians make early diagnosis of PD, machine learning

(ML) approaches have been widely applied to differ-

ent physiological signals (Jeancolas et al., 2016; Mei

et al., 2021), including voice.

The last decade’s growing interest for voice as a

PD biomarker is motivated by its simple and non-

invasive measurement, in addition to the availability

of powerful speech processing tools to extract valu-

able information from voice signals. Various studies

have been proposed in the literature, where several

experimental and methodological aspects have been

tested in order to analyze their impact on the perfor-

mance of classiﬁcation models for PD. Recent stud-

ies have been carried out by combining different ML

approaches on several types of speech data, using dif-

ferent descriptors.

As speech impairment in PD may be related to al-

tered production of vocal sounds (dysphonia) or/and

to speech articulation problems (dysarthria) (Arm-

strong and Okun, 2020), some of these studies fo-

cus on dysarthria and use speech signals (Galaz et al.,

2016; Jeancolas et al., 2016), others on dysphonia and

use sustained vowels (Hlavnicka et al., 2019; Little

et al., 2009), or even both (Tsanas et al., 2012). Even

though speech (numbers, words, predeﬁned or spon-

taneous sentences) corresponds better to our natural

everyday use, sustained vowels are more convenient

for sharing open access databases (DB) since they are

almost language and culture independent, while car-

rying the necessary information about dysphonia (Lit-

tle et al., 2009; Sakar et al., 2013).

Various previous studies have examined the ef-

fectiveness of sustained vowels in PD classiﬁcation,

using different DB. The authors (Villa-Ca

nas et al.,

2014) reported a detection accuracy of 71.6% using

the phonation /i:/, and an accuracy rate of 70.67% for

878

Hafsi, S., Marrakchi-Kacem, L., Mhamdi, F. and Djaziri-Larbi, S.

Automatic Classiﬁcation of Parkinson’s Disease Through the Fusion of Sustained Vowel Descriptors.

DOI: 10.5220/0013156100003911

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 18th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2025) - Volume 1, pages 878-885

ISBN: 978-989-758-731-3; ISSN: 2184-4305

the vowel /a:/, on a group of 50 PD and 50 healthy

controls (HC). This explains the signiﬁcant variations

in the efﬁciency of classiﬁcation based on different

phonemes, highlighting the importance of taking into

account the acoustic particularities of each vowel for

PD analysis.

Nevertheless, there are PD classiﬁcation studies

that report particularly high accuracy scores. For ex-

ample, in (Tsanas et al., 2012) and (Hariharan et al.,

2014), the authors achieved 98.6% and 100% de-

tection accuracy, respectively, using the vowel /a:/.

These results were obtained by analyzing a database

including 263 samples from 43 subjects (33 PD and

10 HC). However, these data were collected using six

or seven repetitions of the same vowel performed by

the same subject, which indicates a potential overﬁt-

ting problem when training the ML model. Datasets

typically contain multiple speech recordings per sub-

ject (i.e. multiple voice tasks with two or more repe-

titions of each one).

According to the reported works on voice-based

PD classiﬁcation, the two most commonly used

datasets are described in (Little et al., 2009) and

in (Sakar et al., 2013), and are available in University

of California Irvine (UCI) ML repository as already

extracted and organized features. Given that sample

sizes of the datasets are rather small (31 and 40, re-

spectively), some studies have used subjects record-

ings’ repetitions as independent training and valida-

tion data to form the models. This method may lead

to biased results (Naranjo et al., 2016), as models may

overﬁt the training data, leading to overly optimistic

performance.

To avoid this problem while exploiting all avail-

able data samples, (Sakar and Kursun, 2010) pro-

posed a suitable cross-validation method. It consists

in keeping all the observations of the same subject,

in the test phase. This approach was then improved

by a more recent one which gives better classiﬁca-

tion results (Sakar et al., 2013). It was based on cen-

tral tendency and feature dispersion metrics extracted

from different voice recordings of the same subject.

Although summarizing multiple voice samples from

each individual into a single sample has improved the

reliability of PD classiﬁcation results (Sakar et al.,

2013), taking central tendency and dispersion met-

rics could reduce the information provided by differ-

ent voice tasks.

Therefore, it seems interesting to study the con-

tribution of each vocal task and then combine the de-

scriptors of the different tasks. We began this process

by individually analyzing the vocal descriptors of two

different sustained vowels to assess their respective

contributions. First, we used a single recording of the

same subject to avoid overﬁtting. Then, we combined

the vocal descriptors extracted from each vowel in a

single features vector. The aim is to provide a ro-

bust and unbiased assessment, highlighting the con-

tribution of feature fusion, while still looking for high

classiﬁcation scores.

To our knowledge, no previous study has simi-

larly delved into the importance of sustained vowels

descriptor fusion for PD classiﬁcation, with the ex-

ception of a single reference (Pah et al., 2022). This

study, although it explored the combination of vowel

descriptors, did not provide precise details on the fu-

sion methodology. The results of the study (Pah et al.,

2022) demonstrate that features linked to the vocal

tract length (VTL) are the most adapted to differen-

tiate PD voice from healthy one. It should be noted

that the authors of (Pah et al., 2022), did not mention

whether they used one or more vowel repetitions of

the same subject, which may suggest a risk of bias in

the classiﬁcation results. The other crucial difference

to consider compared to paper (Pah et al., 2022), is the

diversity of descriptors used as inputs for the classi-

ﬁers. In our study, we opted for ﬁve different classiﬁ-

cation models. This allows us to explore multiple per-

spectives on the data and determine which model per-

forms best in detecting vocal alterations in Parkinso-

nian voice. In contrast, the study in (Pah et al., 2022)

is limited to a single classiﬁcation model, namely the

support vector machine (SVM), and uses only four

distinct groups of inputs: intensity, pitch, formants,

and VTL.

In this study, we focus on the fusion of vocal de-

scriptors, extracted from sustained vowels /a:/ and /i:/.

The main objective is to demonstrate that combining

these vowels descriptors improves classiﬁcation per-

formance compared to using descriptors of a single

vowel.

The remainder of the paper is organized as fol-

lows: section 2 details the adopted approach, used

database, signals preprocessing and the extracted

speech descriptors. Section 3 gives the details of fea-

ture selection (FS), ML techniques as well as clas-

siﬁcation models. Section 4 includes the results of

each experiment, classiﬁcation performance and dis-

cussion. Finally, we outline the major ﬁndings in the

conclusion.

2 SPEECH DATA

The proposed PD classiﬁcation framework is illus-

trated in Fig. 1. The use of vowels in speech-based

PD classiﬁcation is justiﬁed by their ability to cap-

ture subtle vocal alterations and their phonetic stabil-

ity, which enables more accurate acoustic analysis.

Automatic Classiﬁcation of Parkinson’s Disease Through the Fusion of Sustained Vowel Descriptors

879

Figure 1: Block diagram of the proposed approach : sustained vowel descriptors’ fusion.

2.1 Dataset

The PC-GITA DB (Orozco-Arroyave et al., 2014)

used in this study, consists of 100 Colombian Span-

ish native speakers, equally distributed between 50

PD patients and 50 HC, matched by age and gen-

der. This corpus includes a variety of voice record-

ings, such as text readings, words and vowels phona-

tion. It contains three repetitions of ﬁve vowels, pro-

nounced in two ways: sustained manner and with a

tone change from low to high. Speech recordings

were sampled at a frequency of 44.1 kHz with 16-bit

resolution (Orozco-Arroyave et al., 2014).

We restricted our selection to recordings of the

sustained vowels /a:/ and /i:/, with only one record-

ing per vowel and per individual (cf. subsection 2.2),

resulting in a DB of 100 samples for each vowel. We

limited our analysis to two vowels from the PC-GITA

DB to balance vocal descriptors with participant num-

bers, avoiding overﬁtting and reducing computation

time. Also, literature highlights sustained vowels /a:/

and /i:/ as most effective for PD classiﬁcation (Mei

et al., 2021), (Islam et al., 2023) and (Bhattacharjee

et al., 2023). Moreover, we chose /a:/ and /i:/ be-

cause of their phonetic similarity across languages,

which aids cross-linguistic comparisons and minimiz-

ing pronunciation effects on speech measures.

2.2 Impact of Data Dependence on

Classiﬁcation Accuracy

In this study, the decision to use only one vowel

recording per individual is driven by the need to en-

sure data independence. By limiting each subject to

a single recording, we ensured a dataset where each

sustained vowel is represented in a balanced manner

across individuals. Using multiple samples from the

same person in speech classiﬁcation introduces sev-

eral types of errors. The primary concern is overﬁt-

ting, where the model may learn individual-speciﬁc

vocal details rather than the general phonation char-

acteristics of healthy control (HC) or PD patients.

According to the study in (Naranjo et al., 2016),

classiﬁcation methods that assume data independence

should not be used when multiple voice recordings

from the same subjects are present. The authors ex-

plain that this creates an artiﬁcial increase in sam-

ple size, resulting in high performance on the train-

ing data but poor generalization to new data. In-

deed, treating dependent data as independent is a

common practice in speech analysis-based classiﬁca-

tion. Several studies such as (Das, 2010), (Tsanas

et al., 2012), (Jafari, 2013) and (Orozco-Arroyave

et al., 2013), have used multiple recordings from the

same subject as independent data. This practice leads

to biased classiﬁcation results, as the model’s perfor-

mance is overestimated by artiﬁcially treating similar

recordings as independent data.

2.3 Speech Signal Preprocessing

Preprocessing is a crucial step to prepare the record-

ings for the extraction of speech descriptors. Various

techniques such as re-sampling, normalization, seg-

mentation, and ﬁltering allow to obtain speech signals

that are better suited for subsequent analyses.

We ﬁrst down-sampled the signals to 8 kHz, since

the 0-4 kHz frequency band contains the main phona-

tion information, such as formants, pitch, and har-

monics. We also applied a pre-emphasis ﬁlter, which

accentuates high-frequency components (Gore et al.,

2020), associated with rapid transitions in the speech

signal. This ﬁlter can make more detectable some in-

dicators often associated with PD such as irregular-

ities in pitch modulation or in rapid articulation of

phonemes. The effect of the pre-emphasis ﬁlter is

highlighted in Fig.2.

2.4 Voice Descriptors

This study focuses on the analysis of sustained vowels

to detect speech disorders speciﬁc to PD, which affect

patients’ ability to maintain steady vocal folds vibra-

tions (pitch and harmonics) and a stable position of

the vocal tract articulators during vowel production.

BIOSIGNALS 2025 - 18th International Conference on Bio-inspired Systems and Signal Processing

880

Figure 2: Comparison of spectrograms before and after pre-emphasis ﬁlter: the sustained vowel /a:/ of a HC. On the left, the

original spectrogram: low-frequency components are the most present. On the right, the pre-emphasized spectrogram shows

a relative increase in higher frequency components.

The sustained vowel speech task involves producing

the vowel steadily and without interruption for as long

as possible.

To evaluate phonation changes in PD, several

acoustic voice parameters have been studied in the

literature. In this work, we selected three groups

of features that characterize speech production dis-

orders related to the vibratory function of the vocal

folds and the resonances of the vocal tract. The ﬁrst

group of descriptors consists of time-domain descrip-

tors, namely pitch, jitter, shimmer and their variants.

This group captures abnormal pitch variations during

sustained phonation. While healthy voices show natu-

ral pitch variation (low vibrato and tremor), impaired

control of steady voice pitch during sustained vowel

production is a common symptom in PD. We also in-

cluded the Pitch Period Entropy (PPE) and the Recur-

rence Period Density Entropy (RPDE) descriptors in

the ﬁrst group. PPE was introduced in (Little et al.,

2009) as an effective measure for discriminating nat-

ural pitch variations from those caused by PD, while

RPDE (Little et al., 2007) assesses the degree of peri-

odicity based on phase calculations and is considered

as a good indicator of pitch irregularity.

The second group of descriptors consists of

Harmonics-to-Noise Ratio (HNR) and Detrended

Fluctuation Analysis (DFA) (Little et al., 2007),

which assess the noise produced by turbulent air-

ﬂow through the vocal system due to incomplete clo-

sure of the vocal cords, thereby characterizing voice

harshness. The third group includes spectral do-

main features: Perceptual Linear Predictive Coef-

ﬁcients (PLPC) and Mel Frequency Cepstral Coef-

ﬁcients (MFCC) along with their ﬁrst and second

derivatives. Both are computed on short-term win-

dows of the signal, typically 20 to 30 ms. They char-

acterize the spectral envelope of speech on a non-

linear scale (the Mel scale for MFCC and the Bark

scale for PLPC), which mimics the human hearing

scale. In particular, MFCC and PLPC capture the

frequency of the vocal tract resonances (or formants)

and are good indicators for vocal tract misplacement

during sustained vowel phonation (Bouagina et al.,

2023).

2.5 Descriptor Extraction

The descriptors extracted with Parselmouth, pitch, jit-

ter, and shimmer and their derivatives are obtained by

a global analysis. They are calculated by taking into

account the entire voice recording. Therefore, the re-

sulting measures represent average values for the en-

tire duration of the recording.

MFCC, PLPC, HNR and SFM, are computed

frame by frame to capture the characteristics of the

audio signal. We extracted 13 MFCCs, 13 δMFCCs,

13 δ

MFCCs and 13 PLPCs per analysis window

(Dave, 2013). After dividing the signal into over-

lapping frames and extracting these coefﬁcients from

each frame, the ﬁnal descriptors are obtained by tak-

ing the values’ average of each coefﬁcient. The RPDE

and DFA calculations are carried out on the entire au-

dio signal, resulting in global measurements for each

recording rather than speciﬁc values for individual

segments or frames. As for PPE, it consists of extract-

ing the pitch, analyzing its variations over the entire

recording, then quantifying the diversity of ﬂuctua-

tions by calculating the entropy of probability distri-

bution of the variations. In total, we used 70 speech

descriptors per vowel. The vocal feature sets were ex-

tracted from the sustained vowels /a:/ and /i:/, and they

are computed using preconﬁgured libraries available

in Python such as Parselmouth library (Jadoul et al.,

2018) and Librosa (McFee et al., 2015). By combin-

ing these descriptors, which capture vocal cord and

vocal tract tremors, we have a set of vocal descriptors

to automatically classify PD dysphonia.

Automatic Classiﬁcation of Parkinson’s Disease Through the Fusion of Sustained Vowel Descriptors

881

3 CLASSIFICATION

This section highlights the strengths of our classiﬁ-

cation approach, focusing on the nested k-fold cross-

validation and FS method. The nested k-fold ensures

reliable classiﬁcation performance by addressing data

size constraints, while FS reduces dimensionality by

selecting the most relevant vocal descriptors, enhanc-

ing model performance.

We conducted several experiments to classify PD

dysphonia based on vocal descriptors. First, we clas-

siﬁed features extracted from the sustained vowel /a:/,

then features from the sustained vowel /i:/. Following

this, we combined the descriptors from both vowels

(/a:/ + /i:/) for a fusion analysis. In each experiment,

we evaluated ﬁve supervised ML models: k-Nearest

Neighbor (KNN), Decision Tree (DT), Support Vec-

tor Machine (SVM), Random Forest (RF), and Gradi-

ent Boosting (GB).

3.1 Nested k-Fold Cross-Validation

When working with small datasets, it can be chal-

lenging to balance providing enough data for the

model to train effectively and reserving sufﬁcient

test data to evaluate the model on unseen subjects.

Cross-validation has long been a reliable solution

for this. However, using traditional cross-validation,

combined with data normalization, FS and hyperpa-

rameters tuning, can lead to a data leakage problem,

where information from the test set unintentionally in-

ﬂuences the training process.

The cross-validation approach we implemented

addresses this issue by maximizing the use of avail-

able data while preventing data leakage through a

two-step nested cross-validation process (Fig. 3). As

the ﬁrst step, the outer k

-fold cross-validation splits

the dataset into an initial training set and a testing set,

with k

=5 folds. In each outer iteration, 4 folds are

used for model training, while the 5th fold is reserved

for testing. In the next step, an inner k

-fold cross-

Figure 3: Classiﬁcation approach: application of double,

nested k-fold cross-validation for model evaluation.

validation further splits the previously obtained train-

ing set (the 4 folds) into a new training set and a val-

idation set, where k

=4. In each iteration of the inner

cross-validation, 3 folds are used for model training,

while the 4th fold is used for hyperparameter tuning,

using GridSearchCV. Once the optimal hyperparame-

ters are identiﬁed, we evaluate the ﬁnal performance

of our models on the initially reserved test set.

3.2 Feature Preprocessing

For each experiment involving one or two sustained

vowels, feature preprocessing was applied to the

training set in each iteration of the outer k

-fold cross-

validation, which included normalization and FS.

3.2.1 Normalization

Normalization was applied in the outer cross-

validation using MinMaxScaler. The minimum and

maximum values of the descriptors were computed

only from the train set insuring that the test set does

not inﬂuence the train process. These parameters

were then used to normalize train and test sets.

3.2.2 Feature Selection

The number of features plays a crucial role in binary

classiﬁcation, especially when working with small

datasets. An excess of features relative to the num-

ber of observations can lead to overﬁtting (Guyon and

Elisseeff, 2003) (Bol

on-Canedo et al., 2013). There-

fore, a judicious selection of variables is essential to

maximize the model’s accuracy. Given that we have

70 features for each vowel and we are combining two

vowels for a dataset of 100 subjects, we must con-

sider the direct impact on data dimensionality. The

fusion of vowel features increases the total number

of features to 140, which is a high-dimensional vec-

tor relative to the number of samples. Thus, a FS

method was performed before the classiﬁcation step.

As for the data normalization and to avoid data leak-

age, FS was performed at each iteration of the outer

cross-validation using the ReliefF algorithm. We re-

tain only the 10 most signiﬁcant speech descriptors

for training the classiﬁers in each conﬁguration (/a:/,

/i:/, and /a:/+/i:/).

The ReliefF method is a widely used FS technique

for identifying the most discriminatory variables in

a dataset, based on the differences observed between

close samples (Rosario and Thangadurai, 2015). The

principle of ReliefF relies on the iterative updating of

feature weights according to their ability to differen-

tiate samples belonging to different classes. ReliefF

algorithm identiﬁes, for each sample in the dataset,

BIOSIGNALS 2025 - 18th International Conference on Bio-inspired Systems and Signal Processing

882

its close neighbors, both from the same class (posi-

tive) and from the opposite class (negative). By com-

paring the feature values of this sample with those of

its neighbors, the algorithm increases the score of the

feature if it is similar to those of the positive neigh-

bors. At the end of this iterative process, each feature

receives a score that reﬂects its relevance.

The choice of ReliefF as a FS technique is mo-

tivated by several reasons. It is effective in han-

dling datasets with redundant and correlated features,

which is the case here where several speech descrip-

tors may contain similar information (Urbanowicz

et al., 2018). Also, this algorithm is well-suited for

handling datasets with a high dimensionality relative

to their size.

3.3 Performance Evaluation

Performance evaluation is conducted at each iteration

of the outer cross-validation by calculating several

key indicators such as accuracy, precision, sensitiv-

ity (recall), and F1 score. Each iteration provides a

measure of these parameters. At the end of the cross-

validation process, an overall analysis is performed by

calculating the mean and standard deviation of each

of these performance indicators. This evaluates the

model’s average performance providing a more robust

assessment of the results.

4 RESULTS AND DISCUSSION

4.1 Features Analysis

By analyzing the results of the double nested cross-

validation, we observed that some features systemat-

ically reappeared as the most relevant for the clas-

siﬁcation. Our preliminary analyses revealed that

10 features are sufﬁcient to capture the essential in-

formation needed for the classiﬁcation of PD, while

maintaining a balance between accuracy and efﬁ-

ciency. The ten selected features, frequently iden-

tiﬁed as relevant by ReliefF, were used as classi-

ﬁers inputs. During iterations of cross-validation,

we identiﬁed and selected the top ten highest-ranked

speech measures. This process was carried out for

the three conﬁgurations: single vowel /i:/, single

vowel /a:/, and descriptor fusion of both vowels /i:/

and /a:/. In the latter case, among all 140 de-

scriptors we retained the following: ’PLP-Coeff-2-

a’, ’rapJitter-a’, ’mfcc-1-i’, ’localJitter-a’, ’mfcc-2-i’,

’mfcc-delta-3-a’, ’ppq5Jitter-a’, ’localabsoluteJitter-

a’, ’mfcc-delta2-1-i’, and ’mfcc-3-a’. Note that fea-

tures selected from the descriptor fusion of the two

Figure 4: Boxplot of SVM and RF classiﬁcation accuracy.

vowels include descriptors identiﬁed during the sep-

arate analysis of each vowel. This shows that some

important information is found in each analysis, con-

ﬁrming the relevance of both vowels.

4.2 Classiﬁcation Results

In this section, we aim to evaluate the performance

of the classiﬁcation algorithms on our data, using the

extracted vocal features, and to present the results ob-

tained from the fusion of the descriptors of the sus-

tained vowels /a:/ and /i:/.

4.2.1 Results without FS

Classiﬁcation based on the fusion of the two vow-

els’ descriptors, without using the FS method showed

promising results (cf. Fig. 4). Models’ performance is

detailed in Table 1. The DT yields suboptimal results,

with an accuracy of 63.75% ± 15.16 and a F1 score of

63.50%. However, the SVM classiﬁer stands out as

the best model with an accuracy of 73.50%, indicat-

ing its ability to extract and classify the PD voice well

even in the presence of redundant descriptors.

The results with the separate vowels are lower

than those obtained with the fusion. However, the

performance boost observed with the fusion was at-

tenuated by the redundancy between descriptors and

the increased size of the feature vector after fusion.

Thus, the gain observed during fusion was offset by

the large size of feature vector.

5 CONCLUSIONS

5.0.1 Results with ReliefF

We observe from the boxplots on Fig. 4, a clear im-

provement in classiﬁcation performance after apply-

Automatic Classiﬁcation of Parkinson’s Disease Through the Fusion of Sustained Vowel Descriptors

883

Table 1: Classiﬁcation performance with vowels’ descrip-

tors fusion and without FS: Means and Standard deviations.

Acc. (%) Prec. (%) Sens. (%) F1-sc. (%)

KNN 66.50 64.42 76.00 68.85

± 9.23 ± 9.12 ± 16.55 ± 10.33

DT 63.50 64.95 59.00 61.39

± 13.24 ± 15.63 ± 16.09 ± 15.07

SVM 73.50 76.86 70.50 71.57

± 11.52 ± 13.64 ± 20.12 ± 14.31

RF 70.25 72.85 66.50 68.52

± 9.55 ± 11.49 ± 15.26 ± 11.47

GB 68.50 68.55 70.00 68.45

± 10.62 ± 10.68 ± 16.12 ± 11.56

Table 2: Classiﬁcation performance with vowels’ descrip-

tors fusion and with FS: Means and Standard deviations.

Acc. (%) Prec. (%) Sens. (%) F1-sc. (%)

KNN 79.60 80.14 79.30 79.32

± 12.04 ± 12.16 ± 16.03 ± 13.11

DT 76.40 78.88 72.45 74.80

± 11.04 ± 12.94 ± 14.15 ± 11.95

SVM 84.70 89.08 77.35 82.40

± 8.02 ± 9.59 ± 14.71 ± 9.50

RF 82.60 83.96 79.30 81.10

± 10.00 ± 9.60 ± 15.07 ± 12.14

GB 80.50 79.20 83.20 81.14

± 9.89 ± 9.64 ± 13.63 ± 10.71

ing the ReliefF method for all the 3 conﬁgurations,

with and without fusion. The improvement in classi-

ﬁcation rates on the boxplots, with larger values and

higher medians, conﬁrms the crucial role of FS tech-

niques in improving classiﬁcation results. For clarity,

we have chosen to illustrate only the performance of

the two best models, SVM and RF, in Fig. 4.

According to Fig. 4 and Table 2, fusion results

demonstrate a clear improvement in classiﬁcation per-

formance compared to those obtained from each sin-

gle vowel. The SVM model stands out as the best

one, showing the highest accuracy and precision rates.

It also presents the best F1-score, illustrating a good

balance between accuracy and sensitivity. RF stands

out as the second best classiﬁer, with the best sensitiv-

ity rate showing its ability to correctly identify posi-

tive cases. DT shows the weakest performance for all

metrics.

When comparing Tables 1 and 2, it is evident that

FS has signiﬁcantly enhanced the performance of the

models in the case of vowel fusion. By reducing the

size of the feature vector, the models are no longer

burdened by the complexity linked to the large num-

ber of descriptors. Consequently, the combination of

vowel fusion and FS has improved the algorithms’

performance, yielding notable score gains across all

ﬁve models, with improvements of up to 13%.

The aim of this study is to demonstrate the rel-

evance of combining descriptors extracted from two

vowels, instead of a single one, as is most commonly

done, while developing a robust model. To get the

most out of this approach, it was relevant to combine

it with a FS method. The obtained results, with fusion

accuracy scores around 85%, demonstrate the effec-

tiveness of this approach. Our work explored the per-

formance of ﬁve different ML models using the fusion

of vocal features extracted from two sustained vow-

els, /a:/ and /i:/, to detect PD vocal alterations. Clas-

siﬁcation results show that the fusion of descriptors

signiﬁcantly outperforms single-vowel-based analy-

ses. Among the tested models, SVM was found to

be the best performer.

Although this study provides promising results

regarding descriptor fusion, it has some limitations.

Using a larger database and additional vowels could

strengthen the models robustness and improve the ac-

curacy. Further exploration of other vocal features,

such as prosody (text reading, continuous speech),

could allow a more complete and precise analysis.

In future research, we will explore the ﬁve vowels of

the PC-GITA database and test other feature selection

techniques to identify the most relevant vocal descrip-

tors. We also aim to increase the number of vocal fea-

tures selected during the fusion process to 15 or even

20. The goal is to analyze the contribution of each

vowel to the classiﬁcation of PD dysphonia and to val-

idate the proposed approach on additional datasets to

evaluate its generalizability.

ACKNOWLEDGEMENTS

We express our gratitude to Prof. Orozco-Arroyave

for providing the database used in this study.

This research was supported by the Hubert Curien

Partnership ’Utique,’ funded by the French Ministry

of Europe and Foreign Affairs and the Tunisian Min-

istry of Higher Education and Scientiﬁc Research.

REFERENCES

Armstrong, M. J. and Okun, M. S. (2020). Diagnosis and

treatment of Parkinson’s disease: a review. JAMA.

Bhat, S., Acharya, U. R., Hagiwara, Y., Dadmehr, N., and

Adeli, H. (2018). Parkinson’s disease: Cause factors,

measurable indicators, and early diagnosis. Comput-

ers in Biology and Medicine.

Bhattacharjee, T., Belur, Y., Nalini, A., Yadav, R., and

Ghosh, P. (2023). Exploring the Role of Fricatives in

Classifying Healthy Subjects and Patients with Amy-

otrophic Lateral Sclerosis and Parkinson’s Disease. In

ICASSP.

BIOSIGNALS 2025 - 18th International Conference on Bio-inspired Systems and Signal Processing

884

Bol

on-Canedo, V., S

anchez-Maro

no, N., and Alonso-

Betanzos, A. (2013). A review of feature selection

methods on synthetic data. Knowledge and Informa-

tion Systems.

Bouagina, S., Naouara, M., Hafsi, S., and Djaziri-Larbi, S.

(2023). MFCC-Based Analysis of Vibratory Anoma-

lies in Parkinson’s Disease Detection using Sustained

Vowels. In AMCAI.

Das, R. (2010). A comparison of multiple classiﬁcation

methods for diagnosis of Parkinson disease. Expert

Systems with Applications.

Dave, N. (2013). Feature Extraction Methods LPC, PLP

and MFCC In Speech Recognition.

Galaz, Z., Mekyska, J., Mzourek, Z., Smekal, Z., Rek-

torova, I., Eliasova, I., Kostalova, M., Mrackova, M.,

and Berankova, D. (2016). Prosodic analysis of neu-

tral, stress-modiﬁed and rhymed speech in patients

with Parkinson’s disease. Computer methods and pro-

grams in biomedicine.

Gore, S. M., Salunke, M. M., Patil, S. A., and Kemalkar, A.

(2020). Disease detection using voice analysis. IR-

JET.

Guyon, I. and Elisseeff, A. (2003). An introduction to vari-

able and feature selection. JMLR.

Harel, B., Cannizzaro, M., and Snyder, P. J. (2004). Vari-

ability in fundamental frequency during speech in pro-

dromal and incipient Parkinson’s disease: a longitudi-

nal case study. Brain and Cognition.

Hariharan, M., Polat, K., and Sindhu, R. (2014). A new

hybrid intelligent system for accurate detection of

Parkinson’s disease. Computer Methods and Pro-

grams in Biomedicine.

Hlavnicka, J., Cmejla, R., Klempir, J., Ruzivka, E., and

Rusz, J. (2019). Acoustic tracking of pitch, modal,

and subharmonic vibrations of vocal folds in Parkin-

son’s disease and Parkinsonism. IEEE Access.

Ho, A. K., Iansek, R., Marigliani, C., Bradshaw, J. L., and

Gates, S. (1998). Speech impairment in a large sam-

ple of patients with Parkinson’s disease. Behavioural

Neurology.

Islam, R., Abdel-Raheem, E., and Tarique, M. (2023).

Parkinson’s Disease Detection Using Voice Features

and Machine Learning Algorithms.

Jadoul, Y., Thompson, B., and de Boer, B. (2018). Introduc-

ing Parselmouth: A Python interface to Praat. Journal

of Phonetics.

Jafari, A. (2013). Classiﬁcation of parkinson’s disease

patients using nonlinear phonetic features and mel-

frequency cepstral analysis. BME.

Jeancolas, L., Petrovska-Delacr

etaz, D., Leh

ericy, S., Be-

nali, H., and Benkelfat, B. (2016). Voice analysis as

a tool for early diagnosis of parkinson’s disease: state

of the art. In CORESA.

Little, M., Mcsharry, P., Roberts, S., Costello, D., and Mo-

roz, I. (2007). Exploiting nonlinear recurrence and

fractal scaling properties for voice disorder detection.

Biomedical engineering online.

Little, M. A., McSharry, P. E., Hunter, E. J., Spielman, J.,

and Ramig, L. O. (2009). Suitability of dysphonia

measurements for telemonitoring of Parkinson’s dis-

ease. IEEE TBME.

McFee, B., Raffel, C., Liang, D., Ellis, D., McVicar, M.,

Battenberg, E., and Nieto, O. (2015). librosa: Audio

and Music Signal Analysis in Python.

Mei, J., Desrosiers, C., and Frasnelli, J. (2021). Machine

learning for the diagnosis of Parkinson’s disease: a

review of literature. Frontiers in aging neuroscience.

Naranjo, L., Perez, C. J., Campos-Roca, Y., and Martin, J.

(2016). Addressing voice recording replications for

Parkinson’s disease detection. Expert Systems with

Applications.

Orozco-Arroyave, J. R., Arias-Londo

no, J. D., Vargas-

Bonilla, J. F., Gonz

alez-R

ativa, M. C., and N

oth, E.

(2014). New Spanish speech corpus database for the

analysis of people suffering from Parkinson’s disease.

In LREC.

Orozco-Arroyave, J. R., Arias-Londo

no, J. D., Vargas-

Bonilla, J. F., and N

oth, E. (2013). Analysis of Speech

from People with Parkinson’s Disease through Non-

linear Dynamics. In Advances in Nonlinear Speech

Processing.

Pah, N. D., Motin, M. A., and Kumar, D. K. (2022).

Phonemes based detection of Parkinson’s disease for

telehealth applications. Scientiﬁc Reports.

Rosario, S. F. and Thangadurai, K. (2015). RELIEF: Fea-

ture Selection Approach. IJIRD.

Sakar, B. E., Isenkul, M. E., Sakar, C. O., Sertbas, A., Gur-

gen, F., Delil, S., Apaydin, H., and Kursun, O. (2013).

Collection and analysis of a Parkinson speech dataset

with multiple types of sound recordings. IEEE JBHI.

Sakar, C. O. and Kursun, O. (2010). Telediagnosis of

Parkinson’s disease using measurements of dyspho-

nia. Journal of Medical Systems.

Tsanas, A., Little, M., McSharry, P., Spielman, J., and

Ramig, L. (2012). Novel speech signal processing

algorithms for high-accuracy classiﬁcation of Parkin-

son’s disease. IEEE TBME.

Urbanowicz, R. J., Meeker, M., La Cava, W., Olson, R. S.,

and Moore, J. H. (2018). Relief-based feature selec-

tion: Introduction and review. JBI.

Villa-Ca

nas, T., Orozco-Arroyave, J. R., Vargas-Bonilla,

J. F., and Arias-Londo

no, J. D. (2014). Modulation

spectra for automatic detection of Parkinson’s disease.

In STSIVA.

Automatic Classiﬁcation of Parkinson’s Disease Through the Fusion of Sustained Vowel Descriptors

885