Speaker Veriﬁcation Enhancement via Speaking Rate Dynamics in

Persian Speechprints

Nina Hosseini-Kivanani

1 a

, Homa Asadi

2 b

and Christoph Schommer

1 c

Department of Computer Science, University of Luxembourg, Esch-sur-Alzette, Luxembourg

Faculty of Foreign Languages, University of Isfahan, Isfahan, Iran

Keywords:

Speaker Veriﬁcation, Mel-Frequency Cepstral Coefﬁcients (MFCCs), Vowel Formants, Deep Learning,

Persian Language.

Abstract:

This paper investigates the impact of speaking rate variation on speaker veriﬁcation using a hybrid feature

approach that combines Mel-Frequency Cepstral Coefﬁcients (MFCCs), their dynamic derivatives (delta and

delta-delta), and vowel formants. To enhance system robustness, we also applied data augmentation tech-

niques such as time-stretching, pitch-shifting, and noise addition. The dataset comprises recordings of Persian

speakers at three distinct speaking rates: slow, normal, and fast. Our results show that the combined model

integrating MFCCs, delta-delta features, and formant frequencies signiﬁcantly outperforms individual fea-

ture sets, achieving an accuracy of 75% with augmentation, compared to 70% without augmentation. This

highlights the beneﬁt of leveraging both spectral and temporal features for speaker veriﬁcation under varying

speaking conditions. Furthermore, data augmentation improved the generalization of all models, particularly

for the combined feature set, where precision, recall, and F1-score metrics showed substantial gains. These

ﬁndings underscore the importance of feature fusion and augmentation in developing robust speaker veri-

ﬁcation systems. Our study contributes to advancing speaker identiﬁcation methodologies, particularly in

real-world applications where variability in speaking rate and environmental conditions presents a challenge.

1 INTRODUCTION

Speech production is a highly complex phenomenon

in which dynamic articulatory gestures drive the

movements of speech organs to achieve speciﬁc tar-

gets within the vocal tract geometry (Tilsen, 2014).

These articulatory movements shape the acoustic fea-

tures of speech, which carry rich information, en-

abling listeners to comprehend both the linguistic

content (what is said) and the speaker-speciﬁc de-

tails (who said it). Identifying speakers based on the

characteristics of their voices is a fundamental goal in

forensic phonetics and automatic speaker recognition

systems (Rose, 2002; Nolan, 1987).

One of the key challenges in speaker identiﬁca-

tion is the high variability in acoustic characteristics

across speakers, compared to the relatively low vari-

ability within a single speaker (Gold et al., 2013;

McDougall, 2006). In forensic speaker comparison

https://orcid.org/0000-0002-0821-9125

https://orcid.org/0000-0003-1655-1336

https://orcid.org/0000-0002-0308-7637

(FSC), addressing this variability is crucial when de-

termining whether a known voice sample matches

an unknown (disputed) sample (Rose, 2002; Nolan,

1987).

Identifying robust speaker-speciﬁc parameters re-

mains challenging due to intertwined factors, such

as linguistic effects, prosody, and channel condi-

tions, which inﬂuence system accuracy (Rose, 2002).

Among these factors, speaking rate variability plays a

particularly important role, as speakers naturally ad-

just their rate based on communicative context, emo-

tional state, or physiological conditions (Gay et al.,

1974; Imaizumi and Kiritani, 1989). Such variations

affect both articulatory movements and acoustic prop-

erties, posing challenges to speaker recognition sys-

tems (Shahrebabaki et al., 2018; Zeng et al., 2015).

From a computational standpoint, automatic

speaker identiﬁcation (ASI) systems must account for

changes in speaking rate (Reynolds et al., 2000). Tra-

ditional ASI systems use MFCCs as their primary fea-

ture set due to their effectiveness in capturing speaker-

speciﬁc spectral information(Davis and Mermelstein,

1980). However, these systems often degrade when

Hosseini-Kivanani, N., Asadi, H. and Schommer, C.

Speaker Veriﬁcation Enhancement via Speaking Rate Dynamics in Persian Speechprints.

DOI: 10.5220/0013189100003905

In Proceedings of the 14th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2025), pages 665-672

ISBN: 978-989-758-730-6; ISSN: 2184-4313

665

faced with speaking rate variations, as spectral char-

acteristics shift with changes in articulatory dynam-

ics(Zeng et al., 2015). Recent advances in deep learn-

ing (DL) have addressed this by developing speaker-

invariant representations that are more robust to such

variations (Hinton et al., 2012; Xie et al., 2019).

Despite these advancements, vowel formants re-

main valuable in forensic applications due to their in-

terpretability and close relationship to the physiolog-

ical aspects of speech production (Gold et al., 2013).

Given that both vowel formants and MFCCs are dif-

ferentially affected by speaking rate, assessing their

relative robustness is critical for improving speaker

identiﬁcation systems.

In this study, we evaluate the impact of speaking

rate variability on speaker identiﬁcation performance

by comparing the robustness of vowel formants and

MFCCs. Speciﬁcally, we aim to:

1. examine how speaking rate inﬂuences vowel for-

mant frequencies and MFCC features;

2. assess the effectiveness of these features in

speaker identiﬁcation under varying speaking

rates;

3. and determine whether combining formant and

MFCC features improves accuracy and resilience

in speaker veriﬁcation systems.

By addressing these objectives, this study pro-

vides insights into selecting acoustic features most

effective for speaker identiﬁcation under conditions

of variable speaking rates. The remainder of this pa-

per is organized as follows: Section 2 reviews related

work from both phonetic and computational perspec-

tives. Section 3 outlines our experimental methodol-

ogy. Section 4 presents our ﬁndings, while Section 5

discusses their implications. Finally, Section 6 con-

cludes the paper and suggests directions for future re-

search.

2 RELATED WORK

Forensic speaker comparison (FSC) and automatic

speaker recognition systems rely on acoustic features,

such as vowel formants and Mel-Frequency Cep-

stral Coefﬁcients (MFCCs), to differentiate speak-

ers based on their unique vocal characteristics (Rose,

2002; Nolan, 1987). These features effectively cap-

ture speaker-speciﬁc and linguistically relevant as-

pects of speech, making them central to forensic and

automatic speaker veriﬁcation tasks.

Speaking rate is a critical variable that affects the

articulation and acoustic properties of speech. Re-

search has shown that varying speaking rates inﬂu-

ence the kinematics of speech production. For in-

stance, Gay (Gay et al., 1974) demonstrated that in-

creased speaking rate is associated with heightened

muscle activity, such as more pronounced lip clo-

sure and greater bilabial consonant openings. Sim-

ilarly, Tuller and Kelso (Tuller et al., 1982) found

that faster-speaking rates result in shorter muscle ac-

tivity durations, while slower rates lead to longer ar-

ticulatory movement times. These ﬁndings under-

score how speaking rate directly impacts speech ar-

ticulation, making a vital factor in speaker veriﬁca-

tion tasks. These articulatory changes, inﬂuenced

by speaking rate, are further reﬂected in the vari-

ability of speciﬁc speech gestures. For instance,

Shaiman et al. (Shaiman et al., 1997) observed that

lip gestures exhibit variability across different speak-

ing rates, complicating the consistency of speaker-

speciﬁc features as articulation velocity changes.

From an acoustic perspective, speaking rate sig-

niﬁcantly alters the spectral characteristics and for-

mant trajectories of speech. Imaizumi and Kiri-

tani (Imaizumi and Kiritani, 1989) observed that rapid

speech can lead to vowel reduction, especially in the

second formant frequency (F2). Additionally, Weis-

mer and Berry (Weismer and Berry, 2003) showed

that speakers modify formant movement trajectories

based on their speaking rate, with F2 being particu-

larly affected by these changes. This suggests that

variations in speaking rate not only affect articulatory

gestures but also modify key acoustic features critical

for speaker recognition.

Furthermore, research by Mefferd and

Green (Mefferd and Green, 2010) demonstrated

that formant transitions become sharper at higher

speaking rates. Their studies also noted that vowel

formant distances exhibit greater speciﬁcity in slow

speech compared to fast speech. Agwuele et al. (Ag-

wuele et al., 2009) found a reduction in vowel space

with faster speaking rates, indicating that articulation

and vowel acoustics are closely tied to the speed

of speech. Together, these studies highlight the

variability in formant frequency patterns and their

dependency on speaking rate, a critical challenge for

speaker identiﬁcation systems.

Recent advancements in speaker veriﬁcation have

explored the use of learnable MFCCs to improve ro-

bustness to variable speaking rates. For instance, Liu

et al. (Liu et al., 2021) introduced adaptive MFCC

front-end architectures that adjust to data, making

these features more resilient to changing speech con-

ditions, including speaking rate variability. This adap-

tive approach has shown signiﬁcant improvements

in speaker veriﬁcation performance, particularly in

large-scale datasets like VoxCeleb1 and SITW. How-

ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods

666

ever, while learnable MFCCs offer improvements,

they may not fully capture the dynamic properties of

speech under all conditions.

2.1 Speaker Recognition and

Computational Approaches

In traditional automatic speaker recognition, MFCCs

are widely used for their robustness in capturing

speaker-speciﬁc spectral properties (Davis and Mer-

melstein, 1980; Reynolds et al., 2000). However, they

are susceptible to degradation under varying speak-

ing rates, as spectral characteristics shift due to ar-

ticulatory dynamics (Zeng et al., 2015). Zeng and

Sheng (Zeng et al., 2015) demonstrated that these

changes directly affect the reliability of MFCC-based

systems. To address these limitations, recent ap-

proaches have incorporated machine learning (ML)

and deep learning (DL) techniques, such as convo-

lutional neural networks (CNNs) and utterance-level

aggregation methods (Xie et al., 2019), which learn

hierarchical representations that improve the robust-

ness of speaker veriﬁcation systems under speaking

rate variability (Hinton et al., 2012).

Recent work has also explored hybrid models

that integrate the features of MFCCs with DL ar-

chitectures. For instance, combining MFCCs with

CNNs captures both local and global speech pat-

terns, enhancing robustness to noise and rate variabil-

ity. Furthermore, bi-directional long short-term mem-

ory (Bi-LSTM) networks improve speaker veriﬁca-

tion by modeling long-range temporal dependencies,

which are essential for capturing dynamic speech

variations (Anupama et al., 2022).

Hybrid optimization strategies have fur-

ther advanced these models. Chakravarty &

Dua (Chakravarty and Dua, 2023) combined MFCCs

and Gammatone Cepstral Coefﬁcients (GTCCs)

with data augmentation methods, such as Synthetic

Minority Over-Sampling Technique (SMOTE), to

improve performance. By leveraging a hybrid LSTM

backend, their approach enhanced model accuracy

and resilience under noisy conditions, demonstrating

the value of combining feature sets and optimization

algorithms in varying conditions.

In forensic applications, interpretability remains

critical, making vowel formants valuable despite their

susceptibility to speaking rate variability. Formants

offer insights into the physiological aspects of speech

production, which are useful for distinguishing speak-

ers (Asadi et al., 2018; McDougall, 2006). How-

ever, their limitations as standalone features neces-

sitate combining formants with robust features like

MFCCs. Jahangir et al. (Jahangir et al., 2020)

demonstrated that integrating traditional acoustic fea-

tures with DL-derived representations signiﬁcantly

enhances speaker identiﬁcation under variable speak-

ing rates.

Some studies have also shown that the fusion

of acoustic features improves performance. Bahari

et al.(Bahari and Van Hamme, 2011) demonstrated

that combining formant frequencies and MFCCs cap-

tures both articulatory and spectral information, lead-

ing to better recognition rates under challenging con-

ditions. Advanced modeling techniques, such as i-

vectors and x-vectors, have further demonstrated ro-

bustness across varying speaking conditions(Dehak

et al., 2010). These ﬁndings emphasize the beneﬁts

of combining diverse feature sets for robust speaker

veriﬁcation systems.

2.2 Research Gap

While progress has been made in understanding how

speaking rate affects speech production and acous-

tic features, few studies comprehensively compare the

robustness of formants and MFCCs in speaker veriﬁ-

cation under varying rates. Further research is needed

to evaluate whether combining these features with

deep learning techniques can improve resilience to

speaking rate variability, especially in forensic and

real-world applications. In this study, we address

this gap by systematically examining the impact of

speaking rate on formant frequencies and MFCCs in

speaker veriﬁcation tasks. Additionally, we explore

the beneﬁts of integrating these features into a uniﬁed

framework to improve accuracy and robustness under

variable speaking conditions.

3 EXPERIMENTAL SET-UP

3.1 Participants and Task

Eighteen male Persian speakers (Tehrani variety; age

range: 25–36 years; M = 31.3, SD = 3.7) were

recorded. None of the speakers reported any hearing

or speech impairments. All participants were students

pursuing a master’s or PhD degree in various research

areas. This corpus was collected following the pro-

cedure used in the collection of the BonnTempo cor-

pus in German. Speakers were instructed to read The

North Wind and the Sun in Persian at three different

speaking rates (slow, normal, and fast). Before each

recording session, participants were asked to read the

text several times to familiarize themselves with the

passage. First, speakers were instructed to read the

passage at their normal pace. The speakers were then

Speaker Veriﬁcation Enhancement via Speaking Rate Dynamics in Persian Speechprints

667

asked to slow their pace as much as they could and

then to read the text as fast as possible. This resulted

in strong syllable rate variability across the three dif-

ferent reading passages. All recording sessions took

place in a soundproof booth with a sampling rate of

44.1 kHz and 16-bit quantization.

3.2 Feature Extraction

Speech recordings were labeled and segmented

based on the onset and offset information using

Praat (Boersma and Weenink, 2021) version 6.2.22.

A free plugin for Praat with automated scripts for

voice processing, Praat Vocal Toolkit (Corretge,

2022), was used to extract and concatenate all vow-

els from each recording per speaker. Formant values

were extracted at 5-ms intervals using the LPC-based

Burg algorithm in Praat. A long-term analysis method

was adopted, as it has proven effective in represent-

ing speaker individuality (Asadi et al., 2018; Gold

et al., 2013). This approach calculates the average for-

mant values over a long stretch of a speaker’s speech

recording (Gold et al., 2013; Rose, 2002; Nolan,

1987).

The key features were extracted using MFCCs

in Python (Version 3.11.5) with the librosa li-

brary (McFee et al., 2015). Thirteen main coefﬁcients

were calculated and averaged per audio ﬁle to sim-

plify the representation while preserving key sound

characteristics.

We developed a multi-class speech analysis model

that extensively uses MFCCs, delta-MFCCs, and

delta-delta-MFCCs to capture a broad spectrum of

acoustic features from speech. This strategy enriches

the model with spectral and temporal information,

improving its ability to distinguish between varied

speaking speeds and speaker characteristics. The

methodology centers on extracting a rich set of fea-

tures from audio recordings: the base MFCCs pro-

vide spectral information, delta-MFCCs capture the

rate of change in these spectral features, and delta-

delta-MFCCs further detail the acceleration of these

changes. This layered approach to feature extraction

ensures a deep representation of the audio’s charac-

teristics. Each feature dimension is normalized to en-

sure consistency in scale across the dataset, facilitat-

ing more effective model training.

Adding attention mechanisms, such as self-

attention or Transformer layers, can help the model

focus on the most relevant parts of the speech signal

for speaker veriﬁcation, making it more effective in

handling variations in speaking speed. We then trans-

formed these labels into a one-hot encoded format

suitable for classiﬁcation tasks. The dataset was split

into training and testing sets, with 20% reserved for

testing to evaluate the model’s performance. Our neu-

ral network model architecture included Long Short-

Term Memory (LSTM) layers to process the sequen-

tial nature of the audio data, followed by a custom At-

tentionLayer designed to weigh the importance of dif-

ferent parts of the audio signal, enhancing the model’s

focus on relevant features for speaker veriﬁcation.

The model also incorporated dropout layers to prevent

overﬁtting and used a softmax activation function in

the output layer for classiﬁcation. The model was

compiled using the Adam optimizer and categorical

cross-entropy loss. Early stopping was implemented

to terminate training when validation loss ceased to

improve, thereby preventing overﬁtting

We used Group K-Fold Cross-Validation, a variant

of K-Fold, to divide the dataset into ﬁve folds while

ensuring that all recordings from a single speaker

were placed either in the training or testing set, not

both. This approach prevents data leakage and en-

sures a fair evaluation by maintaining class propor-

tions across folds, allowing the model to generalize

effectively to unseen speakers.

3.3 Speech Data Augmentation

To enhance dataset diversity and improve model gen-

eralization, ﬁve audio augmentation techniques were

applied using the librosa library. Augmentation in-

troduces variability in training data, helping models

better handle real-world challenges such as varying

speaking rates, background noise, and recording con-

ditions (Lounnas et al., 2022).

Time-Stretching: We adjusted the speed of the au-

dio using factors of 0.9 (slower) and 1.1 (faster). This

manipulation simulates natural variations in speaking

rate, allowing the model to adapt to speakers with dif-

ferent articulation speeds (Ko et al., 2015).

Pitch Shifting: The pitch of the audio was shifted

by ±2 semitones to mimic variations in vocal pitch,

which may occur due to speaker differences or emo-

tional states. This augmentation captures variations

in speaker intonation while maintaining the original

speech content (Alex et al., 2023).

Noise Addition: Gaussian noise was added to the au-

dio with an amplitude factor of 0.005 to simulate en-

vironmental noise. This method increases robustness

by enabling the model to process noisy input data,

reﬂecting real-world recording conditions (Nugroho

et al., 2021).

Volume Adjustment: The amplitude of the audio

was scaled to 80% and 120% of its original level to

simulate variations in the recording conditions and

speaker distance from the microphone. These adjust-

ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods

668

ments ensure that the model can handle varying input

signal strengths (Zhou et al., 2017).

Audio Shifting: A random circular shift of up to

20% of audio length was applied to simulate mis-

alignments or varying speech onset times, enhancing

the model’s ability to handle such variances (Lounnas

et al., 2022).

Each augmented audio ﬁle was saved as a sepa-

rate sample, effectively increasing the dataset size and

introducing acoustic and temporal variability. This

augmentation strategy ensured that the model became

more resilient to natural variations and real-world

challenges. The augmented dataset was subsequently

used for model training, and the impact of each aug-

mentation type was evaluated for its contribution to

system performance.

3.4 Model Training and Evaluation

To assess the performance of our models, we com-

puted classiﬁcation accuracy, precision, recall, and

F1-score. These metrics are standard in classiﬁca-

tion tasks and provide insights into the model’s over-

all correctness (accuracy), the proportion of correctly

identiﬁed positive cases (precision), the ability to

identify all relevant positive cases (recall), and a bal-

anced measure of precision and recall (F1-score). De-

tailed deﬁnitions of these metrics are available in the

standard machine learning literature.

Figure 1: t-Sne visualization of speaker embeddings de-

rived from MFCC, formant, and pitch features across three

classes of speaking rates: L (Low), N (Normal), and S

(Speedy).

Figure 1 presents a t−SNE projection of the

speaker embeddings derived from multiple acoustic

features, including MFCCs, formant frequencies, and

pitch, across three distinct speaking rates. The visual

separation of the clusters demonstrates the discrim-

inative power of the combined feature set, particu-

larly in capturing the temporal variations in speech

signals. Notably, the dense overlap between some

points highlights the inherent challenges of speaker

veriﬁcation under varying speaking rates. This result

reinforces the importance of integrating complemen-

tary features such as MFCCs and formants to enhance

the robustness of speaker veriﬁcation systems, espe-

cially in real-world applications where speaking rate

variability is prevalent.

4 RESULTS AND DISCUSSION

We used 5-fold stratiﬁed cross-validation for model

training, running each fold for 50 epochs with a learn-

ing rate of 0.0001 and a batch size of 32. Early stop-

ping with a patience of 10 epochs was applied to pre-

vent overﬁtting and to retain optimal model weights.

The models were optimized using the Adam opti-

mizer and categorical cross-entropy loss for multi-

class classiﬁcation tasks.

Table 1 presents the performance met-

rics—accuracy, precision, recall, and F1-score—of

different acoustic feature sets: MFCC, MFCC-delta-

delta, and formant frequencies (F0, F1–F4), evaluated

under augmented and non-augmented conditions.

Without Augmentation: The combined model

(MFCC, formant frequencies, and MFCC derivatives)

achieved the highest performance across all metrics,

with an accuracy of 70%, precision of 69%, recall of

67%, and F1-score of 68%. This result highlights the

advantage of combining diverse acoustic features to

capture speaker-speciﬁc information. In contrast, in-

dividual feature sets performed lower: MFCC-delta-

delta achieved 62% accuracy, MFCC alone reached

60%, and formant frequencies performed the weak-

est at 59%, underscoring their limited discriminatory

power when used in isolation.

With Augmentation: all models showed clear im-

provements, conﬁrming its role in enhancing model

robustness. The combined model again outperformed

individual models, achieving 75% accuracy, with pre-

cision, recall, and F1-scores of 74%, 74%, and 73%,

respectively. Augmentation also notably improved in-

dividual feature sets. Formant frequencies exhibited a

signiﬁcant increase in accuracy, rising to 68%, match-

ing the MFCC-delta-delta model, which improved to

67% accuracy. Even MFCC alone beneﬁted, achiev-

ing 65% accuracy compared to 60% without augmen-

tation.

These results underscore the effectiveness of com-

bining spectral (MFCC), temporal (delta and delta-

delta), and source-related (formants) features to pro-

vide a comprehensive representation of speaker char-

acteristics. The combined model consistently out-

Speaker Veriﬁcation Enhancement via Speaking Rate Dynamics in Persian Speechprints

669

Table 1: Model performance of MFCC, F0, and Formant Frequencies, and their combinations with and without augmentation.

Augmentation Model Accuracy Precision Recall F1-score

Without Aug.

MFCC 60% 61% 60% 59%

MFCC-delta-delta 62% 60% 63% 62%

F0, F1-F4 59% 56% 55% 55%

Combined Model 70% 69% 67% 68%

With Aug.

MFCC 65% 62% 62% 61%

MFCC-delta-delta 67% 64% 63% 64%

F0, F1-F4 68% 68% 67% 67%

Combined Model 75% 74% 74% 73%

performed individual feature sets in both conditions,

while data augmentation further enhanced model per-

formance, introducing variability that simulates real-

world speaking conditions. The superior results

achieved by the augmented combined model demon-

strate the robustness and generalizability of this multi-

feature approach, particularly in handling variations

in speaking styles and conditions, making it a strong

foundation for speaker veriﬁcation systems.

5 DISCUSSION

The ﬁndings from this study highlight the impact of

combining multiple acoustic feature sets and apply-

ing data augmentation on the robustness and accu-

racy of speaker identiﬁcation systems. Speciﬁcally,

the combined model (MFCC, MFCC-delta, MFCC-

delta-delta, and formant frequencies) outperformed

individual feature models in both augmented and

non-augmented conditions, achieving a 75% accuracy

with augmentation, compared to 70% without aug-

mentation. This conﬁrms previous work suggesting

that hybrid approaches, which leverage both spectral

and temporal information, provide a more compre-

hensive representation of speaker-speciﬁc traits (De-

hak et al., 2010).

One reason for the success of MFCC-based fea-

tures in this context is their ability to capture speaker-

speciﬁc spectral envelopes, which have long been

considered the foundation for speaker recognition

tasks (Davis and Mermelstein, 1980). The delta and

delta-delta coefﬁcients add temporal dynamics, al-

lowing the system to capture how the spectral proper-

ties change over time, which is particularly important

for handling variations in speaking rates and articula-

tion patterns. This ﬁnding aligns with previous ﬁnd-

ings (Snyder et al., 2018; Choi et al., 2015), where in-

corporating temporal features signiﬁcantly improved

speaker recognition performance under varied speak-

ing conditions. In this study, the inclusion of MFCC-

delta-delta improved accuracy from 60% to 62% in

non-augmented data, further demonstrating the im-

portance of modeling temporal ﬂuctuations.

Formant frequencies, which represent the resonant

frequencies of the vocal tract, provided additional in-

formation that enhanced speaker identiﬁcation when

combined with MFCCs, even though they underper-

formed as a standalone feature (59% accuracy with-

out augmentation). While formants capture impor-

tant physiological information about a speaker’s vocal

tract, they tend to be less reliable on their own, par-

ticularly when speech is affected by external factors

such as noise or varying speaking rates (Hansen and

Hasan, 2015). However, when paired with MFCCs,

formants contribute valuable vocal tract information,

improving the overall robustness of speaker recog-

nition systems. This echoes ﬁndings from Nath

and Kalita (Nath and Kalita, 2015), who demon-

strated that combining formants with other features

like MFCCs signiﬁcantly enhanced speaker recog-

nition accuracy, with results nearing 100% in some

tasks. Similarly, Messaoud and Hamida (Messaoud

and Hamida, 2011) found that integrating formant fre-

quencies with MFCCs in a phone recognition system

reduced the phone error rate by 3%, further validating

the complementary nature of these features. These

studies highlight how formants and MFCCs work to-

gether by covering different aspects of the speech sig-

nal, making the combined approach highly effective

for speaker recognition.

Data augmentation played a pivotal role in en-

hancing system performance across all models. The

most signiﬁcant improvements were observed in the

combined model, where accuracy increased from

70% to 75%, with similar gains in precision, recall,

and F1-score. This ﬁnding is consistent with prior re-

search, which demonstrated that augmentation meth-

ods such as time-stretching, pitch-shifting, and noise

addition increase the diversity of the training data,

enabling models to generalize better to unseen con-

ditions (Nugroho et al., 2021; Ko et al., 2015). In

our case, augmentation helped the model better han-

dle variations in speaking rate and background noise,

which are common in real-world applications. By ex-

posing the model to these variations during training,

ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods

670

we effectively reduced overﬁtting, thus improving its

ability to generalize to new data.

While MFCC-delta-delta and formant features

saw improvements with augmentation, MFCCs alone

also beneﬁted, achieving a 65% accuracy compared

to 60% without augmentation. This conﬁrms that

data augmentation is essential even when using ro-

bust features such as MFCCs, as it simulates real-

world variability, making the model more resilient

to changes in speech conditions (Koo et al., 2020).

The increased accuracy of formant frequencies (68%

with augmentation) suggests that, although formants

alone may struggle with speaker discrimination in

clean conditions, they become more useful when aug-

mented, as they help capture subtle articulatory varia-

tions that may emerge under different speaking envi-

ronments (Trottier et al., 2015).

5.1 Limitations and Future Work

A primary limitation of this study is the restricted

dataset size and its language speciﬁcity. The dataset,

consisting of only 18 male Persian speakers, limits the

generalizability of the ﬁndings. Speaker veriﬁcation

systems often perform differently across languages

due to phonetic and prosodic variations, raising un-

certainty about the generalizability of these results

to other linguistic contexts. Additionally, the small

dataset may have constrained the model’s ability to

capture broader speaker variability. A more extensive

and diverse dataset would be necessary to assess the

system’s robustness on a larger scale.

This study underscores the value of a hybrid

acoustic feature approach for speaker identiﬁcation,

particularly when combined with data augmentation.

The integration of MFCCs, delta features, and for-

mant frequencies provided a multi-dimensional rep-

resentation of vocal traits, enhancing performance un-

der varied conditions. Future research could investi-

gate additional feature combinations, such as prosodic

features (e.g., intonation, rhythm) and voice qual-

ity measures (e.g., jitter, shimmer), to improve ro-

bustness, particularly for emotional speech or non-

standard speaking styles.

While this study focused on traditional feature

extraction methods, the use of deep learning-based

embeddings—such as x-vectors(Snyder et al., 2018)

or Transformer-based models(Vaswani, 2017)—holds

signiﬁcant potential. These models can learn speaker-

speciﬁc characteristics directly from raw audio, re-

ducing the reliance on manual feature engineering.

Combining such approaches with advanced augmen-

tation techniques could further enhance performance,

enabling speaker identiﬁcation systems to handle

challenging real-world conditions, such as noisy envi-

ronments, emotional variability, and diverse speaking

styles.

6 CONCLUSION

In this study, we investigated the resilience of vari-

ous acoustic features in speaker identiﬁcation across

different speaking rates. The ﬁndings reveal a hierar-

chy of effectiveness among the examined parameters.

The results indicate that vowel formant frequencies

demonstrate a degree of resilience against changes in

speaking rate, achieving an accuracy of 80 in speaker

identiﬁcation tasks. This suggests that while for-

mant frequencies capture relevant speaker-speciﬁc in-

formation, their ability to distinguish between speak-

ers may be somewhat compromised when faced with

variations in speaking rate. Given the strong corre-

lation between formant frequencies and the invariant

physiological dimensions of the vocal tract, it is un-

surprising that these frequencies remain relatively sta-

ble across varying speech rates. Despite the relatively

stable vocal tract, dynamic articulatory adjustments

required for different speaking rates could potentially

introduce variability into formant measurements.

REFERENCES

Agwuele, A., Sussman, H. M., and Lindblom, B. (2009).

The effect of speaking rate onconsonant vowel coar-

ticulation. Phonetica, 65(4):194–209.

Alex, A., Wang, L., Gastaldo, P., and Cavallaro, A. (2023).

Data augmentation for speech separation. Speech

Commun., 152:102949.

Anupama, V., Amrutha, C., Varshini, G. A., Nandan, G.

S. G., and Vivek, G. S. S. (2022). A mfcc-cnn based

voice authentication security. Int. J. Eng. Technol.

Manag. Sci., 4(6).

Asadi, H., Nourbakhsh, M., Sasani, F., and Dellwo, V.

(2018). Examining long-term formant frequency as a

forensic cue for speaker identiﬁcation: An experiment

on persian. In Proc. 1st Int. Conf. Lab. Phon. Phonol.,

pages 21–28.

Bahari, M. H. and Van Hamme, H. (2011). Speaker age

estimation and gender detection based on supervised

non-negative matrix factorization. In BIOMS, pages

1–6. IEEE.

Boersma, P. and Weenink, D. (2021). Praat: Doing phonet-

ics by computer [computer program](version 6.2. 22).

Retrieved fromwww. praat. org.

Chakravarty, N. and Dua, M. (2023). Data augmentation

and hybrid feature amalgamation to detect audio deep

fake attacks. Physica Scripta, 98.

Choi, Y. H., Ban, S. M., Kim, K.-W., and Kim, H. S.

(2015). Evaluation of frequency warping based fea-

Speaker Veriﬁcation Enhancement via Speaking Rate Dynamics in Persian Speechprints

671

tures and spectro-temporal features for speaker recog-

nition. Phonetics and Speech Sciences, 7(1):3–10.

Corretge, R. (2022). Praat vocal toolkit. Available at: http:

//www.praatvocaltoolkit.com.

Davis, S. and Mermelstein, P. (1980). Comparison of para-

metric representations for monosyllabic word recog-

nition in continuously spoken sentences. IEEE Trans.

Acoust. Speech Signal Process., 28(4):357–366.

Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., and

Ouellet, P. (2010). Front-end factor analysis for

speaker veriﬁcation. IEEE Trans. Audio Speech Lang.

Process., 19(4):788–798.

Gay, T., Ushijima, T., Hiroset, H., and Cooper, F. S. (1974).

Effect of speaking rate on labial consonant-vowel ar-

ticulation. Journal of Phonetics, 2(1):47–63.

Gold, E., French, P., and Harrison, P. (2013). Examining

long-term formant distributions as a discriminant in

forensic speaker comparisons under a likelihood ratio

framework. In Proc. Meet. Acoust., volume 19. AIP

Publishing.

Hansen, J. H. and Hasan, T. (2015). Speaker recognition by

machines and humans: A tutorial review. IEEE Signal

Process. Mag., 32(6):74–99.

Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-

r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P.,

Sainath, T. N., et al. (2012). Deep neural networks for

acoustic modeling in speech recognition: The shared

views of four research groups. IEEE Signal Process.

Mag., 29(6):82–97.

Imaizumi, S. and Kiritani, S. (1989). Effect of speaking

rate on formant trajectories and inter-speaker varia-

tions. Ann. Bull. RILP, 23:27–37.

Jahangir, R., Teh, Y. W., Memon, N. A., Mujtaba, G., Za-

reei, M., Ishtiaq, U., Akhtar, M. Z., and Ali, I. (2020).

Text-independent speaker identiﬁcation through fea-

ture fusion and deep neural network. IEEE Access,

8:32187–32202.

Ko, T., Peddinti, V., Povey, D., and Khudanpur, S. (2015).

Audio augmentation for speech recognition. In Inter-

speech, volume 2015, page 3586.

Koo, H., Jeong, S., Yoon, S., and Kim, W. (2020). Develop-

ment of speech emotion recognition algorithm using

mfcc and prosody. ICEIC, pages 1–4.

Liu, X., Sahidullah, M., and Kinnunen, T. (2021). Learn-

able mfccs for speaker veriﬁcation. In ISCAS, pages

1–5. IEEE.

Lounnas, K., Lichouri, M., and Abbas, M. (2022). Analysis

of the effect of audio data augmentation techniques

on phone digit recognition for algerian arabic dialect.

ICAASE, pages 1–5.

McDougall, K. (2006). Dynamic features of speech and the

characterization of speakers: Toward a new approach

using formant frequencies. Int. J. Speech Lang. Law,

13(1):89–126.

McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M.,

Battenberg, E., and Nieto, O. (2015). librosa: Audio

and music signal analysis in python. In SciPy, pages

18–24.

Mefferd, A. S. and Green, J. R. (2010). Articulatory-to-

acoustic relations in response to speaking rate and

loudness manipulations. J. Speech Lang. Hear. Res.,

53:1206–1219.

Messaoud, Z. B. and Hamida, A. (2011). Combining for-

mant frequency based on variable order lpc coding

with acoustic features for timit phone recognition. Int.

J. Speech Technol., 14:393.

Nath, D. and Kalita, S. (2015). Composite feature selection

method based on spoken word and speaker recogni-

tion. Int. J. Comput. Appl., 121:18–23.

Nolan, F. (1987). The phonetic bases of speaker recogni-

tion: Cambridge studies in speech science and com-

munication, cambridge university press, cambridge,

1983, 221 pp. isbn 0-521-24486-2.

Nugroho, K., Noersasongko, E., Purwanto, Muljono, and

Setiadi, D. (2021). Enhanced indonesian ethnic

speaker recognition using data augmentation deep

neural network. J. King Saud Univ. Comput. Inf. Sci.,

34:4375–4384.

Reynolds, D. A., Quatieri, T. F., and Dunn, R. B. (2000).

Speaker veriﬁcation using adapted gaussian mixture

models. Digital signal processing, 10(1-3):19–41.

Rose, P. (2002). Forensic Speaker Identiﬁcation. Interna-

tional Forensic Science and Investigation. Taylor &

Francis.

Shahrebabaki, A. S., Imran, A. S., Olfati, N., and Svendsen,

T. (2018). Acoustic feature comparison for different

speaking rates. In Proc. Human-Computer Interaction

(HCI), pages 176–189. Springer.

Shaiman, S., Adams, S. G., and Kimelman, M. D. (1997).

Velocity proﬁles of lip protrusion across changes in

speaking rate. J. Speech Lang. Hear. Res., 40(1):144–

158.

Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., and

Khudanpur, S. (2018). X-vectors: Robust dnn embed-

dings for speaker recognition. ICASSP 2018, pages

5329–5333.

Tilsen, S. (2014). Selection and coordination of articulatory

gestures in temporally constrained production. Jour-

nal of Phonetics, 44:26–46.

Trottier, L., Chaib-draa, B., and Giguere, P. (2015). Tempo-

ral feature selection for noisy speech recognition. In

Proc. Can. Conf. Artif. Intell., pages 155–166.

Tuller, B., Harris, K. S., and Kelso, J. S. (1982). Stress and

rate: Differential transformations of articulation. J.

Acoust. Soc. Am., 71(6):1534–1543.

Vaswani, A. (2017). Attention is all you need. Advances in

Neural Information Processing Systems.

Weismer, G. and Berry, J. (2003). Effects of speaking rate

on second formant trajectories of selected vocalic nu-

clei. J. Acoust. Soc. Am., 113(6):3362–3378.

Xie, W., Nagrani, A., Chung, J. S., and Zisserman,

A. (2019). Utterance-level aggregation for speaker

recognition in the wild. In ICASSP 2019, pages 5791–

5795. IEEE.

Zeng, X., Yin, S., and Wang, D. (2015). Learning

speech rate in speech recognition. arXiv preprint

arXiv:1506.00799.

Zhou, Y., Xiong, C., and Socher, R. (2017). Improved reg-

ularization techniques for end-to-end speech recogni-

tion. ArXiv, abs/1712.07108.

ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods

672