Detecting Speech Disﬂuencies Using Open-Source Tools in Automatic

Feedback Systems for Oral Presentation Training

Willy Mateo

2 a

, Leonardo Eras

1 b

, Giancarlo Carvajal

2 c

and Federico Dom

ınguez

1,2 d

Information Technology Center, Escuela Superior Politecnica Del Litoral, ESPOL, Guayaquil, Ecuador

Faculty of Electrical and Computer Engineering, Escuela Superior Politecnica Del Litoral, ESPOL, Guayaquil, Ecuador

{wjmateo, leras, gianﬂor, fexadomi}@espol.edu.ec

Keywords:

Oral Presentation Skills, Filler Words Detection, Filled Pauses Detection, Automatic Presentation Feedback.

Abstract:

In the realm of verbal communication, most common non-clinical speech disﬂuencies are ﬁller words and ﬁlled

pauses, which pose challenges for effective oral presentations. Yet their detection is no easy task. This article

presents the usage of OpenAI’s Whisper for ﬁlled pauses and ﬁller words detection in Spanish oral presen-

tations, including on-the-wild usage with undergraduate students. Preliminary results indicate that Whisper

demonstrates promise as a valuable tool to identify a substantial amount of ﬁller words and ﬁlled pauses.

Despite areas of improvement, Whisper serves as a diagnostic tool for assessing disﬂuences in oral communi-

cation.

1 INTRODUCTION

Speech disﬂuencies are interruptions in the regular

ﬂow of speech (Scott Fraundorf, 2014). These inter-

ruptions may manifest in several forms during speech

ranging from mild interruptions such as ﬁller words,

where the disﬂuent sound is a complete word (e.g. ok,

like), and ﬁlled pauses, where a natural pause dur-

ing speech is vocalized with prolonged vowels (e.g.

uhmm, ehmmm), to speech disorders such as stutter-

ing (G

osy, 2023; Das et al., 2019). While some mod-

erate use of ﬁlled pauses and ﬁller words may in fact

increase listener comprehension and add naturalness

to speech (Scott Fraundorf, 2014), it is commonly un-

derstood that increased usage of these disﬂuencies de-

note hesitation in the speaker and delay comprehen-

sion in the listener (Lo, 2020).

Practice, self-awareness, and feedback are fre-

quently cited as strategies to reduce the occurrence

of speech disﬂuencies in oral presentations (De Grez

et al., 2009; Alwi and Sidhu, 2013; Das et al., 2019).

Almost all coaching systems for oral presentations

attempt to detect speech disﬂuencies either in real

time or ofﬂine from recordings to provide automatic

feedback to learners. However, obtaining reasonable

https://orcid.org/0009-0003-0616-7717

https://orcid.org/0000-0002-3594-9289

https://orcid.org/0009-0003-0664-7622

https://orcid.org/0000-0002-3655-2179

levels of detection accuracy is no easy task. Some

systems use phonetic analysis with software such as

Praat (Boersma and Weenink, 2023), a well-tested,

production-ready open-licensed software for speech

analysis and phonetics, to extract ﬁlled pauses from

speech, while others use Automatic Speech Recogni-

tion (ASR) to identify ﬁller words or repetitions from

generated transcripts (Zhu et al., 2022). Phonetic

analysis excels at detecting ﬁlled pauses but struggles

with repetitions while ASR technologies, normally

trained to ignore disﬂuencies, typically require costly

annotated data for retraining (Zhu et al., 2022).

In this work, we focus on the detection of the

most common non-clinical speech disﬂuencies: ﬁller

words and ﬁlled pauses (Lo, 2020), using ASR tech-

nologies with open source tools. Speciﬁcally, we

tackle on the problem on providing accurate detection

of speech disﬂuencies in Spanish speaking students in

the context of an oral presentation.

Since 2018 we have been using the Automatic

Feedback Presentation system (RAP for its Span-

ish acronym) as an experimental tool that facilitates

learning of oral presentation skills and since 2019 as

a learning tool embedded in the academic activities

of communication courses and selected engineering

courses (Dom

ınguez et al., 2021). The RAP system

records a student’s oral presentation in a specialized

room and extracts ﬁve presentation features: posture,

gaze to the audience, use of ﬁlled pauses, voice vol-

ume, and slides legibility (Ochoa et al., 2018). The

Mateo, W., Eras, L., Carvajal, G. and Domínguez, F.

Detecting Speech Disﬂuencies Using Open-Source Tools in Automatic Feedback Systems for Oral Presentation Training.

DOI: 10.5220/0012622100003693

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Computer Supported Education (CSEDU 2024) - Volume 2, pages 213-220

ISBN: 978-989-758-697-2; ISSN: 2184-5026

213

system uses Praat to measure voice volume and de-

tect ﬁlled pauses from the audio recording. It uses

this extracted features to generate an interactive and

customized feedback report at the end of the presen-

tation. In 2020, we demonstrated that this facilitates

modest, but statistically signiﬁcant, learning gains to

their users (Ochoa and Dominguez, 2020).

Aiming to improve the capacity of the RAP sys-

tem to detect speech disﬂuencies, we implemented

two detection algorithms, based on OpenAI’s Whis-

per (Radford et al., 2022), for ﬁller words and ﬁlled

pauses. Section 2 summarizes the state of the art

on the detection of non-clinical speech disﬂuencies,

section 3 describes the experimental methodology for

evaluation of both algorithms with annotated data and

in a small on-the-wild setup within the RAP system,

section 4 describes the obtained results, and sections 5

and 6 summarize our ﬁndings.

2 RELATED WORK

The domain of identifying and classifying ﬁller words

in audio recordings is currently evolving and ongo-

ing. A relevant study, conducted in 2022 (Zhu et al.,

2022), focused on the detection and classiﬁcation of

ﬁller words in English audio recordings. To do this,

audio podcasts collected from the online music plat-

form SoundCloud were used, with a total of 145

hours of recordings. The focus of this research was

based on the creation of a pipeline using Voice Ac-

tivity Detection (VAD) and ASR techniques, which

allowed a more accurate and efﬁcient identiﬁcation

of ﬁller words present in recordings compared to the

keyword-based approach.

More recent work, conducted in June 2023 (Zhu

et al., 2023), addresses the challenge of detecting non-

linguistic ﬁller words. It is recognized that universal

accessibility to ASR systems may be limited by fac-

tors such as budget constraints, the diversity of target

languages, and the computational resources required.

Through the employment of Structured State Space

Sequence (S4) models and semi-Markov neural con-

ditional random ﬁelds (semi-CRF), an absolute per-

formance improvement of 6.4% at the segment level

and 3.1% at the event level is achieved in the Podcast-

Fillers dataset.

As for commercial tools, Microsoft’s ability

to evaluate oral presentations through the Speaker

Coach (Microsoft, 2023) tool within the PowerPoint

application stands out. This tool allows simulations

of oral presentations, providing detailed results on the

quality of the presentation. It evaluates various pa-

rameters such as speech rate, pronunciation, tone of

voice, repetitive language, inclusive language, origi-

nality, and use of ﬁller words. However, this func-

tionality is restricted to the Microsoft PowerPoint en-

vironment, lacking an API to integrate it into external

projects and it does not offer the possibility to down-

load a detailed report of the evaluation results.

In addition, various web-based commercial op-

tions offer similar functionalities, each with speciﬁc

approaches and payment plans:

• Kapwing (Eric and Julia, 2023) introduces the

Smart Crop tool, which detects ﬁller words al-

lowing users to remove these audio or video seg-

ments.

• Podcastle (Team, 2022) offers the Filler Word De-

tection tool which provides the audio transcrip-

tion, detecting ﬁller words, and allowing users to

delete the speciﬁc segment.

• Cleanvoice AI (Adrian, 2023) specializes in re-

moving unnecessary sounds, such as ﬁller sounds,

stuttering and mouth noises.

3 METHODOLOGY

The tools described in the previous section are ei-

ther experimental or commercial. Aiming to integrate

our ﬁndings in the RAP system, we focused on open-

source solutions.

3.1 Filler Words

For detecting ﬁller words out of speech transcriptions

using Speech to Text (STT) technology, two open-

source solutions stand out: Whisper and Coqui STT.

Figure 1: Process of ﬁller word detection.

Whisper (Radford et al., 2022) is a model spe-

cialized in speech processing including multilingual

speech recognition, speech translation, spoken lan-

guage identiﬁcation and voice activity detection. It

has an architecture based on Transformer sequence-

to-sequence and was trained on an extensive dataset

comprising 680 000 hours of audio in 97 languages,

CSEDU 2024 - 16th International Conference on Computer Supported Education

214

including Spanish. This training resulted in the cre-

ation of 5 models variants, presented in ascending

order from lowest to highest capacity: Tiny, Base,

Small, Medium and Large.

The Medium model was used to identify ﬁller

words. There was no need to apply conversion or

cleanup techniques to the audio as Whisper internally

splits the audio conveniently to deliver optimal re-

sults.

3.1.1 Detection and Counting of Filler Words

Previously, communication courses teachers from Es-

cuela Superior Polit

ecnica del Litoral (ESPOL®)

shared with us a list of ﬁller words most frequently

used by students in an oral presentation. With this in-

formation as a foundation, a dictionary was built with

common ﬁllers words as keys and a list with the vari-

ations of said ﬁller word as values.

We noticed that when using Coqui STT it was dif-

ﬁcult to detect certain ﬁller words, so, a boost was

added to those words. In addition, it was unable to

detect composed ﬁller words, for example: ”c

omo es

que es”, limiting its applicability.

At this point, we proceeded to either generate the

transcription of all the audio chunks for Coqui STT, or

to process the entire audio and obtain the transcription

of each segment if Whisper was used.

To save the the number of ﬁller words detected in

an audio ﬁle it uses JSON ﬁles, this is an open stan-

dard ﬁle format used to store and transmit data ob-

jects consisting of attribute–value pairs and arrays. A

JSON ﬁle is created with ﬁller words as keys, with

all values set to 0, representing the number of times

each ﬁller word was detected. Additionally, another

JSON is initialized with the same keywords, assign-

ing empty arrays as values that will store the speciﬁc

moments in which each ﬁller word was uttered.

Next, the dictionary of common ﬁller words is it-

erated in descending order by the number of words in

each ﬁller word. In each iteration, by means of the

nltk

python library focused on NLP (Natural Lan-

guage Processing), the transcription tokens are ex-

tracted to create a list of n-grams with size equal to the

number of words and obtain their frequency distribu-

tion, this is useful to count the number of occurrences

of the ﬁller word and its variations in the transcription

(Bird et al., 2009).

The JSON is then updated with the number of

ﬁller words detected. In the case of Whisper, the

start and end seconds when each ﬁller word was de-

tected, obtained from the corresponding metadata, are

incorporated. Finally, the ﬁller word and its variations

https://github.com/nltk/nltk

are removed from the transcription. This procedure

avoids consideration of these words in future itera-

tions, preventing them from being intertwined with

terms from different ﬁller words.

3.1.2 Tool Comparison and Performance

Analysis

To calculate the performance of each STT technology

to detect ﬁller words, it was necessary to perform the

manual transcription of each test audio.

We tailored a custom metric called Filler Word

Error Rate (FWER), based on the Word Error Rate

(WER) formula which is a commonly used measure

in the evaluation of speech recognition or machine

translation systems. FWER takes into account both

ﬁller word insertions and deletions in the output of

STT technology compared to the total number of ﬁller

words in the manual transcription. So, we calcu-

lated the FWER for each of the test audios using both

Whisper and Coqui STT.

FW ER =

insertions + deletions

f iller words in manual transcript

3.2 Filled Pauses

To correctly capture ﬁlled pauses, we rely on the deﬁ-

nition presented by (Mar

ıa J. Machuca, 2015), which

states that the /e/ and /m/ phonemes are the only ones

that can constitute true ﬁlled pauses in Spanish oral

speeches. We then deﬁne a ﬁlled pause as a word that

doesn’t belong to the speech and ﬁts one of the /e/,

/a/ or /m/ phonemes. We complete this deﬁnition by

adding prolongations, which are the prolonged sounds

of a vowel (and sometimes a consonant) at the end of

a word. If these prolongations are long enough, they

effectively function as ﬁlled pauses. From now on, the

term ﬁlled pause will refer to this speciﬁc combined

deﬁnition.

3.2.1 Whisper Setup

Our ﬁller word detection algorithm relies on the

Whisper Small model, rather than the Medium one.

While we initially used both models, we found that

the Small model consistently produced more accurate

results in several comparisons. As a result, we made

the decision to exclusively use the Small model go-

ing forward. We used the transcribe function,

with the temperature value set to 0 to make the

transcription process deterministic, and a string in the

initial prompt parameter.

Detecting Speech Disﬂuencies Using Open-Source Tools in Automatic Feedback Systems for Oral Presentation Training

215

3.2.2 Detection and Counting of Filled Pauses

By default, Whisper models may leave out ﬁlled

pauses, so we used prompting

to avoid this. The

presence of ellipses (...) plays a very important role

in prompting. The following string is passed in the

initial

prompt parameter because it showed the

best results:

Eh... Rust es... un lenguaje

eh... de programaci

on eh... que

ah... compite eh.. directamente

con... ah... C

We split the original audio in 10 seconds clips. For

each clip, we call the transcribe function and it-

erate trough every word of the transcript, using the

following regular expression to capture ﬁlled pauses:

.{1,2}[hm]{1,2}\.\.\.

It ﬁts the ﬁrst part of the provided ﬁlled pause

deﬁnition and effectively matches strings as “eh...”,

“ah...”, “mmm...”, which represent common Spanish

ﬁlled pauses.

For the second part of the deﬁnition, we use this

regular expression, as the ellipses represent prolonga-

tions themselves:

\.\.\.

Due to the fact that intonation inﬂuences whether

a prolongation is considered a ﬁlled pause and that

we do not analyze the audio itself, but the transcript

provided by Whisper, we need to minimize the pres-

ence of false positives when capturing prolongations.

We achieve this by adding a time threshold of 0.95

seconds to make sure that the prolongations are true

ﬁlled pauses most of the time.

When a ﬁlled pause (fp) is detected, two discard

conditions are checked, due to the possibility of the

models entering a repetition loop and transcribe a

word more than once. If the detected ﬁlled pause

meets either of them, we ignore it:

• The detected fp has the same end timestamp than

the previous one

• The detected fp is identical to the previous one

Finally, we increase the fp counter and use the

word timestamps to export an audio clip containing

the ﬁlled pause with pydub.

https://platform.openai.com/docs/guides/speech-to-

text/prompting

3.2.3 Performance Analysis

Since we only rely on accuracy and recall, we use the

F-score and the average of false positives as the per-

formance analysis metrics for this algorithm. We also

calculate these metrics for the existing Praat-based

ﬁlled pause detector to compare.

3.3 Processing Pipeline of Speech

Disﬂuencies Detection in the RAP

System

The RAP system is designed to efﬁciently handle me-

dia recordings and presentation slides. Once a pre-

sentation is completed, the system bundles the media

recordings, presentation slides, and metadata into a

compressed zip ﬁle. This ﬁle is then transmitted over

the network to a backend server repository using a re-

liable ﬁle transfer protocol. The backend server uses

a broker that implements a publish-subscribe mes-

sage protocol to facilitate communication and orches-

tration between all the components of the RAP sys-

tem. Upon arrival at the backend, the broker notiﬁes

a ﬁle-consumer service, which decompresses the zip

ﬁle and notiﬁes other media-speciﬁc services that the

media ﬁles are ready for processing. This entire pro-

cess is managed through the broker, ensuring seam-

less coordination between all the components of the

RAP system.

Two services process the audio ﬁle sequentially:

the ﬁrst one to detect and count ﬁller words and gen-

erate a bar chart; the second one to detect and count

ﬁlled pauses.

Finally the generated bar chart, along with the au-

dio snippets, are sent to a web service where it be-

comes part of a report for both the student and the

professor. Figure 2 illustrates the overall process pre-

viously described. Figure 3 illustrates the audio ﬁle

process previously described.

Figure 2: Complete process of the audio ﬁle from the

recording room to the analyzers and ﬁnally to become part

of the presentation report.

CSEDU 2024 - 16th International Conference on Computer Supported Education

216

3.4 On-the-Wild Evaluation Setup

As described in the previous section, we added the

detection of ﬁller words as an experimental feature

in the RAP system to evaluate this feature “on-the-

wild”. This implied that after a RAP presentation,

an additional bar chart, together with a small expla-

nation about ﬁller words, was added to the student’s

RAP report. Figure 4 depicts an example of a bar

chart, presented in a student’s RAP report, with the

most common ﬁller words used by the student during

their presentation. The chart was only informative, no

evaluation of the use of ﬁller words was presented to

the student.

All students were informed during class sessions

about the usage of a new experimental feature and

presented with an online informed consent form ex-

plaining the nature of the experiment and how their

data will be handled. Consent was completely op-

tional with no effect to their academic activities and

could be withdrawn at any time by the students.

Consent implied using the student’s data (academic

metadata and RAP results) for this research in an

anonymized aggregated form. We contrasted the stu-

dents’ RAP results against three variables: gender,

study program, and academic performance. Gender is

self-reported by the student, study program and aca-

demic performance (grade average) are part of the stu-

dent’s academic metadata.

In total, 290 students from three different courses

gave their informed consent (90% of enrolled in these

courses) to participate with their data on this research.

Speciﬁcally, 211 from the course Communication, 43

from the course Embedded Systems, and 49 from the

course Computation and Society . Communication

students used the RAP system twice in the semester,

the rest used the system only once. Some Communi-

cation students (13) were also enrolled in Computa-

tion and Society and consequently used the RAP sys-

tem three times. Overall, 501 presentation recordings

were obtained from participants.

Additionally, we performed a post hoc detection

of ﬁlled pauses on the 501 recordings using the Whis-

per methodology with the objective of evaluating its

effectiveness on real RAP presentations. These re-

sults were not presented in the student’s RAP report

because they were obtained several weeks after their

presentations.

4 RESULTS

To measure the behavior of both STT technologies

(Whisper and Coqui STT), an analysis of 64 au-

dios was carried out, all recorded in the RAP class-

room and each with an average duration of 5 minutes.

These standardized conditions ensured consistency in

data collection and provided a controlled scenario for

our assessments.

Table 1: FWER percentage by STT technology.

STT technology FWER [%]

Whisper 35.85

Coqui STT 90.80

Table 1 summarizes the error rate of ﬁller words

detection. There is a signiﬁcant disparity in the ac-

curacy of both technologies. According to Errattahi,

the WER is a good metric to know which ASR tech-

nology is better than another (Errattahi et al., 2018);

taking into account that the FWER metric was based

on the WER and that the ﬁller word error rates of

Whisper and Coqui STT technologies are 35.85% and

90.80% respectively, we can state that Whisper is bet-

ter than Coqui STT for detecting ﬁller words in oral

presentations. This may be because most of the dic-

tionary of common ﬁller words were added as hot

words for Coqui STT.

Table 2: FWER percentage by Gender using Whisper.

Gender Num. Audios FWER [%]

Women 21 27.47

Men 43 26.87

Table 2 shows interesting patterns in the detection

of ﬁller words according to the gender of the speaker

using Whisper. The FWER for women’s presenta-

tions is 27.47%, while for men’s presentations it is

26.87%. These results suggest a consistency in the ef-

fectiveness of Whisper in the detection of ﬁller words

regardless of the gender of the speaker and indicates

that there is no gender bias in the detection of ﬁller

words. As there might be gender differences in the

use of speech disﬂuencies, it is important to rule out

gender bias in the detection technology.

Regarding ﬁlled pauses, 55 of these same 64 au-

dios were used. There were 694 real ﬁlled pauses. We

listened each detected ﬁlled pause clip, for both Whis-

per and Praat implementations, to count true and false

positives and used manual transcripts for the false

negatives.

Table 3: Performance metrics of ﬁlled pause detection.

Implementation F-score FPs Avg.

Whisper 0.8816 1.29

Praat 0.3541 17.09

Detecting Speech Disﬂuencies Using Open-Source Tools in Automatic Feedback Systems for Oral Presentation Training

217

Figure 3: Phases of audio evaluation to obtain both ﬁller words and ﬁlled pauses.

Figure 4: Example of a report of frequency of usage of ﬁller

words in a ﬁve-minute presentation.

Table 3 presents the obtained F-scores and false

positives averages for the two implementations of the

ﬁlled pause detection algorithm.

Table 4: Expanded performance metrics of ﬁlled pause de-

tection.

Impl. TP FN FP Accuracy Recall

Whisper 603 91 71 0.9120 0.8689

Praat 316 775 378 0.3247 0.4553

Table 4 collects the individual calculated metrics.

The accuracy value is the weighted mean of the accu-

racy of each recording analysis.

4.1 On-the-Wild Evaluation

Some recordings, due to unexpected background

noises and technical errors, were not processed (488

out 501 were processed for ﬁller words, 481 out of

501 were processed for ﬁlled pauses). Use of the

RAP system was divided in two sessions during the

semester, Table 5 explains the use of the RAP system

by subject.

Table 5: RAP sessions by subject used in the evaluation.

Subject Session 1 Session 2

Communication YES YES

Embedded Systems YES NO

Computation and Society NO YES

Figure 5 shows the ﬁller words results by gender

(women = 109, men = 177) and Figure 6 by study

program. It can clearly be seen that, on average as

reported by the mean and the median, men (mean

= 12.9) use more ﬁller words than women (mean

= 8.9) in a ﬁve-minute oral presentation. As data

is non-normally distributed, a Wilcoxon non-paired

two-sample t-test was performed on results on the ﬁrst

and second sessions. On both cases, highly statis-

tical differences were observed. As for study pro-

grams, statistically signiﬁcant differences (using the

same test) were observed between engineering pro-

grams and non-engineering programs. For exam-

ple, on average, Electronics and Computing engineer-

ing students reported higher (mean = 13.8) use of

ﬁller words than Design and Communication students

(mean = 8.6). Statistically signiﬁcant differences re-

mained, to a lesser degree due to diminished sample

size, when controlling by gender and voice volume.

No statistically signiﬁcant differences were observed

by academic performance. No statistically signiﬁcant

differences were observed for Whisper ﬁlled pauses

on any variable.

Figure 7 shows the Pearson correlation statistics

between RAP results and Whisper ﬁlled pauses and

ﬁller words. While all correlations were either non-

existent or weak, a statistically signiﬁcant positive

correlation (r ̸= 0) was observed between ﬁller words

and ﬁlled pauses (as extracted by Praat) and a nega-

tive correlation between ﬁller words and gaze. As ex-

CSEDU 2024 - 16th International Conference on Computer Supported Education

218

Figure 5: Use of ﬁller words by gender in RAP presenta-

tions in the ﬁrst and second sessions.

Figure 6: Use of ﬁller words by study program in RAP pre-

sentations, both sessions.

pected, a positive, but rather weak, correlation is ob-

served between both ﬁlled pauses (Whisper and Praat)

detection algorithms.

Figure 7: Correlation between RAP scores and Whisper ex-

tracted ﬁller words and ﬁlled pauses.

5 DISCUSSION

Whisper, with its medium model and ability to tran-

scribe audios in Spanish with 3.6% WER, offers an

opportunity to apply it to the identiﬁcation of ﬁller

words in oral presentations. The evaluation of Whis-

per technology in the context of the RAP system

shows a FRER error rate of 35.85%, indicating a mod-

erate identiﬁcation of ﬁller words that can improve

feedback to students in an academic context. This ef-

ﬁcacy is especially signiﬁcant considering the com-

plexity of identifying ﬁller words in oral presenta-

tions.

The results of the on-the-wild evaluation corrob-

orate a pattern that has been observed before in lan-

guage research: there is a gender disparity in the use

of non-clinical speech disﬂuencies (Bortfeld et al.,

2001; Lo, 2020). On average, men tend to use ∼ 40%

more ﬁller words than women in a ﬁve-minute RAP

presentation. It is important to point out that the

recordings of male presenters tend to exhibit slightly

higher voice volume levels, which may facilitate the

detection of disﬂuencies. However, after controlling

for voice volume, the gender effect remains. After

controlling for study program, there are more men in

engineering programs, the gender effect still remains,

albeit with a weaker statistical signiﬁcance due to re-

duced sample size.

As for the poor correlation between both the Praat

and Whisper ﬁlled pauses detection algorithms, it

might happen that the latter needs further improve-

ment or that the large number of false positives in the

former interferes. Nevertheless, Whisper results are

quite decent, clearly outperforming the existing Praat

algorithm as it is shown in the results tables. Whis-

per small model should certainly be the ﬁrst choice

for detecting ﬁlled pauses, before trying to ﬁne-tune a

model or creating a new detector from scratch.

In previous work, we found that engineering stu-

dents tend to do poorly with non-verbal oral presen-

tation skills, as measured by posture and gaze to the

audience (Dom

ınguez et al., 2023). It is interesting to

note that this pattern is also observed in verbal skills

as measured by ﬁller words. Also, the inverse correla-

tion (r = -0.29, highly signiﬁcant), observed between

gaze to the audience and ﬁller words suggest a pos-

sible correlation between verbal and non-verbal oral

presentation skills.

6 CONCLUSIONS

This study presents the evaluation of different tools to

detect both ﬁller words and ﬁlled pauses. For ﬁller

Detecting Speech Disﬂuencies Using Open-Source Tools in Automatic Feedback Systems for Oral Presentation Training

219

words, Whisper outperforms Coqui STT, while pro-

viding similar performance across gender disparities.

For ﬁlled pauses, the small Whisper model provides a

balance between good accuracy and recall, compared

to the medium model.

While the Whisper algorithms have room for im-

provement, the tool performs well on-the-wild and

may allow the exploration of possible correlations be-

tween verbal and non-verbal oral presentation skills.

ACKNOWLEDGEMENTS

We express our sincere gratitude to Maria Gonzalez,

Nicole Asqui and Hayleen Carrillo for their invalu-

able collaboration in the execution of this project. We

also extend our appreciation to all the contributors to

the open source libraries and tools used in its devel-

opment.

REFERENCES

Adrian, S. (2023). Filler words remover. Accessed: 2023-

12-29.

Alwi, N. F. B. and Sidhu, G. K. (2013). Oral Presenta-

tion: Self-perceived Competence and Actual Perfor-

mance among UiTM Business Faculty Students. Pro-

cedia - Social and Behavioral Sciences, 90(InCULT

2012):98–106.

Bird, S., Klein, E., and Loper, E. (2009). Natural language

processing with Python: analyzing text with the natu-

ral language toolkit. ” O’Reilly Media, Inc.”.

Boersma, P. and Weenink, D. (2023). Praat: doing pho-

netics by computer. http://www.praat.org/. Accessed:

2023-12-19.

Bortfeld, H., Leon, S., Bloom, J., Schober, M., and Bren-

nan, S. (2001). Disﬂuency Rates in Conversation: Ef-

fects of Age, Relationship, Topic, Role, and Gender.

Language and Speech, 44(2):123–147.

Das, S., Gandhi, N., Naik, T., and Shilkrot, R. (2019).

Increase apparent public speaking ﬂuency by speech

augmentation. In 2019 IEEE International Confer-

ence on Acoustics, Speech and Signal Processing

(ICASSP), Brighton, UK. IEEE.

De Grez, L., Valcke, M., and Roozen, I. (2009). The im-

pact of goal orientation, self-reﬂection and personal

characteristics on the acquisition of oral presentation

skills. European Journal of Psychology of Education,

XXIV:293–306.

Dom

ınguez, F., Eras, L., Tomal

a, J., and Collaguazo, A.

(2023). Estimating the Distribution of Oral Presen-

tation Skills in an Educational Institution: A Novel

Methodology. In International Conference on Com-

puter Supported Education, CSEDU - Proceedings,

volume 2, pages 39–46. SCITEPRESS.

Dom

ınguez, F., Ochoa, X., Zambrano, D., Camacho,

K., and Castells, J. (2021). Scaling and Adopt-

ing a Multimodal Learning Analytics Application in

an Institution-Wide Setting. IEEE Transactions on

Learning Technologies, 14(3):400–414.

Eric, L. and Julia, E. (2023). Remove ﬁller words. Ac-

cessed: 2023-12-29.

Errattahi, R., El Hannani, A., and Ouahmane, H. (2018).

Automatic speech recognition errors detection and

correction: A review. Procedia Computer Science,

128:32–37. 1st International Conference on Natural

Language and Speech Processing.

osy, M. (2023). Occurrences and Durations of Filled

Pauses in Relation to Words and Silent Pauses in

Spontaneous Speech. Languages, 8(1).

Lo, J. J. (2020). Between

Ah(m) and Euh(m): The Distribu-

tion and Realization of Filled Pauses in the Speech of

German-French Simultaneous Bilinguals. Language

and Speech, 63(4):746–768.

Mar

ıa J. Machuca, Joaquim Llisterri, A. R. (2015). Las

pausas sonoras y los alargamientos en espa

nol: Un es-

tudio preliminar. Revista Normas, 5:81–96.

Microsoft, T. (2023). Rehearse your slide show with

speaker coach. Accessed: 2023-12-29.

Ochoa, X. and Dominguez, F. (2020). Controlled evalua-

tion of a multimodal system to improve oral presenta-

tion skills in a real learning setting. British Journal of

Educational Technology, 51(5):1615–1630.

Ochoa, X., Dom

ınguez, F., Guam

an, B., Maya, R., Fal-

cones, G., and Castells, J. (2018). The RAP System

: Automatic Feedback of Oral Presentation Skills Us-

ing Multimodal Analysis and Low-Cost Sensors. In

LAK’18: International Conference on Learning Ana-

lytics and Knowledge, pages 360–364, Sydney, Aus-

tralia. ACM.

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey,

C., and Sutskever, I. (2022). Robust speech recogni-

tion via large-scale weak supervision.

Scott Fraundorf, Jennifer Arnold, V. L. (2014). Disﬂuency.

obo in linguistics.

Team, P. (2022). Improve the way you sound! remove ﬁller

words from text in seconds! Accessed: 2023-12-29.

Zhu, G., Caceres, J.-P., and Salamon, J. (2022). Filler word

detection and classiﬁcation: A dataset and benchmark.

arXiv preprint arXiv:2203.15135.

Zhu, G., Yan, Y., Caceres, J.-P., and Duan, Z. (2023). Tran-

scription free ﬁller word detection with neural semi-

crfs. In ICASSP 2023 - 2023 IEEE International Con-

ference on Acoustics, Speech and Signal Processing

(ICASSP), pages 1–5.

CSEDU 2024 - 16th International Conference on Computer Supported Education

220