Tunisian Dialect Speech Corpus: Construction and Emotion Annotation

Latifa Iben Nasr

, Abir Masmoudi

and Lamia Hadrich Belguith

MIRACL Laboratory, University of Sfax, Tunisia

Keywords:

Emotion, Spontaneous Speech, Tunisian Dialect, Multi-Domain.

Abstract:

Speech Emotion Recognition (SER) using Natural Language Processing (NLP) for underrepresented dialects

faces signiﬁcant challenges due to the lack of annotated corpora. This research addresses this issue by con-

structing and annotating SERTUS (Speech Emotion Recognition in TUnisian Spontaneous speech), a novel

corpus of spontaneous speech in the Tunisian Dialect (TD), collected from various domains such as sports,

politics, and culture. SERTUS includes both registers of TD: the popular (familiar) register and the intel-

lectual register, capturing a diverse range of emotions in spontaneous settings and natural interactions across

different regions of Tunisia. Our methodology uses a categorical approach to emotion annotation and employs

inter-annotator agreement measures to ensure the reliability and consistency of the annotations. The results

demonstrate a high level of agreement among annotators, indicating the robustness of the annotation process.

The study’s core contribution lies in its comprehensive and rigorous approach to the development of a dataset

of spontaneous emotional speech in this dialect. The constructed corpus has signiﬁcant potential applications

in various ﬁelds, such as human-computer interaction, mental health monitoring, call center analytics, and

social robotics. It also facilitates the development of more accurate and culturally nuanced SER systems. This

work contributes to existing research by providing a high-quality annotated corpus while emphasizing the im-

portance of including underrepresented dialects in NLP research.

1 INTRODUCTION

Human emotions arise in response to objects or

events in their surroundings, inﬂuencing various as-

pects of human life, including attention, memory re-

tention, goal achievement, recognition of priorities,

knowledge-based motivation, interpersonal commu-

nication, cognitive development, emotional regula-

tion, and effort motivation (Gannouni et al., 2020).

Speech Emotion Recognition (SER) has emerged

as a crucial area of research with diverse applica-

tions, such as mental health assessment, customer ser-

vice optimization, and human-computer interaction

(Goncalves et al., 2024).

As for languages, most speech emotion datasets

are implemented in German, English, and Spanish.

Several SER studies have also been conducted in

Dutch, Danish, Mandarin, and other European and

Asian languages (Aljuhani et al., 2021). Arabic SER

is an emerging research ﬁeld (Besdouri et al., 2024).

The developmental lag in Arabic SER is due to the

https://orcid.org/0009-0008-0677-7458

https://orcid.org/0000-0002-5987-8876

https://orcid.org/0000-0002-4868-657X

lack of available resources compared to other lan-

guages (Alamri et al., 2023).

Expanding upon the linguistic complexities, the

Arabic language, with its rich diversity of dialects

and linguistic nuances, presents a formidable hurdle

in SER tasks. Among these dialects, Tunisian Dialect

(TD) stands out due to its unique characteristics and

complexities. Spoken by approximately 12 million

people across various regions (Zribi et al., 2014), TD

encompasses numerous regional varieties, including

the Tunis dialect (Capital), Sahil dialect, Sfax dialect,

Northwestern TD, Southwestern TD, and Southeast-

ern TD (Gibson, 1999).

Moreover, TD exhibits a distinctive linguistic fu-

sion, incorporating vocabulary from multiple lan-

guages, such as French, Turkish, and Berber (Mas-

moudi et al., 2018). For instance, Tunisians often

borrow expressions from French, incorporating them

into everyday conversations with phrases like ’c¸a va,’

’d

esol

e,’ ’rendez-vous,’ and ’merci.’ Indeed, the lin-

guistic situation in Tunisia is described as ’polyglos-

sic,’ where multiple languages and language varieties

coexist. This linguistic diversity is a testament to

Tunisia’s rich historical heritage, inﬂuenced by var-

360

Nasr, L. I., Masmoudi, A. and Belguith, L. H.

Tunisian Dialect Speech Corpus: Construction and Emotion Annotation.

DOI: 10.5220/0013134000003890

In Proceedings of the 17th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2025) - Volume 3, pages 360-367

ISBN: 978-989-758-737-5; ISSN: 2184-433X

ious powers, including French colonization, the Ot-

toman Empire, and earlier inﬂuences.

Therefore, analyzing emotional expressions in TD

requires a nuanced understanding of the linguistic in-

tricacies of the dialect, including the dual registers

of intellectualized and popular dialects (Boukadida,

2008). The intellectualized dialect, used by schol-

ars and in media broadcasts, features a formal lex-

icon and borrowings while retaining dialectal struc-

tures. In contrast, the popular (familiar) dialect is

used for everyday communication. For example,

the sentence ”Disappointing situation” is rendered as

[





ËA















®Ó/ mVaynaT AlHAlaT] in the familiar reg-

ister and [























ñË@/ Alwa.diyaT xAybT] in the

intellectualized register.

Furthermore, emotional expressions in Tunisian

Arabic are heavily inﬂuenced by cultural nuances.

For instance, a speaker from Tunis might ex-

press joy by saying [A





Q









¯/ fraHit bar$A]

(”I’m very happy”), using [A





Q



/ bar$A] to in-

tensify the emotion, while a southern speaker

might say [









'@ ú







Èñ







K ú



kð



P m









½ë



/ hika ni-

His rowHy tVowl ﬁy AljanaT] (”I feel like I’m in

paradise”) to convey a deeper sense of joy through

metaphor. Similarly, frustration might be expressed as

[





ËA





áÓ









/ tibit mn Al.hAlaT] (”I’m tired of the

situation”) in the north, whereas a southern speaker

might use [ÈA



'@ A









¬A





/ .dAV biyA AlHAl] (”I’m

overwhelmed by the situation”) to indicate a higher

level of emotional intensity.

These variations illustrate how cultural and re-

gional nuances shape emotional expressions in

Tunisian Arabic. Despite this linguistic diversity,

the ﬁeld of SER in TD remains limited, with only a

few studies, such as those conducted by (Nasr et al.,

2023), (Meddeb et al., 2016), and (Messaoudi et al.,

2022), which are annotated with distinct emotion

classes but have a limited size distribution.

The primary objective of this research is to ad-

dress this gap by providing emotional speech anno-

tations in Tunisian Arabic based on a categorical ap-

proach, which will be made publicly available to the

scientiﬁc community. To achieve this, we aim to de-

velop a newly collected corpus of spontaneous TD

across different domains and capture multiple regis-

ters of TD, some of which have already been utilized

in previous studies (Nasr et al., 2023), to ensure the

richness of our dataset. Figure 1 elucidates the metic-

ulous construction process of our SERTUS corpus.

The major contributions of this research are sum-

marized as follows:

• Collecting spontaneous data from various do-

mains in TD.

• Annotating the corpus using a categorical ap-

proach.

• Assessing the effectiveness of this annotation us-

ing an agreement measure.

The remainder of this paper is structured as fol-

lows: In Section 2, we review the related works pre-

sented in the literature. In Section 3, we outline the

construction of SERTUS (Speech Emotion Recogni-

tion in TUnisian Spontaneous Speech). In Section 4,

we establish a discussion. Finally, in Section 5, we

draw conclusions and discuss future research direc-

tions.

2 RELATED WORK

The critical assessment of SER systems depends on

the quality of the databases used and the performance

metrics achieved. To select an appropriate dataset,

several criteria must be taken into account, includ-

ing the naturalness of emotions (natural, acted, or in-

duced), the size of the database, the diversity of avail-

able emotions, and the annotation approach (dimen-

sional or categorical).

Arabic SER is an emerging ﬁeld. However, there

is a notable scarcity of available Arabic speech emo-

tion datasets. Between 2000 and 2021, emotional

databases in Arabic speech accounted for only 4.76%

of the total number of databases across all languages

(Iben Nasr et al., 2024). Additionally, acted and

elicited emotion databases are primarily used for Ara-

bic SER (Alamri et al., 2023). For example, KSUE-

motions is an acted database for Standard Arabic

that comprises 5 hours and 5 minutes of recorded

data. Another example is EYASE, which represents

a semi-natural Egyptian Arabic dataset consisting of

579 utterances ranging in length from 1 to 6 seconds

(El Seknedy and Fawzi, 2022). To the best of our

knowledge, only two datasets are available in TD: the

REGIM TES dataset (Meddeb et al., 2016), which

includes 720 acted emotional speech samples with

lengths of up to 5 seconds, and TuniSER (Messaoudi

et al., 2022), which contains 2,771 induced speech

utterances with durations varying from 0.41 to 15.31

seconds, averaging around 1 second.

In comparing Arabic with other languages such

as English and French, we notice a notable lack of

both the quality and quantity of available datasets for

SER. For instance, English language datasets, such as

the IEMOCAP dataset (Busso et al., 2008), include

natural and spontaneous conversations between ac-

Tunisian Dialect Speech Corpus: Construction and Emotion Annotation

361

Figure 1: Process of SERTUS Construction.

tors, totaling approximately 12 hours of audio record-

ings. Similarly, the SAVEE dataset (Jackson and Haq,

2014) comprises natural emotional speech recorded

from four English male speakers, with a total duration

of 4 hours. Furthermore, French language datasets,

like the RECOLA dataset (Ringeval et al., 2013), con-

tain natural emotional speech recordings in various

contexts, with a total of around 8 hours of audio data.

However, the Arabic language suffers from a

dearth of comparable resources in terms of both qual-

ity and quantity compared to these well-established

English and French datasets. As previously men-

tioned, the available Arabic datasets, predominantly

consisting of acted or elicited speech, lack the natu-

ralness and spontaneity of real-world emotional ex-

pressions. Moreover, compared to other datasets, the

sizes of the existing Arabic datasets are signiﬁcantly

smaller, exacerbating the challenges faced in develop-

ing robust Arabic SER systems. This disparity under-

scores the urgent need for additional efforts to address

the shortcomings in Arabic SER, particularly regard-

ing the quality and size of the datasets. Hence, it is

vital to apply this approach in diverse linguistic con-

texts and identify its gaps in Arabic SER corpora.

3 SERTUS CONSTRUCTION

3.1 Data Collection

In Tunisia, it is possible to make fair use of copy-

righted materials according to Article 10 of Law No.

94-36 of February 24, 1994, related to literary and

artistic property, as amended by Law No. 2009-33 of

June 23, 2009. In our case, within a research frame-

work, and as mentioned above, there are two oral

linguistic registers (Boukadida, 2008). For the ﬁrst

programs, particularly talk shows on YouTube. We

aimed to analyze the range of emotions expressed by

guests who shared their views on various topics re-

lated to social, political, cultural, and sporting mat-

ters. As for the second register, we extracted data

from the ’Tunisian Reality’ YouTube series, which

consists of interviews conducted on the streets us-

ing sidewalk microphones in Tunis, the capital city of

Tunisia. These series capture spontaneous emotional

responses and comments from individuals in public

spaces, addressing a wide range of topics such as pol-

itics and the economy. We collected data from differ-

ent regions, including southern and northern Tunisia,

to capture dialectical differences.

Based on a variety of talk shows and ’Tunisian

Reality’ series, with an emphasis on the various do-

mains and the two registers of TD, and by including

diverse public regions, we were able to build a com-

prehensive representation of spontaneous emotional

reactions within the Tunisian population, ultimately

leading to a better understanding of the SER process.

These resources offer a wide range of emotional ex-

pressions and regional linguistic variations to better

capture the main TD characteristics, including multi-

lingual aspects, various word types, as well as mor-

phological, syntactical, and lexical differences (Mas-

moudi et al., 2018) (Nasr et al., 2023).

Our SERTUS database contains over 23.85 hours

of spontaneous TD speech by 1,259 speakers. We

used the ’WavePad’ software, thanks to its user-

friendly interface and advanced features, to extract

voiced segments from the audio ﬁles. We extracted

the speakers’ responses and eliminated not only the

accompanying questions raised by TV show presen-

ters or journalists after conducting street interviews

but also extraneous noise sources. Each speaker’s re-

sponse was treated as a distinct audio segment to en-

sure clarity and coherence. Our dataset includes 3,793

recordings with durations ranging from 3.15 seconds

to 10.97 minutes, providing a wide range of speech

scenarios. In fact, each segment spans approximately

20 seconds to capture a broad spectrum of linguistic

expressions and emotional nuances. A uniform sam-

pling rate of 44.1 kHz was applied to enhance consis-

tency, and the corpus was gathered over a ﬁve-month

period.

Furthermore, to provide new insights into the dis-

tribution and composition of our dataset, Table 1 cat-

egorizes the collected data into different domains and

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

362

provides a detailed breakdown of the size and compo-

sition of our dataset in each domain.

Table 1: Statistics on the number of minutes collected per

domain.

Domain Program Number

of minutes

Culture Kahwa Arbi 180

Labess Show

Policy Wahech Checha 60

Sport Attesia Sport 60

Society Saﬁ Kalbek 180

Maa Ala

Multi-domain Tunisian Reality 951

3.2 Pre-Processing Steps

To ensure accurate annotations, we have carried out

some pre-processing steps, namely reducing periods

of silence and anonymizing personal information.

3.2.1 Silence Reduction

Our study has primarily focused on selected speeches

delivered by speakers or guests, rather than presenta-

tions made by TV program presenters. Therefore, we

have excluded presenters’ statements from our cor-

pus. It is worth noting that our corpus is made up of

the voices of TV speakers or guests without any over-

lapping speech. Given the spontaneous nature of our

corpus, periods of silence used for reﬂection before

giving a speciﬁc response are common. To ensure

consistency, any silence lasting more than two sec-

onds has been standardized to a duration of exactly

two seconds

3.2.2 Anonymization

Our speech data, collected from authentic conver-

sations, may include personal information such as

names, professions, locations, and other details. To

ensure conﬁdentiality and compliance with data pro-

tection regulations, it is crucial to closely follow and

oversee the legal framework surrounding the man-

agement of personal data. By adhering to the Good

Practice Guide Principles (2006), we prioritize the

anonymization of our data to prevent the identiﬁca-

tion of individuals.

Our approach involves identifying sensitive ele-

ments in recordings that may reveal personal infor-

mation. To ensure anonymity, it is essential to iden-

tify and replace cues that may disclose this informa-

tion with periods of silence, thereby reducing the fre-

quency of personal speech information. We acknowl-

edge that speciﬁc combinations of cues may still re-

veal individuals’ identities. For example, a speaker

may mention only their ﬁrst name, which is not con-

sidered identiﬁable personal information. Also, pro-

fessional roles (e.g., journalists) are not generally

considered identiﬁers unless speciﬁc details are dis-

closed.

After analyzing the corpus, uncommon instances,

such as speciﬁc job titles and locations, require care-

ful handling to avoid the direct identiﬁcation of per-

sonal information. Anonymization is a manual pro-

cess that involves thoroughly examining all audio seg-

ments to protect privacy and adhere to legal standards.

3.3 Guidelines

The annotators were provided with clear and detailed

guidelines to ensure a thorough understanding of the

study’s objectives and the precise ways their contribu-

tions would be utilized. These guidelines emphasized

the importance of objectivity throughout the annota-

tion process to minimize any potential biases stem-

ming from individual perceptions, emotions, or sub-

jective interpretations.

To achieve objectivity, annotators were instructed

to approach the emotional labeling task with a neu-

tral perspective, focusing solely on the linguistic and

acoustic cues present in the audio data rather than

their personal emotional responses. This objectivity

was critical, given the diverse range of spontaneous

emotions expressed in the TD and the need for con-

sistent and reliable annotations across the dataset.

To facilitate the annotation process, a structured

and user-friendly method was employed. Annota-

tors were provided with a pre-designed table that they

were to complete using Microsoft Excel. After lis-

tening to each audio segment, they were required to

associate the identiﬁed emotions with the correspond-

ing audio ﬁle name. This structured approach ensured

a systematic and organized method of capturing emo-

tional labels, reducing the likelihood of errors or in-

consistencies.

The recruitment of the annotation team was han-

dled with care and precision. A public call for ten-

der was launched to develop a robust and transparent

selection process, ensuring that only the most qual-

iﬁed individuals were chosen for the task. The se-

lection criteria were stringent, focusing on expertise

in sentiment annotation, ﬂuency in TD, and a strong

background in linguistic analysis. We speciﬁcally tar-

geted individuals with academic qualiﬁcations from

the Faculty of Arts and Humanities, particularly those

with advanced degrees in linguistics and a solid un-

derstanding of the emotional nuances in the TD.

The ﬁnal annotation team consisted of three na-

Tunisian Dialect Speech Corpus: Construction and Emotion Annotation

363

tive speakers of TD: two females and one male, all

of whom were experts in the ﬁeld of sentiment anal-

ysis. Inspired by previous works such as (Messaoudi

et al., 2022) and (Macary et al., 2020), which also uti-

lized three annotators, each annotator independently

reviewed and labeled the audio ﬁles to ensure a di-

verse yet consistent perspective on the emotional con-

tent of the corpus. This independent annotation pro-

cess, combined with their linguistic expertise, pro-

vided a high degree of reliability and validity in the

emotional labeling of our dataset. Additionally, the

diversity in the team, both in terms of gender and

analytical perspectives, further strengthened the ro-

bustness of the annotations, contributing to the overall

quality of the study’s outcomes.

3.4 Annotation

At this stage, annotators must take into account the

deﬁnitions of emotions identiﬁed in each audio seg-

ment of our dataset. Each segment receives a speciﬁc

emotion. Manual annotation is extremely important

as it helps annotators build a dedicated learning cor-

pus for emotion analysis. The ﬁnal designation for

each clip was determined by a majority vote. Our cat-

egorical annotation includes six different emotions as

follows:

• Neutrality: èñ



QªK



ÐñJ



Ë@ ÕÎ





®Ë

@ Alfylm Alywm

y‘r.dwh / The movie is showing today.

• Joy: Q



jÊ« hQ





ék



g/ .hAjT tfr.h ‘lxr / Its makes

me so happy

• Disgust: C





@



ñêÓ ©



ñË@/ Alw.d‘ mhwA

s jmlA

/ The situation is not going well at all.

• Satisfaction:



èñ







èñ





@ñ



K ð



éJ





@



P A









ék@



Qå



áj



K/ b.srA.hT ’nA rA.dyT w tW b

swT b

swT

t.hsn / Honestly, I’m satisﬁed with this situation,

and little by little, it will get better.

• Sadness:



éJ









JkB



@ ð Q







®Ë@ ÉÓ



K ú







P/ rAniy

t‘bt ml Alfqar w AlA.htyAjyT/ I suffer from

poverty and low living standards.

• Anger: ú





.ñ





Ó A



¾ë A



ÒJ









Qå/ srAaq

kymA hkA myxfw.S rby/ Thieves like these

people do not fear God.

The selection of emotions in our corpus—neutral,

joy, sadness, disgust, anger, and satisfaction—was

carefully determined through discussions with anno-

tators and based on the speciﬁc characteristics of the

data. This choice was inﬂuenced by several consider-

ations, including the inclusion of ”neutral” to capture

moments devoid of intense emotions and ”satisfac-

tion,” which reﬂects a blend of joy and calm (Plutchik,

1980). The presence of these categories in our cor-

pus was essential to accurately represent the observed

emotional diversity. While this study does not di-

rectly compare these emotions with those in datasets

from other languages, the selection aligns with estab-

lished models such as (Plutchik, 1980) and (Ekman,

1992), while being tailored to the unique features of

our dataset.

During the data collection phase, we encountered

a signiﬁcant challenge: the overwhelming prevalence

of negative emotional discourse among Tunisian

speakers. This pattern appeared to reﬂect the broader

socio-economic situation in Tunisia, where ongoing

challenges related to political instability, economic

difﬁculties, and social unrest have shaped public sen-

timent. Despite our concerted efforts to balance

the dataset by actively seeking out episodes that fo-

cused on more positive topics—such as love, personal

achievements, and national celebrations—our corpus

remained heavily skewed toward negative emotional

expressions.

The emotional landscape captured in our data

leaned heavily towards emotions like anger and sad-

ness, while more positive emotions such as joy and

contentment were notably underrepresented. As

shown in Figure 2, the “joy” class accounted for a

mere 4.69% and the ”neutral” class accounted for

only 3.02% of the emotion distribution, whereas the

“anger” class represented 28.76%, indicating a stark

contrast in the emotional categories present in the cor-

pus.

Figure 2: Class distribution in the corpus.

The domains represented in the data col-

lected from Tunisian talk shows and street inter-

views have a direct impact on the types of emo-

tions expressed by speakers. For example, po-

litical discussions, such as those from the pro-

gram ”Wahech Checha”, often evoke negative emo-

tions like anger. In one instance, a speaker ex-

pressed their anger over the political instability,

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

364

saying, [



@



Qå Qå



A











®K



Q¢

ÈðYË@











¯ ñJ





/ynhbw fy mxAzn aldwl b.tryqT .gyr mbA

sr srAq

/They are looting the state’s warehouses indi-

rectly, thieves], reﬂecting the socio-political chal-

lenges faced by the country. Conversely, cul-

tural topics, featured in programs like ”Kahwa

Arbi” and ”Labess Show,” tend to elicit more neu-

tral or positive emotions. For example, a guest

on ”Labess Show” expressed satisfaction, stating,

[½m





KQK





¯ ÐñJ



Ë@ ú















¯ /fr.hAnT ’ny alywm fy

brnAmjk/ I am happy to be in your program today].

Sports-related discussions, such as those from ”At-

tesia Sport,” also display a range of emotions, from

joy during team victories to disappointment during

losses. For instance, a fan celebrating a win said,







JkQ



®Ó A



ß

X ú





K /trjy dymA mfr.htnA/ Attaraji al-

ways bring us joy]. The societal topics addressed in

the programs ”Saﬁ Kalbek” and ”Maa Ala”, as well

as some episodes in the ”Tunisian Reality” YouTube

series, evoke emotions of sadness and disgust that

reﬂect the economic and social situation in Tunisia,

stating, [ú





àA



¿ éK

ÕÎ

ªK







JËA



g /.hAlnA ylm bh kAn

rby/ Only God knows our situation]. By analyzing

these varied domains, we can observe distinct emo-

tional patterns associated with each ﬁeld, enriching

our understanding of how emotions are tied to the

topics being discussed. This domain-emotion cor-

relation is vital for a nuanced analysis of emotional

discourse in TD, especially when considering the re-

gional and contextual factors that inﬂuence emotional

expression.

3.5 Calculation of the Inter-Annotator

Agreement

Evaluating agreement between annotators in corpus

annotation tasks is a critical step in assessing the con-

sistency of the annotations. Inter-annotator reliabil-

ity ensures that the annotated data reﬂect consistent

judgments, minimizing individual biases. One of the

most widely used measures to assess this agreement

is Fleiss’ kappa coefﬁcient (Gwet, 2021). It measures

the degree of concordance between multiple annota-

tors, regardless of the number of categories, and ac-

counts for chance agreement, making it a robust tool

for such analyses. Several Python libraries, such as

scikit-learn and NLTK, provide convenient tools for

researchers to compute this coefﬁcient and validate

the reliability of their annotations.

The formula for Fleiss’ kappa is as follows:

K = (Po − Pe)/(1 − Pe) (1)

Po: represents the proportion of observed agree-

ment between annotators.

Pe: denotes the proportion of agreement expected

by chance.

In our case, the annotators demonstrated a high

level of agreement with with K = 0.89, indicating

strong consistency in annotating the emotions ex-

pressed in spontaneous speech. This result demon-

strates the reliability of the annotations; however,

some disagreements persisted due to the inherent

challenges posed by the nature of the data..

The ﬁrst challenge is the spontaneous nature of

our data, which includes background noise and dis-

ﬂuency phenomena(Boughariou et al., 2021). This

issue affects the performance of the annotation pro-

cess. Annotating emotions in a spontaneous corpus

presents additional challenges due to the complexity

and overlap of emotional states. Emotions are not

always expressed unequivocally and can shift within

the same discourse, making it difﬁcult to achieve per-

fect agreement between annotators. For example, dis-

agreements can arise based on how each annotator

perceives subtle nuances within the same sentence.

Consider the following sentence in TD:

[hA



Ò











IË





Ó ð ú



ÍPA



 ñÊ¿ @





Yë ú









/ yAxy

h*A klw .sArly w mAzelt ntsemAe.h/ After every-

thing that happened to me, I continue to hurt myself].

This sentence could be interpreted as expressing

sadness, due to the sense of resignation it conveys.

However, it could also be perceived as reﬂecting

disgust towards oneself or the situation, illustrating

the difﬁculty of assigning a single emotional label to

nuanced expressions.

Another example that illustrates emotional over-

lap is: [ é<



Ë YÒm

'@ A



@ ,A









¯A





 A









Ë@ ð ÐñÒ

















Q









¯ ð













/konit ma.gmwm w AldenyA

.dAqt biyA, amA Al.hmd lllh t.hasnt w fra.het bar$A

/I was sad and felt overwhelmed, but thank God,

things improved, and I am very happy now.] Here,

we see a clear transition from sadness to joy, demon-

strating an emotional shift marked by the improve-

ment in the situation. This kind of emotional com-

plexity makes it difﬁcult to standardize annotations,

as each annotator may focus on different aspects of

the discourse.

Thus, while the high kappa coefﬁcient reﬂects

strong overall agreement between the annotators,

these examples show that some divergence can per-

sist, especially in situations where emotions are am-

biguous or overlapping.

Tunisian Dialect Speech Corpus: Construction and Emotion Annotation

365

Table 2: Comparative table of Arabic SER corpora.

Work Name Langue Nature Size Number of

emotions

(Meddeb et al., 2016) REGIM TES TD Acted 720 samples 5

of up to 5s each

(Messaoudi et al., 2022) TuniSER TD Semi-natural 2771 utterances 4

from 0.41 to

15.31s lengths

(Meftah et al., 2020) KSUEmotions MSA Acted 5 hours and 5 5

minutes of data

(El Seknedy and Fawzi, 2022) EYASE Egyptian Semi-natural 579 utterances 4

dialect from 1s to 6s

Our work SERTUS TD Spontaneous 23.85h 6

4 DISCUSSION

In the realm of Arabic SER, our study provides a

comprehensive analysis of the SERTUS dataset. As

shown in Table 2, our work stands out across sev-

eral key dimensions, with the SERTUS dataset serv-

ing as the cornerstone of our research. Notably, our

dataset features a substantial size, consisting of ap-

proximately 23.85 hours of audio recordings, with a

total of 3793 samples. Furthermore, the data collected

for SERTUS comes from a variety of programs across

multiple domains, enhancing its applicability to a

broad range of research areas and real-world appli-

cations. One of the main strengths of our dataset lies

in the spontaneous nature of the speech data. Unlike

datasets composed of acted or semi-natural speech,

the natural interactions captured in SERTUS offer a

more authentic representation of human emotions in

everyday environments. This characteristic makes

our dataset especially valuable for developing emo-

tion recognition models that are better suited to real-

life applications, surpassing the limitations posed by

artiﬁcially constructed datasets. Additionally, SER-

TUS covers six distinct emotion categories, exceeding

the four categories commonly found in comparable

datasets, such as TuniSER and EYASE. This broader

emotional range allows for the capture of more nu-

anced and varied emotional expressions, thereby con-

tributing to the development of more accurate and ro-

bust emotion recognition models. The diversity and

scale of spontaneous recordings in SERTUS provide

fertile ground for advancing research in this area, fa-

cilitating improvements in systems designed for SER

across various applications, such as human-computer

interaction, clinical emotion detection, and behavioral

analysis in complex environments.

The TD holds potential for generalization to other

Arabic dialects, such as Algerian and Libyan, due to

their shared linguistic roots and phonological, lexi-

cal, and syntactic similarities. This facilitates transfer

learning between these dialects. Our Tunisian emo-

tional corpus, rich in annotated audio data, provides

a solid foundation for developing models that can be

adapted to neighboring dialects. By leveraging the

linguistic and emotional nuances in this corpus, it

is possible to reﬁne emotion recognition algorithms,

thereby enhancing the understanding of emotions in

Algerian and Libyan dialects. This approach enables

signiﬁcant advancements in computational linguistics

and in the emotion recognition community, beneﬁting

not only the TD but also other Arabic dialects.

5 CONCLUSION AND FUTURE

PERSPECTIVES

In the current research work, we introduced our spon-

taneous emotional speech corpus in TD, named SER-

TUS, with categorical annotation. We labeled our

corpus with six emotions: neutral, disgust, anger,

joy, satisfaction, and sadness, following established

guidelines. This corpus includes more than 23.85

hours of recordings in different domains, such as

sports and politics, and from various sources to cap-

ture regional differences in TD. The ultimate objec-

tive of this work was to ensure the consistency of

this new corpus, and we achieved good agreement be-

tween annotators, measured by the kappa coefﬁcient.

In future directions, we will address the limita-

tions of our research, particularly the imbalance in

our corpus, by applying data augmentation techniques

and measuring their performance. Voici une version

mise

a jour du paragraphe avec l’ajout de la granu-

larit

e plus ﬁne :

Additionally, our focus will pivot towards reﬁning

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

366

the annotation process to better capture the evolving

nature of human emotions, especially within sponta-

neous speech. This will involve exploring alternative

annotation methods, such as dimensional annotation,

which can provide a more nuanced understanding of

emotional subtleties. Speciﬁcally, we aim to adopt

ﬁner granularity in our annotation to detect overlap-

ping emotions, such as disgust and anger. We also

plan to extend the applicability of our dataset beyond

its current scope. By collaborating with experts from

various ﬁelds, we aim to explore how our annotated

corpus can be utilized to advance research in areas

such as NLP, affective computing, and cultural stud-

ies.

Furthermore, we will continue to investigate po-

tential real-world applications for our dataset, includ-

ing its use in developing emotion recognition sys-

tems, improving human-computer interaction inter-

faces, and facilitating cross-cultural communication.

REFERENCES

Alamri, H. et al. (2023). Emotion recognition in arabic

speech from saudi dialect corpus using machine learn-

ing and deep learning algorithms.

Aljuhani, R. H., Alshutayri, A., and Alahdal, S. (2021).

Arabic speech emotion recognition from saudi dialect

corpus. IEEE Access, 9:127081–127085.

Besdouri, F. Z., Zribi, I., and Belguith, L. H. (2024). Chal-

lenges and progress in developing speech recognition

systems for dialectal arabic. Speech Communication,

page 103110.

Boughariou, E., Bahou, Y., and Belguith, L. H. (2021).

Classiﬁcation based method for disﬂuencies detection

in spontaneous spoken tunisian dialect. In Intelligent

Systems and Applications: Proceedings of the 2020

Intelligent Systems Conference (IntelliSys) Volume 2,

pages 182–195. Springer.

Boukadida, N. (2008). Connaissances phonologiques et

morphologiques d

erivationnelles et apprentissage de

la lecture en arabe (Etude longitudinale). PhD thesis,

Universit

e Rennes 2; Universit

e de Tunis.

Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower,

E., Kim, S., Chang, J. N., Lee, S., and Narayanan,

S. S. (2008). Iemocap: Interactive emotional dyadic

motion capture database. Language resources and

evaluation, 42:335–359.

Ekman, P. (1992). Are there basic emotions?

El Seknedy, M. and Fawzi, S. A. (2022). Emotion recog-

nition system for arabic speech: Case study egyptian

accent. In International conference on model and data

engineering, pages 102–115. Springer.

Gannouni, S., Aledaily, A., Belwaﬁ, K., and Aboalsamh, H.

(2020). Adaptive emotion detection using the valence-

arousal-dominance model and eeg brain rhythmic ac-

tivity changes in relevant brain lobes. IEEE Access,

8:67444–67455.

Gibson, M. L. (1999). Dialect contact in Tunisian Ara-

bic: sociolinguistic and structural aspects. PhD the-

sis, University of Reading.

Goncalves, L., Salman, A. N., Naini, A. R., Velazquez,

L. M., Thebaud, T., Garcia, L. P., Dehak, N., Sisman,

B., and Busso, C. (2024). Odyssey 2024-speech emo-

tion recognition challenge: Dataset, baseline frame-

work, and results. Development, 10(9,290):4–54.

Gwet, K. L. (2021). Large-sample variance of ﬂeiss gen-

eralized kappa. Educational and Psychological Mea-

surement, 81(4):781–790.

Iben Nasr, L., Masmoudi, A., and Hadrich Belguith, L.

(2024). Survey on arabic speech emotion recogni-

tion. International Journal of Speech Technology,

27(1):53–68.

Jackson, P. and Haq, S. (2014). Surrey audio-visual ex-

pressed emotion (savee) database. University of Sur-

rey: Guildford, UK.

Macary, M., Tahon, M., Est

eve, Y., and Rousseau, A.

(2020). Allosat: A new call center french corpus for

satisfaction and frustration analysis. In Language Re-

sources and Evaluation Conference, LREC 2020.

Masmoudi, A., Bougares, F., Ellouze, M., Est

eve, Y., and

Belguith, L. (2018). Automatic speech recognition

system for tunisian dialect. Language Resources and

Evaluation, 52:249–267.

Meddeb, M., Karray, H., and Alimi, A. M. (2016). Au-

tomated extraction of features from arabic emotional

speech corpus. International Journal of Computer In-

formation Systems and Industrial Management Appli-

cations, 8:11–11.

Meftah, A., Qamhan, M., Alotaibi, Y. A., and Zakariah,

M. (2020). Arabic speech emotion recognition using

knn and ksuemotions corpus. International Journal of

Simulation–Systems, Science & Technology, 21(2):1–

Messaoudi, A., Haddad, H., Hmida, M. B., and Graiet, M.

(2022). Tuniser: Toward a tunisian speech emotion

recognition system. In Proceedings of the 5th Inter-

national Conference on Natural Language and Speech

Processing (ICNLSP 2022), pages 234–241.

Nasr, L. I., Masmoudi, A., and Belguith, L. H. (2023). Nat-

ural tunisian speech preprocessing for features extrac-

tion. In 2023 IEEE/ACIS 23rd International Confer-

ence on Computer and Information Science (ICIS),

pages 73–78. IEEE.

Plutchik, R. (1980). A general psychoevolutionary theory of

emotion. Emotion: Theory, research, and experience,

Ringeval, F., Sonderegger, A., Sauer, J., and Lalanne, D.

(2013). Introducing the recola multimodal corpus of

remote collaborative and affective interactions. In

2013 10th IEEE international conference and work-

shops on automatic face and gesture recognition (FG),

pages 1–8. IEEE.

Zribi, I., Boujelbane, R., Masmoudi, A., Ellouze, M., Bel-

guith, L. H., and Habash, N. (2014). A conven-

tional orthography for tunisian arabic. In LREC, pages

2355–2361.

Tunisian Dialect Speech Corpus: Construction and Emotion Annotation

367