Masry: A Text-to-Speech System for the Egyptian Arabic
Ahmed Hammad Azab
1
a
, Ahmed B. Zaky
2,4 b
, Tetsuji Ogawa
3 c
and Walid Gomaa
1,5 d
1
Computer Science and Engineering, Egypt-Japan University of Science and Technology, Alexandria, Egypt
2
Computer Science and Information Technology Programs (CSIT),
Egypt Japan University of Science and Technology, Egypt
3
Department of Communications and Computer Engineering, Waseda University, Tokyo, Japan
4
Shoubra Faculty of Engineering, Benha University, Benha, Egypt
5
Faculty of Engineering, Alexandria University, Alexandria, Egypt
Keywords:
Natural Language Processing, Text-To-Speech, Egyptian Arabic.
Abstract:
This paper presents the improvement and evaluation of Masry, an end-to-end system planned to synthesize
Egyptian Arabic speech. The proposed approach leverages the capable Tacotron speech synthesis models,
counting Tacotron1 and Tacotron2, and integrated with progressed vocoders Griffin-Lim f or Tacotron1 and
HiFi-GAN for Tacotron2. By synthesizing waveforms from mel-spectrograms, Masry offers a comprehensive
solution for generating natural and expressive Egyptian Arabic speech. To train and validate our system, we
construct a dataset including a male speaker describing standard composing pieces and news content in Egyp-
tian Arabic. The sampling rate of recorded data is 44100 Hz, guaranteeing constancy and richness within the
synthesized speech output. The execution of our framework was fastidiously assessed t hrough different mea-
surements, with a specific center on the Mean Opinion Score (MOS). The experimental results demonstrated
the prevalence of Tacotron2 over Tacotron1, yielding a MOS of 4.48 compared to 3.64. This emphasizes
the system’s capacity to capture and duplicate the nuances of Egyptian Arabic speech more effectively. Be-
sides, The assessment extended to include fundamental measurements such as word and character error r at es
(WER and CER). These metrics give a quantitative appraisal of the precision and exactness of the synthesized
speech.
1 INTRODUCTION
Text-to-Speech (TTS) technology has become a crit-
ical field of research and development. It aims at
transforming written text into spoken words, enabling
applications like voice assistants, audiobooks, acces-
sibility tools, and language learning platforms with
the potential to revolutionize human-computer inter-
actions.(Youn g et al., 2018)
Arabic stands as one of the most extensively spo-
ken languages, serving as the mother tongue for over
200 million individuals (Versteegh, 2014), and the
largest Semitic languag e. It is composed of two
main dialects: Standard Arabic and Dialec ta l Ara-
bic. While Modern Standar d Arabic (MSA) is the for-
mal linguistic standard, Dialectal Arabic represents
a
https://orcid.org/0009-0007-2461-1040
b
https://orcid.org/0000-0002-3107-5043
c
https://orcid.org/0000-0002-7316-2073
d
https://orcid.org/0000-0002-8518-8908
the daily spoken variation, exhibiting significant dif-
ferences within and across countries(Habash, 202 2).
However, the literature ha s bee n dominate d by
TTS systems for English, resulting in a gap in de-
veloping TTS systems for less commonly spo ken
low-resource languages and dialects, including Egyp -
tian Arabic (Fahmy et al., 2020). As a widely spo-
ken dialect with unique regional variations and in-
formal characteristics (Abdel-Massih, 2011), Egyp-
tian Arab ic requires dedicate d attention to its lin-
guistic nuances and cultural relevance. Hence, de-
veloping an efficient and accurate TTS system for
Egyptian Arabic is paramoun t to en hancing acc essi-
bility and communication for its users. The current
work aims to address this imperative by presenting a
novel approach for building a high-quality TT S sys-
tem specifically tailored to the unique char acteristics
of Egyptian Arabic, drawing insights from existing re-
lated work in both Arabic and English TTS systems
(Habash, 2022) .
Azab, A., Zaky, A., Ogawa, T. and Gomaa, W.
Masry: A Text-to-Speech System for the Egyptian Arabic.
DOI: 10.5220/0012244300003543
In Proceedings of the 20th International Conference on Informatics in Control, Automation and Robotics (ICINCO 2023) - Volume 2, pages 219-226
ISBN: 978-989-758-670-5; ISSN: 2184-2809
Copyright © 2023 by SCITEPRESS Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)
219
Moreover, diacritic and gemination signs, repre-
senting short vowels and consonan t doubling, respec-
tively, play a crucial role in corr ectly pronouncing
Arabic and its various dialects. However, these signs
are often omitted in written texts as most Arab read-
ers are accustomed to inferring them from the con-
text (Habash, 2022). So, this absence poses signifi-
cant difficulties fo r TTS systems aiming to accurately
represent the diverse pronunciations in Arabic.
The lack of Egyptian datasets poses a significant
challenge for training Text-to-Speech (TTS) models.
As TTS technology strives for natural and accurate
speech synthesis, it heavily relies on large and di-
verse datasets in various langu ages, includin g Egyp-
tian Arabic. Unfortu nately, the scarcity of high-
quality, annotated data in this specific d ia le c t hinders
the development of accurate TTS systems that can ef-
fectively mimic the unique ch aracteristics of Egyptian
Arabic speech (Baali et al., 2023). With out sufficient
training data, TTS mod e ls m ay struggle to capture
the nuances of pronunciation, intonation, and linguis-
tic variations specific to the Egyptian dialect, leading
to less authentic and less intelligible speech synthe-
sis. Addressing this issue requires collaborative ef-
forts to collect and curate more Egyptian datasets, fos-
tering the advancement of TTS technology to cater to
a broader linguistic landscape. Our contributions are
as follows:
1. We present a dataset for a male (named, Ashraf)
narrating general writing and news in Egyptian
Arabic.
2. We present Masry an end-to-end text-to-speech
system for Egyptian A rabic.
The paper is structured as fo llows: In Section 1,
we p rovide an introduction to the study’s objectives
and scope. Section 2 is dedicated to the review of
related literature and prior works in the field. Mov-
ing forward, Sectio n 3 elaborates on the system ar-
chitecture, delv ing into each phase within this fra me-
work. Expanding upon the experimental setup, Sec-
tion 4 elucidates the conducted experiments involving
the models. In Section 5, we present the outcomes
of th ese experiments and engage in a comprehensive
discussion ther eof. Finally, Section 6 encapsu late s the
paper with a conclusion summarizing our findings and
outlining potential avenues for future research.
2 RELATED WORK
Arabic text-to-speech synthesis is one o f AI’s NLP
challenges. Ma ny attempts have bee n tried to make
systems that ca n overcome the problem of artificial
voice to create a m ore human natural voice. Previous
works have been done in this a rea, but most models
are made for the English language. Some models have
been applied to Arabic withou t focusing on specific
dialects. (Habash, 2022).
2.1 English TTS Models
In English, many TTS models have bee n developed.
The most famous one is Tacotron. Tacotron is an
end-to-end generative text-to- speech model that can
directly synthesize speech from the text with a sim-
ple waveform synthesis module. Its highest achieved
MOS score (Mean Opinion Score) is 3.82 (Wang
et al., 2017).
The authors in (Ren et al., 2019) proposed a
novel method (FastSpeech) using a feed-forward net-
work based on Transformer that generates a mel-
spectrogram in parallel for TTS. They extract at-
tention alignments from an encoder-decoder-based
teacher model for phoneme duration pre diction. A
length regulator uses it to exp and the source phoneme
sequence to match the length of the target mel-
spectrogram sequence for producing parallel mel-
spectrograms. The co nducted experiments on the
LJSpeech dataset (I to and Johnson, 2017) reveal that
their parallel model achieves a compar able level of
speech quality to autoregressive models.
2.2 Arabic TTS Models
Some TTS models have been developed for the
Arabic lan guage. Ossam et a l. (Abdel-Hamid
et al., 20 06) employed an HMM-based (hidde n
Markov model) ap proach to en hance synthesized Ara-
bic speech. Their methodology used a statistical
model to generate Arabic speech parameters like
spectrum, fundamental frequen cy (F0), and d uration
of phonemes. They also inco rporated a m ulti-band
excitation m odel and utilized samples from the spec-
tral envelope as spectral parameters.
Zangar et al. ( Imene et al., 2018) focused on utiliz-
ing Deep Neural Networks (DNN) for duration mod-
eling in Arabic speech synthesis. They compared
HMM-based and deep neural network D N N-based
duration modeling of various architectures to mini-
mize roo t mean square prediction error (RMSE). The
study concluded that using their DNN for modeling
duration outperformed both the HMM -based model-
ing from the HTS toolkit and the DNN-based mo del-
ing from the MERLIN toolkit.
In (Fahmy et al., 2020), the authors proposed a
transfer learning end-to-e nd Modern Standard Arabic
(MSA) TTS deep architecture. Their work presents
ICINCO 2023 - 20th International Conference on Informatics in Control, Automation and Robotics
220
how they g e nerate high -quality, natu ral, and human-
like Arabic spee c h th at uses an end-to-end neural deep
network architectur e . The approach is built upon a
limited corpus of text and audio pairs, encompass-
ing a relatively small compilation of recorded audio,
amounting to 2.41 hours. Notably, it demonstrates
the successful utilization of English character embed-
dings, even when employing diacritic Arabic charac-
ters as input. The study further expounds on the pre-
processing techniques applied to these audio samples,
elucidating the strategies used to optim iz e outcomes.
The authors in (Abdelali et al., 2022) proposed
an end-to-end TTS system for Arabic. They called
it the NatiQ system. Their speech synthesizer uses
an encoder-decode r architecture with attentio n. They
used the Tacotron model ( Tacotr on1 a nd Tacotron2)
and transformer mod e l to generate mel-spe c trograms
from ch a racters. They used WaveRNN vocoder with
Tacotron1, WaveGlow vocoder with Tacotron2, an d
ESPnet transformer with the parallel wavegan to
synthesize waveforms from the spectrograms. Two
voices, male and female, are used. The authors
achieved 4.21 ( MOS) for the female voice and 4.40
for the male voice.
3 SYSTEM ARCHITECTURE
The system, Masry, is structured into three key el-
ements, depicted in Fig.1. The initial stage in-
volves data preprocessing to refine input data. Sub-
sequently, a text-to- mel-spectrogram model ge nerates
mel-spectrum output. Finally, the mel-spectrum is
processed by a Vocoder to generate corresponding au-
dio. Additionally, we elaborate on the dataset col-
lected, used for training purposes.
3.1 Dataset
In this section, we will discuss the dataset we col-
lected and its characteristics.
EGYARA-23 Dataset
We collected a dataset and called it EGYARA-23. The
EGYARA-23 dataset comprises recordin gs of a male
speaker named Ashraf, who narrates general co nver-
sations and news in Egyptian Arabic. The dataset
spans 20 .5 hour s and con sists of 105,329 words and
32,716 segments. On average, each segment has a
duration of 8 seconds, and the recordings maintain a
quality of 44.1 KHz.
We intended to create a comprehensive da ta set en-
compassing a wide rang e of Eg yptian Arabic words,
considerin g that this dialect contains numerous words
not present in Modern Standard Arabic (MSA). To en-
sure authenticity, we transcribed the dataset in Egyp-
tian Arabic as used in everyda y language, reflecting
the actual dialect. Consequently, each segment is ac -
compan ie d by its respective transcript. Table 1 dis-
plays a selection of MSA words and their equivalents
in Egyptian Arabic.
Table 1: Difference between words in MSA and Egyptian
Arabic.
Egyptian Arabic MSA English
ú
æ
J@
Q
¢
J
K@ Wait
hðQK
.
I
.
ë
X@ Go
àA
«
Ég
.
@
Because
Pð
YK
P@ Want
ʪÓ

Sorry
H ú
梫@
Give me
3.2 Preprocessing
In this sectio n, we will delve into the preprocessing
steps, which include diacritization, segmentation, and
phone tiza tion.
3.2.1 Diacritization
Egyptian Arabic diacritization encompasses two
types of vowels: long vowels, explicitly indicate d in
the text, and short vowels (diacritics), which are of-
ten omitted in modern writing (Ha bash, 2022), re-
lying on readers’ co ntextual understand ing (Abdelali
et al., 2022). Accurately restoring these diacritics is
paramount for human comprehension an d machine-
based pronunciation of Arabic words. To address this,
we employed Camel Tools (Obeid et al., 2020) as a
diacritization tool for Egyptian Arabic, attaining an
accuracy exceedin g 90%. However, to ensure meticu-
lousness, an expe rt in Arabic linguistics reviewed the
automatically diacritized data, particularly in intricate
cases such as named entities and foreign words, which
can present cha llenges even to native speakers. Dia-
critization holds significant importance in accurately
pronouncing words, thus rendering it a crucial step
in the preprocessing phase. In Table 2, you can see
some Egyptian Arabic words before and after the di-
acritiazation of Egyptian Arabic.
3.2.2 Segmentation
Given the constraints of neural architectures in pro-
cessing lengthy audio samples (Shen et al., 2018),
We developed a semi- automated procedure for data
collection, where the dataset is partitioned into ap-
Masry: A Text-to-Speech System for the Egyptian Arabic
221
Vocoder
Preprocessing
(Diacritization/
Segmentation
/Phonetising)
Text to Mel-
Spectrogram Model
Input Text
Output Audio
Mel-
Spectrum
Processed
Text
(Phonems)
Speech Synthesize System
Speech Synthesize Model
Figure 1: Masry Architecture.
Table 2: Before and after diacritization of samples in E gyp-
tian Arabic.
Before After English
ú
æ
J@
ú
æ
J
@
Wait
hðQK
.
hð
Q
K
.
Go
àA
«
àA
« Because
Pð
Pð
A
«
Want
ʪÓ
Ê
ª
Ó
Sorry
H
HA
ë
Give me
proxim ately 8-second frames. This segmentation pro-
cess involves meticulous attention to maintaining sen-
tence coherence while preservin g the overall contex-
tual flow and prosody. Typically, extended pauses
between segments are utilized as reliable indicators
for segmentation, with exceptions being made when
extended pauses are followed by r elevant context or
supplementary conten t that remains part of the sen-
tence (Abdelali et al., 2022). This underscores the
significance of a n intricate a pproach to segmentation
to uphold the precision of audio data representation
and its coherent continuity. Employing our semi-
automated pro cess, we address this challenge during
data collection, resulting in a dataset where each sen-
tence is inherently segmented with established start
and end times. Con sequently, the audio trimming pro-
cess is automated to align with each sentence’s corre-
sponding segments. We ensur e the entirety of each
sentence during d a ta collection, e nsuring that no per-
tinent context or supplementary content is left unre-
solved within the sentence.
3.2.3 Phonetization
Phonetization converts textual input into its cor-
respond ing phonetic representation, mapping each
graphe me to its phoneme. In TTS systems, p honetiza-
tion is crucial to accurately synthesize speech by en-
suring correct pronunciation, intonation, and r hythm.
For languages like Arabic, with intricate vowel pat-
terns and diacritics, phonetization is particularly im-
portant to cap ture pronunciation nuances. It e nables
TTS systems to generate natural an d con textua lly
appropriate speech output, creating a more expres-
sive and high-quality au ditory experience (El-Imam,
2004). In this phase, o ur focus shifts towards convert-
ing the transcripted text into a p honeme representa-
tion, which has already u ndergone diacritization and
segmentation. This transfor mation occu rs through a
two-step procedure as you can see in 2. Initially, we
transcribe the text into Buckwalter transliteration for-
mat. Subsequently, the output from the translitera-
tion step is subjected to further processing to obtain
phone me characters. To accomplish this task effec-
tively, we leverage the Arabic phonetiser tool devel-
oped by Nawar Halabi (Halabi, 2016). These sequen-
tial steps are essential in preparing the data for subse-
quent m odel input, optimizing our research approach
in Egyptian Arabic lang uage processing. you can see
before and after Phonetization in 3.
Buckwalter
Converter
Phonetizer PhonemesArabic Text
Figure 2: Phonetization Process.
Table 3: Difference between Arabic Text, Buckwalter
Transliteration and Phonemes Characters.
Phonemes Buckwalter Arabic Text
E a $ aa n Ea$aAno
àA
«
E aa w i1 z EaAwizo
Pð
A
«
m a E l i1 $ maEoli$o
Ê
ª
Ó
h a a t haAto
HA
ë
< i0 s t a nn ii0 Aisotaniy ú
æ
J
@
ICINCO 2023 - 20th International Conference on Informatics in Control, Automation and Robotics
222
3.3 Speech Synthesis Model
This sectio n presents an overview of the synthe-
sis model’s architecture. The synthesizer comprises
an encoder-decoder model and a vocoder which is
pivotal in gener ating the desired waveforms. The
encoder-decoder module is respo nsible f or converting
the preprocessed text into a mel-spec trum represen-
tation. Conversely, the vocoder transforms the mel-
spectrum representation into the corresponding wave-
form. To comprehensively explore the effectiveness
of the proposed approach, we experimented with two
distinct models and vocoders.
Tacotron1, (Wang et al., 2017), adopts an RNN
sequence-to-sequen ce architecture comprising three
main comp onents in Fig.3: an encoder, an attention-
based decoder, and a post-processing module. The
encoder ta kes text inp ut in the form o f char a cters
and transforms it into a mel-spectrogram represen-
tation. Subsequen tly, the post-processing module
utilizes this mel-spectrogram to generate the corre-
sponding waveform. The encoder in Tacotr on1 uti-
lizes a CBHG-based approach, wh ic h involves a bank
of 1-D convolutio nal filters, fo llowed by highway net-
works and a bidirectional gated recurrent unit (GRU).
Tacotron1 employs the Griffin-Lim algorithm on
top of the generated me l-spectrograms to complete
the speech synthesis pro cess. The Griffin-Lim algo-
rithm is a famou s vocoder used in spee c h synthesis,
including Tacotron1. It converts a mel-spectrogram
back into a time-domain waveform. It works itera-
tively, refining a rand om waveform estimate to match
the target mel-spectr ogram. It retains phase informa-
tion for better waveform reconstruction by altern a ting
between time and frequency domains. This computa-
tionally efficient approach strikes a balance betwe e n
quality and efficiency, making it widely used in real-
time speech synthesis.
Decoder
RNN
Decoder
RNN
Pre-netPre-net Pre-net
Attention
RNN
Attention
RNN
Attention
RNN
Decoder
RNN
Griffin-Lim reconstruction
Attention
Pre-net
CBHG
CBHG
Character embeddings
Attention is applied
to all decoder steps
Linear-scale
spectrogram
Seq2seq target
with r=3
<GO> frame
Figure 3: Tacotron 1 architecture (Wang et al ., 2017).
Tacotron2 (Sh en et al., 2018) is an advanced text-
to-speech (TTS) model. In Fig. 4 consisting of a text
encoder and a spectro gram genera tor. The text en-
coder processes the input text, captur ing its linguis-
tic features and context, and produces a fixed-size
representation. The spec trogram generator, equipped
with attention mechanisms and typically based on re-
current neural networks (RNNs), takes the fixed-size
text representation as input and gene rates mel spectro-
grams, which represent the spectral conten t of audio
over time. During training, Tacotro n2 le arns to alig n
the inpu t text with corresponding mel spectrograms
using attention mechanisms, enabling it to generate
accurate and expressive mel spectrogra ms. On the
other hand, HiFi-GAN (Kong et al., 2020), the high-
fidelity generative adversarial network vocoder, is de-
signed to convert mel spectro grams into high-quality
audio waveforms. By utilizing a GAN architecture
with a generator and a discriminator, HiFi-GAN is
trained to synthesize realistic and natural-sounding
speech from mel spec trograms. T he generator pro-
duces the audio waveforms while the discriminator
tries to distinguish between real and generated audio,
leading to adversarial tra ining that enhances th e qual-
ity of the generate d speech. With the combination of
Tacotron2 a nd HiFi-GAN, the TTS system can gener-
ate human-like speech by first generating mel spec-
trogram s tha t retain essential char acteristics of the
audio a nd then utilizing the high-fide lity vocoder to
transform these spectrograms into realistic and high-
quality speech waveforms.
Figure 4: Tacotron 2 architecture (Shen et al., 2018) with
modification HiFi-GAN vocoder.
4 EXPERIMENTS
To evaluate the performance of each model, we per-
formed computational experiments with automatic
and manual (subjective) evaluations.
Masry: A Text-to-Speech System for the Egyptian Arabic
223
4.1 Training Phase
In the training proced ure, we adopt a two-step ap-
proach . Firstly, we tra in the feature prediction net-
work independently to predict certain features from
phone mes input. Subsequ ently, we proceed to train a
HiFi-GAN separately, utilizing the outputs ge nerated
by the feature prediction n etwork as its input. This
two-step pr ocess allows us to effectively leverage the
predictions made by the first network and enhance the
overall performanc e of the HiFi-GAN. The training
process involved setting hyperpar ameters, such as a
batch size of 32, and conducting 1000 epochs. Ini-
tially, we preprocessed the input text by converting
it into phonemes. Subsequently, the trainin g proce-
dure commenced, during which we took checkpoints
at every 1000 steps. These checkpoints were used to
test and monitor the trainin g progress to ensur e effec-
tiveness an d efficiency. The computing environment
utilized an Intel Xeon 6 230R @ 2.1 G H z (52 CPUs)
with NVIDIA Quadro RTX500 0.
4.2 Testing Phase
To be able to test the models we build a testing dataset
called EGYARA-TEST that shares the same charac-
teristics as the EGYARA-23. However, EGYARA-
TEST encompasses a duration of 1 hour and com-
prises 1003 segments. This new dataset was explic-
itly designed to thorough ly test the performance of the
models on a mo re extensive and d iverse set of audio
samples. By using this dataset, we aimed to evaluate
how well the models perform and assess their robust-
ness and generalization capabilities across a broader
range of speech segments.
For mode l evaluation, we employ two types of as-
sessment: automatic evaluation a nd manual (su bjec-
tive) evaluation.
4.2.1 Automatic Evaluation
In this study, we employed an Egyptian Arabic Au-
tomatic Speech Recognition (ASR) system developed
by (Alyafeai, 2022) to decode the audio files gener-
ated by our Text-to- Speech (TTS) models. To com-
pare the generated transcripts with the input sentences
used for TTS output, we adapted the refere nce orig-
inal text to m atch th e unvowelized output from the
ASR system for a fair comparison . Standa rd evalua-
tion metrics, inc luding Word Error Rate (WER) and
Character Error Rate (CER), were u sed to assess the
TTS model’s perf ormance. Word Error Rate (WER)
is a widely used me tric for automatic speech recogni-
tion perfo rmance. It tackles challenges arising from
variable seq uence leng ths by employing Levenshtein
distance at the word level. WER facilitates system
compariso ns and improvements assessment, though
it lacks specificity on error typ e s. Addressing this,
dynamic strin g alignment aligns recog nized and ref-
erenced word sequences. The power law theory ex-
amines pe rplexity’s correlation w ith WER, shedding
light on language model complexity’s impact on error
rates (Morris et al., 2004). Word Erro r Rate (WER)
can subsequen tly be calculated as:
W ER = (S + D + I)/N = (S + D + I)/(S + D + C)
(1)
where S is the number of substitutions, D is the num-
ber of deletions, I is the number of insertions, C is the
number of correct words, N is the number of words in
the reference (N=S+D+C)
Character Error Rate (CER) is a prevalent perfor-
mance metric for automatic speech recognition sys-
tems. Similar to Word Error Rate (WER), CER op er-
ates at the ch aracter level rather than the word level.
For detailed insights (Morris et al., 2004 ). The com-
putation of the Character Error Rate involves:
CER = (S +D+I)/N = (S +D+I)/(S + D+C) (2)
where S is the number of substitutions, D is the num-
ber of deletions, I is the number of insertions, C is
the number of cor rect character s, N is the number of
characters in the referenc e (N=S+D+C) .
4.2.2 Manual Evaluation
In Manua l E valuation, we used Mean Opinion Score
(MOS) (Guski, 1997), it is a su bjective evaluation
metric used to measure the perceived quality of gen-
erated speech or audio. Huma n participants rate
the samples on a scale; the average score indicate s
the system’s quality. MOS evaluations help identify
strengths and weaknesses in TTS and ASR systems
and guide improvements for better user satisfaction.
It is one aspect of a com prehensive evaluation a p-
proach that consid ers other metrics and user feedback.
We conducted an anonymous survey to evaluate the
output of our mod el. 108 participants were asked to
rate the audio samples on a scale from 1 to 5, where
higher scores indicated better quality. The survey in-
cluded 10 audio samples, and we calculated the Mean
Opinion Score (MOS) based on the participants’ rat-
ings. This approach en sured the confidentiality of the
participant’s responses and provided valuable insights
into the perceived quality of the model’s output.
5 RESULTS
The evaluation metrics of Character Error Rate
(CER), Word Error Rate (WER) based on 1 and 2,
ICINCO 2023 - 20th International Conference on Informatics in Control, Automation and Robotics
224
(a) Ground-truth (b) Tacotron2 (c) Tacotron1
Figure 5: Comparison of predicted Mel-spectrograms for input text (
àA¿
ú
Î
Ë@
é
®K
Q¢Ë@ éJ
.
k
.
ºÓ) (He did not like the way
that was).
(a) Ground-truth (b) Tacotron2 (c) Tacotron1
Figure 6: Comparison of predicted Mel-spectrograms for input text (@Yg
.
ù
®¢
È@
ñ È
A
A
JJ
Ê
m
'
èXð) (This leaves us to ask
a very logical question).
and Mean Opinion Score (MOS) are given in Table 4
and Table 5, Our best-performin g speech synthesizer
is Tacotron2 with HiFi-GAN for Egyptian Arabic.
The combination of the mel-spectrogram predic-
tion model of Tacotron2 and HiFi-GAN ha s demon-
strated supe rior performance in low CER and WER,
indicating accurate and precise speech generation.
Additionally, the high MOS scores obtained from
subjective evaluatio ns reflect the perceived quality
and naturalness of the synthesized speech. The re-
sults from these metrics collectively support the con-
clusion that Tacotron2 with HiFi-GAN is the most ef-
fective a nd preferred choice among the tested spe ech
synthesis mo dels for Egyptian Arabic. Furthermore ,
comparing this model’s (MOS) scores against those of
Modern Standard Arabic T TS sy stems p rovides addi-
tional validation of its superiority. This affirmation
is underscored not only by the observed (Word Er-
ror Rate) (WER) and (Ch aracter Error Rate) (CER)
evaluations but also by the proximity and sur passing
of our achieved MOS scor es in relation to other sys-
tems in th e MSA context. Noteworthy examples in-
clude the NatiQ system (Abdelali et al., 2022), w here
our model excels, and the Transfer Learning End-
to-End Arabic Text-To-Sp e ech (TTS) Deep Ar c hitec-
ture (Fahmy et al., 2020), which further accentuates
the efficacy of our approach. Additionally, in terms
of (WER) and (CER), our model surpasses the per-
formance of the NatiQ sy stem (Abdelali et al., 2022).
In Fig.5 and Fig.6, a comparison of Mel-
spectrograms for three different audios are presented
with two different sentences: a) the ground-truth,
b) Tacotron2, and c) Tacotron 1. Upon observation,
we can n otice that the re is a higher degree of simi-
larity between the ground-truth a nd Tacotron2 Mel-
spectrograms compared to Tacotron1. This finding
suggests that Tacotron2 is better at generating Mel-
spectrograms that closely resemble the gro und-truth,
indicating higher accuracy and fidelity in the spee ch
synthesis p rocess. On the other hand, Tacotron1’s
Mel-spectrograms show more noticeable differences
from the ground-truth, suggesting that it may not cap-
ture specific acoustic characteristics as effectively as
Tacotron2.
Table 4: CER and WER evaluation results.
CER WER
NatiQ TTS System for (MSA) (Abdelali et al., 2022) 8.01 24.87
Tacotron1 (Griffen-Lim) for Egyptian Arabic 20.4 66.6
Tacotron2 (HiFi-GAN) for Egyptian Arabic (proposed) 7.3 22.3
Table 5: MOS evaluation results (maximum score is 5).
MOS
TL TTS System for (MSA) (Fahmy et al., 2020) 4.21
NatiQ TTS System for (MSA) (Abd elali et al., 2 022) 4.40
Tacotron1 (Griffen-Lim) for Egyptian Arabic 3.64
Tacotron2 (HiFi-GAN) for Egyptian Arab ic (propo sed) 4.48
Masry: A Text-to-Speech System for the Egyptian Arabic
225
6 CONCLUSION
This paper presents the Masry end-to-end text-to-
speech system tailored for Egyptian Arabic, combin-
ing Tacotron 2 with the HiFi-GAN Vocoder. Addition-
ally, a novel data set and its transcriptions in Egyptian
Arabic were introduced. T he system’s performance
was assessed through automatic evaluation metric s,
namely Charac te r Error Rate (CER) and Word Error
Rate (WER), resu lting in scores of 7 .3 and 22.3, re-
spectively. Furthermore , a manual evaluation using
the Mean Opinion Score (MOS) yielded a score of
4.48. Our findings indicate that the system’s perfor-
mance is in close proximity to that of English and
Modern Arabic Standard systems. Future work en-
tails incorporating additional features, such as emo-
tions and multispeaker capabilities, to enhance the
system’s cap abilities further.
REFERENCES
Abdel-Hamid, O., Abdou, S. M., and Rashwan, M. (2006).
Improving arabic hmm based speech synthesis qual-
ity. In Ninth International Conference on Spoken Lan-
guage Processing. n/a.
Abdel-Massih, E. T. (2011). An Introduction to Egyptian
Arabic. MPublishing, University of Michigan Library.
Abdelali, A., Durrani, N., Demiroglu, C ., Dalvi, F.,
Mubarak, H. , and Darwish, K. (2022). Natiq: An end-
to-end text-to-speech system for arabic. arXiv preprint
arXiv:2206.07373.
Alyafeai, Z. (2022). Klaam asr. https:// git hub.com/
ARBML/klaam.
Baali, M., Hayashi, T., Mubarak, H., Maiti, S., Watanabe,
S., El-Hajj, W., and Ali, A. (2023). Unsupervised data
selection for TTS: using arabic broadcast news as a
case study. CoRR, abs/2301.09099.
El-Imam, Y. A. (2004). Phonetization of arabic: rules
and algorithms. Computer Speech & Language,
18(4):339–373.
Fahmy, F. K., Khalil, M. I., and Abbas, H. M. (2020). A
transfer learning end-to-end arabic text-to-speech (tts)
deep architecture. In IAPR Workshop on Artificial
Neural Networks in Pattern Recognition, pages 266–
277. Springer.
Guski, R. (1997). P sychological methods for evaluat-
ing sound quality and assessing acoustic information.
Acta Acustica united with Acustica, 83:765–774.
Habash, N. Y. (2022). Introduction to Arabic natural lan-
guage processing. Springer N ature.
Halabi, N. (2016). Modern standard Arabic phonetics for
speech synthesis. PhD thesis, University of Southamp-
ton.
Imene, Z., Mnasri, Z., Vincent, C., D enis, J., Amal, H.,
et al. (2018). Duration modeling using dnn f or arabic
speech synthesis. In Proeedings of 9th International
Conference on Speech Prosody, pages 597–601.
Ito, K. and Johnson, L. (2017). The lj speech dataset. https:
//keithito.com/LJ-Speech-Dataset/.
Kong, J., K im, J., and Bae, J. (2020). Hifi-gan: Genera-
tive adversarial networks for efficient and high fidelity
speech synthesis. Advances in Neural Information
Processing Systems, 33:17022–17033.
Morris, A., Maier, V., and Green, P. (2004). From wer
and r il to mer and wil: improved evaluation measures
for connected speech recognition. In INTERSPEECH
2004 - ICSLP, 8th International Conference on Spoken
Language Processing, Jeju Island, Korea.
Obeid, O., Zalmout, N., Khalifa, S., Taji, D. , Oudah, M.,
Alhafni, B., Inoue, G., Eryani, F., Erdmann, A., and
Habash, N. (2020). CAMeL tools: An open source
python toolkit for Arabic natural language processing.
In Proceedings of the Twelfth Language Resources
and Evaluation Conference, pages 7022–7032, Mar-
seille, France. European Language Resources Associ-
ation.
Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., and
Liu, T.-Y. (2019). Fastspeech: Fast, robust and con-
trollable text to speech. Advances in neural informa-
tion processing systems, 32.
Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N.,
Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerrv-Ryan,
R., et al. (2018). Natural tts synthesis by condition-
ing wavenet on mel spectrogram predictions. In 2018
IEEE international conference on acoustics, speech
and signal processing (ICA SSP), pages 4779–4783.
IEEE.
Versteegh, K. (2014). Arabic language. Edinburgh Univer-
sity Press.
Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss,
R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio,
S., Le, Q. V., Agiomyrgiannakis, Y., Clark, R. A. J.,
and Saurous, R. A. (2017). Tacotron: Towards end-
to-end speech synthesis. In Interspeech.
Young, M., Courtad, C. A., Douglas, K., and Chung, Y.-
C. (2018). The effects of text-to-speech on reading
outcomes for secondary st udents with learning dis-
abilities. Journal of Special Education Technology,
34:016264341878604.
ICINCO 2023 - 20th International Conference on Informatics in Control, Automation and Robotics
226