Cross-Lingual Low-Resources Speech Emotion Recognition with Domain
Adaptive Transfer Learning
Imen Baklouti
a
, Olfa Ben Ahmed
b
and Christine Fernandez-Maloigne
c
XLIM Research Institute, UMR CNRS 7252, University of Poitiers, France
Keywords:
Domain Adaptation, Transfer Learning, Cross-Lingual Speech Emotion Recognition.
Abstract:
Speech Emotion Recognition (SER) plays an important role in several human-computer interaction-based
applications. During the last decade, SER systems in a single language have achieved great progress through
Deep Learning (DL) approaches. However, SER is still a challenge in real-world applications, especially
with low-resource languages. Indeed, SER suffers from the limited availability of labeled training data in the
speech corpora to train an efficient prediction model from scratch. Yet, due to the domain shift between source
and target data distributions traditional transfer learning methods often fail to transfer emotional knowledge
from one language (source) to (target) to another. In this paper, we propose a simple yet effective approach
for Cross-Lingual speech emotion recognition using supervised domain adaptation. The proposed method
is based on 2D Mel-Spectrogram images as features for model training from source data. Then, a transfer
learning method with domain adaptation is proposed in order to reduce the domain shift between source and
target data in the latent space during model fine-tuning. We conduct experiments through different tasks on
three different SER datasets. The proposed method has been evaluated on different transfer learning tasks
namely for low-resource scenarios using the IEMOCAP, RAVDESS and EmoDB datasets. Obtained results
demonstrate that the proposed method achieved competitive classification performance in comparison with the
classical transfer learning method and with recent state-of-the-art SER-based domain adaptation works.
1 INTRODUCTION
The ability to understand and interpret human emo-
tions is a fundamental aspect of effective communica-
tion and social interaction. In recent years, the field of
affective computing has gained significant attention
as researchers and technologists seek to develop intel-
ligent systems capable of perceiving, analyzing, and
responding to human emotions. Among the various
modalities explored in affective computing, speech
has garnered significant attention due to its rich emo-
tional content, and accessibility. Our voice carries a
wealth of information, including vocal pitch, inten-
sity, rhythm, and spectral properties, all of which are
influenced by our emotional states.
Speech Emotion Recognition (SER) presents an im-
portant field in affective computing that involves the
automatic detection and analysis of emotions con-
veyed through speech signals. SER systems are used
in several daily applications such as marketing, gam-
a
https://orcid.org/0000-0001-5413-8660
b
https://orcid.org/0000-0002-6942-2493
c
https://orcid.org/0000-0003-4818-9327
ing, psychology, healthcare, and cognitive sciences.
(El Ayadi et al., 2011; Lugovi
´
c et al., 2016). However,
several challenges exist in the field of SER because of
the subjective and context-dependent nature of emo-
tion in addition to the individual differences and lin-
guistic variations (Lech et al., 2020). While signif-
icant progress has been made in SER, the majority
of existing systems focus on single-language settings,
limiting their applicability in multilingual and multi-
cultural contexts (Agarla et al., 2022; Cai et al., 2021;
Sharma, 2022; Zhang et al., 2021b). Indeed, emotions
are very important and can have an impact on what we
say and what we do and drive us to take decisions and
make choices. However, through different languages
and cultures, emotions are expressed differently that’s
why it’s important to take into account these differ-
ences (Barth, 2020). Hence, developing an efficient
and generalizable SER system remains challenging.
Indeed, different languages exhibit unique phonetic
and acoustic characteristics posing a challenge to de-
veloping language-independent deep learning models
for SER. Additionally, the availability of labeled emo-
tion datasets for training across multiple languages
118
Baklouti, I., Ben Ahmed, O. and Fernandez-Maloigne, C.
Cross-Lingual Low-Resources Speech Emotion Recognition with Domain Adaptive Transfer Learning.
DOI: 10.5220/0012788100003756
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 13th International Conference on Data Science, Technology and Applications (DATA 2024), pages 118-128
ISBN: 978-989-758-707-8; ISSN: 2184-285X
Proceedings Copyright © 2024 by SCITEPRESS Science and Technology Publications, Lda.
presents a significant obstacle. Yet, insufficient data
for less-resourced languages hinders the development
of accurate multilingual SER systems.
In this paper, we propose a new domain adaptation-
based transfer learning method for multi-lingual
SER. The proposed method is based on 2D Mel-
Spectrogram data extracted from voice signals. A
source model is learned in the first step and adapted
in the second step to reduce the domain shift between
the source and the target datasets. The goal is to lever-
age the knowledge learned from the source task to im-
prove the performance of the model on a related but
with a different (language) target SER task. Notably,
our proposed method goes beyond conventional ap-
proaches by incorporating the extraction of domain-
invariant features directly into the loss function in a
supervised manner. This supervised adaptation mech-
anism guides the neural network to acquire features
that are less influenced by domain-specific variations
while maintaining the critical emotional information
necessary for precise SER. The proposed method has
been evaluated on different tasks using three differ-
ent data-sets in different languages namely Interac-
tive Emotional Dyadic Motion Capture (IEMOCAP),
Berlin Database of Emotional Speech (EmoDB), and
Ryerson Audio-Visual Database of Emotional Speech
and Song (RAVDESS).
The rest of the paper is organized as follows. First,
section 2 exposes some literature review on deep
learning and domain adaptation for SER. Second, sec-
tion 3 presents the proposed SER method. Then in
section 4, experiments and results are presented. Fi-
nally, section 5 concludes the work and gives some
perspectives.
2 RELATED WORK
SER has witnessed significant advancements in re-
cent years, fueled by the growing applications of af-
fective computing in human-computer interaction and
artificial intelligence. In this section, we will start
by reviewing the use of deep learning in SER in the
last years, and then we will present and discuss some
recent and relevant works on domain adaptation for
SER.
2.1 Deep Learning for SER
Deep Learning (DL) has been widely used for SER
(Khalil et al., 2019) by extracting high-level fea-
tures from raw speech data. Several works in SER
have used Convolutional Neural Networks (CNN) to
capture local acoustic patterns and learn discrimina-
tive features for spectrograms for emotion recognition
(Huang et al., 2014; Badshah et al., 2019). Recur-
rent Neural Networks (RNNs) have also been investi-
gated to capture both temporal and spectral informa-
tion (Zhao et al., 2018; Ma et al., 2019). However,
DL models trained on one language may not gener-
alize well to other languages due to the differences
between languages (Hu et al., 2020). Ensuring cross-
lingual generalization and transferability of models
are challenges in multilingual SER. In this context,
cross-lingual Transfer Learning (TL), where models
are pre-trained on large-scale speech datasets, is used
for SER in small data. Indeed, pre-trained models can
leverage the learned representations from general au-
dio tasks and transfer them to SER, enhancing per-
formance, especially in low-resource settings. For in-
stance the authors in (Padi et al., 2021), proposed the
TL of a pre-trained ResNet model with spectrogram
data augmentation. the authors in (Liu et al., 2021),
proposed to transform interval time parts of speech
signals into a spectrogram and a discrete waveform
fed separately to the FaceNet model for training. The
network was pre-trained on the Chinese Academy of
Sciences’ Institute of Automation (CASIA) dataset
and fine-tuned on the IEMOCAP dataset. The authors
in (Gerczuk et al., 2021) presented a combination of
residual adapters and ResNet model to be transferred
from visual recognition to multi-corpus EmoSet. The
authors in (Goel and Beigi, 2020), proposed a Multi-
Task Learning framework with arousal, naturalness,
and gender tasks for TL across SER datasets for ve
languages in single and cross corpus data. The au-
thors in (Latif et al., 2018) proposed the Deep Be-
lief Networks (DBN) model for TL evaluated on five
corpora in different three languages SER. The au-
thors in (Deng et al., 2013) proposed a sparse Au-
toEncoder (AE) technique evaluated on six databases
for feature TL for SER. More recently, authors in
(Scheidwasser-Clow et al., 2022), proposed a bench-
mark with nine datasets in six languages called SER
adaptation benchmark for evaluating the extracted
handcrafted features and leverages Deep Neural Net-
works (DNN).
TL has gained a lot of attention for multi-language
deep-learning-based SER systems. However, it still
suffers from some limitations. In fact, TL assumes
that the source and the target domains are similar
which is not the case in real-work speech emotion
recognition tasks. This domain shift can degrade the
performance of the model. Yet, pre-trained models
may not generalize well to new emotional expressions
or speakers not represented in the training data. More-
over, the limited availability of labeled data hampers
the ability to effectively fine-tune the models, leading
Cross-Lingual Low-Resources Speech Emotion Recognition with Domain Adaptive Transfer Learning
119
to overfitting and poor generalization.
2.2 Domain Adaptation for SER
In order to reduce the gap between source and tar-
get domains, Domain Adaptation (DA) methods have
been recently proposed for cross-lingual SER. Most
of the existing works focus on adversarial training
for DA. For instance, The authors in (Latif et al.,
2019) proposed an unsupervised DA cross-lingual
SER using a Generative Adversarial Network (GAN)
model. The authors in (Abdelwahab and Busso, 2018)
proposed a novel Domain Adversarial Neural Net-
work that exploits target unlabeled data to predict
emotional descriptors for dominance, valence, and
arousal. Authors proposed in (Gideon et al., 2019)
an Adversarial Discriminative Domain Generaliza-
tion and Multiclass MADDoG methods for cross-
corpus SER evaluated on MSP-Improv, IEMOCAP
and PRIORI datasets. More recently, authors in (Latif
et al., 2022) proposed Adversarial Dual Discriminator
(ADDi) and self-supervised (sADDi) methods to min-
imize the distance between source and target datasets.
However, the presence of two discriminators in the
ADDi architecture leads to training instability and an
increase in both computational complexity and over-
fitting risk. A DA method based on the constant-Q
Transform (CQT) for time-frequency has been pro-
posed for SER in (Singh et al., 2021). The authors
mentioned that they had no gain improvement over
EmoDB because time-frequency needs monitoring in
SER.
Another group of works maximizes the domain
confusion to learn a common feature space by min-
imizing some measures of domain shift namely the
Maximum Mean Discrepancy (MMD) distance (Yang
et al., 2020; Liu et al., 2020; Zhang et al., 2021a;
Kexin and Yunxiang, 2023; Ye et al., 2023). In-
deed, the MMD has been used for DA in the com-
puter vision field and proved its effectiveness in fea-
ture alignments compared to L1 and L2 distances, es-
pecially in high-dimensional spaces, which is often
the case with complex data like images or spectro-
grams (Rezaeianjouybari and Shang, 2020). How-
ever, achieving DA for speech emotion presents a
heightened level of complexity. This complexity
arises from the necessity to simultaneously preserve
emotional information while mitigating domain shift.
For example, the authors in (Yang et al., 2020) pro-
posed DA based on MMD layers with dynamic trans-
fer coefficients using VGG19 and AlexNet models.
The method has been evaluated on an office envi-
ronment dataset leading to questions regarding its
applicability in the real-world SER scenarios. Au-
thors in (Liu et al., 2020) proposed DA using the
MMD technique with CNN model based on LeNet
and AlexNet models, to validate transfers between the
eNTERFACE and CASIA datasets. More recently,
the authors in (Zhang et al., 2021a) proposed an unsu-
pervised novel Joint Distribution Adaptation Regres-
sion (JDAR) method for DA-based regression. The
authors in (ZHAO et al., 2023) extracted 3D-mel-
spectrograms using CNN then a Long Short-Term
Memory (LSTM) model was trained on these fea-
tures for SER. The authors in (Kexin and Yunxiang,
2023) combined Linear Discriminant Analysis, MMD
and graph embedding to constrain the differences be-
tween target and source data while the authors in
(Ye et al., 2023) proposed a novel dual-level emo-
tion alignment module based on unsupervised DA.
The authors proposed in this paper (Fu et al., 2023)
to evaluate gender impact in SER, subdomain adapta-
tion based on LMMD and multi-task learning models.
However, integrating multiple tasks or subdomains
into a single model often increases the model’s com-
plexity. In (Agarla et al., 2024) authors in proposed
for Cross lingual SER a Semi-Supervised Learning
method based on a transformer to classify unlabeled
utterances. Some limitations are mentioned by au-
thors. First pseudo-labeling can deteriorate perfor-
mance if the model makes incorrect, unlabeled pre-
dictions. Second hard-pseudo labels can cause over-
fitting and the proposed method assumes that the tar-
get and source sets have the same number of emo-
tions. The authors in proposed in this paper (Huijuan
et al., 2023) a local domain adaptation using the lo-
cal MMD method, to measure the distance between
the source and target data. The authors proposed as
future work, a multi-source DA using LMMD to be
more suitable across SER.
Within the MMD-based domain adaptation ap-
proaches, there has been a notable omission in pre-
serving emotional information while aligning the fea-
ture distributions. Moreover, most existing domain
adaptation methods are designed for unsupervised or
semi-supervised SER tasks due to the lack of anno-
tated data in the target domain (low-resource domain).
However, both unsupervised and semi-supervised DA
methods lack direct supervision from labeled target
domain data. This limits their ability to explicitly
guide the adaptation process towards the target do-
main’s specific characteristics. Without explicit su-
pervision, the models may struggle to capture fine-
grained details or subtle differences in emotional ex-
pressions across languages. In addition, most of the
works cited above evaluate DA methods for cross-
corpus SER using similar language corpora and few
works show the effectiveness of their methods for
DATA 2024 - 13th International Conference on Data Science, Technology and Applications
120
different languages in cross-corpus SER (Ahn et al.,
2021). Yet, few works have tackled the problem
of DA from source to low-resource target domain.
Hence, in this work, we propose a supervised DA-
based transfer learning approach for SER for low-
resource target domains. Combining classification
loss, which encourages the model to correctly clas-
sify data, with MMD, which emphasizes the align-
ment of data distributions, can result in a more power-
ful discriminative model. This means the model can
both classify well and adapt effectively to different
domains.
3 PROPOSED APPROACH
Figure 1 illustrates the proposed framework of DA-
based transfer learning for SER which is composed of
two steps. The first one consists of learning the source
model from source data, while the second one imple-
ments the domain adaptation-based transfer learning
approach for final prediction from the target data.
3.1 Step 1: Learning the Source Model:
A Mel-Spectrogram-Based Transfer
In order to get a source model for later transfer learn-
ing to the target data, we perform low-level feature-
based transfer learning using an image-based pre-
trained network. First, 2D Mel-Spectrograms are ex-
tracted from signals using the Librosa library (McFee
et al., 2015). Indeed, a spectrogram is a visual repre-
sentation of the spectrum of frequencies of the speech
signal as it varies with time. Hence, we use spec-
trogram data to represent speech signals as images.
Each point in the spectrogram represents the energy,
or magnitude, of a specific frequency component at
a particular time. Indeed, the process of fine-tuning
the CNN using the spectrogram data involves taking
advantage of the pre-trained CNN’s ability to detect
low-level image features. By initializing the CNN
with these learned features and then adapting it to
the spectrogram domain through fine-tuning, the net-
work gains the capability to recognize low-level fea-
tures and spatial details such as edges, and textures in
spectrogram images that are informative for speech
analysis. This process ultimately helps in transfer-
ring the knowledge of low-level image features to the
speech domain, enhancing the network’s performance
on SER the task.
3.2 Step 2: Domain Adaption Based
Transfer Learning
In this section, we describe our proposed method for
transfer learning with domain adaptation in the case
of cross-lingual SER. Hence, we propose a hybrid
loss function for source model fine-tuning. Based
on the Maximum Mean Discrepancy (MMD) method,
the DA is proposed to minimize the discrepancies
between source and target datasets for cross-lingual
SER. The final strategy of model training is to mini-
mize both, the CNN classification loss and the MMD
loss. Hence, the model is optimized to classify sam-
ples with the classification loss when minimizing the
domain distribution difference during transfer learn-
ing using the MMD loss.
For a source data D
s
= [(x
1
s
,y
1
s
),...,(x
N
s
,y
N
s
)]
where input features X
s
= [x
1
s
,..., x
N
s
] with out-
put labels Y
s
= [y
1
s
,..., y
N
s
] and target data D
t
=
[(x
1
t
,y
1
t
),..., (x
M
t
,y
M
t
)] where input features X
t
=
[x
1
t
,..., x
M
t
] with output labels Y
t
= [y
1
t
,..., y
M
t
], satisfy
the marginal distributions P(D
s
) and Q(D
t
). MMD is
often chosen for feature alignment in domain adapta-
tion due to its ability to capture differences in distri-
butions in a high-dimensional space compared to L1
and L2 distances (Attabi and Dumouchel, 2013). The
total loss function is presented in Eq 1 :
Loss
Total
= C
e
(Y
s
,y
s
) + λMMD
FC
(D
s
,D
t
) (1)
Where C
e
is a loss function Cross entropy computed
from labeled source set, MMD
FC
presents the MMD
loss function (the distance between the target data D
t
and the source data D
s
), y
s
is the source labels data
and λ is the weight of MMD
FC
to transfer source
knowledge to domain target.
The estimation of the MMD method is presented
as follow (see Eq 2), where the kernel function k
(Gretton et al., 2006) is a gaussian function also P
and Q are two marginal different distributions sam-
ple sets (P(D
s
) ̸= Q(D
t
)). Such explicit alignment in
the Reproducing Kernel Hilbert Space (RKHS) could
support the structure learning for precise conditional
alignment.
MMD
FC
[P,Q] =
1
N
2
N
i=1
N
j=1
k
D
i
s
,D
j
s
2
NM
N
i=1
M
j=1
k
D
i
s
,D
j
t
+
1
M
2
M
i=1
M
j=1
k
D
i
t
,D
j
t
(2)
Two output batches from the source D
s
and target D
t
datasets are calculated in the first Fully Connected
(FC) layer. Eq 2 calculates the MMD
FC
as a loss
function for the output, to the optimize the train-
ing process. The objective here is that the distance
Cross-Lingual Low-Resources Speech Emotion Recognition with Domain Adaptive Transfer Learning
121
Figure 1: Domain adaptation based transfer learning proposed method for cross lingual SER.
between the source and target data (MMD loss) for
cross-lingual SER will be narrowed and the total loss
function will be decreased.
By jointly optimizing the cross-entropy loss and
the MMD loss, we encourage the model to learn a fea-
ture representation that minimizes the distribution dis-
crepancy between the source and the target domains
while maintaining good classification performance on
the source domain. The weighting factor λ determines
the trade-off between the two loss terms.
4 EXPERIMENTS AND RESULTS
4.1 SER Datasets
In this work, we use three different SER datasets
for the evaluation of the proposed method. Ta-
ble 1 provides the characteristics of these corpus
namely IEMOCAP (English, Audio) (Busso et al.,
2008), EmoDB (German, Audio) (Burkhardt et al.,
2005) and RAVDESS (North American, Audio-
Visual) dataset. (Livingstone and Russo, 2018). A
total of 535, 9952 and 1248 audio files with seven
emotions have been used for EmoDB, IEMOCAP and
RAVDESS datasets, respectively.
Table 1: Characteristics of Emotional Speech Databases.
Database data Emotions
IEMOCAP (English) 9952 1:Angry, 2:Happy,
3:Excited, 4:Sad,
5:Frust, 6:Fear, 7:Neutral
EmoDB (German) 535 1:Angry, 2:Bored,
3:Disgust, 4:Fear,
5:Happy, 6:Sad, 7:Neutral
RAVDESS (North American) 1248 1:Neutral, 2:Calm,
3:Happy, 4:Sad,
5:Angry, 6:Fear, 7:Disgust
4.2 Training Setting and Model
Parameters
The model is fine-tuned for 400 epochs, with a learn-
ing rate of 10
4
and a batch size equal to 16 for all ex-
periments. Categorical cross entropy loss and Adam
optimizer are used for model optimization. In this
way, the model has been adapted to the new prop-
erties of the source data. Source and target datasets
are randomly divided into training (80%) and testing
(20%) sets. Two evaluation metrics are used:
Unweighted Accuracy: Accuracy = (t
p
+t
n
)/p +
n where t
p
and t
n
are predictions of scores true
positive and true negative, respectively. n + p =
t
p
+ t
n
+ f
n
+ f
p
where f
n
is a prediction of score
false negative and f
p
is a prediction of a false pos-
itive score.
Unweighted Average Recall (UAR): UAR = 0.5
(t
p
/p +t
n
/n) .
4.3 Classification Results
In this section, we present and discuss the obtained
classification results namely on the source model and
the target model with DA.
4.3.1 Classification Results for the Source Model
Learning (Only Source Domain)
In order to select the best model for source data pre-
diction, we computed the prediction of the test set
for the three source datasets. By employing transfer
learning for feature extraction, we have generated Log
Mel-Spectrogram from the input audio set. We per-
formed several experiments with different parameters
and configurations for fine-tuning the VGG model on
the source data. Table 2 presents the best obtained
classification accuracy for the source model. We re-
port classification accuracies of 82.24%, 78.5% and
DATA 2024 - 13th International Conference on Data Science, Technology and Applications
122
83.17% for the RAVDESS, IEMOCAP and EmoDB
datasets, respectively.
Table 2: Classification results for the source model.
work Model Dataset Accuracy %
(Senthilkumar et al., 2022) Spec VGG16 RAVDESS 77
(Aggarwal et al., 2022) Spec VGG16 RAVDESS 81.94
Ours TL RAVDESS 82.24
(Senthilkumar et al., 2022) Spec VGG16 IEMOCAP 63
Ours TL IEMOCAP 78.5
(Senthilkumar et al., 2022) SpecVGG16 EmoDB 75
Ours TL EmoDB 83.17
We also compare the obtained results with re-
cent works on SER in Table 2. In (Aggarwal et al.,
2022), the authors proposed a two-way approach for
SER. The first one is based on Principal Component
Analysis for feature extraction and a DNN for train-
ing. In the second method, Mel-spectrogram images
are extracted from audio files and given as input to
a pre-trained VGG-16 model. Senthilkumar et .al
(Senthilkumar et al., 2022) proposed Bi-directional
LSTM architecture. They also have some disad-
vantages such as increased computational complex-
ity, overfitting risk and increased memory require-
ments. Indeed, Table 2 shows that our proposed trans-
fer learning strategy has better accuracy results com-
pared to (Senthilkumar et al., 2022) and (Aggarwal
et al., 2022). For instance, for RAVDESS from Table
2, we can observe that our transfer learning method
leads to an average increase of at least 0.3% for the
accuracy, 15.5% for the IEMOCAP and 8.17% for the
EmoDB dataset compared to the results obtained in
recent works.
In order to evaluate the model performance per
class, we present confusion matrices on the test data
of the source domain (Figure 2). The goal here is
to see how well the model is performing for each
class, including any imbalances or weakly repre-
sented classes. The test data were randomly sampled.
Figures 2(a),2(b),2(c) show that the proposed transfer
learning gives high accuracy with low misclassifica-
tions for all datasets. Table 3 presents further classifi-
cation results per category.
We can see in the case of RAVDESS, IEMOCAP
and EmoDB datasets that precision of fear, recall of
fear and precision of disgusted, sad and neutral recall
emotions achieve 1, respectively.
4.3.2 Classification Results on Target Data (New
Domain)
In order to evaluate the proposed method, we perform
4 transfer learning tasks (Source Target) with DA
using different datasets:
Table 3: Classification results on source model.
Neutral Calm Happy Sad Anger Fear Disgust
RAVDESS
Precision 0.84 0.86 0.86 0.71 0.64 1 0.81
Recall 0.84 0.92 0.60 0.92 0.58 0.89 0.87
F1-score 0.84 0.89 0.71 0.80 0.61 0.94 0.84
Anger Happy Excited Sad Frust Fear Neutral
IEMOCAP
Precision 0.79 0.86 0.71 0.57 0.64 0.87 0.80
Recall 0.85 0.83 0.50 0.73 0.58 1 0.75
F1-score 0.82 0.84 0.59 0.64 0.61 0.93 0.77
Anger Bored Disgust Fear Happy Sad Neutral
EmoDB
Precision 0.76 0.83 1 0.94 0.54 1 0.89
Recall 0.92 0.91 0.46 0.94 0.64 0.79 1
F1-score 0.83 0.87 0.63 0.94 0.58 0.88 0.94
Task 1: IEMOCAP (9952) EmoDB (535)
(EnglishGerman)
Task 2: EmoDB (535) IEMOCAP (9952)
(German English)
Task 3: RAVDESS (1248) EmoDB (535) (En-
glish:North American German)
Task 4: EmoDB (535) RAVDESS (1248)
(German English:North American)
We used the output of the first FC layer in VGG 16
model to calculate the MMD loss between source
and target domains (Yang et al., 2020). The model
has been trained for 50 epochs. Table 4 presents the
classification results using the proposed DA-based
transfer learning approach on the 4 transfer tasks.
In order to ensure a fair comparison with the works
cited in Table 4, we compute the average and the
standard deviation values of both accuracy and UAR
metrics for task 1 and task 2. All values are calculated
with a margin of error or uncertainty associated with
the true value is expected to fall within this range
centered around the given value. We repeat each
experiment ten times and computed the standard
deviation and mean. We compare the performance
of the DA-based TL method with sADDi (Latif
et al., 2022), GAN (Latif et al., 2019), AE (Deng
et al., 2013) and CNN-LSTM (Parry et al., 2019)
methods. Our proposed method compared to these
existing techniques, achieves better results. More-
over, the DA-based TL technique has demonstrated
improvements in UAR for tasks 1 and 2 in UAR
by 1.1% and 4.1%, respectively, compared to the
sADDi method (Latif et al., 2022), 4.9% and 8.5%
compared to the GAN method (Latif et al., 2019) and
7% and 9.5% (including 9.5% in term of accuracy for
task 1) compared to the CNN-LSTM method (Parry
et al., 2019), respectively. For tasks 3 and 4, we have
only one state of the art work that have performed
cross-corpus in SER (Singh et al., 2021) and there
Cross-Lingual Low-Resources Speech Emotion Recognition with Domain Adaptive Transfer Learning
123
(a) Confusion Matrix of RAVDESS (b) Confusion Matrix of IEMOCAP (c) Confusion Matrix of EmoDB
Figure 2: Confusion Matrices of the source model on RAVDESS, IEMOCAP and EmoDB datasets.
are not be specific studies directly comparing these
two datasets in a cross-lingual context. However,
both datasets are widely used for emotion recognition
tasks in speech processing. We can see in table 4
when we used our proposed DA-based TL method
for tasks 3 and 4, that we achieved better results than
CQT (Singh et al., 2021) and MFSC (Singh et al.,
2021) methods. Accuracy and UAR metric values are
further boosted by using a DA-based TL method. We
achieved a 57% accuracy rate against CQT 48% and
MFSC 45% (Singh et al., 2021) in task 3 and a 50%
accuracy rate against 44% and 41% (Singh et al.,
2021) in task 4, respectively. Indeed for UAR values,
we obtained 58% against CQT 48% and MFSC
42% (Singh et al., 2021) in task 3 and 50% against
CQT 44% and MFSC 41% (Singh et al., 2021)
in task 4, respectively. In summary, the proposed
method proves that the network is able to produce
better-generalized features for cross-language SER
by increasing the final UAR and accuracy of the four
tasks.
Table 4: Comparative results of DA-based TL cross-lingual
results by UAR (%) and Accuracy (%) metrics for the 4
tasks.
Work Models Task 1 Task 2
UAR/Accuracy UAR/Accuracy
(Parry et al., 2019) CNN-LSTM 42.1±1.8/41.99 38.9±2.1/NA
(Latif et al., 2018) DBN 42.5±2.1/NA 39.5±2.4/NA
(Deng et al., 2013) AE 43.2±2.3/NA 40.1±1.8/NA
(Latif et al., 2019) GAN 44.3±1.7/NA 40.3±1.7/NA
(Latif et al., 2022) ADDi 46.1±1.6/NA 41.2±1.8/NA
(Latif et al., 2022) sADDi 48.3±1.5/NA 44.8±1.6/NA
Ours DA based TL 50.1±0.8/ 50.74±0.3 50±0.5/ 50.02±0.4
Work Models Task 3 Task 4
UAR/Accuracy UAR/Accuracy
(Singh et al., 2021) MFSC 42/45 44/41
(Singh et al., 2021) CQT 48/48 46/44
Ours DA based TL 58/ 57 50/ 50
Table 5 presents the classification results of TL
compared to DA-based TL methods for task 1 and
task 3 cross-lingual SER. It shows that the classifi-
cation accuracy of seven emotions in terms of Pre-
cision, Recall and F1-score are improved with DA-
based TL compared to classification report values of
the TL. In the case of the proposed DA-based TL
method, RAVDESSEmoDB (task 3) achieves high
F-score, Precision and Recall values (> 70%) over
Happy emotion while in the case of TL method F-
score, Precision and Recall are the categories with
the lowest values (< 30%). As shown in Table
5, IEMOCAPEmoDB (task 1) with TL method
over Boredom emotion achieves a classification per-
formance (< 35%) lower than IEMOCAPEmoDB
with DA-based TL (> 50%) method.
Table 5: Classification results of IEMOCAP EmoDB
(task 1) and RAVDESSEmoDB (task 3) cross-lingual
SER using DA based TL method.
Angry Bored Disgust Fear Happy Sad Neutral
Task 1
Classification results of DA based TL method
Precision 0.32 0.55 0.32 0.33 0.58 0.60 0.45
Recall 0.47 0.55 0.29 0.39 0.61 0.39 0.49
F1-Score 0.38 0.55 0.31 0.36 0.59 0.47 0.47
Classification results of transfer learning method
Precision 0.35 0.34 0.47 0.29 0.58 0.46 0.47
Recall 0.39 0.29 0.45 0.44 0.56 0.44 0.38
F1-score 0.37 0.31 0.46 0.35 0.57 0.45 0.42
Task 3
Classification results of DA based TL method
Precision 0.45 0.72 0.61 0.50 0.70 0.53 0.61
Recall 0.69 0.82 0.41 0.32 0.73 0.65 0.74
F1-score 0.55 0.77 0.49 0.39 0.71 0.58 0.67
Classification results of transfer learning method
Precision 0.74 1 0.5 0.37 0.27 0.28 0.27
Recall 0.57 0.42 0.5 0.44 0.27 0.83 0.42
F1-score 0.64 0.59 0.5 0.40 0.27 0.42 0.52
4.3.3 Effect of DA on Transfer Learning
Classification Results: In this section, we present
results for the most representative tasks namely task 1
and task 3 that illustrate an adaptation from English
DATA 2024 - 13th International Conference on Data Science, Technology and Applications
124
data to a smaller German (low resource) one. Ta-
ble 6, presents a comparative study of accuracy and
UAR results metrics for IEMOCAPEmoDB (task
1) and RAVDESSEmoDB (task 3) cross-corpus us-
ing DA-based TL and TL methods.
Table 6: Classification results on the target domain.
Method Task Accuracy (%) UAR (%)
Transfer Learning Task 1 45.8 44.73
DA-based TL (ours) Task 1 50.74 50.1
Transfer Learning Task 3 50.47 50.65
DA-based TL (ours) Task 3 53.47 58.01
As shown in Table 6 the results of the ex-
perimental groups using the DA based TL tech-
nique are better than those obtained by the TL
method from scratch. Indeed, with task DA
based TL for IEMOCAPEmoDB (task 1), the
recognition accuracy and UAR are 5.24% and 6%
higher than those obtained by TL method for
IEMOCAPEmoDB (task 1) model, respectively.
For RAVDESSEmoDB (task 3) accuracy and
UAR obtained by DA based TL, are 3% and 8.6%
higher than those obtained by the TL method for
RAVDESSEmoDB (task 3), respectively.
(a) IEMOCAPEmoDB (b) IEMOCAPEmoDB
(c) RAVDESSEmoDB (d) RAVDESSEmoDB
Figure 3: Confusion Matrices for RAVDESSEmoDB
and IEMOCAPEmoDB tasks using TL and DA based
TL methods with 1:Angry, 2:Boredom, 3:Disgust, 4:Fear,
5:Happy, 6:Sad, 7:Neutral emotions.
We plot confusion matrices on the test data of the
target domain (Figure 3) with the same goal as for the
evaluation of the source model. Indeed, some emo-
(a) IEMOCAPEmoDB
(Task 1)
(b) IEMOCAPEmoDB
(Task 1)
(c) RAVDESSEmoDB
(Task3)
(d) RAVDESSEmoDB
(Task 3)
Figure 4: Visualization of features for the tasks 1 and 3
using transfer learning (Fig.(a,c)) and proposed approach
(Fig.(b,d)).
tions might naturally occur less frequently in real-
world scenarios, making them weakly represented
in the data. Hence, in our experiments, the test
data are randomly sampled to contain some under-
represented emotions that may illustrate real-world
and low-resource scenarios. We can see from the con-
fusion matrix Figures of IEMOCAPEmoDB 3(b)
and RAVDESSEmoDB 3(d) cross-lingual, that the
distribution of features resulted in a high classifica-
tion effect using the proposed approach DA based
TL compared to the confusion matrix Figures of
IEMOCAPEmoDB 3(a) and RAVDESSEmoDB
3(c) cross-lingual with TL technique. This is because
the good detection emotion abilities of DA-based TL
over TL techniques. Comparing Figures 3(b) and
3(d), we observe a general increase in accuracy for
different classes of emotions for the DA based TL
method, usually at the expense of the ”disgust” class.
Further-more, we note that ”Happy, Sad and Neutral”
classes, appears to have the largest gain in accuracy.
Features Visualization: In order to validate the
effectiveness of the proposed domain adaptation
method for speech emotion recognition, we plot
the t-SNE (Van der Maaten and Hinton, 2008) for
Cross-Lingual Low-Resources Speech Emotion Recognition with Domain Adaptive Transfer Learning
125
features extracted from the obtained model by simple
transfer learning and features generated by the
proposed method. Hence, we used the layer FC-4 to
extract the features and to plot labels. The points are
colored according to the emotion category.
From Figure 4 we can see that our proposed ap-
proach (figures 4(b) and 4(d)) is able to more ef-
fectively separate emotional features from different
classes and the clusters are better structured than
the basic transfer learning method (figures 4(a) and
4(c)) from the tasks IEMOCAPEmoDB (task 1)
and RAVDESSEmoDB (task 3). As can be ob-
served in figures 4(b) and 4(d), that the inter-class
variance is increased and the emotional classes are
highly correlated, while the obtained emotional clus-
ters are more compact. The proposed adaptive trans-
fer learning method helps in learning discriminative
multi-language features and hence improves the clas-
sification performance of emotions from speech.
5 CONCLUSION
In this paper, we propose a domain adaptive-based
transfer learning approach for cross-lingual Speech
Emotion Recognition problem. The proposed method
jointly align domain invariant features and improve
feature quality for better classification in a supervised
manner. The proposed method has been evaluated
on different transfer learning tasks namely for low-
resource scenarios using IEMOCAP, RAVDESS and
EmoDB datasets. Obtained results revealed that our
proposed approach outperforms existing SER meth-
ods in terms of both UAR and Accuracy metrics. The
promising results obtained from this study open av-
enues for future research in the realm of cross-lingual
low-resource speech emotion recognition due to the
the modest improvement.
ACKNOWLEDGEMENTS
Support for this research was provided by a grant
from La R
´
egion Nouvelle Aquitaine (CPER FEDER
P-2019-2022), in partnership with the European
Union (FEDER/ ERDF, European Regional Develop-
ment Fund).
REFERENCES
Abdelwahab, M. and Busso, C. (2018). Domain adversarial
for acoustic emotion recognition. IEEE/ACM Trans-
actions on Audio, Speech, and Language Processing,
26(12):2423–2435.
Agarla, M., Bianco, S., Celona, L., Napoletano, P., Petro-
vsky, A., Piccoli, F., Schettini, R., and Shanin, I.
(2022). Semi-supervised cross-lingual speech emo-
tion recognition. arXiv preprint arXiv:2207.06767.
Agarla, M., Bianco, S., Celona, L., Napoletano, P., Petro-
vsky, A., Piccoli, F., Schettini, R., and Shanin, I.
(2024). Semi-supervised cross-lingual speech emo-
tion recognition. Expert Systems with Applications,
237:121368.
Aggarwal, A., Srivastava, A., Agarwal, A., Chahal, N.,
Singh, D., Alnuaim, A. A., Alhadlaq, A., and Lee,
H.-N. (2022). Two-way feature extraction for speech
emotion recognition using deep learning. Sensors,
22(6):2378.
Ahn, Y., Lee, S. J., and Shin, J. W. (2021). Cross-corpus
speech emotion recognition based on few-shot learn-
ing and domain adaptation. IEEE Signal Processing
Letters, 28:1190–1194.
Attabi, Y. and Dumouchel, P. (2013). Anchor models for
emotion recognition from speech. IEEE Transactions
on Affective Computing, 4(3):280–290.
Badshah, A. M., Rahim, N., Ullah, N., Ahmad, J., Muham-
mad, K., Lee, M. Y., Kwon, S., and Baik, S. W.
(2019). Deep features-based speech emotion recog-
nition for smart affective services. Multimedia Tools
and Applications, 78:5571–5589.
Barth, I. (2020). When we talk about emotions and plurilin-
gualism.
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F.,
Weiss, B., et al. (2005). A database of german emo-
tional speech. In Interspeech, volume 5, pages 1517–
1520.
Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower,
E., Kim, S., Chang, J. N., Lee, S., and Narayanan,
S. S. (2008). Iemocap: Interactive emotional dyadic
motion capture database. Language resources and
evaluation, 42:335–359.
Cai, X., Wu, Z., Zhong, K., Su, B., Dai, D., and Meng,
H. (2021). Unsupervised cross-lingual speech emo-
tion recognition using domain adversarial neural net-
work. In 2021 12th International Symposium on Chi-
nese Spoken Language Processing (ISCSLP), pages
1–5. IEEE.
Deng, J., Zhang, Z., Marchi, E., and Schuller, B. (2013).
Sparse autoencoder-based feature transfer learning for
speech emotion recognition. In 2013 humaine associ-
ation conference on affective computing and intelli-
gent interaction, pages 511–516. IEEE.
El Ayadi, M., Kamel, M. S., and Karray, F. (2011). Sur-
vey on speech emotion recognition: Features, classi-
fication schemes, and databases. Pattern recognition,
44(3):572–587.
Fu, H., Zhuang, Z., Wang, Y., Huang, C., and Duan, W.
(2023). Cross-corpus speech emotion recognition
based on multi-task learning and subdomain adapta-
tion. Entropy, 25(1):124.
Gerczuk, M., Amiriparian, S., Ottl, S., and Schuller, B. W.
(2021). Emonet: A transfer learning framework
DATA 2024 - 13th International Conference on Data Science, Technology and Applications
126
for multi-corpus speech emotion recognition. IEEE
Transactions on Affective Computing.
Gideon, J., McInnis, M. G., and Provost, E. M. (2019).
Improving cross-corpus speech emotion recognition
with adversarial discriminative domain generalization
(addog). IEEE Transactions on Affective Computing,
12(4):1055–1068.
Goel, S. and Beigi, H. (2020). Cross lingual cross cor-
pus speech emotion recognition. arXiv preprint
arXiv:2003.07996.
Gretton, A., Borgwardt, K., Rasch, M., Sch
¨
olkopf, B., and
Smola, A. (2006). A kernel method for the two-
sample-problem. Advances in neural information pro-
cessing systems, 19.
Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O.,
and Johnson, M. (2020). Xtreme: A massively mul-
tilingual multi-task benchmark for evaluating cross-
lingual generalisation. In International Conference on
Machine Learning, pages 4411–4421. PMLR.
Huang, Z., Dong, M., Mao, Q., and Zhan, Y. (2014). Speech
emotion recognition using cnn. In Proceedings of the
22nd ACM international conference on Multimedia,
pages 801–804.
Huijuan, Z., Ning, Y., and Ruchuan, W. (2023). Improved
cross-corpus speech emotion recognition using deep
local domain adaptation. Chinese Journal of Electron-
ics, 32(3):1–7.
Kexin, Z. and Yunxiang, L. (2023). Speech emotion recog-
nition based on transfer emotion-discriminative fea-
tures subspace learning. IEEE Access.
Khalil, R. A., Jones, E., Babar, M. I., Jan, T., Zafar, M. H.,
and Alhussain, T. (2019). Speech emotion recogni-
tion using deep learning techniques: A review. IEEE
Access, 7:117327–117345.
Latif, S., Qadir, J., and Bilal, M. (2019). Unsupervised ad-
versarial domain adaptation for cross-lingual speech
emotion recognition. In 2019 8th international con-
ference on affective computing and intelligent inter-
action (ACII), pages 732–737. IEEE.
Latif, S., Rana, R., Khalifa, S., Jurdak, R., and Schuller,
B. W. (2022). Self supervised adversarial do-
main adaptation for cross-corpus and cross-language
speech emotion recognition. IEEE Transactions on
Affective Computing.
Latif, S., Rana, R., Younis, S., Qadir, J., and Epps, J. (2018).
Transfer learning for improving speech emotion clas-
sification accuracy. arXiv preprint arXiv:1801.06353.
Lech, M., Stolar, M., Best, C., and Bolia, R. (2020). Real-
time speech emotion recognition using a pre-trained
image classification network: Effects of bandwidth re-
duction and companding. Frontiers in Computer Sci-
ence, 2:14.
Liu, J., Zheng, W., Zong, Y., Lu, C., and Tang, C. (2020).
Cross-corpus speech emotion recognition based on
deep domain-adaptive convolutional neural network.
IEICE TRANSACTIONS on Information and Systems,
103(2):459–463.
Liu, S., Zhang, M., Fang, M., Zhao, J., Hou, K., and
Hung, C.-C. (2021). Speech emotion recognition
based on transfer learning from the facenet frame-
work. The Journal of the Acoustical Society of Amer-
ica, 149(2):1338–1345.
Livingstone, S. R. and Russo, F. A. (2018). The ryerson
audio-visual database of emotional speech and song
(ravdess): A dynamic, multimodal set of facial and
vocal expressions in north american english. PloS one,
13(5):e0196391.
Lugovi
´
c, S., Dunder, I., and Horvat, M. (2016). Techniques
and applications of emotion recognition in speech.
In 2016 39th international convention on information
and communication technology, electronics and mi-
croelectronics (mipro), pages 1278–1283. IEEE.
Ma, A., Filippi, A. M., Wang, Z., and Yin, Z. (2019).
Hyperspectral image classification using similarity
measurements-based deep recurrent neural networks.
Remote Sensing, 11(2):194.
McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M.,
Battenberg, E., and Nieto, O. (2015). librosa: Audio
and music signal analysis in python. In Proceedings
of the 14th python in science conference, volume 8,
pages 18–25.
Padi, S., Sadjadi, S. O., Sriram, R. D., and Manocha, D.
(2021). Improved speech emotion recognition using
transfer learning and spectrogram augmentation. In
Proceedings of the 2021 international conference on
multimodal interaction, pages 645–652.
Parry, J., Palaz, D., Clarke, G., Lecomte, P., Mead, R.,
Berger, M., and Hofer, G. (2019). Analysis of deep
learning architectures for cross-corpus speech emo-
tion recognition. In Interspeech, pages 1656–1660.
Rezaeianjouybari, B. and Shang, Y. (2020). Deep learn-
ing for prognostics and health management: State of
the art, challenges, and opportunities. Measurement,
163:107929.
Scheidwasser-Clow, N., Kegler, M., Beckmann, P., and Cer-
nak, M. (2022). Serab: A multi-lingual benchmark for
speech emotion recognition. In ICASSP 2022-2022
IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pages 7697–7701.
IEEE.
Senthilkumar, N., Karpakam, S., Devi, M. G., Balakumare-
san, R., and Dhilipkumar, P. (2022). Speech emotion
recognition based on bi-directional lstm architecture
and deep belief networks. Materials Today: Proceed-
ings, 57:2180–2184.
Sharma, M. (2022). Multi-lingual multi-task speech emo-
tion recognition using wav2vec 2.0. In ICASSP 2022-
2022 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 6907–
6911. IEEE.
Singh, P., Saha, G., and Sahidullah, M. (2021). Non-
linear frequency warping using constant-q transforma-
tion for speech emotion recognition. In 2021 Interna-
tional Conference on Computer Communication and
Informatics (ICCCI), pages 1–6. IEEE.
Van der Maaten, L. and Hinton, G. (2008). Visualizing data
using t-sne. Journal of machine learning research,
9(11).
Cross-Lingual Low-Resources Speech Emotion Recognition with Domain Adaptive Transfer Learning
127
Yang, Z., Liu, G., Xie, X., and Cai, Q. (2020). Efficient
dynamic domain adaptation on deep cnn. Multimedia
Tools and Applications, 79(45):33853–33873.
Ye, J., Wei, Y., Wen, X.-C., Ma, C., Huang, Z., Liu, K., and
Shan, H. (2023). Emo-dna: Emotion decoupling and
alignment learning for cross-corpus speech emotion
recognition. arXiv preprint arXiv:2308.02190.
Zhang, J., Jiang, L., Zong, Y., Zheng, W., and Zhao, L.
(2021a). Cross-corpus speech emotion recognition us-
ing joint distribution adaptive regression. In ICASSP
2021-2021 IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP), pages
3790–3794. IEEE.
Zhang, S., Liu, R., Tao, X., and Zhao, X. (2021b). Deep
cross-corpus speech emotion recognition: Recent ad-
vances and perspectives. Frontiers in Neurorobotics,
15.
ZHAO, H., YE, N., and WANG, R. (2023). Improved cross-
corpus speech emotion recognition using deep local
domain adaptation. Chinese Journal of Electronics,
32(3):1–7.
Zhao, Z., Zhao, Y., Bao, Z., Wang, H., Zhang, Z., and Li,
C. (2018). Deep spectrum feature representations for
speech emotion recognition. In Proceedings of the
Joint Workshop of the 4th Workshop on Affective So-
cial Multimedia Computing and first Multi-Modal Af-
fective Computing of Large-Scale Multimedia Data,
pages 27–33.
DATA 2024 - 13th International Conference on Data Science, Technology and Applications
128