A Hierarchical Approach for Multilingual Speech Emotion Recognition

Marco Nicolini

∗

and Stavros Ntalampiras

† a

LIM – Music Informatics Laboratory, Department of Computer Science, University of Milan, Italy

∗

Keywords:

Audio Pattern Recognition, Machine Learning, Transfer Learning, Convolutional Neural Network, YAMNet,

Multilingual Speech Emotion Recognition.

Abstract:

This article approaches the Speech Emotion Recognition (SER) problem with the focus placed on multilingual

settings. The proposed solution consists in a hierarchical scheme the ﬁrst level of which identiﬁes the speaker’s

gender and the second level predicts the speaker’s emotional state. We elaborate with three classiﬁers of

increased complexity, i.e. k-NN, transfer learning based on YAMNet and Bidirectional Long Short-Term

Memory neural networks. Importantly, model learning, validation and testing consider the full range of the big-

six emotions, while the dataset has been assembled using well-known SER datasets representing six different

languages. The obtained results show differences in classifying all data against only female or male data

with respect to all classiﬁers. Interestingly, a-priori genre recognition can boost the overall classiﬁcation

performance.

1 INTRODUCTION

Speech is fundamental in human-machine communi-

cation because, among others, it is one of the pri-

mary faucets for expressing emotions. In this context,

speech emotion recognition (SER) aims at automati-

cally identifying the emotional state of a speaker us-

ing her/his voice and, as such, comprises an important

branch of Affective Computing, which studies and de-

velops systems sensing the emotional state of a user

(Chen et al., 2023).

Emotion plays a vital role in the way we think,

react, and behave: it is a central part of decision mak-

ing, problem-solving, communicating, or even nego-

tiating. Among various applications, emotion recog-

nition plays a crucial part in human health and the

related medical procedure to detect, analyze, and de-

termine the medical conditions of a person. For exam-

ple, SER can be applied to design a medical robot that

provides better health-care services for patients by

continuously monitoring the patients’ emotional state

(Park et al., 2009). Other applications of SER tech-

nologies could be deploying emotionally-aware Hu-

man Computer Interaction solutions (Pavlovic et al.,

1997).

The majority of SER solutions are focused on a

single language (Ntalampiras, 2021) and only few

language-agnostic methods are present in the litera-

ture, for instance the work of Saitta (Saitta and Nta-

https://orcid.org/0000-0003-3482-9215

lampiras, 2021) or the work of Sharma (Sharma,

2022). A big part of SER research has focused on

ﬁnding speech features that are indicative of different

emotions (Tahon and Devillers, 2015), and a variety

of both short-term and long-term features have been

proposed. The emotional space is usually organized

in six emotions: angry, disgust, happy, sad, neutral

and fear (Miller Jr, 2016).

Speech emotions tend to have overlapping fea-

tures, making it difﬁcult to ﬁnd the correct classi-

ﬁcation boundaries. Given the latter, deep learning

methods comprise an interesting solution since they

can automatically discover the multiple levels of rep-

resentations in speech signals (Sang et al., 2018) and

as such, there is a constantly growing interest in re-

search in applying deep learning based methods to

automatically learn useful features from emotional

speech data. For instance, Mirsamadi (Mirsamadi

et al., 2017) applies recurrent neural networks to auto-

matically discover emotionally relevant features from

speech and to classify emotions; Kun Han (Han et al.,

2014) applies Deep Neural Network to SER; the work

of Scheidwasser et al. establish a framework for eval-

uating the performance and generalization capacity

of different approaches for SER utterances but their

method is dependent on the language and train the

different models (mainly deep learning methods) on

one dataset of the 6 chosen benchmark datasets per

time (Scheidwasser-Clow et al., 2022).

The approach in this work focuses on a language-

Nicolini, M. and Ntalampiras, S.

A Hierarchical Approach for Multilingual Speech Emotion Recognition.

DOI: 10.5220/0011714800003411

In Proceedings of the 12th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2023), pages 679-685

ISBN: 978-989-758-626-2; ISSN: 2184-4313

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

679

Figure 1: The block diagram of the classiﬁcation hierarchy adopted in this work.

agnostic methodology for SER: it tries to generalize

patterns in data in order to distinguish emotions in

every language existing in the employed dataset. A

relevant component of the proposed solution is auto-

matic gender differentiation and how it can improve

SER performances of the classiﬁers. Such a direction

is analyzed in the work of Vogt et al. (Vogt and Andr

2006) where a framework to improve emotion recog-

nition from monolingual speech by making use of au-

tomatic gender detection is presented. The work of

Dair (Dair et al., 2021) also analyzes differences with

and without gender differentiation on three datasets.

Unlike the previous works existing in the related

literature, we design a system considering the full

range of the big-six emotions as expressed in a multi-

lingual setting with six languages.

The block diagram of the proposed approach is

demonstrated in Fig. 1 where the following three

classiﬁers can be observed: a) gender, b) female, and

c) male emotional speech. The gender-independent

path is shown as well since it has been employed for

comparison purposes.

The following classiﬁers have been employed:

a) k-Nearest Neighbor Classiﬁer (k-NN) which, de-

spite its simplicity, it is a suitable approach for multi-

class problems (Hota and Pathak, 2018), b) trans-

fer learning-based classiﬁer based on YAMNet

, and

c) Bidirectional Long Short-Term Memory (BiL-

STM) neural network Classiﬁer. The last two clas-

siﬁers belong to the deep learning domain; the ﬁrst

one relies on a large-scale convolutional network,

while the second is able to encode temporal depen-

dencies existing in the available emotional manifesta-

tions. Last but not least, every model was trained on

appropriate features extracted from time and/or fre-

quency domains.

The rest of this work is organized as follows:

section 2 explains the construction of a multilingual

https://github.com/tensorﬂow/models/tree/master/

research/audioset/yamnet

dataset facilitating SER purposes. Section 3 brieﬂy

describes the employed features and classiﬁers, while

section 4 presents the experimental protocol and ob-

tained results. Finally, in section 5 we draw our con-

clusions and outline future research directions.

2 CONSTRUCTING THE

MULTILINGUAL SER DATASET

SER literature includes various monolingual datasets,

thus a corpus combining ten different datasets was

formed. More speciﬁcally, the following ones have

been employed: 1. SAVEE (Vlasenko et al., 2007),

2. CREMA-D (Cao et al., 2014), 3. RAVDESS (Liv-

ingstone and Russo, 2018), 4. TESS (Pichora-Fuller

and Dupuis, 2020), 5. EMOVO (Costantini et al.,

2014), 6. EmoDB (Burkhardt et al., 2005), 7. ShEMO

(Nezami et al., 2019), 8. URDU (Latif et al., 2018),

9. JLcorpus (James et al., 2018), and 10. AESDD

(Vryzas et al., 2018a; Vryzas et al., 2018b). These

datasets include the so-called ’big six’ (Miller Jr,

2016) emotions which are typically considered in

the related literature. The following six languages

are considered in the ﬁnal dataset: 1. English (New

Zealand, British, American, and from different ethnic

backgrounds), 2. German, 3. Italian, 4. Urdu, 5. Per-

sian, and 6. Greek.

Table 1 tabulates the duration in seconds for the

various classes included in the dataset. It should

be mentioned that all datasets include acted speech.

We observe only a slight imbalance as regards the

genre aspect, where data representing male emotional

speech lasts 25460s and female 25081s. Moreover,

data associated with the English language comprises

the largest part of the dataset followed by Persian,

while speech data representing remaining ones (Ger-

man, Italian, Greek and Urdu) are within the range

[998s,2481s].

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

680

Table 1: Duration (in seconds) of the diverse classes considered in the present work. All values are truncated.

Data part Angry Neutral Sad Happy Fear Disgust

All Data

(50541)

10594 11200 9325 7116 5773 6530

Female

(25081)

5023 4706 4948 3807 3018 3576

Male

(25460)

5570 6493 4377 3309 2754 2953

English

(32154)

5475 5464 5772 5174 4765 5502

German

(1261)

335 186 251 180 154 154

Italian

(1711)

268 261 313 266 283 317

Urdu

(998)

248 250 250 250 0 0

Persian

(11932)

3832 5037 2175 766 120 0

Greek

(2481)

433 0 562 478 449 556

Table 2: Confusion matrix (in %) as regards to gender

classiﬁcation obtained using the YAMNet-based and k-

NN approaches. The presentation format is the following:

YAMNet/k-NN. The highest accuracy is emboldened.

Presented

Predicted

Female Male

Female 94.3/91.1 5.7/8.9

Male 4.8/3.6 95.2/96.4

Aiming at a uniform representation, all data has

been resampled to 16 kHz and monophonic wave

format. When building a SER system, the speciﬁc

dataset present various challenges, i.e.

• different languages presenting important cultural

gaps,

• imbalances at the genre, language, and emotional

state levels,

• diverse recording conditions, and

• different recording equipment.

The ﬁrst two obstacles were addressed by appropri-

ately dividing the data during train, validation, and

test phases so that the obtained models are not biased

to one or more subpopulations existing within the en-

tire corpus. As regards to the last two, the proposed

approach aims at creating a standardized representa-

tion of the audio signal so that the effect of recording

conditions and equipment is minimized.

To the best of our knowledge, this is the ﬁrst time

in the SER literature that the full range of the big-six

emotional states expressed in six different languages

is considered.

3 THE CONSIDERED

CLASSIFICATION MODELS

This section describes brieﬂy the considered classi-

ﬁcation models along with their suitably-chosen fea-

ture sets characterizing the available audio data. Im-

portantly, each classiﬁcation model has been sepa-

rately trained and tested on the following settings:

a) the entire dataset, b) female data, and c) male data.

Data division in train, validation, and test sets have

been kept constant among every classiﬁers so as to

obtain a reliable comparison among the considered

models. At the same time, k-NN and YAMNet-based

models have been trained to distinguish between man

and female utterances.

3.1 k-NN

The standard version of the k-NN classiﬁer has been

used with the Euclidean distance as similarity metric.

Despite its simplicity, k-NN has been able to offer

satisfactory performance in SER (Venkata Subbarao

et al., 2022), thus we assessed its performance on the

present challenging multilingual setting.

Feature Extraction. The short-term features feed-

ing the k-NN model are the following: a) zero cross-

ing rate, b) energy, c) energy’s entropy, d) spectral

centroid and spread, e) spectral entropy, f) spectral

ﬂux, g) spectral rolloff, h) MFCCs, i) harmonic ra-

tio, j) fundamental frequency, and k) chroma vectors.

A Hierarchical Approach for Multilingual Speech Emotion Recognition

681

Table 3: Average classiﬁcation accuracy and balanced accuracy results (in %) using 10 fold evaluation. The presentation

format is the following: accuracy/balanced accuracy. The highest accuracy and balanced accuracy are emboldened.

Data subpopulation k-NN YAMNet BiLSTM

all data 59/59 74.9/46.9 62.0/59.1

female data 65.2/65.1 79.9/52 68.6/67

male data 51.6/51.6 71.7/47.1 56.7/51.3

Table 4: Confusion matrix (in %) as regards to SER classiﬁcation obtained using the BiLSTM, k-NN and YAMNet mod-

els approaches with all data. The presentation format is the following: BiLSTM/k-NN/YAMNet. The highest accuracy is

emboldened.

Pres.

Pred.

angry disgust fear happy neutral sad

angry 83.1/68.7/60.9 3.8/8.9/12.1 0.9/3.2/23.8 5.8/13.4/3.2 5.5/4.8/- 0.9/0.9/-

disgust 11.6/5.0/1.4 45.0/64.3/76.6 4.3/7.4/19.3 4.6/8.0/2.8 16.8/7.3/- 17.7/8.0/-

fear 10.5/6.0/0.6 6.8/17.6/5.1 41.2/47.4/94.3 8.6/10.7/- 7.6/5.2/- 25.3/13.1/-

happy 22.0/13.3/14.8 4.9/14.9/34.4 8.8/8.8/26.2 43.2/50.2/24.6 16.1/9.2/- 5.0/3.5/-

neutral 2.8/2.7/- 5.3/12.3/40 1.0/4.6/31.4 3.2/4.6/17.1 75.6/67.3/8.6 12.1/8.6/2.9

sad 1.5/3.5/- 3.0/11.5/72 4.0/11.5/12 1.8/3.7/16 24.1/13.6/- 65.7/56.2/-

We opted for the mid-term feature extraction pro-

cess, based short-term features, meaning that mean

and standard deviation statistics on these short term

features are calculated over mid-term segments. More

information on the adopted feature extraction method

can be found in (Giannakopoulos and Pikrakis, 2014).

Parameterization. Short- and mid-term window

and hop sizes, have been discovered after a series of

early experimentations on the various datasets. The

conﬁguration offering the highest recognition accu-

racy is the following: 0.2, 0.1 seconds for short-term

window and hop size; and 3.0, 1.5 seconds for mid-

term window and hop size respectively. Overall, the

both feature extraction levels include a 50% overlap

between subsequent windows.

Moreover, parameter k has been chosen using test

results based on the ten-fold cross validation scheme;

depending on the considered data population, the ob-

tained optimal values range in [5, 21] (see section 4

for more information).

3.2 Transfer Learning Based on

YAMNet

YAMNet is a deep neural network model developed

by Google and trained on 512 classes of generalized

audio events belonging to the AudioSet ontology

As such, the learnt representation may be useful in

diverse audio classiﬁcation tasks including SER. To

this end, we elaborated on the Embeddings layer of

model and employed it as a feature set which com-

prises the input to a dense layer, with as many neu-

https://research.google.com/audioset/index.html

rons as the classes that have to be classiﬁed (2 gen-

res or 6 emotions). The ﬁnal prediction is based on a

softmax layer, while a dictionary of weights can also

be employed to compensate cases of imbalanced data

classes.

3.3 Bidirectional LSTM

Long short-term memory network is a speciﬁc type

of recurrent neural network, which is particularly ef-

fective in capturing long-term temporal dependencies.

Given the fact that audio signal are characterized by

their evolution in time, such a property may be sig-

niﬁcant, thus such a classiﬁer was included in the ex-

perimental set-up. More speciﬁcally, we considered

a bidirectional LSTM (BiLSTM) layer learning bidi-

rectional long-term dependencies in sequential data.

BiLSTMs are an extension of traditional LSTMs that

can improve model performance on sequence classiﬁ-

cation problems (Sajjad et al., 2020). BiLSTMs train

two instead of one LSTMs on the input sequence: the

ﬁrst on the input sequence as-is and the second on a

reversed copy of the input sequence.

Feature Extraction. In this case, the considered

feature sets, able to preserve the temporal evolu-

tion of the available emotional manifestations, were

the following: a) Gammatone cepstral coefﬁcients

(GTCC), b) Delta Gammatone cepstral coefﬁcients

(delta GTCC), c) delta-delta MFCC, d) Mel spec-

trurm, and e) spectral crest. The window length

has been chosen after extensive experimentations per-

formed on the different data subpopulations, e.g.

genre, emotions, etc., while there is no overlapping

between subsequent windows.

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

682

Table 5: Confusion matrix (in %) as regards to SER classiﬁcation obtained using the BiLSTM, k-NN and YAMNet models

approaches with female data. The presentation format is the following: BiLSTM/k-NN/YAMNet. The highest accuracy is

emboldened.

Pres.

Pred.

angry disgust fear happy neutral sad

angry 85.4/75.5/63.2 2.4/6.6/16.1 0.9/2.7/19.5 6.6/9.9/1.1 3.6/4/- 1.1/1.3/-

disgust 9.1/7.8/0.8 56.7/67.4/88.4 2.0/4.1/10.1 4.7/5.3/0.8 13.5/7/- 13.9/7/-

fear 7.8/6.7/0.4 5.3/9.4/3.1 51.6/58.7/96.5 8.2/7.6/- 5.3/11.3/- 21.9/11.3/-

happy 16.2/15.5/14.8 4.5/9.7/59.3 6.5/8.7/25.9 57.8/54.8/- 10.1/8/- 4.8/3.3/-

neutral 2.7/3.4/- 8.0/12.1/40 0.5/2.4/50 2.9/2.5/- 74.8/71.7/10 11.1/7.8/-

sad 1.8/4.5/- 3.1/7.4/71.4 3.6/8.6/28.6 1.3/2.7/- 14.6/13.9/- 75.5/63/-

Table 6: Confusion matrix (in %) as regards to SER classiﬁcation obtained using the BiLSTM, k-NN and YAMNet models

approaches with male data. The presentation format is the following: BiLSTM/k-NN/YAMNet. The highest accuracy is

emboldened.

Pres.

Pred.

angry disgust fear happy neutral sad

angry 80.6/64/58.4 3.8/10.5/11.2 0.7/4.6/25.5 6.9/13.6/5 7.2/6.2/- 0.9/1.1/-

disgust 13.8/4.9/1.9 33.0/55.7/66.5 6.5/10.5/27.3 7.2/10.9/4.3 17.4/6.9/- 22.0/11.1/-

fear 14.7/5.7/0.3 6.9/23.8/4.4 30.4/34.6/95.3 11.1/11.5/- 9.1/6.5/- 27.7/18/-

happy 26.1/13/11.8 6.1/18.5/23.5 11.5/10.8/23.5 31.4/41.2/38.2 19.9/12.3/- 5.0/4.3/2.9

neutral 2.7/2.4/- 5.1/11.6/56 1.9/5.2/28 3.5/5.5/16 75.8/65.5/- 11.0/9.6/-

sad 2.2/1.8/- 3.6/14.6/61.1 5.3/13/- 2.5/4.6/33.3 29.6/17.1/- 56.7/48.8/5.6

4 EXPERIMENTAL PROTOCOL

AND RESULTS

We followed the 10-fold cross evaluation experimen-

tal protocol while care was taken so that every classi-

ﬁer operated in identical training, validation and test-

ing folds. The achieved average accuracy with respect

to every classiﬁer are summarized in Table 3. Fur-

thermore, Tables 4-6 are the confusion matrices for

gender-independent and depended SER. Overall, the

following observations can be made.

First, regarding gender discrimination both YAM-

Net and k-NN perform well, while YAMNet offered

the highest rate as we see in the respective confusion

matrix (Table 2). Interestingly, such results are com-

parable to the state of art on gender discrimination

(Chachadi and Nirmala, 2021).

Second, regarding SER the highest unbalanced ac-

curacy is obtained using the YAMNet model; how-

ever, in confusion matrices Tables 4-6, we observe

that YAMNet-based classiﬁcation performs well for

speciﬁc emotional states, e.g. fear and poorly for oth-

ers, e.g. happy.

Third, we focus on the BiLSTM models: they

manage to overperform the rest of the considered clas-

siﬁers, i.e. BiLSTM reaches 62% average accuracy

on all the data with per classes accuracy measures that

do not fall below 56.1% (Table 4). This may be due

to their ability to capture temporal dependencies ex-

isting in emotional speech, which are important for

speech processing in general (Latif et al., 2022). At

the same time, BiLSTM models trained on male or

female data provide satisfactory performance (Tables

5, 6) with a balanced accuracy of 62.7%.

Fourth, k-NN models results are not very far from

the BiLSTM ones, while the associated confusion ma-

trices conﬁrm the interesting capacity of the classiﬁer

to distinguish between the various classes. The best k

parameter obtained for the all data, female data, male

data, gender data models are: k=11, k=21, k=13, k=5.

These results are in line and conﬁrm the ﬁnding that

distributed modeling types may in effective in multi-

lingual settings (Ntalampiras, 2020).

Finally, regarding gender-dependent classiﬁca-

tion: results show a common pattern that can be found

in literature (for example in the work of Vogt (Vogt

and Andr

e, 2006)), i.e. performances improve when

female emotional speech is considered (in BiLSTM

more than 6%) but not in male (e.g. in BiLSTM

6% worse). Since gender discrimination achieved al-

most perfect accuracy (more than 94%) the hierarchi-

cal classiﬁer combining gender and emotion recog-

nition can improve the overall recognition rate of a

gender-independent SER.

In order to enable reliable comparison with other

solutions and full reproducibility, the implementation

of the experiments presented in this paper is publicly

available at https://github.com/NicoRota-0/SER.

A Hierarchical Approach for Multilingual Speech Emotion Recognition

683

5 CONCLUSION AND FUTURE

DEVELOPMENTS

In this work, multilingual audio gender-based emo-

tion classiﬁcation has been analyzed. Importantly, we

proposed a SER algorithm offering state-of-art results

while considering the full range of the big six emo-

tional states as expressed in six languages. Interest-

ingly, it has been demonstrated that a gender-based

emotion classiﬁer can outperform a general emotion

classiﬁer.

Future work could assess the performance reached

by such modeling architectures on each language sep-

arately. Moreover, these models could be part of a

more complex system to recognise human emotions

that use biosensors measuring physiological parame-

ters, e.g. heart rate, given the accelerated spread of

IoT devices as stated in the work of Pal (Pal et al.,

2021). Other additional work could investigate the

one-vs-all emotion classiﬁcation scheme using the

present models; an example is the work of Saitta et

al. (Saitta and Ntalampiras, 2021). An alternative ap-

proach would be adding a language classiﬁer before

emotion detection (with or without gender detection)

to access if it can achieve better results.

REFERENCES

Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F.,

Weiss, B., et al. (2005). A database of german emo-

tional speech. In Interspeech, volume 5, pages 1517–

1520.

Cao, H., Cooper, D. G., Keutmann, M. K., Gur, R. C.,

Nenkova, A., and Verma, R. (2014). Crema-d: Crowd-

sourced emotional multimodal actors dataset. IEEE

transactions on affective computing, 5(4):377–390.

Chachadi, K. and Nirmala, S. R. (2021). Voice-based gen-

der recognition using neural network. In Informa-

tion and Communication Technology for Competitive

Strategies (ICTCS 2020), pages 741–749. Springer

Singapore.

Chen, L., Wang, K., Li, M., Wu, M., Pedrycz, W., and

Hirota, K. (2023). K-means clustering-based kernel

canonical correlation analysis for multimodal emotion

recognition in human–robot interaction. IEEE Trans-

actions on Industrial Electronics, 70(1):1016–1024.

Costantini, G., Iaderola, I., Paoloni, A., and Todisco,

M. (2014). Emovo corpus: an italian emotional

speech database. In International Conference on Lan-

guage Resources and Evaluation (LREC 2014), pages

3501–3504. European Language Resources Associa-

tion (ELRA).

Dair, Z., Donovan, R., and O’Reilly, R. (2021). Lin-

guistic and gender variation in speech emotion

recognition using spectral features. arXiv preprint

arXiv:2112.09596.

Giannakopoulos, T. and Pikrakis, A. (2014). Introduction

to Audio Analysis: A MATLAB Approach. Academic

Press, Inc., USA, 1st edition.

Han, K., Yu, D., and Tashev, I. (2014). Speech emotion

recognition using deep neural network and extreme

learning machine. In Interspeech 2014.

Hota, S. and Pathak, S. (2018). KNN classiﬁer based ap-

proach for multi-class sentiment analysis of twitter

data. International Journal of Engineering and Tech-

nology, 7(3):1372.

James, J., Tian, L., and Watson, C. I. (2018). An open

source emotional speech corpus for human robot inter-

action applications. In INTERSPEECH, pages 2768–

2772.

Latif, S., Qayyum, A., Usman, M., and Qadir, J. (2018).

Cross lingual speech emotion recognition: Urdu vs.

western languages. In 2018 International Conference

on Frontiers of Information Technology (FIT), pages

88–93. IEEE.

Latif, S., Rana, R., Khalifa, S., Jurdak, R., and Schuller,

B. W. (2022). Self supervised adversarial do-

main adaptation for cross-corpus and cross-language

speech emotion recognition. IEEE Transactions on

Affective Computing, pages 1–1.

Livingstone, S. R. and Russo, F. A. (2018). The ryerson

audio-visual database of emotional speech and song

(ravdess): A dynamic, multimodal set of facial and

vocal expressions in north american english. PloS one,

13(5):e0196391.

Miller Jr, H. L. (2016). The Sage encyclopedia of theory in

psychology. SAGE Publications.

Mirsamadi, S., Barsoum, E., and Zhang, C. (2017). Auto-

matic speech emotion recognition using recurrent neu-

ral networks with local attention. In 2017 IEEE Inter-

national Conference on Acoustics, Speech and Signal

Processing (ICASSP), pages 2227–2231. IEEE.

Nezami, O. M., Lou, P. J., and Karami, M. (2019). Shemo: a

large-scale validated database for persian speech emo-

tion detection. Language Resources and Evaluation,

53(1):1–16.

Ntalampiras, S. (2020). Toward language-agnostic speech

emotion recognition. Journal of the Audio Engineer-

ing Society, 68(1/2):7–13.

Ntalampiras, S. (2021). Speech emotion recognition

via learning analogies. Pattern Recognition Letters,

144:21–26.

Pal, S., Mukhopadhyay, S., and Suryadevara, N. (2021).

Development and progress in sensors and tech-

nologies for human emotion recognition. Sensors,

21(16):5554.

Park, J.-S., Kim, J.-H., and Oh, Y.-H. (2009). Feature vec-

tor classiﬁcation based speech emotion recognition for

service robots. IEEE Transactions on Consumer Elec-

tronics, 55(3):1590–1596.

Pavlovic, V., Sharma, R., and Huang, T. (1997). Visual

interpretation of hand gestures for human-computer

interaction: a review. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 19(7):677–695.

Pichora-Fuller, M. K. and Dupuis, K. (2020). Toronto emo-

tional speech set (TESS). Scholars Portal Dataverse.

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

684

Saitta, A. and Ntalampiras, S. (2021). Language-agnostic

speech anger identiﬁcation. In 2021 44th Interna-

tional Conference on Telecommunications and Signal

Processing (TSP), pages 249–253. IEEE.

Sajjad, M., Kwon, S., et al. (2020). Clustering-based speech

emotion recognition by incorporating learned features

and deep bilstm. IEEE Access, 8:79861–79875.

Sang, D. V., Cuong, L. T. B., and Ha, P. T. (2018). Discrim-

inative deep feature learning for facial emotion recog-

nition. In 2018 1st International Conference on Mul-

timedia Analysis and Pattern Recognition (MAPR),

pages 1–6.

Scheidwasser-Clow, N., Kegler, M., Beckmann, P., and Cer-

nak, M. (2022). Serab: A multi-lingual benchmark for

speech emotion recognition. In ICASSP 2022-2022

IEEE International Conference on Acoustics, Speech

and Signal Processing (ICASSP), pages 7697–7701.

IEEE.

Sharma, M. (2022). Multi-lingual multi-task speech emo-

tion recognition using wav2vec 2.0. In ICASSP 2022-

2022 IEEE International Conference on Acoustics,

Speech and Signal Processing (ICASSP), pages 6907–

6911. IEEE.

Tahon, M. and Devillers, L. (2015). Towards a small set of

robust acoustic features for emotion recognition: chal-

lenges. IEEE/ACM transactions on audio, speech, and

language processing, 24(1):16–28.

Venkata Subbarao, M., Terlapu, S. K., Geethika, N., and

Harika, K. D. (2022). Speech emotion recognition us-

ing k-nearest neighbor classiﬁers. In Shetty D., P. and

Shetty, S., editors, Recent Advances in Artiﬁcial Intel-

ligence and Data Engineering, pages 123–131, Singa-

pore. Springer Singapore.

Vlasenko, B., Schuller, B., Wendemuth, A., and Rigoll,

G. (2007). Combining frame and turn-level informa-

tion for robust recognition of emotions within speech.

pages 2249–2252.

Vogt, T. and Andr

e, E. (2006). Improving automatic emo-

tion recognition from speech via gender differenti-

aion. In LREC, pages 1123–1126.

Vryzas, N., Kotsakis, R., Liatsou, A., Dimoulas, C. A., and

Kalliris, G. (2018a). Speech emotion recognition for

performance interaction. Journal of the Audio Engi-

neering Society, 66(6):457–467.

Vryzas, N., Matsiola, M., Kotsakis, R., Dimoulas, C., and

Kalliris, G. (2018b). Subjective evaluation of a speech

emotion recognition interaction framework. In Pro-

ceedings of the Audio Mostly 2018 on Sound in Im-

mersion and Emotion, pages 1–7.

A Hierarchical Approach for Multilingual Speech Emotion Recognition

685