Emotion Recognition from Speech: A Survey
Georgios Drakopoulos, George Pikramenos, Evaggelos Spyrou and Stavros J. Perantonis
NCSR “Demokritos”, Athens, Greece
Keywords:
Emotion Recognition, Affective Computing, Signal Processing, Cepstrum, MFCC Coefficients, Deep Neural
Networks, Time Series Analysis, Hidden Markov Chains, Higher Order Data.
Abstract:
Emotion recognition from speech signals is an important field in its own right as well as a mainstay of many
multimodal sentiment analysis systems. The latter may as well include a broad spectrum of modalities which
are strongly associated with consciously or subconsciously communicating human emotional state such as
visual cues, gestures, body postures, gait, or facial expressions. Typically, emotion discovery from speech
signals not only requires considerably less computational complexity than other modalities, but also at the
same time in the overwhelming majority of studies the inclusion of speech modality increases the accuracy
of the overall emotion estimation process. The principal algorithmic cornerstones of emotion estimation from
speech signals are Hidden Markov Models, time series modeling, cepstrum processing, and deep learning
methodologies, the latter two being prime examples of higher order data processing. Additionally, the most
known datasets which serve as emotion recognition benchmarks are described.
1 INTRODUCTION
Emotion discovery from speech offers a unique tool
for estimating with considerable accuracy human af-
fective state with remarkably low computationally
complexity, especially when compared with the task
of human activity discovery based on video. Affective
states reveal the subjective understanding of an indi-
vidual of an external stimulus or condition and, more-
over, may well serve as intention indications, since
emotions are the true driving force behind many hu-
man actions or reactions. Additionally, affective state
estimation plays an important role in cognitive sci-
ences (Cowie et al., 2001) and affective computing
(Picard, 2003)(Tao and Tan, 2005).
The primary objective of this conference paper is
the identification of the main research pylons in the
field of emotion discovery from speech, to illustrate
the differences between them, and to present some of
the main scientific literature works underlying each
such pylon. Moreover, as a secondary objective, some
of the most popular online multimodal datasets which
include speech are presented.
The remainder of this work is structured as fol-
lows. The recent scientific literature is briefly re-
viewed in section 2. The primary methodological
frameworks for affective state estimation are pre-
sented in section 3. In section 4 the major public
datasets concerning emotion recognition from speech
are described, whereas in section 5 future research di-
rections are explored. Finally, in table 1 the nnotation
of this conference paper is summarized.
Table 1: Paper Notation.
Symbol Meaning
4
= Equality by definition
{
s
1
, . . . , s
n
}
Set with elements s
1
, . . . , s
n
card(S) Set cardinality
X
e
jω
Magnitude of X
e
jω
F [x (t)] Fourier transform of x(t)
F
1
X
e
jω

Inverse FT of X
e
jω
h
x
1
(t)
|
x
2
(t)
i
Inner product of x
1
(t) and x
2
(t)
2 PREVIOUS WORK
Emotion recognition has taken many forms in scien-
tific literature, both as part of the broader humanis-
tic data mining field but also on its own right (Kwon
et al., 2003). Many engineering approaches for emo-
tion discovery follow a black box approach and rely
on observing emotional traits in such as in facial ex-
pressions as in (Goldman and Sripada, 2005)(Haxby
et al., 2002). Moreover, multifactor facial analysis
with tensor classification (Vasilescu and Terzopoulos,
432
Drakopoulos, G., Pikramenos, G., Spyrou, E. and Perantonis, S.
Emotion Recognition from Speech: A Survey.
DOI: 10.5220/0008495004320439
In Proceedings of the 15th International Conference on Web Information Systems and Technologies (WEBIST 2019), pages 432-439
ISBN: 978-989-758-386-5
Copyright
c
2019 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
2002)(Tian et al., 2012) or tensor subspace (Cai et al.,
2005)(Cichocki et al., 2015) algorithms have been ex-
plored. Inclusion of more traits has led to bimodal
(De Silva and Ng, 2000) and multimodal (Busso et al.,
2004)(Kim et al., 2004) emotion estimators based on
various physiological signs.
Alternatively, a plethora of white box approaches
have been also proposed. Patterns in ECG (Agrafioti
et al., 2012) or EEG waveforms (Mohammadi et al.,
2017)(Murugappan et al., 2010) have been used to de-
duce affective states. This can be also achieved indi-
rectly with the BCI proposed in (Mathe and Spyrou,
2016) as part of an IoT ecosystem.
Concerning human activity and creation, art and
especially music have close ties with the underlying
affective states (Li and Ogihara, 2004). Emotion dis-
covery in music is the focus of (Busso et al., 2009).
The affective reults of music are explored in (Kim
and Andr
´
e, 2008), (Lin et al., 2010), and (Yang et al.,
2008) which proposes affective categorical regression
with arousal and valence as input variables. The ef-
fects of acoustic features such as jitter and shimmer
are evaluated in (Li et al., 2007) and (Jin et al., 2015).
Finally, (Elfenbein and Ambady, 2002) is a meta-
analysis of the connection between affective states
and music based on cultural factors.
Affective states can also play a central role in on-
line social network analysis and specifically in in-
formation diffusion and digital influence. Applying
a modified version of the methodology proposed in
(Drakopoulos et al., 2017b) for an extension of term-
document matrix to a term-keyword-document third
order tensor, (Drakopoulos et al., 2017a) formulates
higher order influence metrics which can be easily
extended to include affective information. The same
methodology can be applied to assess the compact-
ness of spatio-linguistic online communities as dis-
covered in (Drakopoulos et al., 2019) or to communi-
ties defined by multiple interaction paths (Drakopou-
los, 2016).
The primary emotions according to the ground-
breaking theory developed by Plutchik, as stated
among others in (Plutchik et al., 1979), (Plutchik,
1980), and (Plutchik, 2001), are joy, trust, expecta-
tion, fear, sadness, disgust, anger, and surprise (Wall-
bott and Scherer, 1986). Based on this theory, sec-
ondary and tertiary emotions can be derived by su-
perimposing the above primary emotions in various
scales (Lane et al., 1996). For instance, under this
model love is derived as the composition of joy and
trust. Moreover, it should also be noted that, even
though it is not an emotion per se, the netural state
is a valid emotional state. Note that in other contexts
other emotions may well form the basis for research
2002)(Tian et al., 2012) or tensor subspace (Cai et al.,
2005)(Cichocki et al., 2015) algorithms have been ex-
plored. Inclusion of more traits has led to bimodal
(De Silva and Ng, 2000) and multimodal (Busso et al.,
2004)(Kim et al., 2004) emotion estimators based on
various physiological signs.
Alternatively, a plethora of white box approaches
have been also proposed. Patterns in ECG (Agrafioti
et al., 2012) or EEG waveforms (Mohammadi et al.,
2017)(Murugappan et al., 2010) have been used to de-
duce affective states. This can be also achieved indi-
rectly with the BCI proposed in (Mathe and Spyrou,
2016) as part of an IoT ecosystem.
Concerning human activity and creation, art and
especially music have close ties with the underlying
affective states (Li and Ogihara, 2004). Emotion dis-
covery in music is the focus of (Busso et al., 2009).
The affective reults of music are explored in (Kim
and Andr
´
e, 2008), (Lin et al., 2010), and (Yang et al.,
2008) which proposes affective categorical regression
with arousal and valence as input variables. The ef-
fects of acoustic features such as jitter and shimmer
are evaluated in (Li et al., 2007) and (Jin et al., 2015).
Finally, (Elfenbein and Ambady, 2002) is a meta-
analysis of the connection between affective states
and music based on cultural factors.
Affective states can also play a central role in on-
line social network analysis and specifically in in-
formation diffusion and digital influence. Applying
a modified version of the methodology proposed in
(Drakopoulos et al., 2017b) for an extension of term-
document matrix to a term-keyword-document third
order tensor, (Drakopoulos et al., 2017a) formulates
higher order influence metrics which can be easily
extended to include affective information. The same
methodology can be applied to assess the compact-
ness of spatio-linguistic online communities as dis-
covered in (Drakopoulos et al., 2019) or to communi-
ties defined by multiple interaction paths (Drakopou-
los, 2016).
The primary emotions according to the ground-
breaking theory developed by Plutchik, as stated
among others in (Plutchik et al., 1979), (Plutchik,
1980), and (Plutchik, 2001), are joy, trust, expecta-
tion, fear, sadness, disgust, anger, and surprise (Wall-
bott and Scherer, 1986). Based on this theory, sec-
ondary and tertiary emotions can be derived by su-
perimposing the above primary emotions in various
scales (Lane et al., 1996). For instance, under this
model love is derived as the composition of joy and
trust. Moreover, it should also be noted that, even
though it is not an emotion per se, the netural state
is a valid emotional state. Note that in other contexts
other emotions may well form the basis for research
disapprovalremorse
contempt awe
submission
loveoptimism
aggressiveness
pensiveness
annoyance
anger rage
ecstasy
joy
serenity
terror
fear
apprehension
admiration
trust
acceptance
vigilance
anticipation
interest
boredom
disgust
loathing amazement
surprise
distraction
sadness
grief
Figure 1: Emotion wheel (from (Plutchik, 2001)).
(Kohler et al., 2000). One such prominent case is ed-
ucation, where pride, remorse, boredom, or guilt are
sought to be invoked or detectted during teaching ac-
tivity (Jerritta et al., 2011)(Spyrou et al., 2018).
3 EMOTION RECOGNITION
FROM SPEECH
3.1 HIDDEN MARKOV MODELS
Hidden Markov Models (HMMs) are almost synony-
mous with speech processing and they come in two
flavors, depending on the amount of observable in-
formation known to the researcher (El Ayadi et al.,
2011). Let us S denote the state set and n = card(S)
be its cardinality:
S
4
=
{
s
1
, . . . , s
n
}
(1)
Additionally, for each state s
i
S let S
i
be the set of
each possible outbound transitions from s
i
in one step:
S
i
4
=
s
i
s
j
, s
j
S
, 1 i n (2)
Finally, let P contain the individual transition proba-
bilities as:
P
4
=
prob
s
i
s
j
, s
i
, s
j
S
(3)
The two variants of HMMs are:
S and S
i
are known and the elements of P must be
estimated, usually by statistical methods includ-
ing classical or Bayesian estimation.
S, and by extension S
i
, are unknown. The car-
dinality card (S) may be have to be estimated as
well, depending on the problem. In this case the
sets S, S
i
, and P must be estimated only from the
output symbols. Typically this is achieved with
mutual information or other divergence metrics
(Bahl et al., 1986).
Figure 1: Emotion wheel (from (Plutchik, 2001)).
(Kohler et al., 2000). One such prominent case is ed-
ucation, where pride, remorse, boredom, or guilt are
sought to be invoked or detectted during teaching ac-
tivity (Jerritta et al., 2011)(Spyrou et al., 2018).
3 EMOTION RECOGNITION
FROM SPEECH
3.1 Hidden Markov Models
Hidden Markov Models (HMMs) are almost synony-
mous with speech processing and they come in two
flavors, depending on the amount of observable in-
formation known to the researcher (El Ayadi et al.,
2011). Let us S denote the state set and n = card(S)
be its cardinality:
S
4
=
{
s
1
, . . . , s
n
}
(1)
Additionally, for each state s
i
S let S
i
be the set of
each possible outbound transitions from s
i
in one step:
S
i
4
=
s
i
s
j
, s
j
S
, 1 i n (2)
Finally, let P contain the individual transition proba-
bilities as:
P
4
=
prob
s
i
s
j
, s
i
, s
j
S
(3)
The two variants of HMMs are:
S and S
i
are known and the elements of P must be
estimated, usually by statistical methods includ-
ing classical or Bayesian estimation.
Emotion Recognition from Speech: A Survey
433
S, and by extension S
i
, are unknown. The car-
dinality card (S) may be have to be estimated as
well, depending on the problem. In this case the
sets S, S
i
, and P must be estimated only from the
output symbols. Typically this is achieved with
mutual information or other divergence metrics
(Bahl et al., 1986).
In (Schuller et al., 2003) two methodologies for
estimating the parameters of an HMM corresponding
to six basic emotional states are presented. The first
is a mixture of Gaussians model whose local maxima
are functions of said states weighted by the the proba-
bilities of a number of features including pitch-related
statistics such as its location and its maximum abso-
lute deviation in the voice sample. The second way is
based on embedding local emotional distributions and
computing their maximum using the same set of fea-
tures. Gaussian mixture models are also considered
in (Li et al., 2013a), but in conjunction with restricted
Boltzmann models. For a versatile and persisent data
structure which can represent HMMs with a variable
number of states see (Kontopoulos and Drakopoulos,
2014).
HMMs act as classifiers among the archetypal
emotions of anger, disgust, fear, joy, sadness, and sur-
prise in (Nwe et al., 2003), where speech signals are
decomposed in its short time log frequency power co-
efficients. The selection of these particular features
leads to improved accuracy compared to representa-
tions based on fundamental frequency, the ratio be-
tween silence and speech, and energy contour.
3.2 Deep Learning
Neural networks of different configurations acting as
affective classifiers have been proposed. For instance,
(Kobayashi and Hara, 1992) considers feedforward
neural networks, (Sprengelmeyer et al., 1998) and
(Adolphs, 2002) explore the connection of emotion
recognition through different human neural substrates
based on findings from neurophysiological disorders,
and (Nicholson et al., 2000) considers a neural net-
work trained by phoneme balanced words to distin-
guish between eight emotional states. Estimating
emotion from speech is also the goal of the neural
network architecture proposed in (Bhatti et al., 2004).
This is extended to deep neural networks with mul-
tiple hidden layers and a combination of activation
functions in (Stuhlsatz et al., 2011). Multimodal ar-
chitectures based on facial expressions and speech are
the focus of (Ioannou et al., 2005) and (Kahou et al.,
2013). Moreover, (Lin and Wei, 2005) suggests us-
ing a Support Vector Machine (SVM) over HMM in
order to increase the accuracy of recognizing basic
emotional states from speech signals. SVMs are also
considered in the context of a smart home ecosystem
where distributed IoT processing can assess the affec-
tive state of its inhabitants and adapt accordingly (Pan
et al., 2012).
Extreme learning machines (ELMs) proposed
among others in (Han et al., 2014) have a simple ar-
chitecture comprising of one single long hidden layer
of neurons with non-linear activation functions. This
allows for closed form expressions connecting ELM
input and output to be constructed (Huang et al.,
2006), which in turn leads to an easy and controllable
training process. Specifically:
The k-th input neuron forwards the k-th compo-
nent x
k
of the current training vector.
The i-th hidden neuron receives w
0
k,i
x
k
, sums each
such stimulus and then subtracts threshold β
i
, and
drives the result to its non-linear activation func-
tion ψ(·).
The j-th output neuron y
j
repeats this process
with its own synaptic weights w
1
i, j
, non-linear acti-
vation function ϕ(·), and threshold β
j
to generate:
y
j
4
= ϕ
ij
w
1
i, j
ψ
ki
w
0
k,i
x
k
β
i
!
β
j
!
(4)
3.3 Signal Processing
Signal processing of a speech signal either in its origi-
nal time domain waveform x (t) or in a number of var-
ious transformations can yield important information
regarding the speaker emotional state. Regarding time
domain, a statistical method for distinguishing be-
tween joy, anger, sadness, fear, and neutral emotional
state based on speech signal characteristics such as
maximum pitch and maximum absolute pitch diver-
gence is presented in (Petrushin, 2000).
In (Wu et al., 2011) an elaborate filter bank based
on modulations and Hilbert envelope is proposed in
order to extract from speech auditory-inspired long-
term spectro-temporal features which contains vital
temporal and spectral acoustic information, without
resorting to short-term features. The study of cep-
strum ˜x (t), and in particular of its MFCC coefficients,
of the speech signal x (t) appears in many works. Re-
call that:
˜x(t)
4
=
F
1
h
ln
|
F [x (t)]
|
2
i
2
(5)
The amplitude of the coefficients of the discrete co-
sine transform of ˜x (t) are the cepstral or MFCC coef-
ficients of x (t).
Power cepstrum coefficients are used in (Sato and
Obuchi, 2007) instead of prosodic features in order to
WEBIST 2019 - 15th International Conference on Web Information Systems and Technologies
434
express the speech signal in scales which are closer
to that of human audio perception. Specifically, pho-
netic features, expressed in the form of cepstral co-
efficients increase classification accuracy over over
an utterance. Along a similar line of reasoning, in
(Dumouchel et al., 2009) the cepstral coefficients of
Gaussian mixture models are used in order to discern
between basic emotions.
Another approach is the wavelet transform which
reveals information about x(t) in multiple time scales,
potentially discovering more patterns than thse of the
classical Fourier transform (Daubechies, 1990)(An-
tonini et al., 1992). The central idea of the wavelet
transform is to use a family of basis functions indexed
by location µ
0
and scale σ
0
parameters. Although
there is a plethora of wavelet basis such as the Morlet,
the log-Gabor, or the Haar families, perhaps the most
common example is the Gaussian kernel family:
g(t; µ
0
, σ
0
)
4
=
1
σ
0
2π
exp
(t µ
0
)
2
2σ
2
0
!
(6)
The projection of x (t) to various members of the
wavelet basis family yields the wavelet coefficients:
w
µ
0
,σ
0
4
=
h
x (t)
|
g(t; µ
0
, σ
0
)
i
=
Z
x (t) g (t;µ
0
, σ
0
)dt (7)
Wavelet transform is combined with cepstral co-
efficients and the subbabd based cepstral parameter in
(Kishore and Satish, 2013) for affective state estima-
tion. Furthermore, multimodal emotion recognition
from video and speech through wavelets is described
in (Go et al., 2003), with the note that the addition of
speech to video increases classification accuracy.
4 DATASETS
4.1 Audio Datasets
Perhaps the most well known English dataset is
the Toronto Emotional Speech Set (Dupuis and
Pichora-Fuller, 2010) which contains audio only
data collected at Northwestern University accord-
ing to the Auditory Test Protocol no. 6 of the same
university. Two professional female actresses, one
of 26 and one of 64 years old, born and raised in
Toronto area uttered 2800 words corresponding to
seven basic emotional states.
RAVDESS (Livingstone et al., 2012) is also a
popular dataset consisting of 7356 files, each of
which has been evaluated 247 times by an equal
number of North American volunteers in terms
of sentimental validity and intensity. Moreover,
72 additional participants evaluated the same data
based on a test-retest methodology. Both the be-
havior of each individual evaluator and the evalu-
ations themselves were remarkably consistent.
The Emo-Soundscapes collection (Fan et al.,
2017) contains two benchmark protocols as well
as 613 sound clips coming from a combination of
600 music files from www.freesound.org. Along
with the original files there are 1213 music ex-
cerpts under the Creative Commons license. Each
such excerpt lasts six seconds and its induced
emotional intensity was evaluated through crowd-
sourcing by 1182 people from 74 countries around
the globe.
The Speech Under Simulated and Actual Stress
(SUSAS) (Hansen and Bou-Ghazale, 1997) was
created by University of Colorado-Boulder and
the US Air Force Research Laboratory. SUSAS
contains more than 16000 emotionally charged
sentences spoken under various stress levels from
32 speakers (13 men and 19 women) of ages be-
tween 22 and 76. Moreover, SUSAS includes
large files with communications from four Apache
helicopter pilots with the control tower along
with their transcripts from the Linguistic Data
Consortium under the name SUSAS Transcripts
(LDC99T33).
Emotionally charged speech datasets are available
for a number of other languages as well, most of
which belong to the Indo-European language family.
Such datasets allow researchers to exclude linguistic
or cultural factors from the discovery process, focus-
ing only on the emotional content.
The FAU Aibo emotion corpus (Batliner et al.,
2008) has been created from the use of Aibo, a
Sony pet robot, from 51 German children aged be-
tween 10 and 13 years old. This corpus contains
spontaneous and emotionally charged commands
which have been collected and broken down to
elementary parts based on syntax and prosody,
which in turn are manually assigned one out of
11 possible emotional labels from five evaluators.
A second dataset is BAUM-1 (Zhalehpour et al.,
2016) made up of short video clips, each approx-
imately four minutes long. In each such clip Ger-
man male professional actors utter sentences cor-
responding to a plethora of emotional states in-
cluding joy, anger, sadness, disgust, fear, surprise,
confusion, disdain, and annoyance as well as the
neutral state.
Emotion Recognition from Speech: A Survey
435
A smaller German dataset is the Berlin database
of emotional speech (Burkhardt et al., 2005)
which contains 500 samples of speech uttered by
ten professional actors. Each such sample can rep-
resent six emotional states.
The RML emotion dataset (Wang and Guan,
2008) from Ryerson Multimedia Lab consists of
720 audio-visual clips, each lasting from 3 to 6
seconds and with a single emotional charge out of
six possible basic emotional states. These are at
least ten video clips for anger, disgust, fear, joy,
sadness, and surprise. In order for the eight vol-
unteers to utter sentences which genuinely contain
a given emotion, they were asked to recall an indi-
cent from their own lives which caused that emo-
tion. Moreover, they are native speakers of either
English, Mandarin Chinese, Farsi, Italian, Urdu,
or Panjabi.
4.2 Multimodal Datasets
In contrast to audio-only datasets, multimodal ones
allow the separate examination of how isolated
modalities like speech, text, facial expressions, gait,
or body posture or a combination thereof are involved
in emotion discovery.
CREMA-D (Cao et al., 2014) relies on the hu-
man trait of communicating emotional state thr-
ough voice and facial expression. To this end, the
dataset consists of video clips with speech and fa-
cial expressions covering the spectrum of basic
emotions, namely joy, sadness, anger, fear, dis-
gust, and neutral state from 91 actors of various
nationalities. Each such clip is independently la-
beled regarding the emotion and its intensity thr-
ough crowdsourcing by 2443 evaluators based on
either one of the modalities or both. According to
these evaluations, the neutral state was the easiest
to identify, followed by joy, anger, disgust, fear,
and sadness.
The British Surrey Audio-Visual Expressed Emo-
tion (SAVEE) dataset (Jackson and Haq, 2014)
contains 480 video clips of four British male ac-
tors in seven emotional states. The sentences cor-
responding to these states were selected from the
TIMIT corpus so that each state is equally repre-
sented. Ten human evaluators estimated the state
to create the ground truth for classification algo-
rithms.
The Interactive Emotional Dyadic Motion Cap-
ture (IEMOCAP) (Busso et al., 2008) is a mul-
timodal dataset maintained by USC SAIL Lab
based on approximately twelve hours in total of
audio-visual clips enriched with text from multi-
ple authors. In each such clip is recorded a meet-
ing between two actors, who can either act based
on a script or can improvise in order to induce spe-
cific emotional reactions. The IEMOCAP clips
have also been annotated by multiple reviewers
in terms of a quadruple containing an emotional
state, valence, activation, and dominance.
The Oulou-CASIA NIR and VIS facial expression
database (Li et al., 2013b) is made up of high reso-
lution images of expressions corresponding to six
emotional states from 80 actors. Said images were
obtained from two imaging systems, one operat-
ing in the visible light spectrum (VIS) and one
in the near infrared spectrum (NIR). Three typ-
ical lighting settings were used, namely normal
office lighting, weak lighting coming only from
computer monitors, and no lighting.
eNTERFACE ‘05 (Martin et al., 2006) is an
audio-visual dataset, where emotional content can
be found in the audio modality, the visual modal-
ity, or both, depending on the case. This allows
the benchmarking of machine learning algorithms
which rely on either one or both modalities. Ad-
ditionally, eNTERFACE includes the assumptions
underlying its functionality, the challenges during
its implementations, and how these were eventu-
ally addressed.
Finally, the MSP-Improv corpus (Busso et al.,
2017) contains audio-visual recordings of two-
person improvisations out of a pool of twelve pro-
fessional actors. These were specifically designed
to cause genuine emotional reactions, on the con-
dition that speech and facial expressions convey
different emotions. In this way the factors in-
volved in emotion recognition, a cognitive func-
tion known as recombination which relies on all
senses and is central to the creation of the final
stimulus.
5 CONCLUSIONS
The focus of this conference paper is twofold. First,
the basic methodological schemes from emotion dis-
covery are enumerated. Second, the most popular
audio-only and multimodal datasets which contain
emotional states or estimations thereof are presented.
The latter serve as benchmarks for evaluating the al-
gorithmic performance of emotion estimation tech-
niques, primarily in terms of accuracy and scalability.
Regarding emotion discovery, the advent of deep
learning algorithms is promising in the sense that
WEBIST 2019 - 15th International Conference on Web Information Systems and Technologies
436
not only such algorithms can efficiently handle 6V
datasets, but also they can extract non-trivial knowl-
edge from them, with the latter being considerably
more concise and structured compated to the original
datasets. Moreover, cross-domain knowledge transfer
methodologies can be used to augment the knowledge
body of a given domain with external elements. Fi-
nally, ontologies or knowledge graphs allow the gen-
eration of formal theorems from ground truth with
reasoners such as Owlready for Python or the seman-
tic engine of Neo4j.
ACKNOWLEDGMENT
This research has been co-financed by the European
Union and Greek national funds through the Oper-
ational Program Competitiveness, Entrepreneurship
and Innovation, under the call RESEARCH CRE-
ATE – INNOVATE (project code: T1EDK-02070).
REFERENCES
Adolphs, R. (2002). Neural systems for recognizing emo-
tion. Current opinion in neurobiology, 12(2):169–
177.
Agrafioti, F., Hatzinakos, D., and Anderson, A. K. (2012).
ECG pattern analysis for emotion detection. IEEE
Transactions on Affective Computing, 3(1):102–115.
Antonini, M., Barlaud, M., Mathieu, P., and Daubechies, I.
(1992). Image coding using wavelet transform. IEEE
Transactions on image processing, 1(2):205–220.
Bahl, L. R., Brown, P. F., De Souza, P. V., and Mercer, R. L.
(1986). Maximum mutual information estimation of
hidden Markov model parameters for speech recogni-
tion. In ICASSP, volume 86, pages 49–52.
Batliner, A., Steidl, S., and N
¨
oth, E. (2008). Releasing
a thoroughly annotated and processed spontaneous
emotional database: The FAU Aibo Emotion Corpus.
In Satellite Workshop of LREC, volume 2008, page 28.
Bhatti, M. W., Wang, Y., and Guan, L. (2004). A neural
network approach for human emotion recognition in
speech. In International symposium on circuits and
systems, volume 2, pages II–181. IEEE.
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F.,
and Weiss, B. (2005). A database of German emo-
tional speech. In Ninth European Conference on
Speech Communication and Technology.
Busso, C. et al. (2004). Analysis of emotion recognition
using facial expressions, speech and multimodal in-
formation. In International conference on multimodal
interfaces, pages 205–211. ACM.
Busso, C. et al. (2008). IEMOCAP: Interactive emotional
dyadic motion capture database. Language resources
and evaluation, 42(4):335.
Busso, C., Lee, S., and Narayanan, S. (2009). Analysis of
emotionally salient aspects of fundamental frequency
for emotion detection. IEEE transactions on audio,
speech, and language processing, 17(4):582–596.
Busso, C., Parthasarathy, S., Burmania, A., AbdelWahab,
M., Sadoughi, N., and Mower Provost, E. (2017).
MSP-IMPROV: An acted corpus of dyadic interac-
tions to study emotion perception. IEEE Transactions
on Affective Computing, 8(1):67–80.
Cai, D., He, X., and Han, J. (2005). Subspace learning
based on tensor analysis. Technical report, UIUC.
Cao, H., Cooper, D. G., Keutmann, M. K., Gur, R. C.,
Nenkova, A., and Verma, R. (2014). CREMA-D:
Crowd-sourced emotional multimodal actors dataset.
IEEE transactions on affective computing, 5(4):377–
390.
Cichocki, A. et al. (2015). Tensor decompositions for signal
processing applications: From two-way to multiway
component analysis. IEEE Signal Processing Maga-
zine, 32(2):145–163.
Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis,
G., Kollias, S., Fellenz, W., and Taylor, J. G. (2001).
Emotion recognition in human-computer interaction.
IEEE Signal processing magazine, 18(1):32–80.
Daubechies, I. (1990). The wavelet transform, time-
frequency localization and signal analysis. IEEE
transactions on information theory, 36(5):961–1005.
De Silva, L. C. and Ng, P. C. (2000). Bimodal emotion
recognition. In International conference on automatic
face and gesture recognition, pages 332–335. IEEE.
Drakopoulos, G. (2016). Tensor fusion of social structural
and functional analytics over Neo4j. In IISA, pages
1–6. IEEE.
Drakopoulos, G. et al. (2017a). Defining and evaluating
Twitter influence metrics: A higher-order approach in
Neo4j. SNAM, 7(1):52:1–52:14.
Drakopoulos, G. et al. (2017b). Tensor-based semantically-
aware topic clustering of biomedical documents.
Computation, 5(3):34.
Drakopoulos, G. et al. (2019). A genetic algorithm for spa-
tiosocial tensor clustering. EVOS, pages 1–11.
Dumouchel, P., Dehak, N., Attabi, Y., Dehak, R., and Bo-
ufaden, N. (2009). Cepstral and long-term features
for emotion recognition. In Annual conference of the
International Speech Communication Association.
Dupuis, K. and Pichora-Fuller, M. K. (2010). Toronto Emo-
tional Speech Set (TESS). University of Toronto, Psy-
chology Department.
El Ayadi, M., Kamel, M. S., and Karray, F. (2011). Sur-
vey on speech emotion recognition: Features, classi-
fication schemes, and databases. Pattern Recognition,
44(3):572–587.
Elfenbein, H. A. and Ambady, N. (2002). On the universal-
ity and cultural specificity of emotion recognition: A
meta-analysis. Psychological Bulletin, 128(2):203.
Fan, J., Thorogood, M., and Pasquier, P. (2017). Emo-
soundscapes: A dataset for soundscape emotion
recognition. In ACII, pages 196–201. IEEE.
Go, H.-J., Kwak, K.-C., Lee, D.-J., and Chun, M.-G.
(2003). Emotion recognition from the facial image
Emotion Recognition from Speech: A Survey
437
and speech signal. In SICE, volume 3, pages 2890–
2895. IEEE.
Goldman, A. I. and Sripada, C. S. (2005). Simulationist
models of face-based emotion recognition. Cognition,
94(3):193–213.
Han, K., Yu, D., and Tashev, I. (2014). Speech emotion
recognition using deep neural network and extreme
learning machine. In Fifteenth annual conference of
the international speech communication association.
Hansen, J. H. and Bou-Ghazale, S. E. (1997). Getting
started with SUSAS: A speech under simulated and
actual stress database. In Fifth European Conference
on Speech Communication and Technology.
Haxby, J. V., Hoffman, E. A., and Gobbini, M. I. (2002).
Human neural systems for face recognition and social
communication. Biological psychiatry, 51(1):59–67.
Huang, G.-B., Zhu, Q.-Y., and Siew, C.-K. (2006). Extreme
learning machine: Theory and applications. Neuro-
computing, 70(1-3):489–501.
Ioannou, S. V. et al. (2005). Emotion recognition through
facial expression analysis based on a neurofuzzy net-
work. Neural Networks, 18(4):423–435.
Jackson, P. and Haq, S. (2014). Surrey audio-visual ex-
pressed emotion SAVEE database. University of Sur-
rey: Guildford, UK.
Jerritta, S., Murugappan, M., Nagarajan, R., and Wan, K.
(2011). Physiological signals based human emotion
recognition: A review. In International colloquium
on signal processing and its applications, pages 410–
415. IEEE.
Jin, Q., Li, C., Chen, S., and Wu, H. (2015). Speech emo-
tion recognition with acoustic and lexical features. In
ICASSP, pages 4749–4753. IEEE.
Kahou, S. E. et al. (2013). Combining modality spe-
cific deep neural networks for emotion recognition in
video. In ICMI, pages 543–550. ACM.
Kim, J. and Andr
´
e, E. (2008). Emotion recognition based
on physiological changes in music listening. TPAMI,
30(12):2067–2083.
Kim, K. H., Bang, S. W., and Kim, S. R. (2004). Emo-
tion recognition system using short-term monitoring
of physiological signals. Medical and biological en-
gineering and computing, 42(3):419–427.
Kishore, K. K. and Satish, P. K. (2013). Emotion recogni-
tion in speech using MFCC and wavelet features. In
IACC, pages 842–847. IEEE.
Kobayashi, H. and Hara, F. (1992). Recognition of six basic
facial expression and their strength by neural network.
In International workshop on robot and human com-
munication, pages 381–386. IEEE.
Kohler, C. G. et al. (2000). Emotion recognition deficit in
schizophrenia: Association with symptomatology and
cognition. Biological psychiatry, 48(2):127–136.
Kontopoulos, S. and Drakopoulos, G. (2014). A space ef-
ficient scheme for persistent graph representation. In
ICTAI, pages 299–303. IEEE.
Kwon, O.-W., Chan, K., Hao, J., and Lee, T.-W. (2003).
Emotion recognition by speech signals. In Eighth
European conference on speech communication and
technology.
Lane, R. D. et al. (1996). Impaired verbal and nonverbal
emotion recognition in alexithymia. Psychosomatic
medicine, 58(3):203–210.
Li, L., Zhao, Y., Jiang, D., Zhang, Y., Wang, F., Gonzalez,
I., Valentin, E., and Sahli, H. (2013a). Hybrid deep
neural network–Hidden Markov model (DNN-HMM)
based speech emotion recognition. In Humaine Asso-
ciation Conference on Affective Computing and Intel-
ligent Interaction, pages 312–317. IEEE.
Li, S., Yi, D., Lei, Z., and Liao, S. (2013b). The CASIA
NIR-VIS 2.0 face database. In CVPR, pages 348–353.
Li, T. and Ogihara, M. (2004). Content-based music simi-
larity search and emotion detection. In ICASSP, vol-
ume 5, pages V–705. IEEE.
Li, X. et al. (2007). Stress and emotion classification using
jitter and shimmer features. In ICASSP, volume 4,
pages IV–1081. IEEE.
Lin, Y.-L. and Wei, G. (2005). Speech emotion recognition
based on HMM and SVM. In International conference
on machine learning and cybernetics, volume 8, pages
4898–4901. IEEE.
Lin, Y.-P. et al. (2010). EEG-based emotion recognition
in music listening. Transactions on biomedical engi-
neering, 57(7):1798–1806.
Livingstone, S. R., Peck, K., and Russo, F. A. (2012).
RAVDESS: The Ryerson audio-visual database of
emotional speech and song. In Annual meeting of the
Canadian society for brain, behaviour, and cognitive
science, pages 205–211.
Martin, O., Kotsia, I., Macq, B., and Pitas, I. (2006). The
eNTERFACE’05 audio-visual emotion database. In
ICDE, pages 8–8. IEEE.
Mathe, E. and Spyrou, E. (2016). Connecting a con-
sumer brain-computer interface to an internet-of-
things ecosystem. In PETRA, pages 90–95. ACM.
Mohammadi, Z., Frounchi, J., and Amiri, M. (2017).
Wavelet-based emotion recognition system using
EEG signal. Neural Computing and Applications,
28(8):1985–1990.
Murugappan, M., Ramachandran, N., and Sazali, Y. (2010).
Classification of human emotion from EEG using dis-
crete wavelet transform. Journal of biomedical sci-
ence and engineering, 3(04):390.
Nicholson, J., Takahashi, K., and Nakatsu, R. (2000).
Emotion recognition in speech using neural networks.
Neural computing and applications, 9(4):290–296.
Nwe, T. L., Foo, S. W., and De Silva, L. C. (2003). Speech
emotion recognition using hidden Markov models.
Speech Communication, 41(4):603–623.
Pan, Y., Shen, P., and Shen, L. (2012). Speech emotion
recognition using support vector machine. Interna-
tional Journal of Smart Home, 6(2):101–108.
Petrushin, V. A. (2000). Emotion recognition in speech sig-
nal: Experimental study, development, and applica-
tion. In Sixth international conference on spoken lan-
guage processing.
Picard, R. W. (2003). Affective computing: Challenges.
International Journal of Human-Computer Studies,
59(1-2):55–64.
WEBIST 2019 - 15th International Conference on Web Information Systems and Technologies
438
Plutchik, R. (1980). A general psychoevolutionary theory of
emotion. In Theories of emotion, pages 3–33. Elsevier.
Plutchik, R. (2001). The nature of emotions: Human emo-
tions have deep evolutionary roots, a fact that may ex-
plain their complexity and provide tools for clinical
practice. American scientist, 89(4):344–350.
Plutchik, R., Kellerman, H., and Conte, H. R. (1979). A
structural theory of ego defenses and emotions. In
Emotions in personality and psychopathology, pages
227–257. Springer.
Sato, N. and Obuchi, Y. (2007). Emotion recognition using
mel-frequency cepstral coefficients. Information and
Media Technologies, 2(3):835–848.
Schuller, B., Rigoll, G., and Lang, M. (2003). Hidden
Markov model-based speech emotion recognition. In
ICASSP, volume 2, pages II–1. IEEE.
Sprengelmeyer, R., Rausch, M., Eysel, U. T., and Przuntek,
H. (1998). Neural structures associated with recogni-
tion of facial expressions of basic emotions. Proceed-
ings of the Royal Society of London. Series B: Biolog-
ical Sciences, 265(1409):1927–1931.
Spyrou, E. et al. (2018). A non-linguistic approach for hu-
man emotion recognition from speech. In IISA, pages
1–5. IEEE.
Stuhlsatz, A. et al. (2011). Deep neural networks for acous-
tic emotion recognition: Raising the benchmarks. In
ICASSP, pages 5688–5691. IEEE.
Tao, J. and Tan, T. (2005). Affective computing: A review.
In International Conference on Affective computing
and intelligent interaction, pages 981–995. Springer.
Tian, C., Fan, G., Gao, X., and Tian, Q. (2012). Multiview
face recognition: From Tensorface to v-Tensorface
and k-Tensorface. Transactions on Systems, Man, and
Cybernetics, Part B (Cybernetics), 42(2):320–333.
Vasilescu, M. A. O. and Terzopoulos, D. (2002). Multilin-
ear analysis of image ensembles: Tensorfaces. In Eu-
ropean Conference on Computer Vision, pages 447–
460. Springer.
Wallbott, H. G. and Scherer, K. R. (1986). Cues and chan-
nels in emotion recognition. Journal of personality
and social psychology, 51(4):690.
Wang, Y. and Guan, L. (2008). Recognizing human emo-
tional state from audiovisual signals. IEEE transac-
tions on multimedia, 10(5):936–946.
Wu, S., Falk, T. H., and Chan, W.-Y. (2011). Automatic
speech emotion recognition using modulation spectral
features. Speech communication, 53(5):768–785.
Yang, Y.-H., Lin, Y.-C., Su, Y.-F., and Chen, H. H. (2008).
A regression approach to music emotion recognition.
Transactions on audio, speech, and language process-
ing, 16(2):448–457.
Zhalehpour, S., Onder, O., Akhtar, Z., and Erdem, C. E.
(2016). BAUM-1: A spontaneous audio-visual face
database of affective and mental states. IEEE Trans-
actions on Affective Computing, 8(3):300–313.
Emotion Recognition from Speech: A Survey
439