Our proposed model is based on the Connectionist
Temporal Classification (CTC) loss. It takes a long
sequence of data as an input, processes it through a
Convolutional Neural Network (CNN) followed by a
Recurrent Neural Network (RNN) to detect pertinent
features and feeds it to the CTC which will in return
determine the sequence of emotion categories
presented in the input speech. To evaluate our
model’s performance, we have introduced two new
evaluation metrics: the ECER (Emotion Change Error
Rate) and the ECD (Emotion Change Detection).
In the remainder of this paper, the second section
focuses on a few significant works that are connected.
The proposed model is detailed in the third section.
The fourth section sheds lights upon some
information on the datasets that were utilized to
assess our model's performance. Detailed results
along with the new evaluation metrics are provided in
the fifth section. A depth analysis and discussion are
presented in section six. Finally, in Section 7, we
provide a summary of this article and discuss our
future work.
2 THE STATE OF THE ART
Several recent researches have focused on the
emotion recognition from speech. The work of
Mustaqeem and Kwon (2020a) focuses mainly on the
pre-processing phase where they used an adaptive
threshold-based algorithm to remove silence, noises
and irrelevant information, then a spectrogram is
generated and fed to a CNN. In the work of
Aouani
and Ayed (2020) a vector of 42 features was extracted
from each signal. Then, they have deployed an Auto-
Encoder (AE) to reduce the data representation and to
select pertinent features. The output of the AE will be
passed to an SVM to classify speeches and determine
emotions. Slimi et al. (2020) have used log-mel
spectrograms as an input for a shallow neural network
(SNN) to prove that neural networks can work with
small datasets. Once the spectrograms were
generated, they have resized them to be able to feed
them to the first layer of the neural network. In the
work of Mustaqeem and Kwon (2020b), different
blocks were used in the SER framework. They have
used a ConvLSTM (combination of CNN and LSTM)
for local feature learning block (LFLB), a GRU
(gated recurrent units) for global features learning
block (GFLB) and the center loss along with the
softmax for multi-class classification. In the work of
Issa et al. (2020) five different feature sets were used
and tested using a CNN: Mel-frequency Cepstral
Coefficients (MFCCs) Mel-scaled spectrogram,
Chromagram, Spectral contrast feature and Tonnetz
representation. However, despite the variety of
feature extraction algorithms and classification
techniques, they all share one common point which is
recognizing emotions from pre-segmented data with
one global label.
For the emotion change detection, fewer papers
have been published. Huang et al (2015) have worked
on detecting the instant of emotion change and
transition points from one emotion to another. A
Gaussian Mixture Models (GMM) with and without
prior knowledge of emotion-based methods was used
to detect emotion change among only four different
emotions. However, their main focus was on arousal
and valence. Their method consists of using a double
sliding window consisting of both previous and
current fixed-length windows. Within these two
windows, which span multiple frames, features are
extracted based on the frame and used to calculate
probabilities. Scores, which comprise a linear
combination of log likelihoods, are calculated and
compared to a threshold during the detection phase in
order to make a decision. If a score is above the
threshold within the tolerance range of the actual
point of change, then a change occurs. To test their
model, they have used Detection Error Trade-off
(DET) curve and Equal Error Rate (EER). In the
paper of (Huang and Epps, 2016) , authors have
explored the problem of identifying points of
emotional change over time in terms of testing
exchangeability using a martingale framework which
is a sort of stochastic process that employs
conditional expectations. It occurs when a collection
of random variables is repeated at a specific time.
When a new data point is seen in the martingale
framework, hypothesis testing is performed to
determine whether a concept change occurs in the
data stream or not. In this process data points (frame-
based features) of speech are observed point by point.
Their goal was to identify changes in emotional
categories (neutral and emotional), as well as within
dimensions (positive and negative in arousal and
valence). They have used two sets of frame-level
acoustic features: the MFCCs and the Geneva
Minimalistic Acoustic Parameter Set (eGeMAPS).
The model of Huang and Epps (2018) consists of
detection the emotion change points in time as well as
assessing the emotion change by calculating the
magnitude of emotion changes along with the types
of emotion change. They have used 88-dimensional
eGeMAPS features and three different regression
models: Support Vector Regression (SVR),
Relevance Vector Machine (RVM) and Output-
Associative RVM (OA-RVM).