which includes feature extraction, dimensionality
reduction and classification steps. In Section 4 is
presented a description of the audio-visual database
that is used in this work, while Section 5 deals with
experiments setup and the results achieved.
Concluding remarks and plans for future work can be
found in Section 6.
2 SIGNIFICANT RELATED
WORK
The paper by Rashid et al. (Rashid et al., 2012)
explores the problem of human emotion recognition
and proposes the solution of combining audio and
visual features. First, the audio stream is separated
from the video stream. Feature detection and 3D
patch extraction are applied to video streams and the
dimensionality of video features is reduced by
applying PCA. From audio streams prosodic and mel-
frequency cepstrum coefficients (MFCC) are
extracted. After feature extraction the authors
construct separate codebooks for audio and video
modalities by applying the K-means algorithm in
Euclidean space. Finally, multiclass support vector
machine (SVM) classifiers are applied to audio and
video data, and decision-level data fusion is
performed by applying Bayes sum rule. By building
the classifier on audio features the authors received
an average accuracy of 67.39%, using video features
gave an accuracy of 74.15%, while combining audio
and visual features on the decision level improved the
accuracy to 80.27%.
Kahou et al. (Kahou et al., 2013) described the
approach they used for submission to the 2013
Emotion Recognition in the Wild Challenge. The
approach combined multiple deep neural networks
including deep convolutional neural networks
(CNNs) for analyzing facial expressions in video
frames, deep belief net (DBN) to capture audio
information, deep autoencoder to model the spatio-
temporal information produced by the human actions,
and shallow network architecture focused on the
extracted features of the mouth of the primary human
subject in the scene. The authors used the Toronto
Face Dataset, containing 4,178 images labelled with
basic emotions and with only fully frontal facing
poses, and a dataset harvested from Google image
search which consisted of 35,887 images with seven
expression classes. All images were turned to
grayscale of size 48x48. Several decision-level data
integration techniques were used: averaged
predictions, SVM and multi-layer perceptron (MLP)
aggregation techniques, and random search for
weighting models. The best accuracy they achieved
on the competition testing set was 41.03%.
In the work by Cruz et al. (Cruz et al., 2012) the
concept of modelling the change in features is used,
rather than their simple combination. First, the faces
are extracted from the original images, and Local
Phase Quantization (LPQ) histograms are extracted in
each n x n local region. The histograms are
concatenated to form a feature vector. The derivative
of features is computed by two methods: convolution
with the difference of Gaussians (DoG) filter and the
difference of feature histograms. A linear SVM is
trained to output posterior probabilities. and the
changes are modelled with a hidden Markov model.
The proposed method was tested on the Audio/Visual
Emotion Challenge 2011 dataset, which consists of
63 videos of 13 different individuals, where frontal
face videos are taken during an interview where the
subject is engaged in a conversation. The authors
claim that they increased the classification rate on the
data by 13%.
In (Soleymani et al., 2012) the authors exploit the
idea of using electroencephalogram, pupillary
response and gaze distance to classify the arousal of
a subject as either calm, medium aroused, or activated
and valence as either unpleasant, neutral, or pleasant.
The data consists of 20 video clips with emotional
content from movies. The valence classification
accuracy achieved is 68.5 %, and the arousal
classification accuracy is76.4 %.
Busso et al. (Busso et al., 2004) researched the
idea of acoustic and facial expression information
fusion. They used a database recorded from an actress
reading 258 sentences expressing emotions. Separate
classifiers based on acoustic data and facial
expressions were built, with classification accuracies
of 70.9% and 85%respectively. Facial expression
features include 5 areas: forehead, eyebrow, low eye,
right and left cheeks. The authors covered two data
fusion approaches: decision level and feature level
integration. On the feature level, audio and facial
expression features were combined to build one
classifier, giving 90% accuracy. On the decision
level, several criteria were used to combine posterior
probabilities of the unimodal systems: maximum –
the emotion with the greatest posterior probability in
both modalities is selected; average – the posterior
probability of each modality is equally weighted and
the maximum is selected; product - posterior
probabilities are multiplied and the maximum is
selected; weight - different weights are applied to the
different unimodal systems. The accuracies of
decision-level integration bimodal classifiers range
FeatureandDecisionLevelAudio-visualDataFusioninEmotionRecognitionProblem
247