
which includes feature extraction, dimensionality 
reduction and classification steps. In Section 4 is 
presented a description of the audio-visual database 
that is used in this work, while Section 5 deals with 
experiments setup and the results achieved. 
Concluding remarks and plans for future work can be 
found in Section 6. 
2 SIGNIFICANT RELATED 
WORK 
The paper by Rashid et al. (Rashid et al., 2012) 
explores the problem of human emotion recognition 
and proposes the solution of combining audio and 
visual features. First, the audio stream is separated 
from the video stream. Feature detection and 3D 
patch extraction are applied to video streams and the 
dimensionality of video features is reduced by 
applying PCA. From audio streams prosodic and mel-
frequency cepstrum coefficients (MFCC) are 
extracted. After feature extraction the authors 
construct separate codebooks for audio and video 
modalities by applying the K-means algorithm in 
Euclidean space. Finally, multiclass support vector 
machine (SVM) classifiers are applied to audio and 
video data, and decision-level data fusion is 
performed by applying Bayes sum rule. By building 
the classifier on audio features the authors received 
an average accuracy of 67.39%, using video features 
gave an accuracy of 74.15%, while combining audio 
and visual features on the decision level improved the 
accuracy to 80.27%. 
Kahou et al. (Kahou et al., 2013) described the 
approach they used for submission to the 2013 
Emotion Recognition in the Wild Challenge. The 
approach combined multiple deep neural networks 
including deep convolutional neural networks 
(CNNs) for analyzing facial expressions in video 
frames, deep belief net (DBN) to capture audio 
information, deep autoencoder to model the spatio-
temporal information produced by the human actions, 
and shallow network architecture focused on the 
extracted features of the mouth of the primary human 
subject in the scene. The authors used the Toronto 
Face Dataset, containing 4,178 images labelled with 
basic emotions and with only fully frontal facing 
poses, and a dataset harvested from Google image 
search which consisted of 35,887 images with seven 
expression classes. All images were turned to 
grayscale of size 48x48. Several decision-level data 
integration techniques were used: averaged 
predictions, SVM and multi-layer perceptron (MLP) 
aggregation techniques, and random search for 
weighting models. The best accuracy they achieved 
on the competition testing set was 41.03%. 
In the work by Cruz et al. (Cruz et al., 2012) the 
concept of modelling the change in features is used, 
rather than their simple combination. First, the faces 
are extracted from the original images, and Local 
Phase Quantization (LPQ) histograms are extracted in 
each n x n local region. The histograms are 
concatenated to form a feature vector. The derivative 
of features is computed by two methods: convolution 
with the difference of Gaussians (DoG) filter and the 
difference of feature histograms. A linear SVM is 
trained to output posterior probabilities. and the 
changes are modelled with a hidden Markov model. 
The proposed method was tested on the Audio/Visual 
Emotion Challenge 2011 dataset, which consists of 
63 videos of 13 different individuals, where frontal 
face videos are taken during an interview where the 
subject is engaged in a conversation. The authors 
claim that they increased the classification rate on the 
data by 13%. 
In (Soleymani et al., 2012) the authors exploit the 
idea of using electroencephalogram, pupillary 
response and gaze distance to classify the arousal of 
a subject as either calm, medium aroused, or activated 
and valence as either unpleasant, neutral, or pleasant. 
The data consists of 20 video clips with emotional 
content from movies. The valence classification 
accuracy achieved is 68.5 %, and the arousal 
classification accuracy is76.4 %. 
Busso et al. (Busso et al., 2004) researched the 
idea of acoustic and facial expression information 
fusion. They used a database recorded from an actress 
reading 258 sentences expressing emotions. Separate 
classifiers based on acoustic data and facial 
expressions were built, with classification accuracies 
of 70.9% and 85%respectively. Facial expression 
features include 5 areas: forehead, eyebrow, low eye, 
right and left cheeks. The authors covered two data 
fusion approaches: decision level and feature level 
integration. On the feature level, audio and facial 
expression features were combined to build one 
classifier, giving 90% accuracy. On the decision 
level, several criteria were used to combine posterior 
probabilities of the unimodal systems: maximum – 
the emotion with the greatest posterior probability in 
both modalities is selected; average – the posterior 
probability of each modality is equally weighted and 
the maximum is selected; product - posterior 
probabilities are multiplied and the maximum is 
selected; weight - different weights are applied to the 
different unimodal systems. The accuracies of 
decision-level integration bimodal classifiers range 
FeatureandDecisionLevelAudio-visualDataFusioninEmotionRecognitionProblem
247