Authors:
Matthias Wimmer
1
;
Björn Schuller
2
;
Dejan Arsic
2
;
Gerhard Rigoll
2
and
Bernd Radig
3
Affiliations:
1
Perceptual Computing Lab, Faculty of Science and Engineering, Waseda University, Japan
;
2
Institute for Human-Machine Communication, Technische Universität München, Germany
;
3
Technische Universität München, Germany
Keyword(s):
Emotion Recognition, Audio-visual Processing, Multi-modal Fusion.
Related
Ontology
Subjects/Areas/Topics:
Applications
;
Artificial Intelligence
;
Biomedical Engineering
;
Biomedical Signal Processing
;
Computer Vision, Visualization and Computer Graphics
;
Data Manipulation
;
Early Vision and Image Representation
;
Feature Extraction
;
Features Extraction
;
Health Engineering and Technology Applications
;
Human-Computer Interaction
;
Image and Video Analysis
;
Informatics in Control, Automation and Robotics
;
Methodologies and Methods
;
Motion, Tracking and Stereo Vision
;
Neurocomputing
;
Neurotechnology, Electronics and Informatics
;
Pattern Recognition
;
Physiological Computing Systems
;
Real-Time Vision
;
Segmentation and Grouping
;
Sensor Networks
;
Signal Processing, Sensors, Systems Modeling and Control
;
Soft Computing
;
Software Engineering
;
Statistical Approach
;
Time-Frequency Analysis
;
Video Analysis
Abstract:
Bimodal emotion recognition through audiovisual feature fusion has been shown superior over each individual modality in the past. Still, synchronization of the two streams is a challenge, as many vision approaches work
on a frame basis opposing audio turn- or chunk-basis. Therefore, late fusion schemes such as simple logic or voting strategies are commonly used for the overall estimation of underlying affect. However, early fusion is known to be more effective in many other multimodal recognition tasks. We therefore suggest a combined analysis by descriptive statistics of audio and video Low-Level-Descriptors for subsequent static SVM Classification. This strategy also allows for a combined feature-space optimization which will be discussed herein. The high effectiveness of this approach is shown on a database of 11.5h containing six emotional situations in an airplane scenario.