Authors:
Maxim Sidorov
1
;
Evgenii Sopov
2
;
Ilia Ivanov
2
and
Wolfgang Minker
1
Affiliations:
1
Ulm University, Germany
;
2
Siberian State Aerospace University, Russian Federation
Keyword(s):
Emotion Recognition, Speech, Vision, PCA, Neural Network, Human-Computer Interaction (HCI), Feature Level Fusion, Decision Level Fusion.
Related
Ontology
Subjects/Areas/Topics:
Human-Machine Interfaces
;
Hybrid Learning Systems
;
Image Processing
;
Informatics in Control, Automation and Robotics
;
Intelligent Control Systems and Optimization
;
Neural Networks Based Control Systems
;
Robotics and Automation
;
Vision, Recognition and Reconstruction
Abstract:
The speech-based emotion recognition problem has already been investigated by many authors, and reasonable results have been achieved. This article focuses on applying audio-visual data fusion approach to emotion recognition. Two state-of-the-art classification algorithms were applied to one audio and three visual feature datasets. Feature level data fusion was applied to build a multimodal emotion classification system, which helped increase emotion classification accuracy by 4% compared to the best accuracy achieved by unimodal systems. The class precisions achieved by applying algorithms on unimodal and multimodal datasets helped to reveal that different data-classifier combinations are good at recognizing certain emotions. These data-classifier combinations were fused on the decision level using several approaches, which still helped increase the accuracy by 3% compared to the best accuracy achieved by feature level fusion.