LOW-LEVEL FUSION OF AUDIO AND VIDEO FEATURE FOR MULTI-MODAL EMOTION RECOGNITION

Matthias Wimmer; Björn Schuller; Dejan Arsic; Gerhard Rigoll; Bernd Radig

doi:10.5220/0001082801450151

LOW-LEVEL FUSION OF AUDIO AND VIDEO FEATURE FOR MULTI-MODAL EMOTION RECOGNITION

Matthias Wimmer, Björn Schuller, Dejan Arsic, Gerhard Rigoll, Bernd Radig

2008

Abstract

Bimodal emotion recognition through audiovisual feature fusion has been shown superior over each individual modality in the past. Still, synchronization of the two streams is a challenge, as many vision approaches work on a frame basis opposing audio turn- or chunk-basis. Therefore, late fusion schemes such as simple logic or voting strategies are commonly used for the overall estimation of underlying affect. However, early fusion is known to be more effective in many other multimodal recognition tasks. We therefore suggest a combined analysis by descriptive statistics of audio and video Low-Level-Descriptors for subsequent static SVM Classification. This strategy also allows for a combined feature-space optimization which will be discussed herein. The high effectiveness of this approach is shown on a database of 11.5h containing six emotional situations in an airplane scenario.

References

Cootes, T. F. and Taylor, C. J. (1992). Active shape models - smart snakes. In Proc. of the 3rd British Machine Vision Conference 1992, pages 266 - 275. Springer Verlag.
Edwards, G. J., Cootes, T. F., and Taylor, C. J. (1998). Face recognition using active appearance models. In Burkhardt, H. and Neumann, B., editors, 5th European Conference on Computer Vision, volume LNCSSeries 1406-1607, pages 581-595, Freiburg, Germany. Springer-Verlag.
Felzenszwalb, P. and Huttenlocher, D. (2000). Efficient matching of pictorial structures. In International Conference on Computer Vision and Pattern Recognition, pages 66-73.
Hanek, R. (2004). Fitting Parametric Curve Models to Images Using Local Self-adapting Seperation Criteria. PhD thesis, Department of Informatics, Technische Universität München.
Littlewort, G., Fasel, I., Bartlett, M. S., and Movellan, J. R. (2002). Fully automatic coding of basic expressions from video. Technical report.
Michel, P. and Kaliouby, R. E. (2003). Real time facial expression recognition in video using support vector machines. In Fifth International Conference on Multimodal Interfaces, pages 258-264, Vancouver.
Pantic, M. and Rothkrantz, L. J. M. (2003). Toward an affect-sensitive multimodal human-computer interaction. Proceedings of the IEEE, Special Issue on human-computer multimodal interface, 91(9):1370- 1390.
Schuller, B., Mueller, R., Lang, M., and Rigoll, G. (2005). Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles. In Proc. Interspeech 2005, Lisboa, Portugal. ISCA.
Schuller, B. and Rigoll, G. (2006). Timing levels in segment-based speech emotion recognition. In Proc. INTERSPEECH 2006, Pittsburgh, USA. ISCA.
Schweiger, R., Bayerl, P., and Neumann, H. (2004). Neural architecture for temporal emotion classification. In Affective Dialogue Systems 2004, LNAI 3068, pages 49-52, Kloster Irsee. Elisabeth Andre et al (Hrsg.).
Sigal, L., Sclaroff, S., and Athitsos, V. (2000). Esitmation and prediction of evolving color distributions for skin segmentation under varying illumination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2000).
Viola, P. and Jones, M. J. (2004). Robust real-time face detection. International Journal of Computer Vision, 57(2):137-154.
Wimmer, M., Pietzsch, S., Stulp, F., and Radig, B. (2007a). Learning robust objective functions with application to face model fitting. In Proceedings of the 29th DAGM Symposium, volume 1, pages 486-496, Heidelberg, Germany.
Wimmer, M., Radig, B., and Beetz, M. (2006). A person and context specific approach for skin color classification. In Procedings of the 18th International Conference of Pattern Recognition (ICPR 2006), volume 2, pages 39-42, Los Alamitos, CA, USA. IEEE Computer Society.
Wimmer, M., Stulp, F., Pietzsch, S., and Radig, B. (2007b). Learning local objective functions for robust face model fitting. In IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI). to appear.

Download

Paper Citation

in Harvard Style

Wimmer M., Schuller B., Arsic D., Rigoll G. and Radig B. (2008). LOW-LEVEL FUSION OF AUDIO AND VIDEO FEATURE FOR MULTI-MODAL EMOTION RECOGNITION . In Proceedings of the Third International Conference on Computer Vision Theory and Applications - Volume 1: VISAPP, (VISIGRAPP 2008) ISBN 978-989-8111-21-0, pages 145-151. DOI: 10.5220/0001082801450151

in Bibtex Style

@conference{visapp08,
author={Matthias Wimmer and Björn Schuller and Dejan Arsic and Gerhard Rigoll and Bernd Radig},
title={LOW-LEVEL FUSION OF AUDIO AND VIDEO FEATURE FOR MULTI-MODAL EMOTION RECOGNITION},
booktitle={Proceedings of the Third International Conference on Computer Vision Theory and Applications - Volume 1: VISAPP, (VISIGRAPP 2008)},
year={2008},
pages={145-151},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001082801450151},
isbn={978-989-8111-21-0},
}

in EndNote Style

TY - CONF
JO - Proceedings of the Third International Conference on Computer Vision Theory and Applications - Volume 1: VISAPP, (VISIGRAPP 2008)
TI - LOW-LEVEL FUSION OF AUDIO AND VIDEO FEATURE FOR MULTI-MODAL EMOTION RECOGNITION
SN - 978-989-8111-21-0
AU - Wimmer M.
AU - Schuller B.
AU - Arsic D.
AU - Rigoll G.
AU - Radig B.
PY - 2008
SP - 145
EP - 151
DO - 10.5220/0001082801450151