Fusion of Audio-visual Features using Hierarchical Classifier Systems for the Recognition of Affective States and the State of Depression

Markus Kächele, Michael Glodek, Dimitrij Zharkov, Sascha Meudt, Friedhelm Schwenker


Reliable prediction of affective states in real world scenarios is very challenging and a significant amount of ongoing research is targeted towards improvement of existing systems. Major problems include the unreliability of labels, variations of the same affective states amongst different persons and in different modalities as well as the presence of sensor noise in the signals. This work presents a framework for adaptive fusion of input modalities incorporating variable degrees of certainty on different levels. Using a strategy that starts with ensembles of weak learners, gradually, level by level, the discriminative power of the system is improved by adaptively weighting favorable decisions, while concurrently dismissing unfavorable ones. For the final decision fusion the proposed system leverages a trained Kalman filter. Besides its ability to deal with missing and uncertain values, in its nature, the Kalman filter is a time series predictor and thus a suitable choice to match input signals to a reference time series in the form of ground truth labels. In the case of affect recognition, the proposed system exhibits superior performance in comparison to competing systems on the analysed dataset.


  1. Airas, M. and Alku, P. (2007). Comparison of multiple voice source parameters in different phonation types. 3See http://sspnet.eu/avec2013/ for details. (Meng et al.,
  2. January, 2014)
  3. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., and Weiss, B. (2005). A database of German emotional speech. In Proceedings of Interspeech 2005, pages 1517-1520.
  4. Cohn, J., Kruez, T., Matthews, I., Yang, Y., Nguyen, M. H., Padilla, M., Zhou, F., and De la Torre, F. (2009). Detecting depression from facial actions and vocal prosody. In Affective Computing and Intelligent Interaction and Workshops (ACII 2009)., pages 1-7.
  5. Drugman, T., Bozkurt, B., and Dutoit, T. (2011). Causalanticausal decomposition of speech using complex cepstrum for glottal source estimation. Speech Communication, 53(6):855-866.
  6. Fragopanagos, N. and Taylor, J. (2005). Emotion recognition in human-computer interaction. Neural Networks, 18:389-405.
  7. Glodek, M., Reuter, S., Schels, M., Dietmayer, K., and Schwenker, F. (2013). Kalman filter based classifier fusion for affective state recognition. In Proceedings of the International Workshop on Multiple Classifier Systems (MCS), volume 7872 of LNCS, pages 85-94. Springer.
  8. Glodek, M., Schels, M., Palm, G., and Schwenker, F. (2012). Multi-modal fusion based on classification using rejection option and Markov fusion networks. In Proceedings of the International Conference on Pattern Recognition (ICPR), pages 1084-1087. IEEE.
  9. Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural Comput., 18(7):1527-1554.
  10. Jiang, B., Valstar, M. F., and Pantic, M. (2011). Action unit detection using sparse appearance descriptors in space-time video volumes. In Proceedings of IEEE International Conference on Automatic Face and Gesture Recognition, pages 314-321. IEEE.
  11. Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Transactions of the ASME - Journal of Basic Engineering, 82(Series D):35-45.
  12. Kanade, T., Cohn, J., and Tian, Y. (2000). Comprehensive database for facial expression analysis. In Automatic Face and Gesture Recognition, 2000., pages 46-53.
  13. Kane, J. and Gobl, C. (2013). Wavelet maxima dispersion for breathy to tense voice discrimination. Audio, Speech, and Language Processing, IEEE Transactions on, 21(6):1170-1179.
  14. Lee, C. M., Yildirim, S., Bulut, M., Kazemzadeh, A., Busso, C., Deng, Z., Lee, S., and Narayanan, S. S. (2004). Emotion recognition based on phoneme classes. In Proceedings of ICSLP 2004.
  15. Luengo, I., Navas, E., and Hernáez, I. (2010). Feature analysis and evaluation for automatic emotion identification in speech. Multimedia, IEEE Transactions on, 12(6):490-501.
  16. Lugger, M. and Yang, B. (2006). Classification of different speaking groups by means of voice quality parameters. ITG-Fachbericht-Sprachkommunikation 2006.
  17. Lugger, M. and Yang, B. (2007). The relevance of voice quality features in speaker independent emotion recognition. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, volume 4, pages IV-17. IEEE.
  18. Meng, H., Huang, D., Wang, H., Yang, H., AI-Shuraifi, M., and Wang, Y. (2013). Depression recognition based on dynamic facial and vocal expression features using partial least square regression. In Proceedings of AVEC 2013, AVEC 7813, pages 21-30. ACM.
  19. Meudt, S., Zharkov, D., Kächele, M., and Schwenker, F. (2013). Multi classifier systems and forward backward feature selection algorithms to classify emotional coloured speech. In Proceedings of the International Conference on Multimodal Interaction (ICMI 2013).
  20. Nwe, T. L., Foo, S. W., and De Silva, L. C. (2003). Speech emotion recognition using hidden markov models. Speech communication, 41(4):603-623.
  21. Ojala, T., Pietikinen, M., and Harwood, D. (1996). A comparative study of texture measures with classification based on featured distributions. Pattern Recognition, 29(1):51 - 59.
  22. Ojansivu, V. and Heikkil, J. (2008). Blur insensitive texture classification using local phase quantization. In Elmoataz, A., Lezoray, O., Nouboud, F., and Mammass, D., editors, Image and Signal Processing, volume 5099 of LNCS, pages 236-243. Springer Berlin Heidelberg.
  23. Palm, G. and Schwenker, F. (2009). Sensor-fusion in neural networks. In Shahbazian, E., Rogova, G., and DeWeert, M. J., editors, Harbour Protection Through Data Fusion Technologies, pages 299-306. Springer.
  24. Russell, J. A. and Mehrabian, A. (1977). Evidence for a three-factor theory of emotions. Journal of Research in Personality, 11(3):273 - 294.
  25. Sánchez-Lozano, E., Lopez-Otero, P., Docio-Fernandez, L., Argones-Rúa, E., and Alba-Castro, J. L. (2013). Audiovisual three-level fusion for continuous estimation of russell's emotion circumplex. In Proceedings of AVEC 2013, AVEC 7813, pages 31-40. ACM.
  26. Saragih, J. M., Lucey, S., and Cohn, J. F. (2011). Deformable model fitting by regularized landmark meanshift. Int. J. Comput. Vision, 91(2):200-215.
  27. Scherer, K. R., Johnstone, T., and Klasmeyer, G. (2003). Handbook of Affective Sciences - Vocal expression of emotion, chapter 23, pages 433-456. Affective Science. Oxford University Press.
  28. Scherer, S., Kane, J., Gobl, C., and Schwenker, F. (2012). Investigating fuzzy-input fuzzy-output support vector machines for robust voice quality classification. Computer Speech and Language, 27(1):263-287.
  29. Scherer, S., Schwenker, F., and Palm, G. (2008). Emotion recognition from speech using multi-classifier systems and rbf-ensembles. In Speech, Audio, Image and Biomedical Signal Processing using Neural Networks, pages 49-70. Springer Berlin Heidelberg.
  30. Schwenker, F., Scherer, S., Schmidt, M., Schels, M., and Glodek, M. (2010). Multiple classifier systems for the recognition of human emotions. In Gayar, N. E., Kittler, J., and Roli, F., editors, Proceedings of the 9th International Workshop on Multiple Classifier Systems (MCS'10), LNCS 5997, pages 315-324. Springer.
  31. Senechal, T., Rapp, V., Salam, H., Seguier, R., Bailly, K., and Prevost, L. (2012). Facial action recognition combining heterogeneous features via multikernel learning. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 42(4):993-1005.
  32. Valstar, M., Schuller, B., Smith, K., Eyben, F., Jiang, B., Bilakhia, S., Schnieder, S., Cowie, R., and Pantic, M. (2013). Avec 2013: The continuous audio/visual emotion and depression recognition challenge. In Proceedings of AVEC 2013, AVEC 7813, pages 3-10. ACM.
  33. Viola, P. and Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 1, pages I-511-I-518 vol.1.
  34. W öllmer, M., Kaiser, M., Eyben, F., Schuller, B., and Rigoll, G. (2013). LSTM-modeling of continuous emotions in an audiovisual affect recognition framework. Image and Vision Computing, 31(2):153 - 163. Affect Analysis In Continuous Input.
  35. Yang, S. and Bhanu, B. (2011). Facial expression recognition using emotion avatar image. In Automatic Face Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on, pages 866-871.
  36. Zeng, Z., Pantic, M., Roisman, G. I., and Huang, T. S. (2009). A survey of affect recognition methods: Audio, visual, and spontaneous expressions. pages 39- 58.

Paper Citation

in Harvard Style

Kächele M., Glodek M., Zharkov D., Meudt S. and Schwenker F. (2014). Fusion of Audio-visual Features using Hierarchical Classifier Systems for the Recognition of Affective States and the State of Depression . In Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-018-5, pages 671-678. DOI: 10.5220/0004828606710678

in Bibtex Style

author={Markus Kächele and Michael Glodek and Dimitrij Zharkov and Sascha Meudt and Friedhelm Schwenker},
title={Fusion of Audio-visual Features using Hierarchical Classifier Systems for the Recognition of Affective States and the State of Depression},
booktitle={Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},

in EndNote Style

JO - Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - Fusion of Audio-visual Features using Hierarchical Classifier Systems for the Recognition of Affective States and the State of Depression
SN - 978-989-758-018-5
AU - Kächele M.
AU - Glodek M.
AU - Zharkov D.
AU - Meudt S.
AU - Schwenker F.
PY - 2014
SP - 671
EP - 678
DO - 10.5220/0004828606710678