Stefan Glüge, Ronald Böck, Andreas Wendemuth


Emotion recognition from speech means to determine the emotional state of a speaker from his or her voice. Today’s most used classifiers in this field are Hidden Markov Models (HMMs) and Support Vector Machines. Both architectures are not made to consider the full dynamic character of speech. However, HMMs are able to capture the temporal characteristics of speech on phoneme, word, or utterance level but fail to learn the dynamics of the input signal on short time scales (e.g., frame rate). The use of dynamical features (first and second derivatives of speech features) attenuates this problem. We propose the use of Segmented-Memory Recurrent Neural Networks to learn the full spectrum of speech dynamics. Therefore, the dynamical features can be removed form the input data. The resulting neural network classifier is compared to HMMs that use the reduced feature set as well as to HMMs that work with the full set of features. The networks perform comparable to HMMs while using significantly less features.


  1. Albornoz, E. M., Milone, D. H., and Rufiner, H. L. (2011). Spoken emotion recognition using hierarchical classifiers. Computer Speech and Language, 25(3):556- 570.
  2. Baum, L., Petrie, T., Soules, G., and Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. Ann. Math. Stat., 41:164-171.
  3. Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157- 66.
  4. Böck, R., Hübner, D., and Wendemuth, A. (2010). Determining optimal signal features and parameters for hmm-based emotion classification. In MELECON 2010 - 15th IEEE Mediterranean Electrotechnical Conference, pages 1586-1590.
  5. Boreczky, J. S. and Wilcox, L. D. (1998). Hidden markov model framework for video segmentation using audio and image features. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, volume 6, pages 3741-3744.
  6. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., and Weiss, B. (2005). A database of german emotional speech. In Proceedings of the 9th European Conference on Speech Communication and Technology; Lisbon, pages 1517-1520.
  7. Chen, J. and Chaudhari, N. (2009). Segmented-memory recurrent neural networks. Neural Networks, IEEE Transactions, 20(8):1267-80.
  8. Ekman, P. (July 1992). Are there basic emotions? Psychological Review, 99:550-553.
  9. El Ayadi, M., Kamel, M., and Karray, F. (2007). Speech emotion recognition using gaussian mixture vector autoregressive models. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, volume 4, pages 957-960. IEEE.
  10. El Ayadi, M., Kamel, M. S., and Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3):572-587.
  11. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2):179-211.
  12. Fant, G. (1960). Acoustic theory of speech production. Mounton, The Hague.
  13. Ganchev, T., Fakotakis, N., and Kokkinakis, G. (2005). Comparative evaluation of various mfcc implementations on the speaker verification task. In Proc. of the SPECOM, pages 191-194.
  14. Glüge, S., Böck, R., and Wendemuth, A. (2010a). Implicit sequence learning - a case study with a 4-2-4 encoder simple recurrent network. In Proceedings of the International Conference on Fuzzy Computation and 2nd International Conference on Neural Computation, pages 279-288.
  15. Glüge, S., Hamid, O. H., and Wendemuth, A. (2010b). A simple recurrent network for implicit learning of temporal sequences. Cognitive Computation, 2(4):265- 271.
  16. Grimm, M., Kroschel, K., Mower, E., and Narayanan, S. (2007). Primitives-based evaluation and estimation of emotions in speech. Speech Communication, 49(10- 11):787-800.
  17. Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America, 87(4):1738-1752.
  18. Hitch, G. J., Burgess, N., Towse, J. N., and Culpin, V. (1996). Temporal grouping effects in immediate recall: A working memory analysis. Quarterly Journal of Experimental Psychology Section A: Human Experimental Psychology, 49(1):116-139.
  19. Hochreiter, S. (1998). The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(2):107-116.
  20. Hübner, D., Vlasenko, B., Grosser, T., and Wendemuth, A. (2010). Determining optimal features for emotion recognition from speech by applying an evolutionary algorithm. In INTERSPEECH 2010, pages 2358-2361.
  21. Inoue, T., Nakagawa, R., Kondou, M., Koga, T., and Shinohara, K. (2011). Discrimination between mothers' infant- and adult-directed speech using hidden markov models. Neuroscience Research, 70(1):62-70.
  22. Kim, W. and Hansen, J. (2010). Angry emotion detection from real-life conversational speech by leveraging content structure. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pages 5166-5169.
  23. Mehrabian, A. (1996). Pleasure-Arousal-Dominance: A General Framework for Describing and Measuring Individual Differences in Temperament. Current Psychology, 14(4):261-292.
  24. Müller, M. (2007). Information Retrieval for Music and Motion. Springer Verlag.
  25. Nicholson, J., Takahashi, K., and Nakatsu, R. (1999). Emotion recognition in speech using neural networks. In Neural Information Processing, 1999. Proceedings. ICONIP 7899. 6th International Conference on, volume 2, pages 495-501.
  26. Nwe, T. L., Foo, S. W., and Silva, L. C. D. (2003). Speech emotion recognition using hidden markov models. Speech Communication, 41(4):603-623.
  27. Petrushin, V. A. (2000). Emotion recognition in speech signal: experimental study, development, and application. In Proceedings of the ICSLP 2000, volume 2, pages 222-225.
  28. Pierre-Yves, O. (2003). The production and recognition of emotions in speech: features and algorithms. International Journal of Human-Computer Studies, 59(1- 2):157 - 183. Applications of Affective Computing in Human-Computer Interaction.
  29. Scherer, S., Oubbati, M., Schwenker, F., and Palm, G. (2008). Real-time emotion recognition using echo state networks. In André, E., Dybkjr, L., Minker, W., Neumann, H., Pieraccini, R., and Weber, M., editors, Perception in Multimodal Dialogue Systems, volume 5078 of Lecture Notes in Computer Science, pages 200-204. Springer Berlin / Heidelberg.
  30. Schmidt, M., Schels, M., and Schwenker, F. (2010). A hidden markov model based approach for facial expression recognition in image sequences. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 5998 LNAI, pages 149- 160. Springer.
  31. Schuller, B., Batliner, A., Steidl, S., and Seppi, D. (2011). Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Communication. Article in Press.
  32. Schuller, B., Rigoll, G., and Lang, M. (2004). Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP 7804). IEEE International Conference on, volume 1, pages I - 577-80 vol.1.
  33. Schuller, B., Steidl, S., and Batliner, A. (2009a). The interspeech 2009 emotion challenge. In Tenth Annual Conference of the International Speech Communication Association, pages 312-315.
  34. Schuller, B., Vlasenko, B., Eyben, F., Rigoll, G., and Wendemuth, A. (2009b). Acoustic emotion recognition: A benchmark comparison of performances. In Automatic Speech Recognition & Understanding, 2009. ASRU 2009. IEEE Workshop on, pages 552- 557. IEEE.
  35. Severin, F. T. and Rigby, M. K. (1963). Influence of digit grouping on memory for telephone numbers. Journal of Applied Psychology, 47(2):117-119.
  36. Song, M., You, M., Li, N., and Chen, C. (2008). A robust multimodal approach for emotion recognition. Neurocomputing, 71(10-12):1913-1920.
  37. Trentin, E., Scherer, S., and Schwenker, F. (2010). Maximum echo-state-likelihood networks for emotion recognition. In Schwenker, F. and El Gayar, N., editors, Artificial Neural Networks in Pattern Recognition, volume 5998 of Lecture Notes in Computer Science, pages 60-71. Springer Berlin / Heidelberg.
  38. Tuzlukov, V. P. (2000). Birkhäuser, Boston.
  39. C? ern?anskÉ, M. and Ben?us?kova, L. (2003). Simple recurrent network trained by rtrl and extended kalman filter algorithms. Neural Network World, 13(3):223-234.
  40. Ververidis, D. and Kotropoulos, C. (2006). Emotional speech recognition: Resources, features, and methods. Speech Communication, 48(9):1162-1181.
  41. Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Transactions on Information Theory, 13:260- 269.
  42. Vlasenko, B., Schuller, B., Wendemuth, A., and Rigoll, G. (2008). On the influence of phonetic content variation for acoustic emotion recognition. In Perception in Multimodal Dialogue Systems, volume 5078 of Lecture Notes in Computer Science, pages 217-220. Springer Berlin / Heidelberg.
  43. Vlasenko, B. and Wendemuth, A. (2009a). Heading toward to the natural way of human-machine interaction: the nimitek project. In Multimedia and Expo, 2009. ICME 2009. IEEE International Conference on, pages 950- 953.
  44. Vlasenko, B. and Wendemuth, A. (2009b). Processing affected speech within human machine interaction. In INTERSPEECH-2009, volume 3, pages 2039-2042, Brighton.
  45. Williams, R. J. and Zipser, D. (1995). Gradient-based learning algorithms for recurrent networks and their computational complexity, pages 433-486. L. Erlbaum Associates Inc., Hillsdale, NJ, USA.
  46. W öllmer, M., Eyben, F., Reiter, S., Schuller, B., Cox, C., Douglas-Cowie, E., and Cowie, R. (2008). Abandoning emotion classes - towards continuous emotion recognition with modelling of long-range dependencies. In INTERSPEECH-2008, pages 597-600.
  47. Yingthawornsuk, T. and Shiavi, R. (2008). Distinguishing depression and suicidal risk in men using gmm based frequency contents of affective vocal tract response. In Control, Automation and Systems, 2008. ICCAS 2008. International Conference on, pages 901-904.
  48. Young, S. J., Evermann, G., Gales, M. J. F., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., and Woodland, P. C. (2006). The HTK Book, version 3.4. Cambridge University Engineering Department, Cambridge, UK.

Paper Citation

in Harvard Style

Glüge S., Böck R. and Wendemuth A. (2011). SEGMENTED–MEMORY RECURRENT NEURAL NETWORKS VERSUS HIDDEN MARKOV MODELS IN EMOTION RECOGNITION FROM SPEECH . In Proceedings of the International Conference on Neural Computation Theory and Applications - Volume 1: NCTA, (IJCCI 2011) ISBN 978-989-8425-84-3, pages 308-315. DOI: 10.5220/0003644003080315

in Bibtex Style

author={Stefan Glüge and Ronald Böck and Andreas Wendemuth},
booktitle={Proceedings of the International Conference on Neural Computation Theory and Applications - Volume 1: NCTA, (IJCCI 2011)},

in EndNote Style

JO - Proceedings of the International Conference on Neural Computation Theory and Applications - Volume 1: NCTA, (IJCCI 2011)
SN - 978-989-8425-84-3
AU - Glüge S.
AU - Böck R.
AU - Wendemuth A.
PY - 2011
SP - 308
EP - 315
DO - 10.5220/0003644003080315