Speaker State Recognition with Neural Network-based Classification and Self-adaptive Heuristic Feature Selection

Maxim Sidorov, Christina Brester, Eugene Semenkin, Wolfgang Minker

2014

Abstract

While the implementation of existing feature sets and methods for automatic speaker state analysis has already achieved reasonable results, there is still much to be done for further improvement. In our research, we tried to carry out speech analysis with the self-adaptive multi-objective genetic algorithm as a feature selection technique and with a neural network as a classifier. The proposed approach was evaluated using a number of multi-language speech databases (English, German and Japanese). According to the obtained results, the developed technique allows an increase in emotion recognition performance by up to 6.2% relative improvement in average F-measure, up to 112.0% for the speaker identification task and up to 6.4% for the speech-based gender recognition, having approximately half as many features.

References

  1. Batliner, A., Hacker, C., Steidl, S., Nöth, E., D'Arcy, S., Russell, M. J., and Wong, M. (2004). ” you stupid tin box”-children interacting with the aibo robot: A cross-linguistic emotional speech corpus. In LREC.
  2. Bijankhan, M., Sheikhzadegan, J., Roohani, M., Samareh, Y., Lucas, C., and Tebyani, M. (1994). Farsdat-the speech database of farsi spoken language. In the Proceedings of the Australian Conference on Speech Science and Technology, volume 2, pages 826-830.
  3. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., and Weiss, B. (2005). A database of german emotional speech. In Interspeech, pages 1517-1520.
  4. Carpenter, G. A., Grossberg, S., Markuzon, N., Reynolds, J. H., and Rosen, D. B. (1992). Fuzzy artmap: A neural network architecture for incremental supervised learning of analog multidimensional maps. Neural Networks, IEEE Transactions on, 3(5):698-713.
  5. Daridi, F., Kharma, N., and Salik, J. (2004). Parameterless genetic algorithms: review and innovation. IEEE Canadian Review, (47):19-23.
  6. Eyben, F., Wöllmer, M., and Schuller, B. (2010). Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the international conference on Multimedia, pages 1459-1462. ACM.
  7. Gharavian, D., Sheikhan, M., Nazerieh, A., and Garoucy, S. (2012). Speech emotion recognition using fcbf feature selection method and ga-optimized fuzzy artmap neural network. Neural Computing and Applications, 21(8):2115-2126.
  8. Grimm, M., Kroschel, K., and Narayanan, S. (2008). The vera am mittag german audio-visual emotional speech database. In Multimedia and Expo, 2008 IEEE International Conference on, pages 865-868. IEEE.
  9. Hansen, J. H., Bou-Ghazale, S. E., Sarikaya, R., and Pellom, B. (1997). Getting started with susas: a speech under simulated and actual stress database. In EUROSPEECH, volume 97, pages 1743-46.
  10. Haq, S. and Jackson, P. (2010). Machine Audition: Principles, Algorithms and Systems, chapter Multimodal Emotion Recognition, pages 398-423. IGI Global, Hershey PA.
  11. Kockmann, M., Burget, L., and C?ernockÈ, J. (2009). Brno university of technology system for interspeech 2009 emotion challenge. In Tenth Annual Conference of the International Speech Communication Association.
  12. Kwon, O.-W., Chan, K., Hao, J., and Lee, T.-W. (2003). Emotion recognition by speech signals. In INTERSPEECH.
  13. Mori, H., Satake, T., Nakamura, M., and Kasuya, H. (2011). Constructing a spoken dialogue corpus for studying paralinguistic information in expressive conversation and analyzing its statistical/acoustic characteristics. Speech Communication, 53.
  14. Polzehl, T., Schmitt, A., and Metze, F. (2011). Salient features for anger recognition in german and english ivr portals. In Minker, W., Lee, G. G., Nakamura, S., and Mariani, J., editors, Spoken Dialogue Systems Technology and Design, pages 83-105. Springer New York. 10.1007/978-1-4419-7934-6 4.
  15. Potter, M. A. and De Jong, K. A. (1994). A cooperative coevolutionary approach to function optimization. In Parallel Problem Solving from Nature-PPSN III, pages 249-257. Springer.
  16. Schmitt, A., Ultes, S., and Minker, W. (2012). A parameterized and annotated corpus of the cmu let's go bus information system. In International Conference on Language Resources and Evaluation (LREC).
  17. Zitzler, E. and Thiele, L. (1999). Multiobjective evolutionary algorithms: A comparative case study and the strength pareto approach. Evolutionary Computation, IEEE Transactions on, 3(4):257-271.
Download


Paper Citation


in Harvard Style

Sidorov M., Brester C., Semenkin E. and Minker W. (2014). Speaker State Recognition with Neural Network-based Classification and Self-adaptive Heuristic Feature Selection . In Proceedings of the 11th International Conference on Informatics in Control, Automation and Robotics - Volume 1: ICINCO, ISBN 978-989-758-039-0, pages 699-703. DOI: 10.5220/0005049706990703


in Bibtex Style

@conference{icinco14,
author={Maxim Sidorov and Christina Brester and Eugene Semenkin and Wolfgang Minker},
title={Speaker State Recognition with Neural Network-based Classification and Self-adaptive Heuristic Feature Selection},
booktitle={Proceedings of the 11th International Conference on Informatics in Control, Automation and Robotics - Volume 1: ICINCO,},
year={2014},
pages={699-703},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005049706990703},
isbn={978-989-758-039-0},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 11th International Conference on Informatics in Control, Automation and Robotics - Volume 1: ICINCO,
TI - Speaker State Recognition with Neural Network-based Classification and Self-adaptive Heuristic Feature Selection
SN - 978-989-758-039-0
AU - Sidorov M.
AU - Brester C.
AU - Semenkin E.
AU - Minker W.
PY - 2014
SP - 699
EP - 703
DO - 10.5220/0005049706990703