Silent Speech for Human-Computer Interaction

João Freitas, António Teixeira, Miguel Sales Dias

2014

Abstract

A Silent Speech Interface (SSI) performs Automatic Speech Recognition (ASR) in the absence of an intelligible acoustic signal and can be used as a human-computer interface modality in high-background-noise environments such as living rooms, or in aiding speech-impaired individuals such as elderly persons. By acquiring data from elements of the human speech production process – from glottal and articulators activity, their neural pathways or the central nervous system – an SSI produces an alternative digital representation of speech, which can be recognized and interpreted as data, synthesized directly or routed into a communications network. Nowadays, conventional ASR systems rely only on acoustic information, making them susceptible to problems like environmental noise, privacy, information disclosure and also excluding users with speech impairments. To tackle this problem in the context of ASR for Human-Computer Interaction, we propose a novel SSI based on multiple modalities in European Portuguese (EP), a language for which no SSI has yet been developed. After a state-of-the-art assessment, we have selected less-invasive modalities - Vision, Surface Electromyography and Ultrasound – in order to obtain a more complete representation of the human speech production model. Our aim is now to develop a multimodal SSI prototype adapted to EP and evaluate its usability in real-world scenarios.

References

  1. Beddor, P. S., 1993. The perception of nasal vowels. In M. K. Huffman and R. A. Krakow, Nasals, Nasalization, and the Velum, Phonetics and Phonology, Academic Press Inc., Vol. 5, pp. 171-196.
  2. Bell-Berti, F., 1976. An Electromyographic Study of Velopharyngeal Function, Speech Journal of Speech and Hearing Research, Vol.19, pp. 225-240.
  3. Brumberg, J. S., Nieto-Castanonf, A., Kennedye, P. R. and Guenther, F. H., 2010. Brain-computer interfaces for speech communication. Speech Communication, Vol. 52, Issue 4, pp. 367-379.
  4. Denby, B. and Stone, M., 2004. Speech synthesis from real time ultrasound images of the tongue, Internat. Conf. on Acoustics, Speech, and Signal Processing, Montreal, Canada, Vol. 1, pp. I685-I688.
  5. Denby, B., Schultz, T., Honda, K., Hueber, T., Gilbert, J.M. and Brumberg, J.S, 2010. Silent speech interfaces. Speech Communication, Vol. 52, Issue 4, pp. 270-287.
  6. Dias, M. S., Bastos, R., Fernandes, J., Tavares, J. and Santos, P., 2009. Using Hand Gesture and Speech in a Multimodal Augmented Reality Environment, GW2007, LNAI 5085, pp.175-180.
  7. Fagan, M. J., Ell, S. R., Gilbert, J. M., Sarrazin, E. and Chapman, P.M, 2008. Development of a (silent) speech recognition system for patients following laryngectomy. Med. Eng. Phys., Vol. 30, Issue 4, pp. 419-425.
  8. Ferreira, F., Almeida, N., Casimiro, J., Rosa, A. F., Oliveira, A., and Teixeira, A. 2013. Multimodal and Adaptable Medication Assistant for the Elderly CISTI'2013 (8th Iberian Conference on Information Systems and Technologies).
  9. Florescu, V-M., Crevier-Buchman, L., Denby, B., Hueber, T., Colazo-Simon, A., Pillot-Loiseau, C., Roussel, P., Gendrot, C. and Quattrochi, S., 2010. Silent vs Vocalized Articulation for a Portable UltrasoundBased Silent Speech Interface. Proceedings of Interspeech 2010, Makuari, Japan.
  10. Freitas, J. Teixeira, A. Dias M. S. and Bastos, C., 2011. Towards a Multimodal Silent Speech Interface for European Portuguese, Speech Technologies, Ivo Ipsic (Ed.), InTech.
  11. Freitas, J. Teixeira, A., Vaz, F. and Dias, M.S., 2012. Automatic Speech Recognition based on Ultrasonic Doppler Sensing for European Portuguese, Advances in Speech and Language Technologies for Iberian Languages, vol. CCIS 328, Springer.
  12. Freitas, J., Calado, A., Barros, M. J. and Dias, M. S., 2009. Spoken Language Interface for Mobile Devices. Human Language Technology. Challenges of the Information Society Lecture Notes in Computer Science, Vol. 5603, pp. 24-35.
  13. Freitas, J., Teixeira, A., Silva, S., Oliveira, C. and Dias, M.S., 2014. Velum Movement Detection based on Surface Electromyography for Speech Interface. Conference on Bio-inspired Systems and Signal Processing, Biosignals 2014, Angers, France.
  14. Fritzell, B., 1969. The velopharyngeal muscles in speech: an electromyographic and cineradiographic study. Acta Otolaryngolica. Suppl. 50.
  15. Galatas, G., Potamianos, G., Makedon, F., 2012. Audiovisual speech recognition incorporating facial depth information captured by the Kinect. Proceedings of the 20th European Signal Processing Conference (EUSIPCO), pp. 2714-2717.
  16. Hardcastle, W. J., 1976. Physiology of Speech Production - An Introduction for Speech Scientists. Academic Press, London.
  17. Herff, C. Janke, M. Wand, M. and Schultz, T., 2011. Impact of Different Feedback Mechanisms in EMGbased Speech Recognition. In Proceedings of Interspeech 2011. Florence, Italy.
  18. Kalgaonkar, K., Raj B., Hu., R., 2007. Ultrasonic doppler for voice activity detection. IEEE Signal Processing Letters, vol.14, Issue 10, pp. 754-757.
  19. Kalgaonkar, K., Raj., B., 2008. Ultrasonic doppler sensor for speaker recognition. Internat. Conf. on Acoustics, Speech, and Signal Processing.
  20. Kuehn D.P., Folkins JW, Cutting CB., 1982. Relationships between muscle activity and velar position, Cleft Palate Journal, Vol. 19, Issue 1, pp. 25-35.
  21. Levelt. W., 1989. Speaking: from Intention to Articulation. Cambridge, Mass.: MIT Press.
  22. Lubker, J. F., 1968. An electromyographiccinefluorographic investigation of velar function during normal speech production. Cleft Palate Journal, Vol. 5, Issue 1, pp. 17.
  23. Martins, P. Carbone, I. Pinto, A. Silva, A. and Teixeira, A., 2008. European Portuguese MRI based speech production studies. Speech Communication. NL: Elsevier, Vol.50, No.11/12, ISSN 0167-6393, pp. 925- 952.
  24. McGill, S., Juker, D. and Kropf, P., 1996. Appropriately placed surface EMG electrodes reflect deep muscle activity (psoas, quadratus lumborum, abdominal wall) in the lumbar spine. In Journal of Biomechanics, Vol. 29 Issue, 11, pp. 1503-7.
  25. Microsoft Kinect, Online: http://www.xbox.com/enUS/kinect, accessed on 9 December 2013.
  26. Patil, S. A. and Hansen, J. H. L., 2010. The physiological microphone (PMIC): A competitive alternative for speaker assessment in stress detection and speaker verification. Speech Communication. Vol. 52, Issue 4, pp. 327-340.
  27. Pêra, V. Moura, A. and Freitas, D. 2004. LPFAV2: a new multi-modal database for developing speech recognition systems for an assistive technology application. In SPECOM-2004, pp. 73-76.
  28. Phang, C. W., Sutanto, J., Kankanhalli, A., Li, Y., Tan, B. C. Y, and Teo, H. H., 2006. Senior citizens' acceptance of information systems: A study in the context of e-government services. IEEE Transactions On. Engineering Management, Vol. 53, Issue 4, pp. 555-569, 2006.
  29. Plux Wireless Biosignals, Portugal, Online: http://www.plux.info/, accessed on 9 December 2013.
  30. Porbadnigk, A., Wester, M., Calliess, J. and Schultz, T., 2009. EEG-based speech recognition impact of temporal effects. International Conference on Bioinspired Systems and Signal Processing, Biosignals 2009, Porto, Portugal, pp.376-381.
  31. Quatieri, T. F., D. Messing, K. Brady, W. B. Campbell, J. P. Campbell, M. Brandstein, C. J. Weinstein, J. D. Tardelli and P. D. Gatewood, 2006. Exploiting nonacoustic sensors for speech enhancement. IEEE Trans. Audio Speech Lang. Process, Vol. 14, Issue 2, pp. 533-544.
  32. Rossato, S. Teixeira, A. and Ferreira, L., 2006. Les Nasales du Portugais et du Français: une étude comparative sur les données EMMA. In XXVI Journées d'Études de la Parole. Dinard, France.
  33. Sá, F. Afonso, P. Ferreira, R. and Pera, V., 2003. Reconhecimento Automático de Fala Contínua em Português Europeu Recorrendo a Streams AudioVisuais. In The Proceedings of COOPMEDIA'2003 - Workshop de Sistemas de Informação Multimédia, Cooperativos e Distribuídos, Porto, Portugal.
  34. Schultz, T. and Wand, M., 2010. Modeling coarticulation in large vocabulary EMG-based speech recognition. Speech Communication, Vol. 52, Issue 4, pp. 341-353.
  35. Seikel, J. A., King, D. W., Drumright, D. G., 2010. Anatomy and Physiology for Speech, Language, and Hearing, 4rd Ed., Delmar Learning.
  36. Srinivasan, S., Raj, B. and Ezzat, T., 2010. Ultrasonic sensing for robust speech recognition. Internat. Conf. on Acoustics, Speech, and Signal Processing 2010.
  37. Teixeira, A. and Vaz, F., 2000. Síntese Articulatória dos Sons Nasais do Português. Anais do V Encontro para o Processamento Computacional da Língua Portuguesa Escrita e Falada (PROPOR), ICMC-USP, Atibaia, São Paulo, Brasil, 2000, pp. 183-193.
  38. Teixeira, A. and Vaz, F., 2001. European Portuguese Nasal Vowels: An EMMA Study. 7th European Conference on Speech Communication and Technology, EuroSpeech - Scandinavia, pp. 1843- 1846.
  39. Teixeira, A., Braga, D., Coelho, L., Fonseca, J., Alvarelhão, J., Martín, I., Queirós, A., Rocha, N., Calado, A. and Dias, M. S., 2009. Speech as the Basic Interface for Assistive Technology. DSAI 2009 - Proceedings of the 2th International Conference on Software Development for Enhancing Accessibility and Fighting Info-Exclusion, Porto Salvo, Portugal.
  40. Teixeira, A., Martins, P., Oliveira, C., Ferreira, C., Silva, A., Shosted, R., 2012. “Real-time MRI for Portuguese: database, methods and applications”, Proceedings of PROPOR 2012, LNCS vol. 7243. pp. 306-317.
  41. Toda, T., Nakamura, K., Nagai, T., Kaino, T., Nakajima, Y., and Shikano, K., Technologies for Processing Body-Conducted Speech Detected with Non-Audible Murmur Microphone. Proceedings of Interspeech 2009, Brighton, UK.
  42. Toth, A. R., Kalgaonkar, K., Raj, B., Ezzat, T., 2010. Synthesizing speech from Doppler signals, Internat. Conference on Acoustics Speech and Signal Processing, pp.4638-4641.
  43. Tran, V.-A Bailly, G. Loevenbruck, H. and Toda, T., 2009. Multimodal HMM-based NAM to-speech conversion. In Proceedings of Interspeech 2009, Brighton, UK.
  44. Wand, M. and Schultz, T., 2011. Analysis of Phone Confusion in EMG-based Speech Recognition, Internat. Conf. on Acoustics, Speech and Signal Processing 2011, Prague, Czech Republic.
  45. Wand, M. and Schultz, T., 2011. Investigations on Speaking Mode Discrepancies in EMG-based Speech Recognition, Proceedings of Interspeech 2011, Florence, Italy.
  46. Wand, M. and Schultz, T., 2011. Session-Independent EMG-based Speech Recognition, International Conference on Bio-inspired Systems and Signal Processing, Biosignals 2011, Rome, Italy.
  47. Yau, W. C., Arjunan, S. P. and Kumar, D. K., 2008. Classification of voiceless speech using facial muscle activity and vision based techniques, TENCON 2008- 2008 IEEE Region 10 Conference, 2008.
  48. Zhu, B., 2008. Multimodal speech recognition with ultrasonic sensors. Master's thesis. Massachusetts Institute of Technology, Cambridge, Massachusetts.
Download


Paper Citation


in Harvard Style

Freitas J., Teixeira A. and Dias M. (2014). Silent Speech for Human-Computer Interaction . In Doctoral Consortium - DCBIOSTEC, (BIOSTEC 2014) ISBN Not Available, pages 18-27


in Bibtex Style

@conference{dcbiostec14,
author={João Freitas and António Teixeira and Miguel Sales Dias},
title={Silent Speech for Human-Computer Interaction},
booktitle={Doctoral Consortium - DCBIOSTEC, (BIOSTEC 2014)},
year={2014},
pages={18-27},
publisher={SciTePress},
organization={INSTICC},
doi={},
isbn={Not Available},
}


in EndNote Style

TY - CONF
JO - Doctoral Consortium - DCBIOSTEC, (BIOSTEC 2014)
TI - Silent Speech for Human-Computer Interaction
SN - Not Available
AU - Freitas J.
AU - Teixeira A.
AU - Dias M.
PY - 2014
SP - 18
EP - 27
DO -