Unconstrained Speech Segmentation using Deep Neural Networks

Van Zyl van Vuuren, Louis ten Bosch, Thomas Niesler

2015

Abstract

We propose a method for improving the unconstrained segmentation of speech into phoneme-like units using deep neural networks. The proposed approach is not dependent on acoustic models or forced alignment, but operates using the acoustic features directly. Previous solutions of this type were plagued by the tendency to hypothesise additional incorrect phoneme boundaries near the phoneme transitions. We show that the application of deep neural networks is able to reduce this over-segmentation substantially, and achieve improved segmentation accuracies. Furthermore, we find that generative pre-training offers an additional benefit.

References

  1. Adell, J., Bonafonte, A., Gomez, J., and Castro, M. (2005). Comparative study of automatic phone segmentation methods for TTS. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).
  2. Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1- 127.
  3. Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19:153.
  4. Bishop, C. M. et al. (2006). Pattern recognition and machine learning, volume 1. Springer.
  5. Cho, K. et al. (2011). Improved learning algorithms for restricted Boltzmann machines. Master's thesis, School of science, Aalto University.
  6. Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., and Bengio, S. (2010). Why does unsupervised pre-training help deep learning? The Journal of Machine Learning Research, 11:625-660.
  7. Estevan, Y. P., Wan, V., and Scharenborg, O. (2007). Finding maximum margin segments in speech. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).
  8. Finster, H. (1992). Automatic speech segmentation using neural network and phonetic transcription. In Proceedings of the International Joint Conference on Neural Networks (IJCNN).
  9. Fischer, A. and Igel, C. (2012). An introduction to restricted Boltzmann machines. In Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, pages 14-36. Springer.
  10. Fisher, W. M., Doddington, G. R., and Goudie-Marshall, K. M. (1986). The DARPA speech recognition research database: specifications and status. In Proceedings of the DARPA Workshop on Speech Recognition.
  11. Halberstadt, A. K. (1998). Heterogeneous Acoustic Measurements and Multiple Classifiers for Speech Recognition. PhD thesis, Massachusetts Institute of Technology, MIT.
  12. Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527-1554.
  13. Hoffmann, S. and Pfister, B. (2010). Fully automatic segmentation for prosodic speech corpora. In Proceedings of Interspeech.
  14. Keri, V. and Prahallad, K. (2010). A comparative study of constrained and unconstrained approaches for segmentation of speech signal. In Proceedings of the International Conference on Spoken Language Processing (ICSLP).
  15. Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Master's thesis, Department of Computer Science, University of Toronto.
  16. Lee, K.-S. (2006). MLP-based phone boundary refining for a TTS database. IEEE Transactions on Audio, Speech and Language Processing, 14(3):981-989.
  17. Malfrere, F., Deroo, O., and Dutoit, T. (1998). Phonetic alignment : Speech synthesis based vs. hybrid HMM/ANN. In Proceedings of the International Conference on Spoken Language Processing (ICSLP).
  18. Mohamed, A.-r., Dahl, G. E., and Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):14-22.
  19. Räsänen, O., Laine, U., and Altosaar, T. (2011). Blind segmentation of speech using non-linear filtering methods. Speech Technologies, pages 105-124.
  20. Räsänen, O. J., Laine, U. K., and Altosaar, T. (2009). An improved speech segmentation quality measure: the R-value. In Proceedings of Interspeech).
  21. Sarkar, A. and Sreenivas, T. (2005). Automatic speech segmentation using average level crossing rate information. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).
  22. Sharma, M. and Mammone, R. (1996). 'Blind' speech segmentation: automatic segmentation of speech without linguistic knowledge. In Proceedings of the Fourth International Conference on Spoken Language Processing (ICSLP).
  23. Suh, Y. and Lee, Y. (1996). Phoneme segmentation of continuous speech using multi-layer perceptron. In Proceedings of the Fourth International Conference on Spoken Language (ICSLP).
  24. ten Bosch, L. and Cranen, B. (2007). A computational model for unsupervised word discovery. In Proceedings of Interspeech.
  25. Toledano, D. (2000). Neural network boundary refining for automatic speech segmentation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).
  26. van Vuuren, V. Z., ten Bosch, L., and Niesler, T. (2013). A dynamic programming framework for neural networkbased automatic speech segmentation. In Proceedings of Interspeech.
  27. Wang, D., Lu, L., and Zhang, H.-J. (2003). Speech segmentation without speech recognition. In Proceedings of theInternational Conference on Multimedia and Expo, ICME.
Download


Paper Citation


in Harvard Style

Zyl van Vuuren V., ten Bosch L. and Niesler T. (2015). Unconstrained Speech Segmentation using Deep Neural Networks . In Proceedings of the International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-076-5, pages 248-254. DOI: 10.5220/0005201802480254


in Bibtex Style

@conference{icpram15,
author={Van Zyl van Vuuren and Louis ten Bosch and Thomas Niesler},
title={Unconstrained Speech Segmentation using Deep Neural Networks},
booktitle={Proceedings of the International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2015},
pages={248-254},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005201802480254},
isbn={978-989-758-076-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - Unconstrained Speech Segmentation using Deep Neural Networks
SN - 978-989-758-076-5
AU - Zyl van Vuuren V.
AU - ten Bosch L.
AU - Niesler T.
PY - 2015
SP - 248
EP - 254
DO - 10.5220/0005201802480254