network is an effective way to train an acoustic
model.
4 CONCLUSIONS
We have presented a novel application of CNNs to
phoneme recognition. We have shown how the
TIMIT speech corpus can be used for labelled
spectrogram patches for the CNN-AM training. The
results whilst not surpassing the current state of the
art are encouraging, and the usability and
transparency of the output processing have proved
that CNNs are a very viable way to do speech
recognition. We have also done some initial
experiments with NTIMIT which contains noise from
various telephone networks and as it is telephone
speech it has a narrower frequency range [0, 3.3kHz].
Typically, we have found that NTIMIT results are
around 10% less than for TIMIT. However, we have
found that we are within 1% of the TIMIT networks
performance in our preliminary tests which suggests
that the CNN approach is much more noise robust.
In the near future, we plan to develop strategies to
acquire large volumes of phonetic transcriptions for
training more robust CNN-AM. We are also in the
process of training a sequence-to-sequence language
model to transform the phonetic output to text.
REFERENCES
Chen, D., Zhang, W., Xu, X., & Xing, X., 2016. Deep
networks with stochastic depth for acoustic modelling.
In Signal and Information Processing Association
Annual Summit and Conference (APSIPA), pp. 1-4.
Ciresan, D.C., Meier, U., Masci, J., Gambardella, L.,
Schmidhuber, J., 2011. Flexible, high performance
convolutional neural networks for image classification.
In Int Joint Conf Artificial Intelligence (IJCAI), vol. 22,
no. 1, pp. 1237-1242.
Fukushima, K., 1980. Neocognitron: A self-organizing
neural network model for a mechanism of pattern
recognition unaffected by shift in position. In Biol
Cybern, vol. 36, no. 4, pp. 193-202.
Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., Pallett, D.,
Dahlgren, N., Zue, V., 1993. TIMIT Acoustic-Phonetic
Continuous Speech Corpus LDC93S1. Web Download,
Philadelphia: Linguistic Data Consortium.
Graves, A., Mohamed, A., Hinton, G., 2013. Speech
recognition with deep recurrent neural networks. In
IEEE Int Conf Acoust Speech Signal Process (ICASSP),
pp. 6645-6649.
Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos,
G., Elsen, E., Prenger, R. et al., 2014. Deep speech:
Scaling up end-to-end speech recognition. In arXiv
preprint arXiv:1412.5567.
Hubel, D.H., Wiesel, T.N., 1962. Receptive fields,
binocular interaction and functional architecture in cat's
visual cortex. In J Physiol (London), vol. 160, pp. 106-
154.
ImageNet Large Scale Visual Recognition Challenge
(ILSVRC), 2011, http://image-net.org/challenges/
LSVRC/2011/index.
Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet
classification with deep convolutional neural networks.
In Adv Neural Inf Process Syst (NIPS), pp. 1097-1105.
LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard,
R.E., Hubbard, W., Jackel, L.D., 1990. Handwritten
digit recognition with a back-propagation network. In
Adv Neural Inf Process Syst (NIPS), pp. 396-404.
Lopes, C., Perdigao, F., 2011. Phone recognition on the
TIMIT database. In Speech Technologies/Book 1, pp.
285-302.
NVIDIA DIGITS Interactive Deep Learning GPU Training
System, https://developer.nvidia.com/digits.
Paulin, M.G., 1998. A method for analysing neural
computation using receptive fields in state space. In
Neural Networks, vol. 11, no. 7, pp. 1219-1228.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,
Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich,
A., 2015. Going deeper with convolutions. In IEEE
Conf Computer Vision Pattern Recognition (CVPR),
pp. 1-9.
Shamma, S., 2001. On the role of space and time in auditory
processing. In Trends in Cognitive Sciences, vol. 5, no.
8, pp. 340–348.
Tóth, L., 2015. Phone recognition with hierarchical
convolutional deep maxout networks. In EURASIP
Journal on Audio, Speech, and Music Processing, vol.
1 , pp.1-13.
Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M.,
Stolcke, A., Yu, D. and Zweig, G., 2017. The Microsoft
2016 conversational speech recognition system. In
IEEE Int. Conf. on Acoustics, Speech and Signal
Processing (ICASSP), pp. 5255-5259.
Zhang, Z., Sun, Z., Liu, J., Chen, J., Huo, Z., Zhang, X.,
2016. Deep Recurrent Convolutional Neural Network:
Improving Performance For Speech Recognition, arXiv
1611.07174.
Convolutional Neural Networks for Phoneme Recognition
195