uninfluenced by the effects of the normalization.
The DNN models appear insensitive to the spec-
tral subtraction artifacts contained in the ”Denoised”
dataset. The performance of the DNN system is near
to the level achieved on undistorted data. The best
GMM:BNC system is outperformed by 5.7% due to
the low efficiency of the CMLLR adaptation. The rea-
son is that the test data consists of very short sentences
(3 − 10 s long), which provide an amount of data too
small for estimating CMLLR.
The nonlinear analog amplification (and potential
clipping) within the ”Lecture” dataset is very harm-
ful to both types of models and all feature configura-
tions. Additional robust recognition techniques need
to be utilized for this type of distorted data (a partial
solution can be offered, e.g., by clipping removal pro-
posed in (Eaton and Naylor, 2013)).
6 CONCLUSIONS
We investigated the robustness of bottleneck-based
systems endowed with feature adaptation with re-
spect to nonlinear distortions in speech. We showed
that the bottleneck features are more robust than the
conventional MFCCs. On most considered datasets,
the bottleneck-based GMM models, adapted to given
distortions, achieve performance comparable to the
DNN models. However, the BNC-based systems are
much more demanding computationally, which hin-
ders their utilization.
The most robust acoustic model in our experi-
ments was the DNN model using FBC input features.
This is in accord with the results presented for clean
speech in literature; low-level frequency features rep-
resent an input more suitable for DNN systems than
conventional MFCCs.
ACKNOWLEDGEMENTS
This work was supported by the Technology Agency
of the Czech Republic (Project No. TA04010199) and
partly by the Student Grant Scheme 2016 of the Tech-
nical University in Liberec.
REFERENCES
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pas-
canu, R., Desjardins, G., Turian, J., Warde-Farley, D.,
and Bengio, Y. (2010). Theano: a cpu and gpu math
expression compiler. In Proceedings of the Python
for scientific computing conference (SciPy), volume 4,
page 3. Austin, TX.
Dahl, G. E., Sainath, T. N., and Hinton, G. E. (2013). Im-
proving deep neural networks for lvcsr using rectified
linear units and dropout. In Acoustics, Speech and
Signal Processing (ICASSP), 2013 IEEE International
Conference on, pages 8609–8613. IEEE.
Dahl, G. E., Yu, D., Deng, L., and Acero, A. (2012).
Context-dependent pre-trained deep neural networks
for large-vocabulary speech recognition. Audio,
Speech, and Language Processing, IEEE Transactions
on, 20(1):30–42.
Davis, S. B. and Mermelstein, P. (1980). Comparison
of parametric representations for monosyllabic word
recognition in continuously spoken sentences. Acous-
tics, Speech and Signal Processing, IEEE Transac-
tions on, 28(4):357–366.
Delcroix, M., Kubo, Y., Nakatani, T., and Nakamura, A.
(2013). Is speech enhancement pre-processing still
relevant when using deep neural networks for acoustic
modeling? In INTERSPEECH, pages 2992–2996.
Deng, L., Li, J., Huang, J.-T., Yao, K., Yu, D., Seide, F.,
Seltzer, M., Zweig, G., He, X., Williams, J., et al.
(2013). Recent advances in deep learning for speech
research at microsoft. In Acoustics, Speech and Signal
Processing (ICASSP), 2013 IEEE International Con-
ference on, pages 8604–8608. IEEE.
Deng, L., Seltzer, M. L., Yu, D., Acero, A., Mohamed,
A.-R., and Hinton, G. E. (2010). Binary coding of
speech spectrograms using a deep auto-encoder. In
Interspeech, pages 1692–1695. Citeseer.
Eaton, J. and Naylor, P. A. (2013). Detection of clipping
in coded speech signals. In Signal Processing Confer-
ence (EUSIPCO), 2013 Proceedings of the 21st Euro-
pean, pages 1–5. IEEE.
Gales, M. J. (1998). Maximum likelihood linear transfor-
mations for hmm-based speech recognition. Com-
puter speech & language, 12(2):75–98.
Gr
´
ezl, F. and Fousek, P. (2008). Optimizing bottle-neck
features for lvcsr. In Acoustics, Speech and Signal
Processing, 2008. ICASSP 2008. IEEE International
Conference on, pages 4729–4732. IEEE.
Gr
´
ezl, F., Karafi
´
at, M., Kont
´
ar, S., and Cernocky, J. (2007).
Probabilistic and bottle-neck features for lvcsr of
meetings. In Acoustics, Speech and Signal Process-
ing, 2007. ICASSP 2007. IEEE International Confer-
ence on, volume 4, pages IV–757. IEEE.
Heigold, G., Vanhoucke, V., Senior, A., Nguyen, P., Ran-
zato, M., Devin, M., and Dean, J. (2013). Multi-
lingual acoustic models using distributed deep neural
networks. In Acoustics, Speech and Signal Process-
ing (ICASSP), 2013 IEEE International Conference
on, pages 8619–8623. IEEE.
Jolliffe, I. (2002). Principal component analysis. Wiley
Online Library.
Kneser, R. and Ney, H. (1995). Improved backing-off for
m-gram language modeling. In Acoustics, Speech,
and Signal Processing, 1995. ICASSP-95., 1995 In-
ternational Conference on, volume 1, pages 181–184.
IEEE.
SIGMAP 2016 - International Conference on Signal Processing and Multimedia Applications
70