containing three or more words, as the lexicon con-
tains 4k multi-word collocations. The unseen bigrams
are backed-off by Kneser-Ney smoothing (Kneser and
Ney, 1995).
4.1 Experimental Results
Within the performed experiment, the data for testing
were transcribed a) with and b) without the use of the
SAD module. The obtained results in terms of WER
and RTF are presented in Table 6.
They show that the use of the SAD module has ad-
vantages from accuracy as well speed of transcription
points of view: WER was slightly reduced by 0.22%
and RTF increased to almost twice the baseline value.
The reason is that most of the non-speech parts were
omitted from being recognized. The RTF of the SAD
module itself is around 85, making its computation
demands almost negligible. Note that the presented
RTF values were measured using Intel Core proces-
sor i7-3770K @ 3.50GHz.
Table 6: Evaluation of the resulting SAD module in a
speech transcription system.
SAD module used WER [%] RTF
No 12.67 1.29
Yes 12.45 2.44
5 CONCLUSIONS
Various DNN-based SAD approaches are evaluated in
this paper. Our goal was to find a method that could
be used in a system for transcription of broadcast data.
All of the findings obtained from the evaluation pro-
cess can be summarized as follows:
• Smoothing the output from DNN is essential as it
reduces the residual misclassified frames.
• The use of mixed data according to SNR leads to
a significant increase in the accuracy of detection.
• The context frame window of 25-1-25 performed
as the best while keeping the processing time low.
• The DNN with 128 neurons/layer showed to be a
compromise between the detection accuracy and
computation demands.
• RTF of the final SAD module is around 80, which
makes its computation demands almost negligi-
ble.
The advantages of using the resulting SAD ap-
proach (based on DNNs, smoothing and the use of
artificial training data) in a speech transcription sys-
tem can be summarized as follows:
• The yielded speech recognition accuracy is com-
parable or even slightly better.
• The data is transcribed almost two times faster.
Considering that the computation demands of the
SAD module itself are almost negligible, the time
savings for the transcription is significant.
In our future work, we plan to consider context-
dependent transduction models, which could better
represent the transitions between speech and non-
speech segments. Other neural network architectures,
e.g., convolution neural networks, recurrent neural
networks or even residual neural networks could also
be employed.
ACKNOWLEDGEMENTS
This work was supported by the Technology Agency
of the Czech Republic (Project No. TA04010199) and
partly by the Student Grant Scheme 2016 of the Tech-
nical University in Liberec.
REFERENCES
Dahl, G., Yu, D., Deng, L., and Acero, A. (2012).
Context-dependent pre-trained deep neural networks
for large-vocabulary speech recognition. Audio,
Speech, and Language Processing, IEEE Transactions
on, 20(1):30 –42.
Graciarena, M., Alwan, A., Ellis, D., Franco, H., Ferrer, L.,
Hansen, J. H. L., Janin, A., Lee, B. S., Lei, Y., Mi-
tra, V., Morgan, N., Sadjadi, S. O., Tsai, T. J., Schef-
fer, N., Tan, L. N., and Williams, B. (2013). All for
one: feature combination for highly channel-degraded
speech activity detection. In Bimbot, F., Cerisara, C.,
Fougeron, C., Gravier, G., Lamel, L., Pellegrino, F.,
and Perrier, P., editors, INTERSPEECH, pages 709–
713. ISCA.
Hughes, T. and Mierle, K. (2013). Recurrent neural net-
works for voice activity detection. In ICASSP, pages
7378–7382. IEEE.
Kneser, R. and Ney, H. (1995). Improved backing-off for
m-gram language modeling. In In Proceedings of the
IEEE International Conference on Acoustics, Speech
and Signal Processing, volume I, pages 181–184, De-
troit, Michigan.
Ma, J. (2014). Improving the speech activity detection for
the darpa rats phase-3 evaluation. In Li, H., Meng,
H. M., Ma, B., Chng, E., and Xie, L., editors, INTER-
SPEECH, pages 1558–1562. ISCA.
Mateju, L., Cerva, P., and Zdansky, J. (2015). Investiga-
tion into the use of deep neural networks for lvcsr
of czech. In Electronics, Control, Measurement, Sig-
nals and their Application to Mechatronics (ECMSM),
2015 IEEE International Workshop of, pages 1–4.
SIGMAP 2016 - International Conference on Signal Processing and Multimedia Applications
50