Bengio, Y. (2018). Quantized neural networks: Training
neural networks with low precision weights and activa-
tions. Journal of Machine Learning Research, 18(187):1–
30.
Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D.,
and Wilson, A. G. (2018). Averaging weights leads to
wider optima and better generalization. arXiv preprint
arXiv:1803.05407.
Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard,
A. G., Adam, H., and Kalenichenko, D. (2017). Quantiza-
tion and training of neural networks for efficient integer-
arithmetic-only inference. 2018 IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 2704–
2713.
Kapur, S., Mishra, A., and Marr, D. (2017). Low precision
rnns: Quantizing rnns without losing accuracy. arXiv
preprint arXiv:1710.07706.
Kingma, D. P. and Ba, J. (2014). Adam: A method for
stochastic optimization. CoRR, abs/1412.6980.
Krause, B., Kahembwe, E., Murray, I., and Renals, S. (2018).
Dynamic evaluation of neural sequence models. In Dy,
J. and Krause, A., editors, Proceedings of the 35th Inter-
national Conference on Machine Learning, volume 80
of Proceedings of Machine Learning Research, pages
2766–2775. PMLR.
Krishnamoorthi, R. (2018). Quantizing deep convolutional
networks for efficient inference: A whitepaper. arXiv
preprint arXiv:1806.08342.
Marcus, M. P., Santorini, B., and Marcinkiewicz, M. A.
(1993). Building a large annotated corpus of English: The
Penn Treebank. Computational Linguistics, 19(2):313–
330.
Melis, G., Ko
ˇ
cisk
´
y, T., and Blunsom, P. (2020). Mogrifier
lstm. In International Conference on Learning Represen-
tations.
Merity, S., Xiong, C., Bradbury, J., and Socher, R. (2016).
Pointer sentinel mixture models. CoRR, abs/1609.07843.
Mikolov, T., Sutskever, I., Deoras, A., Le, H.-S., Kombrink,
S., and Cernocky, J. (2012). Subword language modeling
with neural networks. preprint (http://www. fit. vutbr.
cz/imikolov/rnnlm/char. pdf), 8:67.
Ott, J., Lin, Z., Zhang, Y., Liu, S.-C., and Bengio, Y. (2016).
Recurrent neural networks with limited numerical preci-
sion. arXiv preprint arXiv:1608.06902.
Panayotov, V., Chen, G., Povey, D., and Khudanpur, S.
(2015). Librispeech: an asr corpus based on public do-
main audio books. In Acoustics, Speech and Signal Pro-
cessing (ICASSP), 2015 IEEE International Conference
on, pages 5206–5210. IEEE.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison,
M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai,
J., and Chintala, S. (2019). Pytorch: An imperative style,
high-performance deep learning library. In Wallach, H.,
Larochelle, H., Beygelzimer, A., d
'
Alch
´
e-Buc, F., Fox, E.,
and Garnett, R., editors, Advances in Neural Information
Processing Systems, volume 32. Curran Associates, Inc.
Rumelhart, D., Hinton, G. E., and Williams, R. J. (1986).
Learning internal representations by error propagation.
Sari, E. and Partovi Nia, V. (2020). Batch normalization in
quantized networks. In Proceedings of the Edge Intelli-
gence Workshop, pages 6–9.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wo-
jna, Z. (2016). Rethinking the inception architecture for
computer vision. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 2818–
2826.
Wang, C., Wu, S., and Liu, S. (2019a). Accelerating trans-
former decoding via a hybrid of self-attention and recur-
rent neural network. arXiv preprint arXiv:1909.02279.
Wang, Y., Chen, T., Xu, H., Ding, S., Lv, H., Shao, Y., Peng,
N., Xie, L., Watanabe, S., and Khudanpur, S. (2019b).
Espresso: A fast end-to-end neural speech recognition
toolkit. In 2019 IEEE Automatic Speech Recognition and
Understanding Workshop (ASRU), pages 136–143.
Werbos, P. J. (1990). Backpropagation through time: what
it does and how to do it. Proceedings of the IEEE,
78(10):1550–1560.
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M.,
Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey,
K., et al. (2016). Google’s neural machine translation
system: Bridging the gap between human and machine
translation. arXiv preprint arXiv:1609.08144.
Yang, Z., Dai, Z., Salakhutdinov, R., and Cohen, W. W.
(2018). Breaking the softmax bottleneck: A high-rank
rnn language model.
A APPENDIX
A.1 Specific Details on LSTM-based
Models
For BiLSTM cells, nothing stated in section Integer-
only LSTM network is changed except that we enforce
the forward LSTM hidden state
−→
h
t
and the backward
LSTM hidden state
←−
h
t
to share the same quantiza-
tion parameters so that they can be concatenated as a
vector. If the model has embedding layers, they are
quantized to 8-bit as we found they were not sensitive
to quantization. If the model has residual connections
(e.g. between LSTM cells), they are quantized to 8-bit
integers. In encoder-decoder models the attention lay-
ers would be quantized following section Integer-only
attention. The model’s last fully-connected layer’s
weights are 8-bit quantized to allow for 8-bit matrix
multiplication. However, we do not quantize the out-
puts and let them remain 32-bit integers as often this is
where it is considered that the model has done its job
and that some postprocessing is performed (e.g. beam
search).
ICPRAM 2022 - 11th International Conference on Pattern Recognition Applications and Methods
118