Authors:
Eyyüb Sari
;
Vanessa Courville
and
Vahid Partovi Nia
Affiliation:
Huawei Noah’s Ark Lab, Canada
Keyword(s):
Recurrent Neural Network, LSTM, Model Compression, Quantization, NLP, ASR.
Abstract:
Recurrent neural networks (RNN) are used in many real-world text and speech applications. They include
complex modules such as recurrence, exponential-based activation, gate interaction, unfoldable normalization,
bi-directional dependence, and attention. The interaction between these elements prevents running them
on integer-only operations without a significant performance drop. Deploying RNNs that include layer
normalization and attention on integer-only arithmetic is still an open problem. We present a quantization-aware
training method for obtaining a highly accurate integer-only recurrent neural network (iRNN). Our approach
supports layer normalization, attention, and an adaptive piecewise linear approximation of activations (PWL),
to serve a wide range of RNNs on various applications. The proposed method is proven to work on RNNbased language models and challenging automatic speech recognition, enabling AI applications on the edge.
Our iRNN maintains similar performance as its
full-precision counterpart, their deployment on smartphones
improves the runtime performance by 2×, and reduces the model size by 4×.
(More)