A.1 Specific Details on LSTM-based
For BiLSTM cells, nothing stated in section Integer-
only LSTM network is changed except that we enforce
the forward LSTM hidden state
and the backward
LSTM hidden state
to share the same quantiza-
tion parameters so that they can be concatenated as a
vector. If the model has embedding layers, they are
quantized to 8-bit as we found they were not sensitive
to quantization. If the model has residual connections
(e.g. between LSTM cells), they are quantized to 8-bit
integers. In encoder-decoder models the attention lay-
ers would be quantized following section Integer-only
attention. The model’s last fully-connected layer’s
weights are 8-bit quantized to allow for 8-bit matrix
multiplication. However, we do not quantize the out-
puts and let them remain 32-bit integers as often this is
where it is considered that the model has done its job
and that some postprocessing is performed (e.g. beam
