Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., and
Salakhutdinov, R. (2019). Transformer-xl: Attentive
language models beyond a fixed-length context.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2019). BERT: Pre-training of deep bidirectional
transformers for language understanding. In Pro-
ceedings of the 2019 Conference of the North Amer-
ican Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume
1 (Long and Short Papers), pages 4171–4186, Min-
neapolis, Minnesota. Association for Computational
Linguistics.
Ho, J., Kalchbrenner, N., Weissenborn, D., and Salimans,
T. (2019). Axial attention in multidimensional trans-
formers.
Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F.
(2020). Transformers are rnns: Fast autoregressive
transformers with linear attention.
Kitaev, N., Łukasz Kaiser, and Levskaya, A. (2020). Re-
former: The efficient transformer.
Lee, J., Lee, Y., Kim, J., Kosiorek, A. R., Choi, S., and
Teh, Y. W. (2019). Set transformer: A framework
for attention-based permutation-invariant neural net-
works.
Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang,
Y., Krikun, M., Shazeer, N., and Chen, Z. (2020).
Gshard: Scaling giant models with conditional com-
putation and automatic sharding.
Lin, J., Sun, X., Ren, X., Li, M., and Su, Q. (2018a).
Learning when to concentrate or divert attention: Self-
adaptive attention temperature for neural machine
translation.
Lin, J., Sun, X., Ren, X., Ma, S., Su, J., and Su, Q. (2018b).
Deconvolution-based global decoding for neural ma-
chine translation.
Liu, P. J., Saleh, M., Pot, E., Goodrich, B., Sepassi,
R., Kaiser, L., and Shazeer, N. (2018). Generating
wikipedia by summarizing long sequences.
Luong, M.-T. and Manning, C. D. (2015). Stanford neu-
ral machine translation systems for spoken language
domains. In Stanford.
Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen,
E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev,
O., Venkatesh, G., and Wu, H. (2017). Mixed preci-
sion training.
Nguyen, T. Q. and Salazar, J. (2019). Transformers without
tears: Improving the normalization of self-attention.
CoRR.
Ott, M., Edunov, S., Grangier, D., and Auli, M. (2018).
Scaling neural machine translation.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002).
Bleu: a method for automatic evaluation of machine
translation. In Proceedings of the 40th Annual Meet-
ing of the Association for Computational Linguistics,
pages 311–318, Philadelphia, Pennsylvania, USA.
Association for Computational Linguistics.
Parmar, N., Vaswani, A., Uszkoreit, J., Łukasz Kaiser,
Shazeer, N., Ku, A., and Tran, D. (2018). Image trans-
former.
Pennington, J., Socher, R., and Manning, C. (2014). Glove:
Global vectors for word representation. In EMNLP,
volume 14, pages 1532–1543.
Phan-Vu, H.-H., Tran, V.-T., Nguyen, V.-N., Dang, H.-V.,
and Do, P.-T. (2018). Machine translation between
vietnamese and english: an empirical study.
Provilkov, I., Emelianenko, D., and Voita, E. (2020). BPE-
dropout: Simple and effective subword regularization.
In Proceedings of the 58th Annual Meeting of the As-
sociation for Computational Linguistics, pages 1882–
1892, Online. Association for Computational Linguis-
tics.
Rae, J. W., Potapenko, A., Jayakumar, S. M., Hillier, C.,
and Lillicrap, T. P. (2020). Compressive transformers
for long-range sequence modelling. In International
Conference on Learning Representations.
Roy, A., Saffar, M., Vaswani, A., and Grangier, D. (2020).
Efficient content-based sparse attention with routing
transformers.
Sennrich, R., Haddow, B., and Birch, A. (2016). Neural ma-
chine translation of rare words with subword units. In
Proceedings of the 54th Annual Meeting of the Associ-
ation for Computational Linguistics (Volume 1: Long
Papers), pages 1715–1725, Berlin, Germany. Associ-
ation for Computational Linguistics.
Tay, Y., Bahri, D., Metzler, D., Juan, D.-C., Zhao, Z., and
Zheng, C. (2020a). Synthesizer: Rethinking self-
attention in transformer models.
Tay, Y., Bahri, D., Yang, L., Metzler, D., and Juan, D.-C.
(2020b). Sparse sinkhorn attention.
Tay, Y., Dehghani, M., Bahri, D., and Metzler, D. (2020c).
Efficient transformers: A survey.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L., and Polosukhin, I.
(2017). Attention is all you need. ArXiv.
Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H.
(2020). Linformer: Self-attention with linear com-
plexity.
Wiseman, S. and Rush, A. M. (2016). Sequence-to-
sequence learning as beam-search optimization.
Xu, J., Sun, X., Zhang, Z., Zhao, G., and Lin, J. (2019).
Understanding and improving layer normalization.
Zaheer, M., Guruganesh, G., Dubey, A., Ainslie, J., Al-
berti, C., Ontanon, S., Pham, P., Ravula, A., Wang,
Q., Yang, L., and Ahmed, A. (2020). Big bird: Trans-
formers for longer sequences.
Zhang, M., Li, Z., Fu, G., and Zhang, M. (2019). Syntax-
enhanced neural machine translation with syntax-
aware word representations.
APPENDIX
Code, dataset and results are available to download
through the following link: (Informer Project).
Informer, an Information Organization Transformer Architecture
389