in computer vision: A survey. Computational Visual
Media.
Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z.,
Tang, Y., Xiao, A., Xu, C., Xu, Y., et al. (2022). A
survey on vision transformer. transactions on pattern
analysis and machine intelligence.
He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. (2020).
Momentum contrast for unsupervised visual represen-
tation learning. In Conf. on computer vision and pat-
tern recognition.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In Conf. on com-
puter vision and pattern recognition.
Hendrycks, D. and Gimpel, K. (2016). Gaussian error linear
units (gelus). arXiv:1606.08415.
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and
Hochreiter, S. (2017). Gans trained by a two time-
scale update rule converge to a local nash equilibrium.
Advances in neural information processing systems.
Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger,
K. Q. (2017). Densely connected convolutional net-
works. In Conf. on computer vision and pattern recog-
nition.
Ioffe, S. and Szegedy, C. (2015). Batch normalization: Ac-
celerating deep network training by reducing internal
covariate shift. In Int. Conf. on machine learning.
Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D.,
and Wilson, A. G. (2018). Averaging weights
leads to wider optima and better generalization.
arXiv:1803.05407.
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J.,
and Aila, T. (2020a). Analyzing and improving the
image quality of stylegan. In Conf. on computer vision
and pattern recognition.
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J.,
and Aila, T. (2020b). Analyzing and improving the
image quality of stylegan. In Conf. on computer vision
and pattern recognition.
Khan, A., Sohail, A., Zahoora, U., and Qureshi, A. S.
(2020). A survey of the recent architectures of deep
convolutional neural networks. Artificial intelligence
review.
Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S.,
and Shah, M. (2022). Transformers in vision: A sur-
vey. ACM computing surveys (CSUR).
Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y.,
Isola, P., Maschinot, A., Liu, C., and Krishnan, D.
(2020). Supervised contrastive learning. Advances
in neural information processing systems.
Kiefer, J. and Wolfowitz, J. (1952). Stochastic estimation
of the maximum of a regression function. The Annals
of Mathematical Statistics.
Kingma, D. P. and Ba, J. (2014). Adam: A method for
stochastic optimization. arXiv:1412.6980.
Kipf, T. N. and Welling, M. (2016). Semi-supervised
classification with graph convolutional networks.
arXiv:1609.02907.
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mo-
hamed, A., Levy, O., Stoyanov, V., and Zettlemoyer,
L. (2020). BART: Denoising Sequence-to-Sequence
Pre-training for Natural Language Generation, Trans-
lation, and Comprehension. In Proceedings of the
58th Annual Meeting of the Association for Compu-
tational Linguistics, pages 7871–7880.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Doll
´
ar, P.
(2017). Focal loss for dense object detection. In Int.
Conf. on computer vision.
Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and
Han, J. (2019a). On the variance of the adaptive learn-
ing rate and beyond. arXiv:1908.03265.
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig,
G. (2023). Pre-train, prompt, and predict: A system-
atic survey of prompting methods in natural language
processing. ACM Computing Surveys.
Liu, P. J., Saleh, M., Pot, E., Goodrich, B., Sepa-
ssi, R., Kaiser, L., and Shazeer, N. (2018). Gen-
erating wikipedia by summarizing long sequences.
arXiv:1801.10198.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,
Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov,
V. (2019b). Roberta: A robustly optimized bert pre-
training approach. arXiv:1907.11692.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin,
S., and Guo, B. (2021). Swin transformer: Hierarchi-
cal vision transformer using shifted windows. In Int.
Conf. on computer vision.
Loshchilov, I. and Hutter, F. (2017). Decoupled weight de-
cay regularization. arXiv:1711.05101.
Mescheder, L., Geiger, A., and Nowozin, S. (2018). Which
training methods for GANs do actually converge? In
Int. Conf. on machine learning.
Min, B., Ross, H., Sulem, E., Veyseh, A. P. B., Nguyen,
T. H., Sainz, O., Agirre, E., Heinz, I., and Roth, D.
(2021). Recent advances in natural language process-
ing via large pre-trained language models: A survey.
arXiv preprint arXiv:2111.01243.
Misra, D. (2019). Mish: A self regularized non-monotonic
activation function. arXiv:1908.08681.
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T.,
Harley, T., Silver, D., and Kavukcuoglu, K. (2016).
Asynchronous methods for deep reinforcement learn-
ing. In Int. Conf. on machine learning.
Moradi, R., Berangi, R., and Minaei, B. (2020). A survey
of regularization strategies for deep models. Artificial
Intelligence Review.
Nair, V. and Hinton, G. E. (2010). Rectified linear units
improve restricted boltzmann machines. In Int. Conf.
on machine learning (ICML-).
OpenAI (2022). Chatgpt: Optimizing language models for
dialogue. https://openai.com/blog/ chatgpt/ .
OpenAI (2023). Gpt-4 technical report.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wain-
wright, C. L., Mishkin, P., Zhang, C., Agarwal, S.,
Slama, K., Ray, A., et al. (2022). Training language
models to follow instructions with human feedback.
arXiv:2203.02155.
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.,
et al. (2018). Improving language understanding by
generative pre-training.
A Survey of Deep Learning: From Activations to Transformers
429