ACKNOWLEDGEMENTS
This work was performed using the compute re-
sources from the Academic Leiden Interdisciplinary
Cluster Environment (ALICE) provided by Lei-
den University. We thank Andrius Bernatavicius,
Shima Javanmardi and the participants of the Ad-
vances in Deep Learning 2022 class in LIACS for the
valuable discussions and feedback.
REFERENCES
Chollet, F. (2017). Xception: Deep learning with depthwise
separable convolutions. In CVPR, pages 1251–1258.
Cordonnier, J., Loukas, A., and Jaggi, M. (2020). On the
relationship between self-attention and convolutional
layers. In 8th International Conference on Learning
Representations, ICLR 2020, Addis Ababa, Ethiopia,
April 26-30, 2020. OpenReview.net.
Cubuk, E. D., Zoph, B., Shlens, J., and Le, Q. (2020).
Randaugment: Practical automated data augmentation
with a reduced search space. In Larochelle, H., Ran-
zato, M., Hadsell, R., Balcan, M., and Lin, H., editors,
Advances in Neural Information Processing Systems,
volume 33, pages 18613–18624. Curran Associates,
Inc.
Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2019).
BERT: pre-training of deep bidirectional transformers
for language understanding. In Burstein, J., Doran,
C., and Solorio, T., editors, NAACL-HLT, pages 4171–
4186. Association for Computational Linguistics.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby,
N. (2021). An image is worth 16x16 words: Trans-
formers for image recognition at scale. In 9th Interna-
tional Conference on Learning Representations, ICLR
2021, Virtual Event, Austria, May 3-7, 2021. OpenRe-
view.net.
Dubey, S. R., Singh, S. K., and Chaudhuri, B. B. (2021).
A comprehensive survey and performance analysis of
activation functions in deep learning. arXiv preprint
arXiv:2109.14545.
Elsken, T., Metzen, J. H., and Hutter, F. (2019). Neural
Architecture Search, pages 63–77. Springer Interna-
tional Publishing, Cham.
Hassani, A., Walton, S., Li, J., Li, S., and Shi, H. (2022).
Neighborhood attention transformer. arXiv preprint
arXiv:2204.07143.
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep resid-
ual learning for image recognition. CVPR, pages 770–
778.
Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F.,
Dorundo, E., Desai, R., Zhu, T., Parajuli, S., Guo, M.,
et al. (2021a). The many faces of robustness: A crit-
ical analysis of out-of-distribution generalization. In
ICCV, pages 8340–8349.
Hendrycks, D. and Gimpel, K. (2016). Gaussian error linear
units (gelus). arXiv preprint arXiv:1606.08415.
Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., and
Song, D. (2021b). Natural adversarial examples. In
CVPR, pages 15262–15271.
Heo, B., Chun, S., Oh, S. J., Han, D., Yun, S., Kim, G.,
Uh, Y., and Ha, J. (2021). Adamp: Slowing down
the slowdown for momentum optimizers on scale-
invariant weights. In 9th International Conference on
Learning Representations, ICLR 2021, Virtual Event,
Austria, May 3-7, 2021. OpenReview.net.
Hu, J., Shen, L., and Sun, G. (2018). Squeeze-and-
excitation networks. In CVPR, pages 7132–7141.
Huang, G., Sun, Y., Liu, Z., Sedra, D., and Weinberger,
K. Q. (2016). Deep networks with stochastic depth. In
European conference on computer vision, pages 646–
661. Springer.
Hutter, F., Hoos, H. H., and Leyton-Brown, K. (2011).
Sequential model-based optimization for general al-
gorithm configuration. In Proc. of LION-5, page
507–523.
Jin, H., Song, Q., and Hu, X. (2019). Auto-keras: An ef-
ficient neural architecture search system. In Proceed-
ings of the 25th ACM SIGKDD International Confer-
ence on Knowledge Discovery & Data Mining, pages
1946–1956. ACM.
Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple
layers of features from tiny images. Technical report,
University of Toronto.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2017). Im-
agenet classification with deep convolutional neural
networks. Communications of the ACM, 60(6):84–90.
Liaw, R., Liang, E., Nishihara, R., Moritz, P., Gonzalez,
J. E., and Stoica, I. (2018). Tune: A research platform
for distributed model selection and training. arXiv
preprint arXiv:1807.05118.
Lindauer, M., Eggensperger, K., Feurer, M., Biedenkapp,
A., Deng, D., Benjamins, C., Ruhkopf, T., Sass, R.,
and Hutter, F. (2022). Smac3: A versatile bayesian op-
timization package for hyperparameter optimization.
Journal of Machine Learning Research, 23(54):1–9.
Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and
Han, J. (2020). On the variance of the adaptive learn-
ing rate and beyond. In 8th International Confer-
ence on Learning Representations, ICLR 2020, Addis
Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
Liu, Y., Sun, Y., Xue, B., Zhang, M., Yen, G. G., and Tan,
K. C. (2021a). A survey on evolutionary neural archi-
tecture search. IEEE transactions on neural networks
and learning systems.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S.,
and Guo, B. (2021b). Swin transformer: Hierarchi-
cal vision transformer using shifted windows. ICCV,
pages 9992–10002.
Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., and
Xie, S. (2022). A convnet for the 2020s. CVPR, pages
11966–11976.
Loshchilov, I. and Hutter, F. (2017). SGDR: stochastic gra-
dient descent with warm restarts. In 5th International
Conference on Learning Representations, ICLR 2017,
From Xception to NEXcepTion: New Design Decisions and Neural Architecture Search
235