Table 2: Main results for several methods on ImageNet-C. The value is corruption error ; lower is desirable. IN is the ImageNet
pre-trained model and IN-21k is the model pre-trained on ImageNet21k and fine-tuned on ImageNet.
PT Type Method
Blur Digital Extra Noise Weather
defocus glass motion zoom contrast elastic jpeg pixelate gaussian saturate spatter speckle gaussian impulse shot brightness fog frost snow
IN
SL
ResNet-152 61.9 79.3 59.7 66.3 42.2 70.3 56.8 54.0 60.8 40.3 54.2 47.9 50.8 51.9 52.4 44.7 45.5 58.6 56.0
ConvNeXt-L 56.2 71.2 48.5 55.5 34.2 61.9 48.9 47.7 55.9 34.9 39.7 37.5 37.1 36.0 38.7 38.8 41.5 42.0 43.1
Pool-48M 65.5 82.2 59.3 66.4 42.6 74.0 59.1 60.9 64.4 40.2 49.9 47.9 48.0 47.6 50.7 43.6 51.0 49.5 56.2
MLPMixer-L 92.4 97.6 87.3 94.6 63.1 96.2 96.1 82.6 91.4 73.7 83.9 78.4 81.3 84.0 82.8 69.8 70.9 77.2 88.3
CvT-21 110.8 107.1 103.8 104.6 98.9 107.2 109.9 106.2 110.8 93.0 86.0 92.0 95.1 94.9 94.7 99.8 90.9 82.5 83.9
Swin-B 62.8 75.8 55.1 63.8 41.1 68.0 56.7 55.7 62.5 40.1 43.4 47.8 47.8 49.1 51.2 43.3 39.1 48.6 52.7
DeiT-B 60.0 68.8 57.6 65.4 50.0 61.6 57.8 51.8 59.1 45.1 47.2 44.3 47.0 46.9 48.5 45.5 46.5 46.7 53.9
DeiT-L 59.2 63.5 57.9 64.3 41.1 58.5 57.7 53.0 58.6 43.9 49.7 48.1 51.6 50.8 53.4 47.6 46.3 47.3 54.9
Deit III-B 51.1 65.3 48.4 60.0 32.9 59.6 48.5 41.8 50.0 35.0 36.8 36.5 36.1 35.8 38.1 38.4 32.4 39.4 40.3
Deit III-L 47.4 59.6 42.2 51.4 30.1 54.2 44.1 36.8 46.4 32.2 33.4 32.1 32.0 31.4 33.6 35.9 29.2 35.8 36.6
SSL
Dino-B 59.5 75.9 65.2 71.7 51.0 65.3 60.5 52.9 58.1 46.2 54.8 56.6 62.1 62.6 63.1 48.1 54.5 61.3 61.5
MAE-B 61.3 75.8 56.9 66.8 42.7 71.4 58.1 52.7 60.3 40.9 43.1 42.6 45.4 44.5 46.4 43.5 44.7 45.6 46.1
MAE-L 50.9 66.8 44.1 53.0 33.4 59.2 46.2 42.5 50.7 32.9 32.7 33.6 36.1 34.8 36.7 36.2 35.5 35.9 35.6
IN-21k
SL
ConvNeXt-L 45.4 62.4 41.7 47.3 33.5 54.5 40.8 33.8 46.7 32.4 35.1 33.8 35.2 33.3 36.2 35.5 34.7 42.3 39.8
Swin-B 49.1 65.0 44.9 53.4 34.5 57.0 44.9 37.3 49.3 34.5 34.4 35.2 37.0 36.7 37.8 35.7 32.7 39.5 40.2
Swin-L 45.9 62.6 42.6 49.6 32.6 55.1 41.7 34.6 46.5 33.2 32.8 34.0 35.9 33.9 36.6 36.4 31.7 39.2 36.7
ViT-L 45.6 53.5 41.5 50.3 31.4 52.5 42.8 33.5 44.2 34.4 36.9 33.4 36.5 36.5 37.4 35.3 35.0 44.9 40.4
DeiT III-L 41.3 58.6 43.3 49.0 34.2 55.2 38.9 32.1 41.6 31.7 35.2 37.8 42.5 39.9 43.3 34.7 35.8 44.8 37.9
REFERENCES
Anurag, A., Mostafa, D., Georg, H., Chen, S., Mario, L.,
and Cordelia, S. (2021). Vivit: A video vision trans-
former. In International Conference on Computer Vi-
sion.
Caron, M., Touvron, H., Misra, I., J
´
egou, H., Mairal, J., Bo-
janowski, P., and Joulin, A. (2021). Emerging proper-
ties in self-supervised vision transformers. In Interna-
tional Conference on Computer Vision.
Dan, H. and Thomas, D. (2019). Benchmarking neural net-
work robustness to common corruptions and perturba-
tions. In International Conference on Learning Rep-
resentations.
Daquan, Z., Zhiding, Y., Enze, X., Chaowei, X., Anima,
A., Jiashi, F., and Jose, M. A. (2022). Understanding
the robustness in vision transformers. In International
Conference on Machine Learning.
Deng, J., Dong, W., Socher, R., Li, L.-J., Kai, L., and Li, F.-
F. (2009). Imagenet: A large-scale hierarchical image
database. In Computer Vision and Pattern Recogni-
tion, pages 248–255.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby,
N. (2021). An image is worth 16x16 words: Trans-
formers for image recognition at scale. In Interna-
tional Conference on Learning Representations.
Hangbo, B., Li, D., Songhao, P., and Furu, W. (2022). Beit:
Bert pre-training of image transformers. In Interna-
tional Conference on Learning Representations.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In Computer Vi-
sion and Pattern Recognition.
Hongxu, Y., Arash, V., Jose, A., Arun, M., Jan, K., and
Pavlo, M. (2022). A-vit: Adaptive tokens for efficient
vision transformer. In Computer Vision and Pattern
Recognition.
Hugo, T., Matthieu, C., and Herve, J. (2022). Deit iii: Re-
venge of the vit. arXiv.
Kaiming, H., Xinlei, C., Saining, X., Yanghao, L., Piotr, D.,
and Ross, G. (2022). Masked autoencoders are scal-
able vision learners. In Computer Vision and Pattern
Recognition.
Kaleel, M., Rigel, M., and Marten, v. D. (2021). On the
robustness of vision transformers to adversarial exam-
ples. In International Conference on Computer Vision.
Kwonjoon, L., Huiwen, C., Lu, J., Han, Z., Zhuowen, T.,
and Ce, L. (2021). Vitgan: Training gans with vision
transformers. In International Conference on Learn-
ing Representations.
Leon, A. G., Alexander, S. E., and Matthias, B. (2016). Im-
age style transfer using convolutional neural networks.
In Computer Vision and Pattern Recognition, pages
2414–2423.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S.,
and Guo, B. (2021). Swin transformer: Hierarchical
vision transformer using shifted windows. In Interna-
tional Conference on Computer Vision.
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T.,
and Xie, S. (2022a). A convnet for the 2020s. In
Computer Vision and Pattern Recognition.
Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., and
Hu, H. (2022b). Video swin transformer. In Computer
Vision and Pattern Recognition.
Muzammal, N., Kanchana, R., Salman, K., Munawar, H.,
Fahad, S. K., and Ming-Hsuan, Y. (2022). Intriguing
properties of vision transformers. In Advances in Neu-
ral Information Processing Systems.
Namuk, P. and Songkuk, K. (2022). How do vision trans-
formers work? In International Conference on Learn-
ing Representations.
Peihao, W., Wenqing, Z., Tianlong, C., and Zhangyang,
W. (2022). Anti-oversmoothing in deep vision trans-
formers via the fourier domain analysis: From theory
to practice. In International Conference on Learning
Representations.
Robert, G., Patricia, R., Claudio, M., Matthias, B., Felix,
A. W., and Wieland, B. (2019). Imagenet-trained cnns
are biased towards texture; increasing shape bias im-
Understanding of Feature Representation in Convolutional Neural Networks and Vision Transformer
435