Self Attention layer of the Vision Transformer with
shuffle mixing, which can effectively aggregate fea-
tures at a low cost. The results on ImageNet-1K and
ADE20K datasets showed that the proposed model
outperformed the conventional Vision Transformers.
Compared with the conventional methods, the im-
provement in this paper is in the token mixing and
patch embedding layers. Therefore, further improve-
ment in accuracy can be expected by adjusting FFN,
Normalization, optimization methods, and other pa-
rameters.
ACKNOWLEDGEMENTS
This work is supported by JSPS KAKENHI Grant
Number 21K11971.
REFERENCES
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lu
ˇ
ci
´
c, M.,
and Schmid, C. (2021). Vivit: A video vision trans-
former. In Proceedings of the IEEE/CVF Interna-
tional Conference on Computer Vision, pages 6836–
6846.
Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). Layer
normalization. arXiv preprint arXiv:1607.06450.
Chen, S., Xie, E., Ge, C., Liang, D., and Luo, P. (2021). Cy-
clemlp: A mlp-like architecture for dense prediction.
arXiv preprint arXiv:2107.10224.
Chi, C., Wei, F., and Hu, H. (2020). Relationnet++:
Bridging visual representations for object detection
via transformer decoder. Advances in Neural Infor-
mation Processing Systems, 33:13564–13574.
Cubuk, E. D., Zoph, B., Shlens, J., and Le, Q. V. (2020).
Randaugment: Practical automated data augmentation
with a reduced search space. In Proceedings of the
IEEE/CVF conference on computer vision and pattern
recognition workshops, pages 702–703.
Dai, Z., Liu, H., Le, Q. V., and Tan, M. (2021). Coatnet:
Marrying convolution and attention for all data sizes.
Advances in Neural Information Processing Systems,
34:3965–3977.
Guo, J., Han, K., Wu, H., Tang, Y., Chen, X., Wang, Y.,
and Xu, C. (2022). Cmt: Convolutional neural net-
works meet vision transformers. In Proceedings of
the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 12175–12185.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In Proceedings of
the IEEE conference on computer vision and pattern
recognition, pages 770–778.
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D.,
Wang, W., Weyand, T., Andreetto, M., and Adam,
H. (2017). Mobilenets: Efficient convolutional neu-
ral networks for mobile vision applications. arXiv
preprint arXiv:1704.04861.
Huang, G., Sun, Y., Liu, Z., Sedra, D., and Weinberger,
K. Q. (2016). Deep networks with stochastic depth. In
European conference on computer vision, pages 646–
661. Springer.
Ioffe, S. and Szegedy, C. (2015). Batch normalization: Ac-
celerating deep network training by reducing internal
covariate shift. In International conference on ma-
chine learning, pages 448–456. PMLR.
Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F.
(2020). Transformers are rnns: Fast autoregressive
transformers with linear attention. In International
Conference on Machine Learning, pages 5156–5165.
PMLR.
Kirillov, A., Girshick, R., He, K., and Doll
´
ar, P. (2019).
Panoptic feature pyramid networks. In Proceedings
of the IEEE/CVF conference on computer vision and
pattern recognition, pages 6399–6408.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2017). Im-
agenet classification with deep convolutional neural
networks. Communications of the ACM, 60(6):84–90.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard,
R. E., Hubbard, W., and Jackel, L. D. (1989). Back-
propagation applied to handwritten zip code recogni-
tion. Neural computation, 1(4):541–551.
Li, J., Yan, Y., Liao, S., Yang, X., and Shao, L. (2021).
Local-to-global self-attention in vision transformers.
arXiv preprint arXiv:2107.04735.
Liu, H., Dai, Z., So, D., and Le, Q. V. (2021a). Pay attention
to mlps. Advances in Neural Information Processing
Systems, 34:9204–9215.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin,
S., and Guo, B. (2021b). Swin transformer: Hierar-
chical vision transformer using shifted windows. In
Proceedings of the IEEE/CVF International Confer-
ence on Computer Vision, pages 10012–10022.
Loshchilov, I. and Hutter, F. (2016). Sgdr: Stochastic
gradient descent with warm restarts. arXiv preprint
arXiv:1608.03983.
Loshchilov, I. and Hutter, F. (2017). Decoupled weight de-
cay regularization. arXiv preprint arXiv:1711.05101.
Mikolov, T., Karafi
´
at, M., Burget, L., Cernock
`
y, J., and
Khudanpur, S. (2010). Recurrent neural network
based language model. In Interspeech, volume 2,
pages 1045–1048. Makuhari.
Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., and
Doll
´
ar, P. (2020). Designing network design spaces.
In Proceedings of the IEEE/CVF conference on com-
puter vision and pattern recognition, pages 10428–
10436.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,
Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-
stein, M., et al. (2015). Imagenet large scale visual
recognition challenge. International journal of com-
puter vision, 115(3):211–252.
Sharir, G., Noy, A., and Zelnik-Manor, L. (2021). An image
is worth 16x16 words, what is a video worth? arXiv
preprint arXiv:2103.13915.
Strudel, R., Garcia, R., Laptev, I., and Schmid, C. (2021).
Segmenter: Transformer for semantic segmentation.
VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications
706