Figure 9: Trainable Word Patches for CIFAR100 dataset.
in previous section, it seems to capture more detailed
features than the Word Patches obtained by clustering.
This may have resulted in improved accuracy.
5 CONCLUSIONS
We proposed a method to use Word Patches in the
Vision Transformer. Experimental results on the
Food101 and CIFAR100 datasets showed that the ac-
curacy of the Vision Transformer was improved. In
addition, by using trainable Word Patches that fine
patterns are generated automatically, the classifica-
tion accuracy was improved further. The improve-
ment of the Vision Transformer using Word Patches
will lead to advances in recent researches using the
Transformer.
Although the accuracy was improved by our
method, we are not sure that the proposed method is
the best way for creating Word Patches. In the future,
we would like to find a new method for creating adap-
tive Word Patches according to an input image.
ACKNOWLEDGEMENTS
This work is supported by JSPS KAKENHI Grant
Number 21K11971.
REFERENCES
Bossard, L., Guillaumin, M., and Gool, L. V. (2014). Food-
101–mining discriminative components with random
forests. In European conference on computer vision,
pages 446–461. Springer.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov,
A., and Zagoruyko, S. (2020). End-to-end object de-
tection with transformers. In European conference on
computer vision, pages 213–229. Springer.
Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L.,
Yuille, A. L., and Zhou, Y. (2021). Transunet: Trans-
formers make strong encoders for medical image seg-
mentation. arXiv preprint arXiv:2102.04306.
Chu, P., Wang, J., You, Q., Ling, H., and Liu, Z.
(2021). Transmot: Spatial-temporal graph trans-
former for multiple object tracking. arXiv preprint
arXiv:2104.00194.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
M., Heigold, G., Gelly, S., et al. (2020). An image is
worth 16x16 words: Transformers for image recogni-
tion at scale. arXiv preprint arXiv:2010.11929.
Guibas, J., Mardani, M., Li, Z., Tao, A., Anandkumar,
A., and Catanzaro, B. (2021). Efficient token mixing
for transformers via adaptive fourier neural operators.
In International Conference on Learning Representa-
tions.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin,
S., and Guo, B. (2021). Swin transformer: Hierarchi-
cal vision transformer using shifted windows. In Pro-
ceedings of the IEEE/CVF International Conference
on Computer Vision, pages 10012–10022.
Meinhardt, T., Kirillov, A., Leal-Taixe, L., and Feichten-
hofer, C. (2022). Trackformer: Multi-object tracking
with transformers. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recogni-
tion, pages 8844–8854.
Sethi, A. et al. (2021). Wavemix: Multi-resolution token
mixing for images.
Tan, C.-H., Chen, Q., Wang, W., Zhang, Q., Zheng, S., and
Ling, Z.-H. (2021). Ponet: Pooling network for effi-
cient token mixing in long sequences. arXiv preprint
arXiv:2110.02442.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.
(2017). Attention is all you need. Advances in neural
information processing systems, 30.
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J. M.,
and Luo, P. (2021). Segformer: Simple and efficient
design for semantic segmentation with transformers.
Advances in Neural Information Processing Systems,
34:12077–12090.
Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng,
J., and Yan, S. (2022). Metaformer is actually what
you need for vision. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recogni-
tion, pages 10819–10829.
Zeng, F., Dong, B., Zhang, Y., Wang, T., Zhang, X., and
Wei, Y. (2022). Motr: End-to-end multiple-object
tracking with transformer. In European Conference
on Computer Vision, pages 659–675. Springer.
VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications
736