
tection with transformers. In European conference on
computer vision, pages 213–229. Springer.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-
Fei, L. (2009). Imagenet: A large-scale hierarchical
image database. In 2009 IEEE conference on com-
puter vision and pattern recognition, pages 248–255.
Ieee.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
M., Heigold, G., Gelly, S., et al. (2020). An image is
worth 16x16 words: Transformers for image recogni-
tion at scale. arXiv preprint arXiv:2010.11929.
Haroun, K., Martinet, J., Ben Chehida, K., and Allenet, T.
(2024). Leveraging local similarity for token merg-
ing in Vision Transformers. In ICONIP 2024 - 31th
International Conference on Neural Information Pro-
cessing, Auckland, New Zealand.
Haurum, J. B., Madadi, M., Escalera, S., and Moeslund,
T. B. (2022). Multi-scale hybrid vision transformer
and sinkhorn tokenizer for sewer defect classification.
Automation in Construction, 144:104614.
Liu, Y., Gehrig, M., Messikommer, N., Cannici, M., and
Scaramuzza, D. (2024). Revisiting token pruning
for object detection and instance segmentation. In
2024 IEEE/CVF Winter Conference on Applications
of Computer Vision (WACV), pages 2646–2656, Los
Alamitos, CA, USA. IEEE Computer Society.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin,
S., and Guo, B. (2021). Swin transformer: Hierar-
chical vision transformer using shifted windows. In
Proceedings of the IEEE/CVF international confer-
ence on computer vision, pages 10012–10022.
Marin, D., Chang, J.-H. R., Ranjan, A., Prabhu, A., Raste-
gari, M., and Tuzel, O. (2023). Token pooling in vi-
sion transformers for image classification. In Proceed-
ings of the IEEE/CVF Winter Conference on Applica-
tions of Computer Vision, pages 12–21.
Renggli, C., Pinto, A. S., Houlsby, N., Mustafa, B.,
Puigcerver, J., and Riquelme, C. (2022). Learning to
merge tokens in vision transformers. arXiv preprint
arXiv:2202.12015.
Strudel, R., Garcia, R., Laptev, I., and Schmid, C. (2021).
Segmenter: Transformer for semantic segmentation.
In Proceedings of the IEEE/CVF international con-
ference on computer vision, pages 7262–7272.
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles,
A., and J
´
egou, H. (2021). Training data-efficient im-
age transformers & distillation through attention. In
International conference on machine learning, pages
10347–10357. PMLR.
Ward Jr, J. H. (1963). Hierarchical grouping to optimize an
objective function. Journal of the American statistical
association, 58(301):236–244.
Zeng, W., Jin, S., Liu, W., Qian, C., Luo, P., Ouyang,
W., and Wang, X. (2022). Not all tokens are equal:
Human-centric visual analysis via token clustering
transformer. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition,
pages 11101–11111.
Zhang, B., Tian, Z., Tang, Q., Chu, X., Wei, X., Shen, C.,
and Liu, Y. (2022). Segvit: Semantic segmentation
with plain vision transformers. NeurIPS.
Zong, Z., Li, K., Song, G., Wang, Y., Qiao, Y., Leng, B.,
and Liu, Y. (2022). Self-slimmed vision transformer.
In European Conference on Computer Vision, pages
432–448. Springer.
VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications
684