but faster. In International Conference on Learning
Representations.
Bolya, D. and Hoffman, J. (2023). Token merging for fast
stable diffusion. CVPR Workshop on E fficient Deep
Learning for Computer Vi sion.
Bonnaerens, M. and Dambre, J. (2023). Learned thresh-
olds token merging and pruning for vision transform-
ers. Transactions on Machine Learning Research.
Caesar, H., Uij lings, J., and Ferrari, V. (2018). Coco-stuff:
Thing and stuff classes in context. In 2018 IEEE/CVF
Conference on Computer Vision and Pattern R ecog-
nition (CVPR), pages 1209–1218, Los Alamitos, CA,
USA. IEEE Computer Society.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov,
A., and Zagoruyko, S . (2020). End-to-end object de-
tection with tr ansformers. In European conference on
computer vision, pages 213–229. Springer.
Chen, M., Shao, W., Xu, P., Lin, M., Zhang, K., Chao, F., Ji,
R., Qiao, Y., and Luo, P. (2023). Diffrate : Differen-
tiable compression rate f or efficient vision transform-
ers. arXiv preprint arXiv:2305.17997.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
M., Heigold, G., G elly, S., Uszkoreit, J., and Houlsby,
N. (2021). An image is worth 16x16 words: Trans-
formers for image recognition at scale. In Interna-
tional Conference on Learning Representations.
Fayyaz, M., Abbasi Kouhpayegani, S., Rezaei Jafari, F.,
Sommerlade, E., Vaezi Joze, H. R ., Pirsiavash, H., and
Gall, J. (2022). Adaptive token sampling for efficient
vision transformers. European Conference on Com-
puter Vision (ECCV).
Hao, Y. and Jianxin, W. (2023). A unified pruning frame-
work for vision transformers. Science C hina Informa-
tion Sciences, 66:1869–1919.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In Proceedings of
the IEEE Conference on C omputer Vision and Pattern
Recognition ( CVPR).
Huang, H., Zhou, X., Cao, J., He, R., and Tan, T. (2022).
Visi on transformer with super token sampling. ArXiv,
abs/2211.11167.
Kim, M., Gao, S., Hsu, Y.-C., Shen, Y., and Jin, H. (2024).
Token fusion: Bri dging the gap between token prun-
ing and token merging. In 2024 IEEE/CV F Win-
ter Conference on Applications of Computer Vision
(WACV), pages 1372–1381.
Kong, Z., Dong, P., Ma, X., Meng, X., Niu, W., Sun, M.,
Shen, X., Yuan, G., Ren, B. , Tang, H., et al. (2022).
Spvit: E nabling faster vision transformers via latency-
aware soft token pruning. I n Computer Vision–ECCV
2022: 17th European Conference, Tel Aviv, Israel, Oc-
tober 23–27, 2022, Proceedings, Part XI, pages 620–
640. Springer.
Li, Z. and Gu, Q. (2023). I-vit: Integer-only quantization for
efficient vision transformer inference. In Proceedings
of the IEEE/CVF International Conference on Com-
puter Vision, pages 17065–17075.
Liang, Y., Ge, C., Tong, Z., Song, Y., Wang, J., and Xie, P.
(2022). Not all patches are what you need: Expediting
vision transformers via token reorganizations. In In-
ternational Conference on Learning Representations.
Lin, Y., Zhang, T., Sun, P., L i, Z., and Zhou, S. (2022). Fq-
vit: Post-training quantization for fully quantized vi-
sion transformer. In Proceedings of the Thirty-First
International Joint Conference on Art ificial Intelli-
gence, IJCAI-22, pages 1173–1179.
Liu, Y., Gehrig, M., Messikommer, N., Cannici, M., and
Scaramuzza, D. (2024). Revisiting token pruning
for object detection and instance segmentation. In
2024 IEEE/CVF Winter Conference on Applications
of Computer Vision (WACV), pages 2646–2656, Los
Alamitos, CA, USA. IEEE Computer Society.
Liu, Z., L in, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin,
S., and Guo, B. (2021). Swin transformer: Hierarchi-
cal vision transformer using shifted windows. In Pro-
ceedings of the IEEE/CVF International Conference
on Computer Vision (ICCV), pages 10012–10022.
Lu, C., de Geus, D., and Dubbelman, G. (2023). Content-
aware Token Sharing for Efficient Semantic Segmen-
tation with Vision Transformers. In IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition
(CVPR).
Marin, D., Chang, J.-H. R., Ranjan, A., Prabhu, A., Raste-
gari, M., and Tuzel, O. (2023). Token pooling
in vision transformers for image classification. In
2023 IEEE/CVF Winter Conference on Applications
of Computer Vision (WACV), pages 12–21.
Meng, L., Li, H., Chen, B.-C., Lan, S., Wu, Z., Jiang, Y.-
G., and Lim, S.-N. (2022). Adavit: Adaptive vision
transformers for efficient image recognition. In 2022
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition (CVPR), pages 12309–12318.
Michel, P., Levy, O., and Neubig, G. (2019). Are si x-
teen heads really better than one? In Wallach, H. ,
Larochelle, H ., Beygelzimer, A., d'Alch´e-Buc, F.,
Fox, E., and Garnett, R., editors, Advances in Neural
Information Processing Systems, volume 32. Curran
Associates, Inc.
MMSegmentation Contributors (2020). MMSegmenta-
tion: Openmmlab semantic segmentation toolbox
and benchmark. https://github.com/open-mmlab/
mmsegmentation.
Rao, Y., Liu, Z. , Zhao, W., Zhou, J., and Lu, J. (2023). Dy-
namic spatial sparsification for efficient vision trans-
formers and convolutional neural networks. IEEE
Transactions on Pattern Analysis and Machine Intel-
ligence, 45(9):10883–10897.
Rao, Y., Zhao, W., Liu, B., Lu, J., Z hou, J., and Hsi eh,
C.-J. (2021). Dynamicvit: Efficient vision transform-
ers with dynamic token sparsification. In Advances in
Neural I nformation Processing Systems (NeurIPS).
Russakovsky, O., Deng, J. , Su, H., Krause, J., Satheesh,
S., Ma, S., Huang, Z., Karpathy, A., Khosla, A.,
Bernstein, M., Berg, A. C. , and Fei-Fei, L. (2015).
ImageNet Large Scale Visual Recognition Challenge.
International Journal of C omputer Vision (IJCV),
115(3):211–252.
Ryoo, M. S ., Pi ergiovanni, A. J., Arnab, A., Dehghani, M.,
and Angelova, A. (2021). Tokenlearner: What can