
lem that it fails to highlight objects of the predic-
tion class, however, the proposed method improved
the problem. Quantitative and qualitative evalua-
tions demonstrated the effectiveness of the proposed
method.
ACKNOWLEDGEMENTS
This research is partially supported by JSPS KAK-
ENHI Grant Number 22H04735 and 21K11971.
REFERENCES
Abnar, S. and Zuidema, W. (2020). Quantifying
attention flow in transformers. arXiv preprint
arXiv:2005.00928.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov,
A., and Zagoruyko, S. (2020). End-to-end object de-
tection with transformers. In European Conference on
Computer Vision, pages 213–229. Springer.
Caron, M., Touvron, H., Misra, I., J
´
egou, H., Mairal, J., Bo-
janowski, P., and Joulin, A. (2021). Emerging prop-
erties in self-supervised vision transformers. In Pro-
ceedings of the IEEE/CVF International Conference
on Computer Vision, pages 9650–9660.
Chattopadhay, A., Sarkar, A., Howlader, P., and Balasub-
ramanian, V. N. (2018). Grad-cam++: Generalized
gradient-based visual explanations for deep convolu-
tional networks. In 2018 IEEE Winter Conference
on Applications of Computer Vision, pages 839–847.
IEEE.
Chefer, H. et al. (2021). Transformer interpretability be-
yond attention visualization. In IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition,
pages 782–791.
Chen, P., Li, Q., Biaz, S., Bui, T., and Nguyen, A. (2022).
gscorecam: What objects is clip looking at? In Pro-
ceedings of the Asian Conference on Computer Vision,
pages 1959–1975.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-
Fei, L. (2009). Imagenet: A large-scale hierarchical
image database. In IEEE Conference on Computer
Vision and Pattern Recognition, pages 248–255.
Dosovitskiy, A. et al. (2021). An image is worth 16x16
words: Transformers for image recognition at scale.
In International Conference on Learning Representa-
tions.
Jiang, P.-T., Zhang, C.-B., Hou, Q., Cheng, M.-M., and Wei,
Y. (2021). Layercam: Exploring hierarchical class ac-
tivation maps for localization. IEEE Transactions on
Image Processing, 30:5875–5888.
Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J.,
Li, C., Yang, J., Su, H., Zhu, J., et al. (2023).
Grounding dino: Marrying dino with grounded pre-
training for open-set object detection. arXiv preprint
arXiv:2303.05499.
Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning,
J., Cao, Y., Zhang, Z., Dong, L., et al. (2022). Swin
transformer v2: Scaling up capacity and resolution.
In Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, pages 12009–
12019.
Petsiuk, V., Das, A., and Saenko, K. (2018). Rise: Ran-
domized input sampling for explanation of black-box
models. arXiv preprint arXiv:1806.07421.
Radford, A. et al. (2021). Learning transferable visual mod-
els from natural language supervision. In Interna-
tional Conference on Machine Learning, pages 8748–
8763.
Ramaswamy, H. G. et al. (2020). Ablation-cam: Vi-
sual explanations for deep convolutional network
via gradient-free localization. In Proceedings of
the IEEE/CVF Winter Conference on Applications of
Computer Vision, pages 983–991.
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R.,
Parikh, D., and Batra, D. (2017). Grad-cam: Visual
explanations from deep networks via gradient-based
localization. In Proceedings of the IEEE International
Conference on Computer Vision, pages 618–626.
Wang, H., Wang, Z., Du, M., Yang, F., Zhang, Z., Ding, S.,
Mardziel, P., and Hu, X. (2020). Score-cam: Score-
weighted visual explanations for convolutional neu-
ral networks. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition
Workshops, pages 24–25.
Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D.,
Lu, T., Luo, P., and Shao, L. (2021). Pyramid vi-
sion transformer: A versatile backbone for dense pre-
diction without convolutions. In Proceedings of the
IEEE/CVF International Conference on Computer Vi-
sion, pages 568–578.
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J. M.,
and Luo, P. (2021). Segformer: Simple and efficient
design for semantic segmentation with transformers.
Advances in Neural Information Processing Systems,
34:12077–12090.
Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu,
Y., Feng, J., Xiang, T., Torr, P. H., et al. (2021). Re-
thinking semantic segmentation from a sequence-to-
sequence perspective with transformers. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 6881–6890.
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Tor-
ralba, A. (2016). Learning deep features for discrim-
inative localization. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition,
pages 2921–2929.
ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods
106