
Accuracy Improvement of Neuron Concept Discovery Using CLIP with
Grad-CAM-Based Attention Regions
Takahiro Sannomiya
a
and Kazuhiro Hotta
b
Meijo University, Nagoya, Japan
Keywords:
Explainable AI, CLIP, Concept of Neurons.
Abstract:
WWW is a method that computes the similarity between image and text features using CLIP and assigns a
concept to each neuron of the target model whose behavior is to be determined. However, because this method
calculates similarity using center crop for images, it may include features that are not related to the original
class of the image and may not correctly reflect the similarity between the image and text. Additionally, WWW
uses cosine similarity to calculate the similarity between images and text. Cosine similarity can sometimes
result in a broad similarity distribution, which may not accurately capture the similarity between vectors. To
address them, we propose a method that leverages Grad-CAM to crop the model’s attention region, filtering
out the features unrelated to the original characteristics of the image. By using t-vMF to measure the similarity
between the image and text, we achieved a more accurate discovery of neuron concepts.
1 INTRODUCTION
In recent years, image recognition models have been
used in a variety of fields, but the problem is that it is
unclear how the models are making decisions. To ad-
dress this issue, visualization methods such as Class
Activation Maps (CAM)(Wang et al., 2020; Zhou
et al., 2016) have been proposed, but they only show
the regions of interest and cannot explain what con-
cepts and features the model is learning. A method
was proposed to identify the concepts of the model’s
neurons using CLIP(Radford et al., 2021), which can
measure the similarity between images and text. This
method allows us to explain in concrete terms that hu-
mans can understand what concepts the model is bas-
ing its decisions on, and to deepen our understanding
of the model’s decision-making process and internal
behavior.
WWW(Ahn et al., 2024) is a method for identi-
fying neuron concepts. Since this method calculates
similarity using center crop for images, it includes
features that are not related to the original class of
the image and may not correctly reflect the similar-
ity between the image and the text. The WWW uses
cosine similarity in calculating the similarity between
images and text. Cosine similarity may not accurately
a
https://orcid.org/0009-0005-3644-381X
b
https://orcid.org/0000-0002-5675-8713
reflect the similarity between vectors due to the wide
similarity distribution. To address these issues, we
propose a method to more accurately discover neuron
concepts by using t-vMF similarity between images
and text, while using Grad-CAM to crop the regions
of interest in the model and eliminating features that
are not related to the original features of the image.
Experiments were conducted on ImageNet valida-
tion datasets consisting of 1000 classes, such as ani-
mals and vehicles, and text datasets such as Broaden
and WordNet. The results showed that the accuracy of
some evaluation metrics, such as CLIP cos, mpnet cos
and F1-score, exceeded that of conventional method.
The paper is organized as follows. Section 2 de-
scribes related works. Section 3 details the proposed
method. Section 4 presents experimental results. Sec-
tion 5 discusses the Ablation Study. Finally, Section
6 concludes our paper.
2 RELATED WORKS
We explain WWW, a method for identifying neuron
concepts. Figure 1 illustrates WWW. Let the i-th neu-
ron in layer l of the target model be denoted as (l, i).
We denote the number of text samples as j. First,
we crop a center region in each image in the prob-
ing dataset (evaluation data) D
probe
, and feed them
into the target model. We then select images where
Sannomiya, T. and Hotta, K.
Accuracy Improvement of Neuron Concept Discovery Using CLIP with Grad-CAM-Based Attention Regions.
DOI: 10.5220/0013247500003912
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 2: VISAPP, pages
497-502
ISBN: 978-989-758-728-3; ISSN: 2184-4321
Proceedings Copyright © 2025 by SCITEPRESS – Science and Technology Publications, Lda.
497