Accuracy Improvement of Neuron Concept Discovery Using CLIP with

Grad-CAM-Based Attention Regions

Takahiro Sannomiya

and Kazuhiro Hotta

Meijo University, Nagoya, Japan

Keywords:

Explainable AI, CLIP, Concept of Neurons.

Abstract:

WWW is a method that computes the similarity between image and text features using CLIP and assigns a

concept to each neuron of the target model whose behavior is to be determined. However, because this method

calculates similarity using center crop for images, it may include features that are not related to the original

class of the image and may not correctly reﬂect the similarity between the image and text. Additionally, WWW

uses cosine similarity to calculate the similarity between images and text. Cosine similarity can sometimes

result in a broad similarity distribution, which may not accurately capture the similarity between vectors. To

address them, we propose a method that leverages Grad-CAM to crop the model’s attention region, ﬁltering

out the features unrelated to the original characteristics of the image. By using t-vMF to measure the similarity

between the image and text, we achieved a more accurate discovery of neuron concepts.

1 INTRODUCTION

In recent years, image recognition models have been

used in a variety of ﬁelds, but the problem is that it is

unclear how the models are making decisions. To ad-

dress this issue, visualization methods such as Class

Activation Maps (CAM)(Wang et al., 2020; Zhou

et al., 2016) have been proposed, but they only show

the regions of interest and cannot explain what con-

cepts and features the model is learning. A method

was proposed to identify the concepts of the model’s

neurons using CLIP(Radford et al., 2021), which can

measure the similarity between images and text. This

method allows us to explain in concrete terms that hu-

mans can understand what concepts the model is bas-

ing its decisions on, and to deepen our understanding

of the model’s decision-making process and internal

behavior.

WWW(Ahn et al., 2024) is a method for identi-

fying neuron concepts. Since this method calculates

similarity using center crop for images, it includes

features that are not related to the original class of

the image and may not correctly reﬂect the similar-

ity between the image and the text. The WWW uses

cosine similarity in calculating the similarity between

images and text. Cosine similarity may not accurately

https://orcid.org/0009-0005-3644-381X

https://orcid.org/0000-0002-5675-8713

reﬂect the similarity between vectors due to the wide

similarity distribution. To address these issues, we

propose a method to more accurately discover neuron

concepts by using t-vMF similarity between images

and text, while using Grad-CAM to crop the regions

of interest in the model and eliminating features that

are not related to the original features of the image.

Experiments were conducted on ImageNet valida-

tion datasets consisting of 1000 classes, such as ani-

mals and vehicles, and text datasets such as Broaden

and WordNet. The results showed that the accuracy of

some evaluation metrics, such as CLIP cos, mpnet cos

and F1-score, exceeded that of conventional method.

The paper is organized as follows. Section 2 de-

scribes related works. Section 3 details the proposed

method. Section 4 presents experimental results. Sec-

tion 5 discusses the Ablation Study. Finally, Section

6 concludes our paper.

2 RELATED WORKS

We explain WWW, a method for identifying neuron

concepts. Figure 1 illustrates WWW. Let the i-th neu-

ron in layer l of the target model be denoted as (l, i).

We denote the number of text samples as j. First,

we crop a center region in each image in the prob-

ing dataset (evaluation data) D

probe

, and feed them

into the target model. We then select images where

Sannomiya, T. and Hotta, K.

Accuracy Improvement of Neuron Concept Discovery Using CLIP with Grad-CAM-Based Attention Regions.

DOI: 10.5220/0013247500003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 2: VISAPP, pages

497-502

ISBN: 978-989-758-728-3; ISSN: 2184-4321

497

Figure 1: Overview of WWW.

the neuron exhibits a strong response (high activa-

tion) and denote this set as D

represent

. By feeding both

represent

and the text dataset D

concept

into CLIP, we

compute the concept score S

(l,i)

using the following

equation.

(l,i)

∑

o=1

cos



(l,i)

, t



− cos



(l,i)

, t

tem

o

(1)

where v

(l,i)

is the CLIP image feature vector, t

rep-

resents the CLIP text feature vector, and t

tem

is the

CLIP text feature vector of a base template word, such

as ”a photo of a,” which has similarity with any im-

age. In Equation (1), subtracting the cosine similarity

between v

(l,i)

and t

tem

from the cosine similarity be-

tween v

(l,i)

and t

removes the inﬂuence of the base

template, allowing us to focus solely on the similarity

between the image and the text itself. After calcu-

lating the concept score S

l,i

, the texts corresponding

to scores above the threshold δ

(l,i)

are considered as

concepts for the neuron (l, i). δ

(l,i)

is deﬁned by the

following equation.

(l,i)

= α × max(S

(l,i)

), (2)

where α is a hyper parameter representing concept

sensitivity.

This method reduces the inﬂuence of the base tem-

plate t

tem

, enabling the measurement of similarity be-

tween the image and text itself. However, since im-

ages in D

probe

are center-cropped and then passed

through the model to select high-activation images,

which are used as D

represent

for calculating similarity

with text, the unrelated features to the original class

of the image, such as background details, may be in-

cluded. This can prevent an accurate reﬂection of the

similarity between the image and text, potentially af-

fecting the identiﬁcation of neuron concepts. Addi-

tionally, although WWW uses cosine similarity for

similarity calculations, this similarity measure has a

wide distribution, which may be insufﬁcient for mea-

suring precise similarity.

3 PROPOSED METHOD

To solve this issue, we propose a method that com-

putes Grad-CAM on D

probe

, crops the attention re-

gion, and calculates the similarity using images that

retain class-relevant features. Additionally, to mea-

sure the similarity more accurately, we replace cosine

similarity with t-vMF similarity (Kobayashi, 2021),

which narrows the similarity distribution for more

precise measurement. As shown in Figure 2, the pro-

posed method feeds D

probe

into the target model and

computes Grad-CAM. The image is then cropped the

area centered on the highest Grad-CAM value. The

cropped image is fed into the target model, and im-

ages in which the i-th neuron in layer l (i.e., (l, i))

shows high activation are selected as D

represent

. The

concept score S

(l,i)

is then computed using the follow-

ing equation.

(l,i)

∑

o=1

t-vMF



′(l,i)

, t



−t-vMF



′(l,i)

, t

tem

o

(3)

where v

′(l,i)

is the image feature vector obtained

when high-activation images, determined using Grad-

CAM, are fed into CLIP. By calculating S

l,i

and se-

lecting texts corresponding to scores above the thresh-

old, as done in WWW, we can deﬁne these texts as the

concepts for the neuron (l, i). This approach allows

us to retain only the essential features of the image,

enabling a more accurate calculation of similarity be-

tween features related to the image class and the text.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

498

Figure 2: Overview of the proposed method.

4 EXPERIMENTS

In this section, we describe the experimental setup

and results. Section 4.1 describes the experimental

setup. Section 4.2 presents the results of the quali-

tative evaluation experiments. Section 4.3 shows the

results of the quantitative evaluation experiments.

4.1 Experimental Settings

Following the previous research, we use a ResNet-

50 model pre-trained on ImageNet-1k (Russakovsky

et al., 2015) as the target model. For D

probe

, we uti-

lize the validation images in the ImageNet-1k dataset,

while D

concept

includes ImageNet-1k, WordNet (Fell-

baum, 2005), and Broaden (Bau et al., 2017). We se-

lected 40 images from D

probe

that exhibited high ac-

tivation in the neurons to form D

represent

. Evaluation

metrics include CLIP cos, mpnet cos, F1-Score, and

Hit Rate, using CLIP and mpnet (Song et al., 2020).

CLIP cos and mpnet cos are metrics that measure co-

sine similarity by feeding class labels and selected

concepts into CLIP and mpnet, respectively. The F1-

score is an evaluation metric that measures the accu-

racy and ﬂexibility of the discovered concepts, while

the Hit Rate is calculated based on the proportion of

selected concepts that match the class labels. Higher

values for any of these evaluation metrics indicate bet-

ter performance. The concept sensitivity α is set to

0.95 for both methods.

4.2 Qualitative Result

The results of the qualitative evaluation are shown in

Figure 3. WordNet is used for D

oncept, and im-

ages with each neuron in the ﬁnal layer of ResNet-50

highly activated are compared. Below each image, the

proposed method and the concept identiﬁed in WWW

are shown. Figure 3 conﬁrms the superiority of the

proposed method. For example, in Neuron 0, Neuron

10, Neuron 446, and Neuron 479, the concept identi-

ﬁed by the proposed method matches the ground truth

of the image, while the concept identiﬁed by WWW

is similar to the ground truth but different. For Neu-

ron 296, Neuron 460, Neuron 671,Neuron 742, and

Neuron 850, the proposed method identiﬁes concepts

that are consistent with ground truth, while WWW

identiﬁes signiﬁcantly different concepts. This may

be due to the fact that WWW uses center crop, which

includes unnecessary features in addition to the orig-

inal image features, making it easier to identify un-

related or similar concepts. This can be seen from

Figure 4. Figure 4 shows that if center crop is simply

used, the left side of the image contains many objects

unrelated to the ground truth, which may result in in-

correct similarity calculations. On the other hand, if

the image is cropped based on Grad-CAM, it is possi-

ble to remove the areas unrelated to the ground truth

Accuracy Improvement of Neuron Concept Discovery Using CLIP with Grad-CAM-Based Attention Regions

499

Figure 3: Qualitative evaluation of each method.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

500

Figure 4: Comparison of the proposed method and WWW crop images.

Table 1: Comparison results with Resnet-50 as target model.

Method D

probe

concept

CLIP cos mpnet cos F1-score Hit Rate

ImageNet val ImageNet(1k) 93.31 83.47 77.53 95.5

ImageNet val Broaden(1.2k) 77.79 44.47 6.6 9.3WWW

ImageNet val Wordnet(80k) 88.76 70.05 42.62 66.2

ImageNet val ImageNet(1k) 93.46 83.90 76.98 95.5

ImageNet val Broaden(1.2k) 78.31 45.45 6.73 9.3Ours

ImageNet val Wordnet(80k) 89.49 72.58 45.77 69.8

and keep many of the original features of the image.

we believe that the proposed method could more ac-

curately measure the relationship between image fea-

tures and text features by t-vMF similarity while ex-

cluding areas that are irrelevant to the original image

features using Grad-CAM.

4.3 Quantitative Results

From Table 1, it can be observed that the accuracy has

improved for almost all evaluation metrics. The max-

imum improvements are 0.73% for CLIP cos, 2.53%

for mpnet cos, 3.15% for F1-score, and 3.6% for Hit

Rate. This improvement is believed to result from the

use of Grad-CAM to remove non-essential features of

the images while accurately calculating the similar-

ity between image features and text features using t-

vMF similarity. Additionally, regarding Broaden and

WordNet, both methods show an increase in accuracy

as the size of D

concept

increases. This improvement is

considered to arise from the enhanced expressiveness

of the concepts that can be assigned to the neurons

as D

concept

grows. The proposed method can reﬂect

the similarity between image features and text fea-

tures more accurately, which likely leads to a greater

increase in the accuracy of evaluation metrics when

concept

is changed from Broaden to WordNet com-

pared to existing methods.

Table 2: Results of Ablation Study.

Grad-CAM t-vMF CLIP cos mpnet cos F1-score

88.76 70.05 42.62

✓ 88.89 70.45 42.42

✓ 89.43 72.42 45.83

✓ ✓ 89.49 72.58 45.77

5 ABLATION STUDY

In this section, we discuss the contributions of Grad-

CAM and t-vMF to the improvements in accuracy.

The experimental results are presented in Table 2.

From Table 2, it can be inferred that WWW uses

center-cropped images, which contain a signiﬁcant

amount of redundant features, leading to relatively

low accuracy. Furthermore, when we use t-vMF, a

slight improvement in accuracy is observed in CLIP

cos and mpnet cos. In contrast, when Grad-CAM is

used, improvements in accuracy are conﬁrmed across

all evaluation metrics. Additionally, when both Grad-

CAM and t-vMF are used together, the maximum ac-

curacy is achieved for two evaluation metrics: CLIP

cos and mpnet cos. This is believed to be due to

the elimination of redundant features by Grad-CAM

while allowing for accurate similarity calculations

through t-vMF.

Accuracy Improvement of Neuron Concept Discovery Using CLIP with Grad-CAM-Based Attention Regions

501

6 CONCLUSION

In this paper, we proposed a method that utilizes

Grad-CAM and t-vMF similarity to accurately mea-

sure the similarity between the intrinsic features of

images and text for improving the discovery accuracy

of neuron concepts. As a result, we achieved more

accurate identiﬁcation of neuron concepts across var-

ious datasets.

REFERENCES

Ahn, Y. H. et al. (2024). WWW: A uniﬁed framework for

explaining what, where and why of neural networks

by interpretation of neuron concepts. In Proceedings

of the IEEE/CVF Conference on Computer Vision and

Pattern Recognition, pages 10968–10977.

Bau, D., Zhou, B., Khosla, A., Oliva, A., and Torralba,

A. (2017). Network dissection: Quantifying inter-

pretability of deep visual representations. In Proceed-

ings of the IEEE conference on computer vision and

pattern recognition, pages 6541–6549.

Fellbaum, C. (2005). Wordnet and wordnets. encyclopedia

of language and linguistics.

Kobayashi, T. (2021). T-vMF similarity for regularizing

intra-class feature distribution. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition, pages 6616–6625.

Radford, A. et al. (2021). Learning transferable visual mod-

els from natural language supervision. In Interna-

tional Conference on Machine Learning, pages 8748–

8763.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,

Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-

stein, M., et al. (2015). Imagenet large scale visual

recognition challenge. International journal of com-

puter vision, 115:211–252.

Song, K., Tan, X., Qin, T., Lu, J., and Liu, T.-Y. (2020). Mp-

net: Masked and permuted pre-training for language

understanding. Advances in neural information pro-

cessing systems, 33:16857–16867.

Wang, H., Wang, Z., Du, M., Yang, F., Zhang, Z., Ding, S.,

Mardziel, P., and Hu, X. (2020). Score-cam: Score-

weighted visual explanations for convolutional neu-

ral networks. In Proceedings of the IEEE/CVF con-

ference on computer vision and pattern recognition

workshops, pages 24–25.

Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Tor-

ralba, A. (2016). Learning deep features for discrim-

inative localization. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 2921–2929.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

502