Beyond Labels: Self-Attention-Driven Semantic Separation Using

Principal Component Clustering in Latent Diffusion Models

Felix Stillger

1,2 a

, Frederik Hasecke

2 b

, Lukas Hahn

2 c

and Tobias Meisen

1 d

University of Wuppertal, Gaußstraße 20, Wuppertal, Germany

APTIV, Am Technologiepark 1, Wuppertal, Germany

{felix.stillger, meisen}@uni-wuppertal.de, {frederik.hasecke, lukas.hahn}@aptiv.com

Keywords:

Diffusion Model, Self-Attention, Segmentation.

Abstract:

High-quality annotated datasets are crucial for training semantic segmentation models, yet their manual cre-

ation and annotation are labor-intensive and costly. In this paper, we introduce a novel method for generating

class-agnostic semantic segmentation masks by leveraging the self-attention maps of latent diffusion models,

such as Stable Diffusion. Our approach is entirely learning-free and explores the potential of self-attention

maps to produce semantically meaningful segmentation masks. Central to our method is the reduction of

individual self-attention information to condense the essential features required for semantic distinction. We

employ multiple instances of unsupervised k-means clustering to generate clusters, with increasing cluster

counts leading to more specialized semantic abstraction. We evaluate our approach using state-of-the-art mod-

els such as Segment Anything (SAM) and Mask2Former, which are trained on extensive datasets of manually

annotated masks. Our results, demonstrated on both synthetic and real-world images, show that our method

generates high-resolution masks with adjustable granularity, relying solely on the intrinsic scene understand-

ing of the latent diffusion model - without requiring any training or ﬁne-tuning.

1 INTRODUCTION

Semantic segmentation is a fundamental task in com-

puter vision, with applications ranging from au-

tonomous driving to medical image analysis. How-

ever, the process of creating large, annotated datasets

to train segmentation models is both time-consuming

and costly. This has prompted increasing interest

in methods that leverage existing data, models, or

mechanisms to bypass the need for data creation

and manual annotation. Generative models, particu-

larly diffusion-based models like Stable Diffusion 2.1

(Rombach et al., 2022a), have shown remarkable ca-

pabilities in generating detailed and coherent images,

yet their potential to assist in generating segmenta-

tion masks remains underexplored. In this work, we

investigate the intrinsic ability of diffusion models to

produce class-agnostic semantic segmentation masks

without any modiﬁcation to the models themselves

or reliance on additional pre-trained networks (see

Figure 1). Speciﬁcally, we exploit the self-attention

https://orcid.org/0009-0006-9771-6233

https://orcid.org/0000-0002-6724-5649

https://orcid.org/0000-0003-0290-0371

https://orcid.org/0000-0002-1969-559X

Figure 1: Our method generates class-agnostic yet seman-

tically meaningful segmentation masks. The highlighted

pixel (marked by a star in the upper left image) can be as-

sociated with various semantic categories, such as left eye,

eyes, face, cat, and foreground. These segmentation masks

are produced solely through the self-attention mechanism

of Stable Diffusion, without relying on any external image

features.

mechanisms embedded in latent diffusion models,

which are designed to enhance image generation qual-

ity by capturing relationships between different parts

of the image (Hong et al., 2023). While self-attention

has been used in previous efforts to create segmenta-

Stillger, F., Hasecke, F., Hahn, L. and Meisen, T.

Beyond Labels: Self-Attention-Driven Semantic Separation Using Principal Component Clustering in Latent Diffusion Models.

DOI: 10.5220/0013124500003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 2: VISAPP, pages

68-80

ISBN: 978-989-758-728-3; ISSN: 2184-4321

tion masks, it has not been fully explored at the gran-

ularity of individual attention heads. We hypothesize

that the self-attention heads within these models en-

code sufﬁcient information about image structure and

content, enabling the segmentation of distinct regions

with semantically meaningful boundaries - without

the need for external supervision.

Previous methods, such as (Nguyen et al., 2023)

and (Tian et al., 2024), typically aggregate self-

attention maps by averaging or summing over atten-

tion heads and/or features to manage the large ten-

sor sizes involved. In contrast, our approach lever-

ages the individual multi-head self-attention maps in-

dependently, preserving their distinct objectives and

enabling the derivation of more ﬁne-grained semantic

masks.

Our main contributions are as follows:

• Head-Wise Self-Attention Analysis. We con-

duct a detailed analysis of the individual self-

attention maps from each head in Stable Diffu-

sion, demonstrating how they contribute to se-

mantic separation within an image.

• Class-Agnostic Mask Generation. We propose

a novel method for generating semantic segmen-

tation masks across multiple levels of granular-

ity—ranging from coarse to ﬁne—directly from

the self-attention features of the diffusion model.

• Zero-Shot Segmentation. We validate our ap-

proach in the context of zero-shot segmentation,

showcasing the ability to interpret and semanti-

cally segment real-world images without any prior

training or ﬁne-tuning.

2 RELATED WORK

Numerous text-to-image diffusion models have been

developed to generate images from textual prompts,

with notable examples including DALL-E 3 (Betker

et al., 2023), Imagen (Saharia et al., 2022), Muse

(Chang et al., 2023), and Stable Diffusion (Rombach

et al., 2022b). Among these, Stable Diffusion stands

out as an open-source model capable of synthesizing

high-resolution images containing multiple objects in

one scene. This is achieved by encoding the input

text into a latent space, where a diffusion process is

applied using a denoising network. The ﬁnal image is

then reconstructed through a decoder.

Previous works have explored the role of self-

attention in generative models, particularly diffusion-

based models. For instance, (Vaswani et al., 2023)

examined the self-attention mechanism in Stable Dif-

fusion and concluded that it encapsulates valuable

layout and shape information. SegDiff (Amit et al.,

2022) introduces a segmentation approach for diffu-

sion models but relies on ground truth data for ac-

curate segmentation. DiffuMask (Wu et al., 2023),

building on AfﬁnityNet (Ahn and Kwak, 2018), uses

cross-attentions to generate foreground masks, yet,

like SegDiff, it produces only one foreground mask

per image. Dataset Diffusion (Nguyen et al., 2023)

was the ﬁrst to enable the generation of multiple ob-

ject masks per image by combining both self- and

cross-attentions. However, its reliance on cross-

attentions results in coarse segmentation masks, as

cross-attentions provide only broad region informa-

tion and still require the ﬁner details derived from

self-attention maps.

SliME (Khani et al., 2024) reﬁnes self-attentions

to improve cross-attention segmentation but still re-

quires ground truth data to specify the segmentation

style. Ref-Diff (Ni et al., 2023) demonstrates how

generative models can leverage the connection be-

tween visual elements and text descriptions, and in-

troduces a diffusional segmentor for zero-shot seg-

mentation. DiffSeg (Tian et al., 2024) treats atten-

tion resolutions differently via an iterative merging

approach but averages multi-head attentions, assum-

ing similar objectives using Kullback-Leibler diver-

gence. A network-free approach to obtain semantic

segmentation masks is proposed in A network-free

approach is proposed in (Feng et al., 2023), where

semantic segmentation masks are generated by ob-

serving pixel connectivity through synthetic image

variants, utilizing Generative Adversarial Networks

(Goodfellow et al., 2014) rather than diffusion mod-

els. Meanwhile, iSeg (Sun et al., 2024) proposes an

iterative reﬁnement framework to reduce the entropy

of self-attention maps, applying their module to unsu-

pervised segmentation tasks and mask generation for

synthetic datasets.

DAAM (Tang et al., 2022) introduces a method

to visualize the cross-attention between words and

pixels, producing pixel-level attribution maps us-

ing word-pixel scores from the denoising network.

OVAM (Marcos-Manch

on et al., 2024) extends this

idea to generate cross-attention maps for open vo-

cabulary tasks, enabling the segmentation of semantic

meanings that may not be explicitly represented in the

textual prompt without altering the image generation

process.

To the best of our knowledge, no existing method

leverages individual self-attention heads in a learning-

free manner to generate multiple high-resolution

masks for semantic separation within attention-based

generative models.

Beyond Labels: Self-Attention-Driven Semantic Separation Using Principal Component Clustering in Latent Diffusion Models

3 INTERPRETING

SELF-ATTENTIONS

The attention mechanism in diffusion models is used

to focus on particular objectives and enhance recon-

struction abilities, thereby improving the ﬁnal sam-

ple quality (Hong et al., 2023). The basic scaled dot-

product attention is computed as follows:

Attention(Q, K, V ) = softmax



√



V (1)

Where: Q ∈ R

n×d

is the query matrix, K ∈ R

n×d

the key matrix, V ∈ R

n×d

is the value matrix, d

the dimensionality of the key/query vectors, d

the di-

mensionality of the value vector and n identical layers

(Vaswani et al., 2023).

In multi-head attention, the ﬁnal attention map is

derived from the outputs of several individual atten-

tion heads. Each head processes the input by linearly

projecting the query, key, and value matrices into dis-

tinct subspaces, allowing each head to focus on differ-

ent parts of the input. These projections are then com-

bined by concatenating the outputs from all heads, re-

sulting in a richer representation that captures more

nuanced relationships between the elements in the in-

put.

MHead(Q, K, V ) = Concat(head

, .., head

with head

= Attention(QW

, KW

, VW

)

(2)

Where: W

∈ R

model

×d

, W

∈ R

model

×d

, W

∈

model

×d

, and W

∈ R

×d

model

(Vaswani et al.,

2023).

This allows to jointly attend to information from dif-

ferent representations in different subspaces (Vaswani

et al., 2023).

In the context of self-attention mechanisms, par-

ticularly within the context of multi-head attention,

this means that each head in the transformer model is

capable of focusing on distinct objectives and captur-

ing different aspects of the features.

Stable Diffusion is comprised of three main com-

ponents. To encode and decode images, a pre-

trained VAE (Kingma and Welling, 2022) is em-

ployed. The second component is a text encoder, such

as CLIP (Radford et al., 2021), which encodes a tex-

tual prompt into latent space. The core of the model

is the denoising U-Net, which reconstructs images

under conditioning from noise by learning a denois-

ing strategy that incorporates self- and cross-attention

mechanisms (Rombach et al., 2022b).

In Stable Diffusion, multi-head self-attention layers

are positioned at various stages within the denoising

Table 1: Self-attention positions with layer name, count of

individual heads, resolution and feature size of Stable Dif-

fusion 2.1.

Name Heads Resolution Feature Size

Down-1 5 4096 (64x64) 64

Down-2 10 1024 (32x32) 64

Down-3 20 256 (16x16) 64

Mid 20 64 (8x8) 64

Up-2 20 256 (16x16) 64

Up-3 10 1024 (32x32) 64

Up-4 5 4096 (64x64) 64

U-Net architecture. Our method utilizes Stable Dif-

fusion 2.1, as this version features a larger number of

attention heads compared to earlier versions, such as

Stable Diffusion 1.5, which incorporates only eight

parallel heads per layer (Rombach et al., 2022a). The

speciﬁc positions and sizes of the self-attention layers

in Stable Diffusion 2.1 are summarized in Table 1 and

further detailed in Figure 12.

Previous methods for leveraging attention maps,

such as the aforementioned Dataset Diffusion

(Nguyen et al., 2023), aggregate information by sum-

ming and averaging the head-wise output over all time

steps without using additional pre-trained models for

practical purposes. The averaging and summing pro-

cedures lead to a loss of the individual separation ca-

pabilities inherent to each attention head, as they fo-

cus solely on aggregated areas of interest. To par-

tially recover this lost precision, methods like Dataset

Diffusion (Nguyen et al., 2023) average across multi-

ple iteration steps. In contrast, our approach seeks to

preserve and condense the rich information embed-

ded within each self-attention map. We hypothesize

that these maps inherently encode sufﬁcient details to

separate objects in the generated image based on their

semantic meaning. Averaging across attention heads,

however, can dilute the semantic information, limit-

ing the ability to capture ﬁne-grained distinctions.

Our methodology operates under the assumption

that head-wise data can be effectively reduced to a

more manageable form while preserving the unique

subspace representation of each individual attention

head. Furthermore, the speciﬁc objectives of each

head can be visualized, allowing for direct inspection

of their role in the segmentation process.

Each layer within the self-attention mechanism cap-

tures distinct features and serves different objectives.

In Stable Diffusion 2.1, the number of attention heads

varies based on their position within the architecture.

The downstream ﬂow contains two consecutive trans-

former layers for self-attention, while the upstream

ﬂow incorporates three consecutive transformer lay-

ers, and the middle block features a single transformer

layer. Although the resolutions change in accordance

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

with the U-Net layers, the feature dimensions remain

consistent across all layers.

Selecting the layers and features with the high-

est probabilities for object separation based on se-

mantic meaning is a signiﬁcant challenge due to the

high-dimensional nature of the multi-head outputs.

To facilitate visual interpretation of these objectives,

we employ Principal Component Analysis (PCA) as

a standard dimensionality reduction technique. A

beneﬁcial side effect of dimensionality reduction is

its ability to highlight the most distinguishable ele-

ments of a feature map while simultaneously discard-

ing noise. Visualizing the PCA of an individual atten-

tion head enables us to interpret and understand the

semantics encoded within that head.

A straightforward method involves computing the

top three principal components of a self-attention fea-

ture map for an individual head and mapping these

components to RGB-values. In this mapping, simi-

lar colors indicate alignment across all three princi-

pal components, whereas visual differences in color

highlight discrepancies in at least one component.

This color-based separation allows for the easy iden-

tiﬁcation of clusters and their corresponding objec-

tives. Additionally, we interpret the resulting PCA

values not only as a visual representation of the self-

attention outputs but also as an encoding of objec-

tives, with the distances between these representa-

tions reﬂecting similarities in image construction ob-

jectives. The resulting clusters/classes need not be

human-understandable; they correspond to the intrin-

sic clusters that the model perceives as classes and

their subclasses. Still the principal components are in-

terpreted as important objectives that are able to rep-

resent a human-made class deﬁnition. This implies

that a face contains subclasses like eyes, which are

identiﬁed without prompting, but the model must un-

derstand that eyes must be in a particular area of the

face. The PCA is conducted to understand the under-

lying objectives of each head and further to improve

and accelerate downstream algorithms by extracting

the most semantically signiﬁcant data.

To illustrate our ﬁndings, we provide an example

using the prompt: ”a baby in a yellow toy car”, along

with the classes depicted in the image: [”person”,

”car”], follows the prompt style by (Nguyen et al.,

2023), which appends the class names to the actual

prompt separated by ”;”. This prompt is employed

to generate an image using Stable Diffusion, and the

resulting output is shown in Figure 2.

Figure 3 depicts the visualization of the ﬁrst head

of every multi-head attention (see Appendix Figure

12 for more information). The visualization starts

with the downstream ﬂow, transitioning from a 64x64

Figure 2: Output image by Stable Diffusion 2.1 based on

the prompt: ”a baby in a yellow toy car; person, car”.

self-attention resolution to 16x16. As the resolution

decreases, a greater number of distinct color clus-

ters become apparent, while higher resolutions reveal

ﬁner clusters and more detailed features. The con-

tours of objects are marked by abrupt color changes,

effectively approximating the original image without

relying on image space features. This demonstrates

that the principal components successfully condense

the objectives, visualize the objectives in a human-

interpretable way, support our hypothesis of the in-

trinsic semantic information, and can be further uti-

lized to segment an image into its semantic compo-

nents.

Figure 3: Principal component visualization of the ﬁrst

multi-head attention heads in U-Net order, progressing from

left to right and top to bottom. The resolution begins at

64x64 and decreases to 16x16 in the down layers, with the

mid block at 8x8 resolution. The upstream ﬂow starts at

16x16 and increases back to 64x64 in the up layers. First

three main components are mapped to RGB-values

The ﬁrst self-attention in Figure 3, located at the

upper left, demonstrates objective separation capabil-

ity. The eyes, mouth, and hair are represented with

distinct colors, indicating signiﬁcant differences in at

least one principal component, in comparison to the

rest of the face and skin. It is noteworthy that the sec-

ond (right) visualization of the ﬁrst downstream layer

Beyond Labels: Self-Attention-Driven Semantic Separation Using Principal Component Clustering in Latent Diffusion Models

shows a distinct color cluster for the wheels (green-

ish) of the car, while the left head’s visualization has

one color cluster for the black parts (including shadow

of the car and wheels) of the car. Averaging over

these head maps would result in the loss of impor-

tant details, as some of these distinct clusters would

be merged or diminished. This underscores the ra-

tionale for our approach of leveraging head-speciﬁc

features, as individual heads may capture unique ob-

jectives that are otherwise lost through averaging.”

However, it is important to note that the car and

the child’s face may appear closely related in this rep-

resentation, leading to similar colors. This observa-

tion highlights the necessity of incorporating diverse

heads and their respective features when applying

clustering algorithms to ensure more effective sepa-

ration of the image’s distinct semantic components.

The subsequent principal component visualization

of the second down layer with a 64x64 resolution suc-

cessfully separates the car from the child. However,

as the resolution of the self-attention principal com-

ponents decreases, the clusters lose ﬁner details Ad-

ditionally, we observed subtle differences in seman-

tic separation between the upstream and downstream

self-attention layers, which we further analyze in Ta-

ble 2.

Figure 3 already illustrates the advantage of vi-

sualizing individual heads. Notably, both the down-

stream and upstream layers reveal distinct color clus-

ters that correspond well to the objects in the output

image shown in Figure 2.

In addition to the upstream and downstream lay-

ers, Stable Diffusion 2.1 includes a single self-

attention layer in the middle block. To explore its

relevance, we conducted a PCA with three compo-

nents and visualized the head outputs. The results re-

vealed no noticeable color clusters corresponding to

the semantic meaning of the output image, as shown

in Figure 14, which includes all 20 heads of the mid-

dle block from the previous example.

Our experiments indicate that this layer does not

contribute meaningful semantic information. We hy-

pothesize that this lack of semantic separability is due

to the block’s low resolution. This conclusion is fur-

ther supported by the quantitative results presented in

Table 8, where we repeated the analysis from Table 2

but included the middle self-attention features. The

overall performance did not improve, and in some

cases, it even worsened when the middle block fea-

tures were incorporated. Consequently, we exclude

the middle block from further investigation.

4 METHODOLOGY FOR MASK

GENERATION

Figure 4: Main methodology of our proposed method.

Our method is built on top of the OVAM (Marcos-

Manch

on et al., 2024) source code to extract the raw

self-attentions from the denoising U-Net. To demon-

strate the effectiveness of our approach, we utilize

publicly available text prompts from Dataset Diffu-

sion (Nguyen et al., 2023), which are derived from

images in the Pascal VOC training dataset (Evering-

ham et al., 2012). The image captions for these

prompts are generated using BLIP (Li et al., 2022).

4.1 Process Self-Attentions

Our main methodology, as illustrated in Figure 4, in-

volves extracting self-attention maps from up to 16

different layers. The speciﬁc positions of these layers

are detailed in Figure 12, with corresponding sizes

presented in Table 1. To process the self-attention

maps, we apply bilinear upsampling to the head-wise

tensors, which are reduced via principal component

analysis, ensuring a uniform shape across all layers.

This approach yields PCA feature maps from up to

195 individual heads, all at the same resolution. De-

pending on their position within the network, these

maps may capture different intrinsic semantic infor-

mation.

To obtain masks from the reduced self-attentions,

we apply k-means clustering to the stacked princi-

pal component features of the sub-selected heads. K-

means is a simple clustering algorithm that does not

need any additional parameters besides the cluster

count k. This is an advantage if we want fast con-

vergence for clusters without the need of ﬁne-tuning

parameters on our examples. We use a ﬁxed seed for

initialization to ensure that our method remains re-

producible over all experiments. To address potential

issues with ﬁxed initialization, we experiment with

various cluster counts and cluster the same features

multiple times over an increasing cluster count. This

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

allows us to produce multiple masks with different

semantic meaning per image. As a straightforward

mask reduction step, we merge masks with a high in-

tersection over union into a single uniﬁed mask. Ex-

amples of this approach are shown in Figure 5, where

the generated images are displayed in the top row,

and the corresponding clusters, with cluster counts

∈ [2, 5, 6, 10, 20], are shown in the rows below. A low

cluster count corresponds highly to a background-

foreground segmentation [cluster count = 2] whereas

the clusters with a cluster count of [5,6] correspond to

high-level classes like child, car or cat. Clusters from

a high cluster count [10, 20] correspond to low-level

semantic segmentation masks like face, eyes and ears.

The resolution and the therefore abstraction level

of the self-attentions is crucial: low-resolution self-

attentions provide connectivity and distinguishable

features for low-level objectives and therefore coarse

segmentation but suffer from imprecise contours,

whereas high-resolution self-attentions capture ﬁner

details, such as facial features, but may miss some

obvious connections among masks (see Figure 11 for

more details).

Depending on the target objective and computing

time, parameters such as the number of k-means clus-

ters, upsampling resolution, number of principal com-

ponents, and position selection of the self-attention

layers can be varied. We utilize scikit-learn (Pe-

dregosa et al., 2011) to conduct our upsampling and

clustering. The choice of the optimal setup depends

on the desired outcome. For instance, if the goal is to

segment only a person, a coarse segmentation, such as

background versus foreground, might sufﬁce. How-

ever, if the masks are intended for an image-to-image

inpainting approach and a detailed deﬁnition of in-

stance e.g. ”eyes” are of interest, a much ﬁner seg-

mentation is required and therefore a higher cluster

count has to be set.

As demonstrated qualitatively in Figure 5 and Fig-

ure 11, our approach is capable of segmenting ﬁne-

grained classes and multiple objects within a single

scene. For instance, in the last column, a monitor dis-

playing content is placed on a table in a room. Our

method effectively separates the monitor, its content,

and the table it is placed on. With an increasing clus-

ter count, hierarchical masks can be deﬁned, such as

identifying a head as part of a person, which is part of

the content on the monitor, which in turn belongs to

the monitor. Additionally, in the Appendix Figure11

provides a comparison of the generated clusters de-

rived solely from the 16x16 self-attention maps.

This qualitative analysis shows that clusters de-

rived from low-resolution self-attentions tend to cap-

ture more semantic meaning, closely aligning with

Image

2 Clusters

5 Clusters

6 Clusters

10 Clusters

20 Clusters

Figure 5: K-means clustering over multiple cluster counts

and ﬁve examples. Every color represents a single clus-

ter for attention-maps from sizes [16x16, 32x32, 64x64].

See more in Figure 11 (Appendix), where also clusters from

only [16x16] are presented.

high-level classes such as those in Pascal VOC. In

contrast, higher-resolution self-attentions are better

suited for generating ﬁne-grained, low-level segmen-

tations. However, due to the signiﬁcantly lower res-

olution of self-attentions - at minimum eight times

lower than the output image — the principal compo-

nents between clusters lack well-deﬁned edges, espe-

cially in lower resolutions.

As the number of clusters increases, some pseudo-

clusters may form due to transitions between truly

meaningful clusters, which is an inherent limitation of

the bilinear interpolation process. Additionally, ob-

ject contours remain uncertain because no ﬁnal out-

put image features are incorporated into the clustering

process. Introducing extraneous information, such as

pixel positions or image features, could result in clus-

ters being overly inﬂuenced by these features, thereby

reducing the interpretability and effectiveness of the

clustering method.

To improve accuracy, a reﬁnement method is

needed. We propose leveraging the upsampling error

as a potential means of reﬁnement and incorporating

image features in a post-processing step. A brief anal-

ysis of the upsampling error is provided in the Ap-

pendix 6.

Beyond Labels: Self-Attention-Driven Semantic Separation Using Principal Component Clustering in Latent Diffusion Models

4.2 Hyperparameter-Studies and

Evaluation

In order to also quantitatively evaluate our method

and determine suitable parameters, we use two ex-

ternal task-speciﬁc state-of-the-art models to gener-

ate pseudo-labels on images generated by the diffu-

sion model. First, we employ the Segment Anything

model (Kirillov et al., 2023) with a ViT-L backbone,

which is expected to generate instance masks without

domain knowledge. Second, we use Mask2Former

(Cheng et al., 2022) with a Swin-L backbone, which

is pre-trained on the ADE20K dataset (Zhou et al.,

2017) and is available as a pre-trained implemen-

tation via the mmseg toolbox (Contributors, 2020).

Unlike instance segmentation, this model provides

class-speciﬁc semantic segmentations, and its train-

ing classes closely align with the Pascal VOC classes,

which are the basis for the prompts used in our model.

Both of these methods are trained on labeled

segmentation ground truth, whereas our method has

never seen or trained on a segmentation mask. In-

stead, our method relies solely on the intrinsic seman-

tic knowledge derived from the self-attentions of Sta-

ble Diffusion and its derived features.

To assess the performance of our method, we

compare the semantic masks generated by our ap-

proach with those produced by the two evaluation

models. This comparison aims to highlight the effec-

tiveness of our method relative to an instance-focused

model and a class-based segmentation model. We

omit the classiﬁcation component, as it would only be

feasible with cross-attention. For evaluation, we use

Intersection over Union (IoU) as our metric, deﬁned

as follows:

IoU(A,B) =

Area of Intersection

Area of Union

|A ∩B|

|A ∪B|

(3)

To provide a comprehensive evaluation of the

most signiﬁcant masks, we deﬁne the Top-n metric.

This metric selects the n-largest masks from the ex-

ternal models and computes the average IoU with the

corresponding classes from our method:

Top-n =

∑

n=1

max

B∈predicted masks



∩B|

∪B|



(4)

where A

represents the n-th largest mask from the

external models, and B denotes the masks from our

method.

For the ﬁnal evaluation, we average this score

over all samples. Figure 6 presents our evaluation for

the ﬁrst sample using the Segment Anything Model

(SAM), and Figure 7 shows the performance of our

method compared to Mask2Former.

SAM Pseudo Ground Truth

Our Matched Masks

Figure 6: Top: SAM Top-5 masks colored in blue. Bottom:

Matched masks from our method. The IoUs between the

pseudo ground truth and our method’s masks are 0.97, 0.82,

0.94, 0.87, and 0.77 (from left to right).

The SAM evaluation provides an example of ﬁne,

pixel-accurate masks. While it segments parts of

the image very effectively, our method identiﬁes not

only the corresponding masks but also additional se-

mantically consistent variations that are not iden-

tiﬁed by SAM. We do not provide additional do-

main knowledge to SAM, such as masks, points or

boxes, to obtain the raw scene-interpreted masks from

SAM. On the other hand, the Mask2Former evalu-

ation serves as a sanity check, as it was trained on

the ADE20K dataset, which limits its ability to gen-

eralize beyond its training domain. In the example,

Mask2Former splits the background into upper and

one lower parts, with only a single mask for the fore-

ground. Our method closely matches these masks of

the Mask2Former network as shown in Figure 7.

It is important to note that SAM contributes to

the Top-1 to Top-5 average score due to its ability

to generate a larger number of masks. In contrast,

Mask2Former only contributes to the Top-1 to Top-

3 average score, as it generates fewer class-speciﬁc

masks. We observed that SAM typically produces

more masks than Mask2Former, making it a more re-

liable source for the Top-n metric.

Mask2Former Pseudo Ground Truth

Our Matched Masks

Figure 7: Top: Mask2Former Top-3 masks colored in blue.

Bottom: Matched masks from our method. The IoUs be-

tween the pseudo ground truth and our method’s masks are

0.90, 0.87, and 0.82 (from left to right).

Next, we present a study on layer positions and

reweighting of the features in Table 2. We hypothe-

size that the position of the layer affects the semantic

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

Table 2: Study on reweighting and layer positions, evaluated by the average top scores across multiple samples (measured in

average IoU). Self-attentions at resolutions of [16, 32, 64] are used with three principal components, bilinearly upsampled to

a resolution of 64. No middle block is incorporated.

ID Reweight Layer SAM Top-1 SAM Top-3 SAM Top-5 M2F Top-1 M2F Top-3 M2F Top-5

0 True down 0.80 0.79 0.76 0.76 0.73 0.64

1 True up 0.84 0.82 0.79 0.79 0.76 0.67

2 True both 0.84 0.81 0.78 0.79 0.76 0.66

6 False down 0.84 0.82 0.79 0.80 0.78 0.67

7 False up 0.85 0.83 0.80 0.81 0.78 0.68

8 False both 0.84 0.83 0.80 0.80 0.78 0.68

meaning of the generated masks. While the down-

ﬂow is expected to focus on condensing information,

the up-ﬂow is hypothesized to support image recon-

struction. The reweighting process addresses the im-

balance in the feature map distribution across differ-

ent layer positions, particularly due to the higher num-

ber of 16x16 attention layers compared to 64x64 lay-

ers. The reweighting is deﬁned as follows:

reweighted



original

) ·r

max(r

, r

)



∈{16,32,64}

(5)

where x are the features which get reweighted by their

resolution.

The study presented in Table 2 demonstrates that

self-attentions from the upstream layers are more

valuable for reconstructing accurate masks compared

to the downstream layers. All rows with only ”down”-

layers are inferior to gather only ”up”-layers for

self-attention. However, including features from the

downstream layers does not disadvantage the perfor-

mance, but using ”both”-self-attentions has no effect

on performance. Also, reweighting has no positive

impact in this scenario, and the non-reweighted clus-

ters are slightly ahead, in conclusion supporting the

importance of the low-resolution self-attention maps.

We also conducted a study on the impact of

the number of principal components, as shown in

Figures 8 and Table 7. These ﬁgures demonstrate

that applying PCA to the self-attention features en-

hances clustering compared to using the raw features.

While there is no signiﬁcant change in mask accu-

racy, the speedup achieved is deﬁned by speedup =

#principal components

. The study shows an improvement

in the SAM Top-n metric when using approximately

eight principal components, with the optimal number

being around 32. This analysis was conducted with

bilinear feature upsampling to a resolution of 256.

The study also reveals that we can reduce the origi-

nal feature parameters by more than a factor of 20 (by

applying PCA to eight components and using only the

”up” attention layers) without sacriﬁcing the ﬁnal av-

erage performance.

Finally, we examined how the position of feature

layers affects which resolution of self-attention pro-

Figure 8: Impact of varying the number of principal compo-

nent on multiple SAM-scores. Higher dimensionality does

not necessarily lead to a greater performance. Detailed met-

rics in Table 7.

vides the most accurate semantic information. This

is motivated by the fact that we have qualitatively

noticed that the low-resolution self-attention fea-

tures represent low-level semantics, while the higher-

resolution self-attention carries ﬁner details (see Fig-

ure 11). Conversely, the higher resolution self-

attentions should tend to better represent the contour

of an object because of the higher feature resolution.

Based on this study, we concluded that the PCA com-

ponents should be taken from all reasonable attention

layers across all resolutions, speciﬁcally the [16x16,

32x32, 64x64] self-attention layers for best perfor-

mance, as shown in Table 6. The mix of coarse and

ﬁne features provides the best performance. For efﬁ-

ciency, we bilinearly upsampled to 64x64 to perform

this study in a reasonable time frame.

5 ZERO-SHOT SEGMENTATION

To demonstrate the capabilities of our ﬁndings, we

conduct our method with an image-to-image ap-

proach to enable zero-shot segmentation on all types

of images. To validate, we apply our method on a

real-world image, which we try to denoise with Sta-

Beyond Labels: Self-Attention-Driven Semantic Separation Using Principal Component Clustering in Latent Diffusion Models

Image

2 Clusters

5 Clusters

15 Clusters

20 Clusters

Figure 9: Zero-Shot segmentation of our method with as-

cending cluster counts from top to bottom, going from high-

level segmentation to low-level segmentation.

ble Diffusion to obtain the self-attention features. In

order not to change the original content of the image

too much and still obtain segmentation masks we ap-

ply only a small denoising strength (strength = 0.05)

and only perform 2 iteration steps. Furthermore,

we do not use an input prompt (we input an empty

string), but one can easily be generated with any im-

age descriptor, such as BLIP (Li et al., 2022). We

validate our method using the public datasets Pascal

VOC and Cityscapes, where we are able to compare

against a public ground truth and the performance of

the SOTA-segmentation algorithm SAM. Our work-

ﬂow is the same as presented previously in section

4. Figure 9 illustrates the general capabilities of our

method on real-world images. Our method demon-

strates that self-attention features convey semantic in-

formation for classes and even subclasses, and that the

internal model’s interpretation of them can be visual-

ized. The zero-shot application proves that the self-

attentions can be employed to segment the content

of an image not only working for image generation

by textual prompt. This method can also be utilized

to obtain segmentation masks in the wild. Remark-

able in our method is that it connects same classes

across pixel gaps and group classes together, visual-

ized in the third image of Figure 9 from the left. Peo-

ple’s faces or clothes are assigned to the same cluster.

Table 3: Performance comparison of our method and SAM

variants on Cityscapes and Pascal VOC validation datasets

in average class IoU.

Validation Set Our Method SAM ViT-H SAM ViT-B SAM ViT-L

Pascal VOC Semantic 0.66 0.69 0.39 0.65

Pascal VOC Instance 0.62 0.74 0.39 0.65

Cityscapes Semantic 0.27 0.44 0.26 0.43

Cityscapes Instance 0.21 0.26 0.15 0.25

As the number of clusters increases, these high-level

clusters split, as discussed, into lower-level clusters.

As the number of clusters increases, the high-level

clusters are divided into lower-level clusters. This is

possible due to the lack of additional pixel position

data or additional information.

In order to evaluate and compare our method with

SAM, we conduct an analysis using the Pascal VOC

and Cityscapes validation sets. We bypass the classi-

ﬁcation task and match the generated mask with the

highest IoU to the ground truth mask for both SAM

and our method. Subsequently, we determined the

class-wise IoU per image and averaged the results

over all samples. We chose an average class IoU per

image instead of an mIoU to not consider complexer

scences with ﬁner and smaller masks less than simple

scenes. We use both the semantic and instance labels

for comparison. The results are presented in Table 3,

which shows that the best performance was achieved

on the semantic segmentation validation set of the

Pascal VOC dataset. Our method outperforms the

strongest SAM with ViT-H backbone in some classes

(see Appendix Table 4 for more details).

However, the overall average class performance

is slightly below that of the best performing SAM

model, but above the rest. The discrepancy widens in

the case of instance segmentation, where the metric of

the best performing SAM model is noticeably higher,

but our method is close to the SAM model with ViT-

L backbone. In the Cityscapes datasets, there are a

lot more ﬁner and complex segmentations, and our

method falls slightly more behind compared to best

performance of the best SAM and can now only per-

form better than the weakest SAM model with ViT-B

backbone.

Nevertheless, our performance is remarkable con-

sidering that we use an untrained method for seman-

tic segmentation, while SAM is self-supervised and

trained on billions of images and millions of anno-

tated masks. We have added some examples in the

Appendix in Figure 13 where one will notice that our

method still cannot match the pixel accuracy of SAM,

but is able to obtain masks where a class is well repre-

sented. Furthermore, our method is superior in some

edge cases where a class is occluded by other things

and our method is able to associate a class with a

mask where all SAM model fail to segment (see Fig-

ure 13 example horse). This is due to the underlying

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

scene understanding of the Stable Diffusion model

compared to a mask-generating optimized model such

as SAM.

6 CONCLUSION

We have presented a novel approach to generate

high-resolution semantic masks using only the self-

attention maps from diffusion models. We show that

our method extracts semantically meaningful masks,

without requiring additional learning or pre-trained

models. This approach can be employed to directly

obtain semantic masks for self-generated images us-

ing textual prompts as input only or for zero-shot

segmentation, where an input image is given. Our

method enables the utilization of Stable Diffusion’s

inherent scene understanding for semantic separation,

a task it has not been explicitly trained on. Validation

with SAM (Kirillov et al., 2023) shows that our ap-

proach produces high-quality semantic segmentation

on par with state-of-the-art methods, further allowing

ﬂexibility to adjust segmentation granularity.We fur-

ther show that the generated masks beneﬁt from Sta-

ble Diffusion’s scene understanding, to provide clus-

ters of consistent semantic meaning beyond occlusion

and pixel gaps.

As a direction for future work, we suggest using

cross-attention maps to further obtain class labels for

the generated masks. Additionally, a post-processing

step using the upsampling error and image features

to reﬁne the masks will increase the accuracy of the

masks.

REFERENCES

Ahn, J. and Kwak, S. (2018). Learning pixel-level semantic

afﬁnity with image-level supervision for weakly su-

pervised semantic segmentation.

Amit, T., Shaharbany, T., Nachmani, E., and Wolf, L.

(2022). Segdiff: Image segmentation with diffusion

probabilistic models.

Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J.,

Li, L., Ouyang, L., Zhuang, J., Lee, J., Guo, Y.,

et al. (2023). Improving image generation with bet-

ter captions. Computer Science. https://cdn. openai.

com/papers/dall-e-3. pdf, 2(3):8.

Chang, H., Zhang, H., Barber, J., Maschinot, A., Lezama,

J., Jiang, L., Yang, M.-H., Murphy, K., Freeman,

W. T., Rubinstein, M., Li, Y., and Krishnan, D. (2023).

Muse: Text-to-image generation via masked genera-

tive transformers.

Cheng, B., Misra, I., Schwing, A. G., Kirillov, A., and Gird-

har, R. (2022). Masked-attention mask transformer for

universal image segmentation.

Contributors, M. (2020). MMSegmentation: Openmmlab

semantic segmentation toolbox and benchmark. https:

//github.com/open-mmlab/mmsegmentation.

Everingham, M., Van Gool, L., Williams, C. K. I.,

Winn, J., and Zisserman, A. (2012). The PASCAL

Visual Object Classes Challenge 2012 (VOC2012)

Results. http://www.pascal-network.org/challenges/

VOC/voc2012/workshop/index.html.

Feng, Q., Gadde, R., Liao, W., Ramon, E., and Martinez,

A. (2023). Network-free, unsupervised semantic seg-

mentation with synthetic images. In Proceedings of

the IEEE/CVF Conference on Computer Vision and

Pattern Recognition, pages 23602–23610.

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A., and Ben-

gio, Y. (2014). Generative adversarial networks.

Hong, S., Lee, G., Jang, W., and Kim, S. (2023). Im-

proving sample quality of diffusion models using self-

attention guidance. In Proceedings of the IEEE/CVF

International Conference on Computer Vision, pages

7462–7471.

Khani, A., Taghanaki, S. A., Sanghi, A., Amiri, A. M., and

Hamarneh, G. (2024). Slime: Segment like me.

Kingma, D. P. and Welling, M. (2022). Auto-encoding vari-

ational bayes.

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C.,

Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C.,

Lo, W.-Y., Doll

ar, P., and Girshick, R. (2023). Seg-

ment anything.

Li, J., Li, D., Xiong, C., and Hoi, S. (2022). Blip:

Bootstrapping language-image pre-training for uniﬁed

vision-language understanding and generation.

Marcos-Manch

on, P., Alcover-Couso, R., SanMiguel, J. C.,

and Mart

ınez, J. M. (2024). Open-vocabulary atten-

tion maps with token optimization for semantic seg-

mentation in diffusion models.

Nguyen, Q., Vu, T., Tran, A., and Nguyen, K. (2023).

Dataset diffusion: Diffusion-based synthetic dataset

generation for pixel-level semantic segmentation.

Ni, M., Zhang, Y., Feng, K., Li, X., Guo, Y., and Zuo,

W. (2023). Ref-diff: Zero-shot referring image seg-

mentation with generative models. arXiv preprint

arXiv:2308.16777.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,

Thirion, B., Grisel, O., Blondel, M., Prettenhofer,

P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,

A., Cournapeau, D., Brucher, M., Perrot, M., and

Duchesnay, E. (2011). Scikit-learn: Machine learning

in Python. Journal of Machine Learning Research,

12:2825–2830.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,

Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,

J., Krueger, G., and Sutskever, I. (2021). Learning

transferable visual models from natural language su-

pervision.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and

Ommer, B. (2022a). High-resolution image synthe-

sis with latent diffusion models. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition (CVPR), pages 10684–10695.

Beyond Labels: Self-Attention-Driven Semantic Separation Using Principal Component Clustering in Latent Diffusion Models

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and

Ommer, B. (2022b). High-resolution image synthe-

sis with latent diffusion models.

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Den-

ton, E., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi,

S. S., Lopes, R. G., Salimans, T., Ho, J., Fleet, D. J.,

and Norouzi, M. (2022). Photorealistic text-to-image

diffusion models with deep language understanding.

Sun, L., Cao, J., Xie, J., Khan, F. S., and Pang, Y. (2024).

iseg: An iterative reﬁnement-based framework for

training-free segmentation.

Tang, R., Liu, L., Pandey, A., Jiang, Z., Yang, G., Kumar,

K., Stenetorp, P., Lin, J., and Ture, F. (2022). What

the daam: Interpreting stable diffusion using cross at-

tention.

Tian, J., Aggarwal, L., Colaco, A., Kira, Z., and Gonzalez-

Franco, M. (2024). Diffuse attend and segment: Un-

supervised zero-shot segmentation using stable diffu-

sion. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages

3554–3563.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, L., and Polosukhin, I.

(2023). Attention is all you need.

Wu, W., Zhao, Y., Shou, M. Z., Zhou, H., and Shen, C.

(2023). Diffumask: Synthesizing images with pixel-

level annotations for semantic segmentation using dif-

fusion models. In Proceedings of the IEEE/CVF In-

ternational Conference on Computer Vision, pages

1206–1217.

Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., and

Torralba, A. (2017). Scene parsing through ade20k

dataset. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 633–

641.

APPENDIX

Upsampling Error. The upsampling effect/error can

be computed as follows:

Upsample

(i, j) =

∑

k=1

bilinear

(i, j, k) −x

nearest

(i, j, k))

∀i ∈ {1, 2, . . . , width}, ∀j ∈ {1, 2, . . . , height}

n = count of principal components

(6)

This calculation computes a pixel-wise mean of

the root squared differences of the nearest neighbor

nearest

) and bilinear upsampled (x

bilinear

) principal

components. This approach faciliates a visual esti-

mation of clusters, as illustrated in Figure 10. The

upsampling effect can be leveraged for contour reﬁne-

ment of the ﬁnal masks, ﬁltering out potential pseudo-

clusters, or identifying objects in a generated image.

Table 4: Our Method vs. SAM on the Pascal VOC valida-

tion set for semantic segmentation. The metric is the per

image averaged class-wise Iou.

Class Our Method SAM ViT-H SAM ViT-L SAM ViT-B

airplane 0.65 0.73 0.70 0.34

bicycle 0.30 0.26 0.23 0.14

bird 0.70 0.79 0.74 0.47

bottle 0.52 0.74 0.72 0.52

bus 0.79 0.85 0.85 0.49

car 0.61 0.77 0.74 0.54

cat 0.83 0.86 0.84 0.46

chair 0.52 0.56 0.53 0.40

table 0.71 0.55 0.49 0.36

dog 0.79 0.84 0.83 0.39

horse 0.70 0.65 0.60 0.20

motorbike 0.67 0.66 0.50 0.18

person 0.61 0.64 0.63 0.39

potted plant 0.42 0.44 0.42 0.30

sheep 0.73 0.64 0.63 0.50

sofa 0.80 0.67 0.62 0.43

train 0.79 0.82 0.78 0.16

monitor 0.60 0.83 0.82 0.68

all classes 0.66 0.69 0.65 0.39

16x16 32x32 64x64

Figure 10: Pixel-mean upsampling effect of individual res-

olutions of the self-attentions for the ”baby”-example.

Table 5: Our Method versus SAM on the Pascal VOC vali-

dation set for instance segmentation. The metric is the per

mask averaged class-wise IoU. The differences compared

to semantic segmentation for Pascal VOC are minimal be-

cause multiple instances of a single object are rare in this

dataset.

Class Our Method SAM ViT-H SAM ViT-L SAM ViT-B

airplane 0.62 0.73 0.70 0.34

bicycle 0.29 0.28 0.24 0.14

bird 0.69 0.84 0.74 0.47

bottle 0.48 0.81 0.72 0.52

bus 0.70 0.92 0.85 0.49

car 0.54 0.81 0.74 0.54

cat 0.81 0.89 0.84 0.46

chair 0.48 0.65 0.53 0.40

table 0.69 0.56 0.49 0.36

dog 0.77 0.88 0.83 0.39

horse 0.67 0.70 0.60 0.20

motorbike 0.63 0.68 0.50 0.18

person 0.54 0.75 0.63 0.39

potted plant 0.40 0.54 0.42 0.30

sheep 0.60 0.79 0.63 0.50

sofa 0.79 0.71 0.62 0.43

train 0.78 0.83 0.78 0.16

monitor 0.59 0.91 0.82 0.68

all classes 0.62 0.74 0.65 0.39

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

Figure 11: Examples comparing high-resolution masks from [16x16, 32x32, 64x64] attentions (left) with coarse masks from

[16x16] attentions (right). High-resolution masks capture ﬁner details and sharper contours, while the [16x16] masks lack

precision in object boundaries, but focus more on high-level semantic meanings.

Figure 12: Positions of self-attentions in denoising U-Net in Stable Diffusion 2.1. There are six self-attention in the down

layers, one in the middle layer and 9 in the up layers, down-4 and up-1 does not contain any transformer layer.

Table 6: Parameter study on attention layers position and principal component count, only with 64x64 bilinear feature upsam-

pling.

ID SAM Top-1 SAM Top-3 SAM Top-5 SAM Top-10 SAM Top-15 M2F Top-1 M2F Top-2 M2F Top-3 M2F Top-5 M2F Top-10 Reweight Attention Resolution #Principal Components Upsample Resolution Sample Count

0 0.84 0.83 0.80 0.73 0.67 0.79 0.79 0.77 0.68 0.59 True [16, 32, 64] 64 64 50

1 0.84 0.82 0.80 0.72 0.67 0.79 0.78 0.76 0.67 0.59 True [16, 32, 64] 32 64 50

2 0.84 0.82 0.79 0.72 0.66 0.80 0.79 0.77 0.68 0.60 True [16, 32, 64] 16 64 50

3 0.84 0.82 0.80 0.72 0.67 0.79 0.78 0.76 0.68 0.60 True [16, 32, 64] 10 64 50

4 0.83 0.82 0.79 0.71 0.66 0.78 0.78 0.76 0.67 0.59 True [16, 32, 64] 8 64 50

5 0.84 0.82 0.79 0.70 0.65 0.79 0.78 0.76 0.67 0.59 True [16, 32, 64] 5 64 50

6 0.84 0.81 0.78 0.70 0.65 0.78 0.77 0.76 0.66 0.56 True [16, 32, 64] 3 64 50

7 0.82 0.80 0.76 0.67 0.62 0.77 0.76 0.74 0.65 0.54 True [16, 32, 64] 1 64 50

8 0.83 0.82 0.80 0.72 0.67 0.78 0.77 0.76 0.67 0.59 True [32, 64] 64 64 50

9 0.82 0.81 0.79 0.72 0.67 0.78 0.76 0.75 0.66 0.58 True [32, 64] 32 64 50

10 0.82 0.81 0.79 0.71 0.66 0.78 0.76 0.75 0.66 0.58 True [32, 64] 16 64 50

11 0.83 0.81 0.79 0.71 0.67 0.78 0.76 0.75 0.67 0.59 True [32, 64] 10 64 50

12 0.82 0.81 0.78 0.71 0.66 0.77 0.76 0.75 0.66 0.59 True [32, 64] 8 64 50

13 0.83 0.81 0.78 0.70 0.66 0.76 0.75 0.74 0.66 0.58 True [32, 64] 5 64 50

14 0.82 0.81 0.78 0.69 0.65 0.77 0.75 0.74 0.65 0.57 True [32, 64] 3 64 50

15 0.80 0.78 0.73 0.66 0.62 0.75 0.73 0.71 0.62 0.53 True [32, 64] 1 64 50

16 0.84 0.81 0.79 0.70 0.62 0.80 0.80 0.78 0.68 0.60 True [16, 32] 64 64 50

17 0.84 0.82 0.79 0.69 0.62 0.80 0.80 0.78 0.68 0.60 True [16, 32] 32 64 50

18 0.84 0.82 0.79 0.70 0.62 0.79 0.80 0.78 0.69 0.59 True [16, 32] 16 64 50

19 0.84 0.82 0.79 0.70 0.62 0.80 0.80 0.78 0.68 0.59 True [16, 32] 10 64 50

20 0.84 0.81 0.78 0.70 0.62 0.79 0.80 0.78 0.68 0.59 True [16, 32] 8 64 50

21 0.84 0.81 0.78 0.69 0.62 0.79 0.80 0.78 0.68 0.58 True [16, 32] 5 64 50

22 0.83 0.81 0.78 0.69 0.61 0.79 0.79 0.77 0.68 0.58 True [16, 32] 3 64 50

23 0.82 0.80 0.76 0.67 0.60 0.77 0.78 0.76 0.66 0.57 True [16, 32] 1 64 50

24 0.82 0.81 0.79 0.71 0.66 0.78 0.76 0.75 0.66 0.58 True [64] 64 64 50

25 0.82 0.80 0.79 0.71 0.66 0.77 0.76 0.74 0.65 0.58 True [64] 32 64 50

26 0.82 0.80 0.78 0.70 0.66 0.77 0.76 0.75 0.66 0.58 True [64] 16 64 50

27 0.82 0.80 0.78 0.70 0.66 0.77 0.76 0.74 0.66 0.58 True [64] 10 64 50

28 0.81 0.80 0.78 0.70 0.66 0.77 0.76 0.74 0.66 0.57 True [64] 8 64 50

29 0.81 0.79 0.77 0.69 0.65 0.76 0.75 0.74 0.65 0.56 True [64] 5 64 50

30 0.81 0.79 0.76 0.69 0.64 0.76 0.75 0.73 0.65 0.56 True [64] 3 64 50

31 0.79 0.78 0.74 0.66 0.61 0.74 0.73 0.71 0.63 0.54 True [64] 1 64 50

Table 7: Study on principal component count with 256x256 bilinear feature upsampling.

ID SAM Top-1 SAM Top-3 SAM Top-5 SAM Top-10 SAM Top-15 M2F Top-1 M2F Top-2 M2F Top-3 M2F Top-5 M2F Top-10 Reweight Attention Resolution #Principal Components Upsample Resolution Sample Count

0 0.85 0.83 0.82 0.74 0.69 0.80 0.79 0.78 0.69 0.61 True [16, 32, 64] 64 256 41

1 0.86 0.83 0.82 0.75 0.70 0.79 0.79 0.78 0.68 0.62 True [16, 32, 64] 32 256 41

2 0.84 0.83 0.82 0.74 0.69 0.79 0.78 0.77 0.69 0.63 True [16, 32, 64] 16 256 41

3 0.85 0.83 0.82 0.74 0.69 0.80 0.79 0.77 0.68 0.61 True [16, 32, 64] 10 256 41

4 0.85 0.83 0.82 0.74 0.68 0.79 0.79 0.78 0.68 0.61 True [16, 32, 64] 8 256 41

5 0.85 0.83 0.81 0.74 0.68 0.79 0.78 0.77 0.68 0.59 True [16, 32, 64] 5 256 41

6 0.84 0.82 0.81 0.73 0.68 0.78 0.77 0.76 0.67 0.60 True [16, 32, 64] 3 256 41

7 0.83 0.81 0.78 0.69 0.64 0.77 0.76 0.74 0.65 0.57 True [16, 32, 64] 1 256 41

8 0.85 0.83 0.82 0.75 0.69 0.78 0.78 0.76 0.68 0.60 True [32, 64] 64 256 41

9 0.84 0.83 0.82 0.75 0.70 0.78 0.78 0.77 0.68 0.62 True [32, 64] 32 256 41

10 0.85 0.83 0.81 0.75 0.70 0.79 0.78 0.77 0.68 0.61 True [32, 64] 16 256 41

11 0.84 0.83 0.81 0.74 0.70 0.79 0.78 0.76 0.68 0.61 True [32, 64] 10 256 41

12 0.83 0.83 0.81 0.74 0.70 0.77 0.77 0.76 0.67 0.60 True [32, 64] 8 256 41

13 0.83 0.82 0.81 0.73 0.69 0.77 0.76 0.76 0.67 0.61 True [32, 64] 5 256 41

14 0.83 0.81 0.80 0.73 0.68 0.76 0.76 0.75 0.66 0.60 True [32, 64] 3 256 41

15 0.80 0.79 0.76 0.68 0.64 0.74 0.73 0.72 0.63 0.56 True [32, 64] 1 256 41

Beyond Labels: Self-Attention-Driven Semantic Separation Using Principal Component Clustering in Latent Diffusion Models

Table 8: Incorporating middle self-attention and study on reweighting and layer positions (like Table 2), evaluated by the

average of the top scores over multiple samples (in average IoU). [8,16,32,64] self-attention layers were used with 3 principal

components, feature bilinear upsampling resolution to 64 and additionally middle block is incorporated.

ID Reweight Layer SAM Top 1 SAM Top 3 SAM Top 5 M2F Top-1 M2F Top-3 M2F Top-5

0 True down+mid 0.80 0.78 0.75 0.75 0.73 0.64

1 True up+mid 0.84 0.82 0.79 0.79 0.77 0.67

2 True all+mid 0.84 0.82 0.79 0.79 0.76 0.67

3 False down+mid 0.83 0.81 0.78 0.79 0.77 0.67

4 False up+mid 0.85 0.82 0.79 0.80 0.78 0.68

5 False all+mid 0.84 0.83 0.80 0.80 0.78 0.68

Figure 13: Comparison of segmentation performance be-

tween SAM models and our method on the Pascal VOC val-

idation set.

Figure 14: Visualization of the self-attentions from the

mid-block for the ”baby”-example presented in Section 3.

No distinct clusters are noticeable when mapping the three

main components to RGB-values.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications