
scene understanding of the Stable Diffusion model
compared to a mask-generating optimized model such
as SAM.
6 CONCLUSION
We have presented a novel approach to generate
high-resolution semantic masks using only the self-
attention maps from diffusion models. We show that
our method extracts semantically meaningful masks,
without requiring additional learning or pre-trained
models. This approach can be employed to directly
obtain semantic masks for self-generated images us-
ing textual prompts as input only or for zero-shot
segmentation, where an input image is given. Our
method enables the utilization of Stable Diffusion’s
inherent scene understanding for semantic separation,
a task it has not been explicitly trained on. Validation
with SAM (Kirillov et al., 2023) shows that our ap-
proach produces high-quality semantic segmentation
on par with state-of-the-art methods, further allowing
flexibility to adjust segmentation granularity.We fur-
ther show that the generated masks benefit from Sta-
ble Diffusion’s scene understanding, to provide clus-
ters of consistent semantic meaning beyond occlusion
and pixel gaps.
As a direction for future work, we suggest using
cross-attention maps to further obtain class labels for
the generated masks. Additionally, a post-processing
step using the upsampling error and image features
to refine the masks will increase the accuracy of the
masks.
REFERENCES
Ahn, J. and Kwak, S. (2018). Learning pixel-level semantic
affinity with image-level supervision for weakly su-
pervised semantic segmentation.
Amit, T., Shaharbany, T., Nachmani, E., and Wolf, L.
(2022). Segdiff: Image segmentation with diffusion
probabilistic models.
Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J.,
Li, L., Ouyang, L., Zhuang, J., Lee, J., Guo, Y.,
et al. (2023). Improving image generation with bet-
ter captions. Computer Science. https://cdn. openai.
com/papers/dall-e-3. pdf, 2(3):8.
Chang, H., Zhang, H., Barber, J., Maschinot, A., Lezama,
J., Jiang, L., Yang, M.-H., Murphy, K., Freeman,
W. T., Rubinstein, M., Li, Y., and Krishnan, D. (2023).
Muse: Text-to-image generation via masked genera-
tive transformers.
Cheng, B., Misra, I., Schwing, A. G., Kirillov, A., and Gird-
har, R. (2022). Masked-attention mask transformer for
universal image segmentation.
Contributors, M. (2020). MMSegmentation: Openmmlab
semantic segmentation toolbox and benchmark. https:
//github.com/open-mmlab/mmsegmentation.
Everingham, M., Van Gool, L., Williams, C. K. I.,
Winn, J., and Zisserman, A. (2012). The PASCAL
Visual Object Classes Challenge 2012 (VOC2012)
Results. http://www.pascal-network.org/challenges/
VOC/voc2012/workshop/index.html.
Feng, Q., Gadde, R., Liao, W., Ramon, E., and Martinez,
A. (2023). Network-free, unsupervised semantic seg-
mentation with synthetic images. In Proceedings of
the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 23602–23610.
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B.,
Warde-Farley, D., Ozair, S., Courville, A., and Ben-
gio, Y. (2014). Generative adversarial networks.
Hong, S., Lee, G., Jang, W., and Kim, S. (2023). Im-
proving sample quality of diffusion models using self-
attention guidance. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages
7462–7471.
Khani, A., Taghanaki, S. A., Sanghi, A., Amiri, A. M., and
Hamarneh, G. (2024). Slime: Segment like me.
Kingma, D. P. and Welling, M. (2022). Auto-encoding vari-
ational bayes.
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C.,
Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C.,
Lo, W.-Y., Doll
´
ar, P., and Girshick, R. (2023). Seg-
ment anything.
Li, J., Li, D., Xiong, C., and Hoi, S. (2022). Blip:
Bootstrapping language-image pre-training for unified
vision-language understanding and generation.
Marcos-Manch
´
on, P., Alcover-Couso, R., SanMiguel, J. C.,
and Mart
´
ınez, J. M. (2024). Open-vocabulary atten-
tion maps with token optimization for semantic seg-
mentation in diffusion models.
Nguyen, Q., Vu, T., Tran, A., and Nguyen, K. (2023).
Dataset diffusion: Diffusion-based synthetic dataset
generation for pixel-level semantic segmentation.
Ni, M., Zhang, Y., Feng, K., Li, X., Guo, Y., and Zuo,
W. (2023). Ref-diff: Zero-shot referring image seg-
mentation with generative models. arXiv preprint
arXiv:2308.16777.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
Thirion, B., Grisel, O., Blondel, M., Prettenhofer,
P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,
A., Cournapeau, D., Brucher, M., Perrot, M., and
Duchesnay, E. (2011). Scikit-learn: Machine learning
in Python. Journal of Machine Learning Research,
12:2825–2830.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
J., Krueger, G., and Sutskever, I. (2021). Learning
transferable visual models from natural language su-
pervision.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and
Ommer, B. (2022a). High-resolution image synthe-
sis with latent diffusion models. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition (CVPR), pages 10684–10695.
Beyond Labels: Self-Attention-Driven Semantic Separation Using Principal Component Clustering in Latent Diffusion Models
77