
grouped prompts captured broader semantic group-
ings, offering flexibility for various application needs.
We showed that task-specific and grouped
prompts significantly enhance clustering performance
compared to image-only baselines, highlighting the
critical role of prompt design in structuring em-
bedding spaces. Furthermore, our method effec-
tively adapts to zero-shot and constrained classifica-
tion tasks, emphasizing the versatility of multimodal
models in unsupervised workflows.
While the primary focus was on evaluating the in-
fluence of prompts on clustering and classification,
our findings also underscore the potential for future
work in prompt optimization, dynamic embedding
structures, and applications to more complex datasets.
This study contributes to a growing understanding of
how natural language supervision can guide multi-
modal models, bridging the gap between zero-shot
generalization and task-specific optimization.
ACKNOWLEDGEMENTS
This work was funded partially by FAPESP project
2022/15304-4 and MCTI (law 8.248, PPI-Softex -
TIC 13 - 01245.010222/2022-44).
REFERENCES
Allingham, J. U., Ren, J., Dusenberry, M. W., Gu, X., Cui,
Y., Tran, D., Liu, J. Z., and Lakshminarayanan, B.
(2023). A simple zero-shot prompt weighting tech-
nique to improve prompt ensembling in text-image
models. In International Conference on Machine
Learning, pages 547–568. PMLR.
Chen, B., Rouditchenko, A., Duarte, K., Kuehne, H.,
Thomas, S., Boggust, A., Panda, R., Kingsbury, B.,
Feris, R., Harwath, D., et al. (2021). Multimodal
clustering networks for self-supervised learning from
unlabeled videos. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages
8012–8021.
Deng, L. (2012). The mnist database of handwritten digit
images for machine learning research. IEEE Signal
Processing Magazine, 29(6):141–142.
Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al. (1996).
A density-based algorithm for discovering clusters in
large spatial databases with noise. In kdd, volume 96,
pages 226–231.
Huang, T., Chu, J., and Wei, F. (2022). Unsupervised
prompt learning for vision-language models. arXiv
preprint arXiv:2204.03649.
Hubert, L. and Arabie, P. (1985). Comparing partitions.
Journal of classification, 2:193–218.
Krizhevsky, A., Nair, V., and Hinton, G. Cifar-10 (canadian
institute for advanced research).
Krizhevsky, A., Nair, V., and Hinton, G. Cifar-100 (cana-
dian institute for advanced research).
Li, Z., Li, X., Fu, X., Zhang, X., Wang, W., Chen, S., and
Yang, J. (2024). Promptkd: Unsupervised prompt dis-
tillation for vision-language models. In Proceedings
of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 26617–26626.
Lloyd, S. (1982). Least squares quantization in pcm. IEEE
transactions on information theory, 28(2):129–137.
Ma, J., Huang, P.-Y., Xie, S., Li, S.-W., Zettlemoyer, L.,
Chang, S.-F., Yih, W.-T., and Xu, H. (2024). Mode:
Clip data experts via clustering. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition, pages 26354–26363.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
J., Krueger, G., and Sutskever, I. (2021). Learning
transferable visual models from natural language su-
pervision.
Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to
the interpretation and validation of cluster analysis.
Journal of Computational and Applied Mathematics,
20:53–65.
Scikit-Learn (2024). Adjusted Mutual Information
Score - Scikit-learn 1.5.2 Documentation. Scikit-
learn. https://scikit-learn.org/1.5/modules/generated/
sklearn.metrics.adjusted
mutual info score.html.
Shi, J. and Malik, J. (2000). Normalized cuts and image
segmentation. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 22(8):888–905.
van der Maaten, L. and Hinton, G. (2008). Visualizing data
using t-sne. Journal of Machine Learning Research,
9(86):2579–2605.
Vinh, N. X., Epps, J., and Bailey, J. (2009). Information
theoretic measures for clusterings comparison: is a
correction for chance necessary? In Proceedings of
the 26th annual international conference on machine
learning, pages 1073–1080.
Zeng, F., Zhu, F., Guo, H., Zhang, X.-Y., and Liu, C.-L.
(2024). Modalprompt: Dual-modality guided prompt
for continual learning of large multimodal models.
arXiv preprint arXiv:2410.05849.
Improving Image Classification Tasks Using Fused Embeddings and Multimodal Models
241