Authors:
Artur Oliveira
;
Mateus Espadoto
;
Roberto Hirata Jr.
and
Roberto M. Cesar Jr.
Affiliation:
Institute of Mathematics and Statistics, University of São Paulo, Brazil
Keyword(s):
Prompt Engineering, Guided Embeddings, Multimodal Learning, Clustering, t-SNE Visualization, Zero-Shot Learning.
Abstract:
In this paper, we address the challenge of flexible and scalable image classification by leveraging CLIP embeddings, a pre-trained multimodal model. Our novel strategy uses tailored textual prompts (e.g., “This is digit 9”, “This is even/odd”) to generate and fuse embeddings from both images and prompts, followed by clustering for classification. We present a prompt-guided embedding strategy that dynamically aligns multimodal representations to task-specific or grouped semantics, enhancing the utility of models like CLIP in clustering and constrained classification workflows. Additionally, we evaluate the embedding structures through clustering, classification, and t-SNE visualization, demonstrating the impact of prompts on embedding space separability and alignment. Our findings underscore CLIP’s potential for flexible and scalable image classification, supporting zero-shot scenarios without the need for retraining.