Vision-Language In-Context Learning Driven Few-Shot Visual Inspection Model
Shiryu Ueno, Yoshikazu Hayashi, Shunsuke Nakatsuka, Yusei Yamada, Hiroaki Aizawa, Kunihito Kato
2025
Abstract
We propose general visual inspection model using Vision-Language Model (VLM) with few-shot images of non-defective or defective products, along with explanatory texts that serve as inspection criteria. Although existing VLM exhibit high performance across various tasks, they are not trained on specific tasks such as visual inspection. Thus, we construct a dataset consisting of diverse images of non-defective and defective products collected from the web, along with unified formatted output text, and fine-tune VLM. For new products, our method employs In-Context Learning, which allows the model to perform inspections with an example of non-defective or defective image and the corresponding explanatory texts with visual prompts. This approach eliminates the need to collect a large number of training samples and re-train the model for each product. The experimental results show that our method achieves high performance, with MCC of 0.804 and F1-score of 0.950 on MVTec AD in a one-shot manner. Our code is available at https://github.com/ia-gu/Vision-Language- In-Context-Learning-Driven-Few-Shot-Visual-Inspection-Model.
DownloadPaper Citation
in Harvard Style
Ueno S., Hayashi Y., Nakatsuka S., Yamada Y., Aizawa H. and Kato K. (2025). Vision-Language In-Context Learning Driven Few-Shot Visual Inspection Model. In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP; ISBN 978-989-758-728-3, SciTePress, pages 253-260. DOI: 10.5220/0013088100003912
in Bibtex Style
@conference{visapp25,
author={Shiryu Ueno and Yoshikazu Hayashi and Shunsuke Nakatsuka and Yusei Yamada and Hiroaki Aizawa and Kunihito Kato},
title={Vision-Language In-Context Learning Driven Few-Shot Visual Inspection Model},
booktitle={Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP},
year={2025},
pages={253-260},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013088100003912},
isbn={978-989-758-728-3},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP
TI - Vision-Language In-Context Learning Driven Few-Shot Visual Inspection Model
SN - 978-989-758-728-3
AU - Ueno S.
AU - Hayashi Y.
AU - Nakatsuka S.
AU - Yamada Y.
AU - Aizawa H.
AU - Kato K.
PY - 2025
SP - 253
EP - 260
DO - 10.5220/0013088100003912
PB - SciTePress