Transformer-Based Fine-Tuning and Zero-Shot Learning for Image Classification

Haoyang Fei

2024

Abstract

An automated, efficient, and accurate image classification approach is essential across various domains. This research compares different state-of-the-art image classification approaches, including the fine-tuned Vision Transformer (ViT), base Contrastive Language-Image Pretraining (CLIP) model, and fine-tuned CLIP model, on specialized image classification tasks. The research evaluates classification accuracy, zero-shot classification ability for unseen categories, and deployment costs. The findings indicate that while the fine-tuned ViT model excels in test accuracy, the base CLIP model demonstrates remarkable zero-shot learning capabilities, making it highly efficient for unseen categories. However, fine-tuning the CLIP model results in a significant loss of its zero-shot ability without a proportional increase in performance, with the fine-tuning cost far exceeding that of the ViT model. The author suggests that the fine-tuned ViT model is more suitable for tasks requiring high accuracy, while the base CLIP model is preferable for applications valuing versatility and lower deployment costs. Fine-tuning the CLIP model is suitable only if the dataset is sufficiently large and deployment cost is not a concern. These insights provide a nuanced understanding of the trade-offs involved in selecting an image classification model for specialized tasks, emphasizing the importance of considering both the task's nature and available resources.

Download


Paper Citation


in Harvard Style

Fei H. (2024). Transformer-Based Fine-Tuning and Zero-Shot Learning for Image Classification. In Proceedings of the 1st International Conference on Engineering Management, Information Technology and Intelligence - Volume 1: EMITI; ISBN 978-989-758-713-9, SciTePress, pages 511-516. DOI: 10.5220/0012958000004508


in Bibtex Style

@conference{emiti24,
author={Haoyang Fei},
title={Transformer-Based Fine-Tuning and Zero-Shot Learning for Image Classification},
booktitle={Proceedings of the 1st International Conference on Engineering Management, Information Technology and Intelligence - Volume 1: EMITI},
year={2024},
pages={511-516},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012958000004508},
isbn={978-989-758-713-9},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 1st International Conference on Engineering Management, Information Technology and Intelligence - Volume 1: EMITI
TI - Transformer-Based Fine-Tuning and Zero-Shot Learning for Image Classification
SN - 978-989-758-713-9
AU - Fei H.
PY - 2024
SP - 511
EP - 516
DO - 10.5220/0012958000004508
PB - SciTePress