Figure 2: The pipeline of the research
(Photo/Picture credit: Original).
generalization performance and reduce the training
cost (Loshchilov, 2017). Figure 2 illustrates the
comprehensive pipeline of the research.
2.2.1 Transformer
Both ViT and CLIP uses transformer architecture as
its encoder. The transformer model is an architecture
of deep learning model that adopts a self-attention
mechanism to process sequential data, such as text
and images (Vaswani, 2017). Transformer generally
contains two parts, which are an encoder and a
decoder, both composed of numerous layers of self-
attention and feedforward neural networks. With the
self-attention mechanism, transformer-based models
are able to focus on the most important parts of the
source sequence when making predictions. This
enables the model to capture long-range
dependencies and improve the performance when
performing sequential tasks.
2.2.2 Vision Transformer (ViT)
In this study, the author introduces the ViT model for
image classification tasks, inspired by the success of
the Transformer architecture in natural language
processing. Unlike traditional approaches that rely on
CNN for computer vision tasks, ViT directly applies
the Transformer architecture to sequences of image
patches. This departure from CNN-based methods
demonstrates promising performance across a
spectrum of image classification benchmarks,
including ImageNet, CIFAR-100, and VTAB
(Dosovitskiy, 2020).
The ViT model operates by initially reshaping
images into a sequence of flattened 2D patches, which
are then processed through the Transformer
architecture. Each patch undergoes a trainable linear
projection to generate a fixed-dimensional
embedding. Similar to the BERT model in NLP, ViT
incorporates a learnable embedding at the start of the
patch sequence, serving as input representation for
the encoder of ViT. The encoder consists of several
alternating layers of multi-head self-attention and
MLP layers, with layer normalization and residual
connections at each layer. Position embeddings added
to the patch embeddings are used for retaining
positional information. Notably, ViT displays less
image-specific bias compared to CNNs. The reason is
that only the MLP layers exhibit local and
translational equivariance, whereas the self-attention
layers are global in nature. Furthermore, ViT supports
a hybrid architecture where the input sequence can be
generated from CNN feature maps, offering
flexibility in model design. In this study, the author
fine-tuned the vit_base_patch16_224 model on 91
different categories of images from the food-101
training dataset and evaluated its performance on the
validation dataset to showcase its effectiveness
compared to the CLIP model. Detailed descriptions
of the experimental setup, including training
procedures and hyperparameter settings, are provided
in subsequent sections.
2.2.3 Contrastive Language-Image
Pretraining (CLIP)
CLIP is a multimodal model that learns to associate
images and text through a contrastive objective. CLIP
is trained on a large-scale dataset of images, together
with their associated text, such as image captions, to
learn a joint embedding space in which semantically
similar image and text pairs have less distance to each
other. This joint embedding space enables CLIP to
perform a diverse array of vision-language tasks,
which includes image classification, image
segmentation, and detection of objects (Radford,
2021). CLIP can be seized as a composition of two
separate components: a vision encoder and a text
encoder. The images are first resized into a fixed size
(224 by 224 in CLIP ViT/B-32) and normalized into
standard pixel values. The image encoder takes the
normalized images as input, passes the image through
a CNN backbone, such as ResNet or ViT, to extract
image features. The extracted image features are then
projected to a fixed-dimensional embedding using a
learnable linear projection.
In this research, the author fine-tuned the CLIP
model on 91 different categories of images form
food-101 training dataset which consists of the image
and it correspond text label. All the text labels are
extracted from the class names of the image, which
are then tokenized and processed by the text encoder
to acquire text embeddings. These embeddings,
together with the image embeddings processed by the
image encoder are then compared using cosine
similarity to determine the semantic similarity
between the image and labels from the dataset. The