dominate the field, but now the Vision Transformer
(ViT) shows its potential replacement (Dosovitskiy,
2020).
The main objective of this research is to assess the
effectiveness of flower classification utilizing
transformer architecture, particularly the ViT model
renowned for its unique self-attention mechanism
(Vaswani, 2017). This mechanism enables the
establishment of global relationships among image
patches, facilitating the learning of intricate feature
correlations within flower image datasets. The
research focuses on examining the impact of varying
head quantities and depth of encoder layers on the
prediction accuracy curves of the ViT model with
pre-training. Furthermore, analyzing the attention
map distribution across different heads and
transformer layers is vital for understanding the
model capability to establish relationships between
image patches and extract meaningful features from
complex flower images. The analysis also highlights
that while increasing model depth can lead to
performance improvements, there's a point of
saturation. Through simulations of the transformer's
receptive field to measure attention distribution, this
study provides insights into optimal trade-offs.
Ultimately, it suggests that while augmenting the
number of heads and depth in ViT models generally
enhances performance, the highest values may not
always be optimal, especially in intricate tasks like
flower classification. Careful consideration of these
trade-offs is essential for achieving optimal results in
flower classification tasks.
2 METHODOLOGIES
2.1 Dataset Description and
Preprocessing
The dataset used in this work is called tf_flowers,
sourced from TensorFlow Datasets (TFDS),
containing 3670 images of flowers (Luo, 2022). All
original images are sourced from Flickr. Each image
varies in size, the number of flowers, shapes,
proportions within the frame, etc. This flowers dataset
contains five categories: daisy, dandelion, roses,
sunflowers, and tulips. A sample is shown in the
Figure 1.
Figure 1: Images from tf_flowers dataset
(Photo/Picture
credit: Original).
With no predefined splits, in this work, 20% of
images are randomly sampled for validation, the rest
for training. About data preprocessing, the main task
is to resize the images to a consistent size, 224x224
pixels. Specifically, for the training set, to enhance
data diversity and complexity, images are randomly
cropped to the specified size, with a 0.5 probability of
horizontal flipping. For the validation set, to maintain
consistency and comparability in evaluation, the
image’s shorter is resized to 256 pixels and cropped
into a 224x224 pixel region from the centre. Finally,
all image data is converted into tensor and normalized.
2.2 Proposed Approach
This study primarily focuses on implementing the
classic ViT model for flower image classification
tasks, with a specific emphasis on two key
hyperparameters: depth and head. The architecture of
the ViT model comprises three main components: the
Embedding layer, the transformer encoder, and the
MLP head. The investigation employs various
parameter analysis methods, including accuracy
curves and visualization of attention maps (both self
and class token), along with mean attention distance
dot diagrams. These methodologies are employed to
examine how variations in depth and head influence
the model's performance, thus offering valuable
insights into the effectiveness of the ViT model for
flower classification tasks. The pipeline is illustrated
in Figure 2, providing a visual representation of the
process.
Figure 2: The pipeline of the model and analysis method
(Photo/Picture credit: Original).
2.2.1 Embedding Layer
The model converts the image, represented as a three-
dimensional matrix [H, W, C], into patches using a
simple convolutional process. With a kernel size of
16x16, a stride of 16, and 768 filters, an input image
shape of [224, 224, 3] transforms into [14, 14, 768].
After this process, the output can be a set of tokens
with a shape of [196, 768]. Furthermore, before these
tokens proceed to the next parts, a position
embedding process is applied to keep the sequential
information among the patches. Besides, a [class]