dedicated to process data with grid structure. In 1998,
Yan LeCun et al. first applied convolutional neural
networks to on the image classification task, with
great success in the handwritten digit recognition
task. Simonyan and Zisserman proposed the Visual
Geometry Group (VGG) network structure in 2014,
which is one of the most popular convolutional neural
networks for now. It is welcomed by the majority of
researchers because of its simple structure and strong
application. While Christian Szegedy et al proposed
GoogLeNet in 2014 and won the ImageNet
competition in 2014.Scientists have proposed some
landmark models, such as VGG, GoogLeNet,
ResNet, EfficientNet and so on. Before 2020, CNN
technology was used in the vast majority of image
classification models, with relatively fixed network
mechanisms, including basic modules such as
convolution cores, residuals, pooling units, and linear
layers. From 2017 to now, more and more models
with excellent performance have appeared, but CNN
still has irreplaceable advantages in image
classification.
The primary aim of this study is to offer a
comprehensive overview of the evolution and
techniques employed in image classification.
Initially, it delineates the developmental trajectory of
image classification, elucidating fundamental
concepts and principles. Subsequently, it delves into
an analysis of pivotal models in image classification,
dissecting their underlying mechanisms.
Furthermore, the study evaluates the performance of
these models, spotlighting key technological
advancements. Additionally, it conducts a critical
examination of the strengths, weaknesses, and future
potentials of these models. Finally, the paper
consolidates and synthesizes the entire image
classification system, providing insights into future
directions and prospects.
2 METHODOLOGIES
2.1 Dataset Description and
Preprocessing
Currently, prominent datasets utilized for image
classification include MNIST, CIFAR-10, ImageNet,
COCO, Open Image, and YouTube-8M. MNIST,
introduced by LeCun et al. (LeCun, 2012) in 1998,
serves as a fundamental learning framework with
70,000 instances across different classes,60,000
training images and 10,000 testing images. CIFAR-
10, crafted by Alex Krizhevskyp, features 10 classes
(Krizhevsky, 2009). ImageNet, a project led by
Professor Fei Fei Li's team from Stanford University,
boasts over 14 million images (Deng, 2009), while
the Microsoft COCO dataset comprises more than
300,000 images and over 2 million instances, useful
for classification and recognition tasks (Lin, 2014).
Open Images, a vast dataset provided by Google,
offers about 9 million images annotated with labels
and bounding boxes, with its fourth version being the
largest to date. YouTube-8M, a massive video
dataset, includes over 7 million labelled videos
spanning 4,716 classes and 8 billion YouTube links,
featuring training, validation, and test sets. These
datasets serve diverse purposes, from training
classifiers to object detection and relationship
detection.
2.2 Proposed Approach
First, this study first introduces the development of
the field of image classification, and then introduces
several data sets commonly used for image
classification: MNIST, CIFAR10, ImageNet, COCO,
Open Image, Youtube-8M. Then it analysed the
structure of its dataset, and the development of each
dataset. Then it introduces several commonly used
models in the territory of image classification: CNN,
RNN and VGG, and analyses their main performance
deeply. The pipeline of this study is shown in the
Figure 1.
Figure 1: The pipeline of study (Photo/Picture credit:
Original).
2.2.1 CNN
Convolutional neural networks make a huge leap
forward in image classification through supervised
learning, while the CNN design roots in the structure
of the visual system. CNN is short for cellular neural
networks. It consists of a stack of alternating
convolutional and pooling layers. In short, the CNN
can be trained by the backpropagation error signal. In
the past few years, the neural network (LeCun, 1990)
has proven to be effective for simple recognition
tasks. Convolutional layer is the most important
component of CNN. On each convolution layer, the
input cube is convolved with multiple learnable fit
graphs to generate multiple feature maps. Currently,