Yolov9, as the latest version of the You Look
Only Once (YOLO) series, not only has higher
detection accuracy, but also has a significant
improvement in inference speed, reaching a new
State-of-the-art (SOTA) on the MS COCO dataset,
and becoming one of the research hotspots in the field
of object detection. It addresses these challenges
through the introduction of Programmable Gradient
Information (PGI) and the introduction of
Generalized Efficient Layer Aggregation Network
(GELAN) by improving the information retention
and gradient flow, thus preventing false associations
between targets and inputs (Wang, 2024). In terms of
object detection detection performance, the object
detection method which base on GELAN and PGI
outperform previous "train from scratch" methods,
and also outperform RT DETR (Lv, 2023), which
uses large datasets for pre-training, and Yolo MS,
which is based on a deep convolutional design, in
terms of parameter utilization (Wang, 2024; Chen,
2023). PGI's applicability spans from lightweight to
large-scale models, and is able to be applied to a
variety of models without the need for large pre-
training The applicability of PGI spans from
lightweight to large models, and is able to train
models from scratch with very good performance
without the need for large pre-training datasets.
This study will focus on evaluating the
performance of the Yolov9 model through a series of
targeted investigations. Firstly, the model's sensitivity
to batch size, a critical factor in its performance, is
examined, and the interplay between batch size and
learning rate is explored to optimize the training
process. Secondly, the effectiveness of different
optimizers in model training is compared,
determining the best-fit optimizer and corresponding
learning rate for Yolov9. Additionally, the
performance of different scale variants of Yolov9,
namely Yolov9-c and Yolov9-e, is assessed by
comparing their training performance under varying
conditions. Furthermore, the impact of different
optimizers on the performance of these models is
investigated. This study is aiming to present a
straightforward yet efficient method for testing and
optimizing the training and performance of the
Yolov9 model on the COCO128 dataset, offering
valuable insights and solutions for object detection
tasks, particularly for small research teams or those
with limited computational resources. This study
serves as a practical reference for optimizing the
utilization of Yolov9 in target detection tasks,
particularly for resource-constrained research teams.
2 METHODOLOGIES
2.1 Dataset Description and
Preprocessing
In this study, the COCO128 dataset was chosen as the
main data source for the study.COCO128 is the first
128 sheets of the dataset Microsoft Common Objects
in Context (MS COCO), which is a large-scale multi-
category object detection dataset containing more
than 80 categories totaling more than one million The
MS COCO dataset is a large-scale multi-category
target detection dataset that contains more than 80
categories totaling more than one million images
which cover a variety of real scenes, such as indoor
and outdoor scenes, and thus is highly representative
and challenging. Each image of the COCO128 dataset
is labeled with the bounding box of one or more
objects and their category information, and also
contains segmentation masks and key points for the
objects, as well as a preference for images in which
the target co-occurs with its scene, i.e., non-iconic
images, which can reflect visual semantics and are
more in line with the requirements of the image
understanding task. This detailed labeling
information makes it one of the ideal datasets for
tasks of object detection, instance segmentation, pose
estimation and so on.
2.2 Proposed Approach
This study will focus on exploring the performance of
optimizing the Yolov9 model and investigating the
performance of Yolov9 under each condition. In
general, the experiment firstly reproduces the model
and imports the dataset, then the training
configuration and environment are selected, the
experiment-testing model is started, and finally the
images/tables are drawn based on the results and the
analysis is given. The pipeline is shown in the Figure
1.
Figure 1: General flow of the experiment
(Picture credit: Original).