through dual convolution operations that optimize
gradient flow and reduce computational complexity.
The structural design of the C2f module aims to map
input features to an expanded feature space through
initial convolution layers, followed by feature
segmentation and recombination, along with the
sequential use of multiple C2f bottleneck blocks,
deepening the processing of features at various scales.
In the YOLOv5 implementation of this study, the C2f
module is integrated into key positions within the
backbone and head, replacing existing convolutional
layers and adding new feature fusion points. Within
the backbone, starting with preliminary feature
extraction, C2f is first applied at the P2/4 level three
times, followed by six, nine, and three instances of
C2f processing at the P3/8, P4/16, and P5/32 levels,
respectively. In the head, the combination of C2f with
upsampling and Concat operations optimizes multi-
scale feature fusion and refinement. Initially, three
instances of C2f operation are introduced in the
processing of P3/8 small-size feature maps,
enhancing the model's ability to recognize small
targets; subsequently, to further improve the effect of
feature fusion, the C2f module is also used in the
processing of P4/16 and P5/32 size feature maps,
optimizing the performance of the detection head
through precise feature reorganization.
2.2.2 Squeeze-and-Excitation (SE)
The SE module is incorporated into the model to
refine its feature recalibration capabilities, focusing
on enhancing the representational power of
convolutional layers. This module operates by
selectively highlighting valuable features while
diminishing the impact of less relevant ones, utilizing
an adaptive process to recalibrate channel-wise
feature responses. Initially, the SE module employs
an adaptive average pooling technique to condense
global spatial information into a channel descriptor. It
is followed by a compression phase, where a
convolution operation reduces the channel
dimensionality based on a specified ratio, aiming to
learn a compact channel-wise descriptor.
Subsequently, an excitation phase, utilizing another
convolution layer, scales the channel dimensionality
back to its original size, allowing the model to learn a
specific activation for each channel. This scaled
output is then multiplied with the original feature map
to recalibrate the features adaptively, focusing the
model's attention on more relevant spatial regions. In
this YOLOv5 implementation, the SE module is
strategically placed at the end of the backbone
section, following the SPPF layer, to process the
highest-level feature map with a channel count of
1024, right before transitioning to the head of the
model. This placement ensures that the SE module
maximally exploits the hierarchical feature
representations learned by the network, enhancing the
overall detection performance by providing a more
focused feature set for the subsequent detection
layers.
2.2.3 Loss Function
Choosing the correct loss function is crucial for the
training of deep learning models. For the YOLOv5
object detection task, the loss function is meticulously
designed to handle complex multi-task learning
problems, effectively integrating three key
components: bounding box regression loss,
objectness loss, and classification loss. Each part is
optimized for different requirements of the detection
process.
2
()
box box box
L pred true=−
(1)
The above formula represents the Box Loss, 𝐿
box
,
which measures the discrepancy between the model's
predicted bounding boxes and the actual ground truth
boxes, ensuring accurate localization and sizing of
detected objects. The pred
box
represents the model's
predicted bounding box coordinates, typically
including the center's positions along with the box's
width and height. The true
box
refers to the actual
ground truth bounding box coordinates, which
include the center's positions as well as the box's
width and height, as obtained from the labeled data.
( log( )
(1 ) log(1 ))
obj obj
obj
obj obj
true pred
L
true pred
+
=−
−−
(2)
The above formula denotes the Objectness Loss,
𝐿
obj
, evaluating the precision of the model's assurance
in identifying an object's presence in designated
bounding box, aiming to differentiate between
background and foreground.
The true
obj
represents the ground truth objectness
score for a given region or bounding box in the image.
The objectness score indicates whether the region
contains an object (score of 1) or not (score of 0). This
score is obtained from the labeled dataset during
training. The pred
obj
denotes the model's predicted
objectness score for a given region or bounding box.
Similar to the ground truth, this predicted score aims
to reflect the model's confidence in the presence of an