Vehicle Detection Algorithm Based on Fisheye Camera in Parking
Environment
Liao Wang and Qiu Fang
Xiamen University of Technology, Xiamen, China
Keywords: Deep Learning, YOLOv5, Fisheye Image, Distortion, Object Detection.
Abstract: To address the issue of missing and wrong detection of fisheye images by existing target detection
algorithms, a targeted dataset is constructed, and an improved model is proposed using YOLOv5 as a
reference. Firstly, to facilitate better adapt to the dimensions of the custom dataset, the K-means++
algorithm was utilized for anchor box clustering. Secondly, a deformable convolutional network was
brought in to substitute some convolution layers in the original network, so that the network can adaptively
extract distorted image feature points, and the integration of coordinate attention mechanisms enhances the
expression of semantic and positional information of the feature points of interest. Furthermore, Slim-Neck
is designed to replace the original Neck based on GSConv convolution, resulting in reduced model
parameters and enhanced algorithmic precision rate. Lastly, redesigning the detector's loss function by
incorporating EIoU Loss and Focal Loss. The results demonstrate that precision rate, recall rate and mean
precision are improved by 2.31%, 4.41% and 3.50% respectively compared with YOLOv5 algorithm.
1 INTRODUCTION
Autonomous valet parking (AVP) plays a pivotal
role in self-driving, the implementation of this
technology relies on the perception of the
environment by sensors such as fisheye cameras and
ultrasonic radar. Among them, fisheye cameras have
unique advantages in detection due to their wide
field of view and distortion characteristics(Lee M,
2019). Due to the large amount of aberrations in
their acquired images, it becomes a challenging
problem to detect vehicles quickly and accurately in
fisheye images. In recent years, many scholars have
used neural networks to accomplish target detection.
Deep learning-based object detection algorithms can
be categorized into two-stage object detection
algorithms, exemplified by R-CNN (Girshick R,
2014) and Fast R-CNN (Yang, 2020) and
single-stage object detection algorithms, exemplified
by YOLO (Redmon, 2018) and SSD (Chen X, 2019),
according to the presence or absence of the
displayed regions suggested by the algorithms (Jun
Jiang, 2021). In terms of target detection based on
fisheye images, the literature (Wei, 2022) concluded
that the main problems affecting the detection
accuracy of fisheye images are object rotation and
spatial distortion. The authors proposed a rotation
mask deformable convolutional network architecture
to improve the capacity for learning and
computational efficiency of the convolutional kernel
for rotating object features in fisheye images, but its
model complexity is high. The literature (Fremont,
2016) divides the image into 7 regions and generates
4 sets of training samples for 4 different distortion
levels and imaging models. The method has good
performance in pedestrian detection task in fisheye
images, but still has the problem of missed and
wrong detection. Hence, this paper presents a deep
learning-based algorithm for object detection in
fisheye images.
The primary tasks are as follows: (1)
constructing a fisheye dataset based on the parking
environment; (2) introducing the deformable
convolution into the feature extraction network, we
propose the DCNC3 module as a replacement for
original C3 module, and incorporating the
coordinate attention mechanism; (3) designing the
GSCSP structure based on the new hybrid
convolution GSConv to optimize the model
performance.
Wang, L. and Fang, Q.
Vehicle Detection Algorithm Based on Fisheye Camera in Parking Environment.
DOI: 10.5220/0012272600003807
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 2nd International Seminar on Artificial Intelligence, Networking and Information Technology (ANIT 2023), pages 5-10
ISBN: 978-989-758-677-4
Proceedings Copyright © 2024 by SCITEPRESS Science and Technology Publications, Lda.
5
2 DATASET CONSTRUCTION
The only publicly available fisheye dataset for the
autonomous driving domain is WoodScape from
Valeo (Yogamani, 2019). Given that the scenarios of
this dataset are mostly foreign and parking scenarios
are few. Therefore, we constructed our own fisheye
dataset based on the parking environment. The
fisheye cameras were mounted above the air intake
grille, on the front left and right door panels, and
above the rear license plate (Figure 1). To ensure the
diversity of the dataset, images of underground and
open-air parking lots were collected under different
weather and at different time periods according to
occupying parking spaces in Xiamen and Shanghai.
The dataset is divided into training, testing, and
validation sets with an 8:1:1 distribution. 8400
images were obtained by manually labeling the data
with LabelImg, a data labeling software. An
illustration of the dataset is depicted in Figure 2.
Figure 1: Fisheye camera installation position.
Figure 2: Example of dataset.
There are more small targets as well as distorted
shapes in this dataset with large variations in pixel
sizes. Therefore, using the k-means++ algorithm to
perform clustering on frames containing vehicle
labels. to obtain the most suitable a priori frame size
for this dataset, and the a priori frame size after
clustering is shown in Table 1.
Table 1: K-means++ clustering a priori box.
Feature Map Receptive Field Ancho
r
13*13 Big (113,67)(93,99)(138,109)
26*26 Middle (41,38) (76,41) (66,70)
52*52 Small (16,12) (27,20) (48,24)
3 IMPROVED NETWORK
ALGORITHM BASED ON
YOLOv5
YOLOv5 adopts Anchor-based detection method,
which belongs to single-stage target detection
method. Compared with the previous versions,
YOLOv5 offers enhanced speed and improved
accuracy, and is one of the leading target detection
algorithms in the industry. The algorithm's core
concept is to partition the image into grids, each
responsible for predicting object type and location. It
filters target boxes based on IoU (Intersection over
Union) values between predicted and actual
bounding boxes, ultimately outputting class labels
and positional coordinates for predicted bounding
boxes. YOLOv5 is structured around four key
elements: input, backbone network, neck feature
fusion network, and head module. Aiming at the
problems in fisheye image detection, this paper
constructs a vehicle detection model utilizing the
fisheye camera based on the YOLOv5s model. Its
architecture is illustrated in Figure 3.
Head
Neck
Backbone
=
=
=
=
Figure 3: The network architecture of the model.
3.1 DCBS Module Based on
Deformable Convolution
A deformable convolution module (DCBS) is
introduced into the feature extraction framework. As
show in Figure 4 this module is composed of
deformable convolution layer, BN layer and a SiLu
activation function layer. Some C3 modules in the
feature extraction network are substituted with
DCNC3 modules. The DCNC3 structure consists of
ANIT 2023 - The International Seminar on Artificial Intelligence, Networking and Information Technology
6
a conventional convolution module (CBS), a
deformable convolution module (DCBS), and N
D-Resunit residuals (only 3×3 convolution layers are
replaced by deformable convolution).Deformable
convolution can alter the shape and size of the
receptive field based on the distorted image.
Traditional convolution uses fixed-size
convolution kernel sampling and weighting for
feature graphs. The calculation formula is outlined
below:
00
() ()( )
n
nn
pR
yp p xp p
ω
=+
g
(1)
On the basis of formula 1, the offset
{Pnn=1,2,3...,N},N=|R| is extended for all points
in grid R. Then 𝑦(𝑝
) in the output feature diagram
is adjusted as follows :
00
() ()( )
n
nnn
pR
yp p xp p p
ω
=++
gV
(2)
Then add a weighting factor for the sampling
points∆𝑚
[0,1] The formula is as follows:
00
() ()( )
n
nnnn
pR
yp p xp p p m
ω
=++×
gVV
(3)
Finally, the offset pixels are obtained by using
bilinear interpolation as illustrated below:
() (,) ()
q
Xp Gqp Xq=
g
(4)
Input feature map output feature map
conv
offset field
offsets
BN SiLu
Figure 4: Structure of the DCBS module.
3.2 Integration of Coordinate Attention
Mechanism
To facilitate the network in autonomously directing
its attention towards distorted targets, the coordinate
attention(CA) mechanism is incorporated in this
paper (Hou, 2021). The module's structural
configuration is depicted in Figure 5. The CA
attention module first performs the average pooling
operation along the vertical and horizontal
orientations of the input feature map, and completes
the cutting of the number of channels through
Concat and 1*1 convolution operation. Then the
spatial information is encoded through the BN layer
and the non-linear layer, and the normalization
weighting process is carried out at the same time.
Residual
Re-weight
X Avg Poll Y Avg Poll
Conv2d Conv2d
Concat+Conv2d
BatchNorm+Non-linear
Sigmoid Sigmoid
Figure 5: Architecture of coordinate attention mechanism.
3.3 GSCSP Module
In this paper, a hybrid convolution method GSConv
is introduced by means of standard convolution(SC),
depth separable convolution(DSC) (Chollet,
Xception, 2017) and shuffle combination and its
structure is shown in Figure 6. As shown in Figure 7,
based on convolutional GSConv, the GSbottleneck
structure is introduced, and Gsbottleneck replaces
the Bottleneck structure of the C3 module within the
feature integration network with GSbottleneck. The
Gsbottleneck structure replaces the 1×1 convolution
layer with GSConv and adds a new skip connection.
This improvement can reduce the number of
network parameters, thereby improving the
computational efficiency of the network and
mitigating the risk of overfitting. In addition, the
addition of new jump connections makes the two
branches do not share the same weight, and the
information of each channel experiences a unique
propagation path, so the correlation and difference of
channel information are significantly enhanced,
which not only improves the transmission efficiency
of information, but also accelerates the convergence
speed of training. Remove the C3 module from the
original Neck and add the GSCSP module. Finally,
the GSConv, GSbottleneck and GSCSP modules can
be combined to form a Slim-Neck network.
Conv
DWConv Concat Shuffle
input
channels
channels
channels
Figure 6: Structure of GSConv module.
Vehicle Detection Algorithm Based on Fisheye Camera in Parking Environment
7
GSConv
GSConv
Conv
Input
Output
channels
GSbottleneck
channels
Output
Conv
Conv
Conv
GSbottleneck
Concat
Input
1/2c
channels
channels
GSCSP
Figure 7: Structure of Bottleneck and GS bottleneck.
4 EXPERIMENTS AND
ANALYSIS OF RESULTS
4.1 The Experimental Setup and
Assessment Metrics
The experiments in this paper are carried out on
Ubuntu18.04 operating system, the hardware
environment is Intel(R) Core(TM) i7-9700
CPU@2.60GHz, the memory is 32G, the GPU is
GeForce RTX 3090, the video memory is 24G. The
software environment is python3.8, and the deep
learning framework is Pytorch 1.11.0. The image
size resolution is 640×640, the training batch size is
32, and the network model parameters are learned
and updated by SGD. The initial learning rate is 0.01,
the learning rate decline parameter is 0.0001, the
momentum is 0.937, the weight decay coefficient is
0.0005, and the training times (epochs) of all
samples in the training set is 150. In the experiment,
the mean Average Precision (mAP), Precision and
Recall were used as evaluation indicators to evaluate
the model. The formula for calculating the index is
as follows:
Pr
TP
ecision
TP FP
=
+
(5)
Recall
TP
TP FN
=
+
(6)
1
0
()
A
Pprecisiontdt=
(7)
1
N
n
n
A
P
mAP
N
=
=
(8)
In this context, TP denotes the quantity of
positive samples correctly predicted, FP represents
the number of negative samples mistakenly
predicted as positive, and the count of positive
samples erroneously classified as negative is
denoted by FN.
4.2 Analysis of Experimental Results
1) Algorithm Performance Comparison
Experiment
For the purpose of validating the advancement of the
algorithm in this paper, YOLOv3, YOLOv4,
YOLOv5, SSD, Faster-RCNN, Retinanets and the
improved algorithm are selected to compare the
objective evaluation indicators in the data set
constructed in this paper. The comparison results of
their detection performance are shown in Table II.
Table 2 shows that the precision, recall and mean
average precision of the proposed algorithm are
greatly improved compared with other algorithms.
Compared with the original YOLOv5 algorithm, the
precision, recall and mean average precision are
increased by 2.31%, 4.41% and 3.50%, respectively.
In conclusion, the improved algorithm in this paper
has better detection performance.
Table 2: Different model detection performance comparison.
Model mAP@0.5 Precision Recall
SSD 75.25 80.0 53.53
Faste
r
-RCNN 74.79 78.63 64.61
Retinanets 67.43 87.32 55.21
YOLOV3 88.35 89.24 79.93
YOLOV4 82.56 91.28 73.5
YOLOV5 91.40 89.39 84.99
Ours 94.90 91.70 89.40
2) Comparison of the Model Before and After
Improvement
To assess the effectiveness of the aforementioned
improvement strategies on the missed detection and
error detection problem, the ablation experiment is
carried out. Table 3 shows the average precision
mean, precision and recall rate of different
improvement strategies. Table 2 illustrates the
consequences of integrating deformable
convolutions in the reconstruction of the backbone
network, the average precision mean, precision and
recall rate are increased by 2.36%, 1.12% and 1.36%,
respectively. After the introduction of Slim-Neck,
although the mean average precision is increased by
1.05%, the precision is only increased by 0.28%, but
the recall rate is increased by 2.22%. It is evident
that this improvement effectively decreases the
occurrence of missed detections and false detection
rate. The improved experiments show that DCNC3,
Slim-Neck and CA have the potential to boost
detection accuracy to a certain degree.
ANIT 2023 - The International Seminar on Artificial Intelligence, Networking and Information Technology
8
Table 3: Experimental results of the improved model.
Baseline
algorithms
Improvement strategy
mAP@50
Precisi
on
Recall
DCNC
3
CA GSCSP
YOLOv5
91.40 89.39 84.99
93.76 90.51 86.35
93.85 91.70 87.18
94.90 91.98 89.40
3) Assessment of model performance on The
WoodScape Dataset
To further reflect the advancement of the proposed
algorithm, this paper verifies it on WoodScape, the
public dataset of Valeo autonomous driving fisheye.
The improved algorithm and the original YOLOv5
algorithm are verified by using this data set. Refer to
Table IV for the findings of the experiments. A
comparative analysis was carried out in relation to
the original YOLOv5 algorithm, the mean average
precision, precision and recall of the proposed
algorithm are increased by 0.93%, 0.61% and 0.96%,
respectively. Since the images in the dataset
constructed in this paper are all domestic parking
scenes, the types of detected vehicles are not as rich
as those in the WoodScape dataset. Therefore,
although the algorithm's precision in detecting
objects on the WoodScape data set is lower than that
of the self-made data set, it still achieves good
performance.
Table 4: Performance comparison of each algorithm on the
WoodScape dataset.
Model mAP@0.5 Precision Recall
YOLOv5 85.20 83.10 76.60
Ours 86.03 83.51 77.56
4) Part of The Detection Effect Comparison
Experiment
An example of the detection effect part on the data
set built in this paper and WoodScape data set is
shown in Figure 8. As shown in (a) and (c), in the
underground garage with poor light, the proposed
algorithm reduces the missed detection rate
compared with the original YOLOv5 algorithm.
Figure (d) At the edge of the fisheye image with
large distortion, the original YOLOv5 algorithm
cannot detect the distorted vehicle, while the
algorithm is capable of accurately detecting distorted
vehicles. The false detection of the original
YOLOv5 in Figure (b). Therefore, this algorithm
demonstrates good accuracy and robustness across
various scenarios.
a b c d
Figure 8: Example of comparison of detection effects.
5 CONCLUSIONS
In this paper, a series of improvement measures are
proposed to facilitate the adaptive learning of
distorted information DCNC3 module, add
coordinate attention mechanism, and design
Slim-Neck reconstruction feature fusion network
from the aspects of feature learning, feature fusion,
sample weight allocation and information
transmission mode. The experimental outcomes
demonstrate that the algorithm improves all
indicators on both the dataset constructed in this
paper and the public dataset. It not only effectively
boosts the detection precision of vehicles in fisheye
images, but also reduces the missed detection and
false detection rate. However, the data set
constructed in this paper is based on the urban
parking environment, and the data set samples are
not rich enough. It will be supplemented in the
future.
ACKNOWLEDGMENTS
This work was financially supported by Natural
Science Foundation of Fujian Province fund (Grant
No. 2022J011247).
REFERENCES
Lee M, Kim H, Paik J. Correction of barrel distortion in
fisheye lens images using image-based estimation of
distortion parameters[J], IEEE ACCESS, 2019,
7:45723-45733.
Girshick R, Donahue J, Darrell T, et al. Rich feature
hierarchies for accurate object detection and semantic
segmentation[C], Proceedings of the IEEE conference
on computer vision and pattern recognition. 2014:
580-587. https://doi.org/10.1109/cvpr.2014.81
Yang W, Li Z, Wang C, et al.A multi-task Faster R-CNN
method for 3D vehicle detection based on a single
image[J], Applied Soft Computing, 2020, 95:106533.
https://doi.org/10.1016/j.asoc.2020.106533
Vehicle Detection Algorithm Based on Fisheye Camera in Parking Environment
9
Redmon J, Farhadi A. YOLOv3: An Incremental
Improvement[J]. Computer Vision and Pattern
Recognition, 2018, 87(8): 101-104. https://doi.org/
10.48550/arXiv.1804.02767
Chen X,Yu J,Wu Z.Temporally Identity-Aware SSD With
Attentional LSTM[J], IEEE transactions on
cybernetics, 2019,50( 06) : 2674-2686.
Jun Jiang, Donghai Zhai. Single-stage object detection
Algorithm based on dilated convolution and Feature
enhancement [J], Computer Engineering, 2021,
47(7):232-238+248.
Wei X, Wei Y, Lu X. RMDC: Rotation-mask deformable
convolution for object detection in top-view fisheye
camera[J], Neurocomputing, 2022, 504: 99-108.
https://doi.org/10.1016/j.neucom.2022.06.116
Fremont V, Bui M T, Boukerroui D, et al. Vision-based
people detection system for heavy machine
applications[J], Sensors, 2016, 16(1): 128. https://
doi.org/10.3390/s16010128
Yogamani S, Hughes C, Horgan J, et al. Woodscape: A
multi-task, multi-camera fisheye dataset for
autonomous driving[C], Proceedings of the IEEE/CVF
International Conference on Computer Vision. 2019:
9308-9318.
Hou Q , Zhou D , Feng J . Coordinate attention for
efficient mobile network design[C], Proceedings of the
IEEE/CVF conference on computer vision and pattern
recognition. 2021: 13713-13722. https://doi.org/10.
1109/cvpr46437.2021.01350
Chollet F. Xception: Deep learning with depthwise
separable convolutions[C], Proceedings of the IEEE
conference on computer vision and pattern recognition.
2017: 1251-1258. https://doi.org/10.1109/cvpr.2
017.195
ANIT 2023 - The International Seminar on Artificial Intelligence, Networking and Information Technology
10