based imaging system, typically used to monitor a No
Fly Zone or a restricted area. The real-time object de-
tection applied to UAV monitoring is really crucial.
Nevertheless, these applications need early detection
of objects so that they can be used later as inputs for
other activities. Due to early detection, the appear-
ance of the objects is generally very small. In general,
the aim of tiny object detection is to detect objects
that belong to the image and are tiny in size, which
implies that the objects of interest are objects that ei-
ther have a large physical appearance but occupy only
a tiny area in an image, or have a really tiny appear-
ance. Improvements in object detection algorithms
allow faster and more accurate results.
The most recent methods using deep Convolu-
tional Neural Networks (Deep CNN) usually involve
several steps. First, specify the objects of interest in
the image, then pass them through the Deep CNN
for feature extraction and then classify them using
supervised classification techniques. Finally, mixing
the results between the objects to properly mark the
bounding box. In Deep CNN models there are mainly
two categories of current state-of-art object detectors:
single-stage and two-stage detectors. On one hand,
the single stage detectors, are represented by SSD
(Single Shot multibox Detector)(Liu et al., 2016) that
runs a convolutional network on input image only
once, calculates a feature map and predicts a de-
tection; and YOLO (You Only Look Once)(Redmon
et al., 2016), that treats object detection as a simple
regression problem by tacking an input image and
learning the class probabilities and bounding box co-
ordinates. Such models (SSD and YOLO) are pro-
posed by considering both accuracy and processing
time. On the other hand, the two-stage detectors, in-
clude the Faster R-CNN (Region-based convolutional
Neural Networks) (Ren et al., 2015) that uses a region
proposal networks to generate regions of interests in
the first stage; and Mask R-CNN (He et al., 2017) that
sends the region proposals down the pipeline for ob-
ject classification and bunding box regression. Such
models perform well in term of accuracy, in particu-
lar the faster R-CNN with an accuracy of 73% mAP.
But due to their very complex pipeline, these two-
stage detectors perform poorly in terms of speed with
7 frames per second (FPS), which restricts them for
real-time object detection.
Since real-time is a challenge in optical early
warning UAV detection, in our work, we will propose
a CNN architecture based on a detection method with
fast processing speed. Especially YOLO performs
well compared to previous region-based algorithms in
terms of speed with 45 FPS while maintaining a good
detection accuracy more than 63% mAP (Rahim et al.,
2021). Although the speed and accuracy were good,
YOLOv1 (YOLO first version) (Redmon et al., 2016)
made some remarkable localization errors. In other
words, the bounding boxes predicted by YOLOv1 are
not accurate. So, to overcome the deficiencies of
YOLOv1, the creators of YOLO launched YOLOv2
(YOLO second version) (Redmon and Farhadi, 2017)
where the similarity of predicted bounding box to the
ground truth bounding box, and the percentage of to-
tal relevant objects correctly classified, were mainly
focused without impairing the accuracy of the clas-
sification. Moreover, YOLOv2, which called also
YOLO9000 (Redmon and Farhadi, 2017), gained a
speed of 59 FPS and mAP of 77.8% in experiments
on the PASCAL VOC 2007 dataset(Everingham et al.,
2014), (Everingham et al., 2010). The YOLOv3 (the
third version of YOLO) (Redmon and Farhadi, 2018),
whose main improvement is the addition of multi-
scale prediction, has brought further improvements
in speed and accuracy. In experimenting with MS
COCO (Lin and Maire, 2014), (Kim, 2017)dataset,
YOLOv3 obtained 55% AP score and achieved a real-
time speed of approximately 200 FPS on Tesla V100.
YOLOv4 (YOLO fourth version) was released on 23
April 2020 and YOLOv5 on 10 June 2020. However,
YOLOv4 (Bochkovskiy et al., 2020), (Wang et al.,
2021d) was released in the Darknet framework, while
YOLOv5 (Wang et al., 2021d) ,(Ultralytics, 2021),
(Ahmed and Kharel, 2021), (Wang et al., 2021b),
(Yan et al., 2021), (Yang et al., 2020) has been re-
leased in the Ultralytics PyTorch framework. Despite
the fact that YOLOv4 can reach 43.5% AP on MS
COCO (COCO, 2021)and 65 FPS speed, the devel-
opers of YOLOv5 claim that in a YOLOv5 Collab
notebook, running a Tesla P100, they found inference
times of up to 0.007 seconds per image, meaning 140
frames per second (FPS) (Yan et al., 2021). In con-
trast, YOLOv4 achieved 50 FPS after having been
converted to the same Ultralytics PyTorch library (Ul-
tralytics, 2021). Not only that, they also mentioned
that YOLOv5 is smaller. Specifically, the YOLOv5
file weights 27 megabytes. However, the weights
file for YOLOv4 (with Darknet architecture) is 244
megabytes. So, YOLOv5 is about 88% smaller than
YOLOv4 (Roboflow, 2021). The development of new
versions of YOLO is not finished. In Oct 28, 2021
Yuxin et al. (Fang et al., 2021) have launched the
YOLOS (You Only Look at One Sequence) . It is a
series of object detection models based on the vanilla
Vision Transformer with the fewest possible modifi-
cations, region priors, as well as inductive biases of
the target task. However, despite that other variants
of YOLO are developed such as YOLOX (Ge et al.,
2021), YOLOv5 remains more practical in real time
An Improved YOLOv5 for Real-time Mini-UAV Detection in No Fly Zones
175