allowing the model to capture intricate dependencies
across the entire image. This capability is particularly
beneficial for UAV imagery, where objects of interest
may appear at various scales and in partial occlusions,
often against highly cluttered backgrounds.
This paper's contribution lies in the strategic
combination of these two powerful models. By
deploying DETR and ViT in parallel, each model
processes the same input independently, thus
leveraging DETRβs acute precision in localization
and ViTβs adeptness at handling scale variations and
occlusions. This dual-model approach mitigates the
limitations inherent in each model when used alone
and capitalizes on their complementary strengths.
A dynamic fusion algorithm orchestrates the
integration of outputs from both models. This
algorithm does not merely aggregate confidence
scores but also intelligently adjusts the fusion ratio in
real-time, based on the contextual nuances and
specific characteristics of detected objects. Such a
sophisticated approach ensures that the system adapts
continuously to complex and evolving landscapes of
UAV operation, thereby enhancing detection
accuracy and robustness across a wide range of
operational scenarios. This fusion of DETR and ViT
sets new standards in UAV-based surveillance and
monitoring, promising substantial improvements in
the reliability and effectiveness of such systems. The
anticipated impact of this study spans improvements
in operational safety, particularly in search and rescue
missions, enhancements in surveillance accuracy for
security applications, and greater data precision for
environmental monitoring. This approach represents
a significant technological leap in computer vision
and heralds a paradigm shift in how UAVs can be
utilized in complex and critical applications
worldwide.
2 RELATED WORKS
A comprehensive benchmark of real-time object
detection models tailored for UAV applications was
presented by (Du et al., 2019). The authors developed
new motion models to enhance detection accuracy in
high-speed aerial scenarios, addressing challenges
with rapidly moving objects. Their research
highlighted the importance of integrating dynamic
movement models into detection frameworks to
improve response times and accuracy in UAV-
captured imagery.
The Vision Transformer architecture was
extended by (Wang and Tien, 2023) to better suit
aerial image analysis by incorporating dynamic
position embeddings. This adaptation allows the
model to handle varying scales and orientations of
objects typically found in UAV datasets. Their
findings demonstrate significant improvements in
object detection performance on aerial images,
supporting the concept of transformers' adaptability
to specialized tasks.
(Huang and Li. 2024) introduced enhancements in
small object detection, focusing on information
augmentation and adaptive feature fusion to improve
detection accuracy and real-time performance. Their
results demonstrate superior performance over the
latest DETR model. This research is pertinent to our
work as it highlights the effectiveness of advanced
algorithms in refining object detection, echoing our
approach to optimizing UAV-based detection with
transformer architectures.
(Ye et al., 2023) introduced RTD-Net, tailored for
UAV-based object detection. It addresses challenges
like small and occluded object detection and the need
for real-time performance. By implementing a
Feature Fusion Module (FFM) and a Convolutional
Multiheaded Self-Attention (CMHSA) mechanism,
the network achieved improvements in handling
complex detection scenarios, resulting in an 86.4%
mAP on their UAV dataset. Their approach,
emphasizing efficiency and effectiveness, aligns with
our methods of optimizing object detection through
advanced architecture fusion.
3 MATERIALS AND METHODS
3.1 Dataset Used
The VisDrone dataset, which contains diverse aerial
images from various urban and rural scenes across
Asia, was used in this study. Initially, the dataset
included many objects, such as cars, buildings, and
trees. The following steps were performed to tailor it
to research needs.
Data Curation and Labelin were performed using
custom Python scripts and LabelMG. The dataset was
filtered to retain only images containing people. The
annotations were re-labeled to ensure uniformity,
combining labels for "person" and "people" into a
single "person" label.
A format conversion was performed while
preprocessing the dataset and researching ViT and
DETR accepted formats. Originally in COCO format,
the dataset was converted to Pascal VOC format. This
involved adapting the annotations and restructuring
the dataset files using a custom Python script.