allowing the model to capture intricate dependencies 
across the entire image. This capability is particularly 
beneficial for UAV imagery, where objects of interest 
may appear at various scales and in partial occlusions, 
often against highly cluttered backgrounds. 
This  paper's  contribution  lies  in  the  strategic 
combination  of  these  two  powerful  models.  By 
deploying  DETR  and  ViT  in  parallel,  each  model 
processes  the  same  input  independently,  thus 
leveraging  DETRβs  acute  precision  in  localization 
and ViTβs adeptness at handling scale variations and 
occlusions.  This  dual-model  approach  mitigates  the 
limitations inherent  in  each model when used alone 
and capitalizes on their complementary strengths. 
A  dynamic  fusion  algorithm  orchestrates  the 
integration  of  outputs  from  both  models.  This 
algorithm  does  not  merely  aggregate  confidence 
scores but also intelligently adjusts the fusion ratio in 
real-time,  based  on  the  contextual  nuances  and 
specific  characteristics  of  detected  objects.  Such  a 
sophisticated approach ensures that the system adapts 
continuously to complex and evolving landscapes of 
UAV  operation,  thereby  enhancing  detection 
accuracy  and  robustness  across  a  wide  range  of 
operational scenarios. This fusion of DETR and ViT 
sets  new  standards  in  UAV-based  surveillance  and 
monitoring,  promising  substantial  improvements  in 
the reliability and effectiveness of such systems. The 
anticipated impact of this study spans improvements 
in operational safety, particularly in search and rescue 
missions, enhancements in surveillance accuracy for 
security  applications,  and  greater  data  precision  for 
environmental monitoring. This approach represents 
a  significant  technological  leap  in  computer  vision 
and  heralds  a  paradigm  shift  in  how  UAVs  can  be 
utilized  in  complex  and  critical  applications 
worldwide. 
2  RELATED WORKS  
A  comprehensive  benchmark  of  real-time  object 
detection models tailored for UAV applications was 
presented by (Du et al., 2019). The authors developed 
new motion models to enhance detection accuracy in 
high-speed  aerial  scenarios,  addressing  challenges 
with  rapidly  moving  objects.  Their  research 
highlighted  the  importance  of  integrating  dynamic 
movement  models  into  detection  frameworks  to 
improve  response  times  and  accuracy  in  UAV-
captured imagery.  
The  Vision  Transformer  architecture  was 
extended  by  (Wang  and  Tien,  2023)  to  better  suit 
aerial  image  analysis  by  incorporating  dynamic 
position  embeddings.  This  adaptation  allows  the 
model  to  handle  varying  scales  and  orientations  of 
objects  typically  found  in  UAV  datasets.  Their 
findings  demonstrate  significant  improvements  in 
object  detection  performance  on  aerial  images, 
supporting  the  concept  of  transformers'  adaptability 
to specialized tasks.  
(Huang and Li. 2024) introduced enhancements in 
small  object  detection,  focusing  on  information 
augmentation and adaptive feature fusion to improve 
detection accuracy and real-time performance. Their 
results  demonstrate  superior  performance  over  the 
latest DETR model. This research is pertinent to our 
work  as  it  highlights  the  effectiveness  of  advanced 
algorithms in  refining object detection, echoing our 
approach  to  optimizing  UAV-based  detection  with 
transformer architectures.  
(Ye et al., 2023) introduced RTD-Net, tailored for 
UAV-based object detection. It addresses challenges 
like small and occluded object detection and the need 
for  real-time  performance.  By  implementing  a 
Feature  Fusion Module  (FFM)  and  a  Convolutional 
Multiheaded  Self-Attention  (CMHSA)  mechanism, 
the  network  achieved  improvements  in  handling 
complex  detection  scenarios,  resulting  in  an  86.4% 
mAP  on  their  UAV  dataset.  Their  approach, 
emphasizing efficiency and effectiveness, aligns with 
our  methods  of  optimizing  object  detection  through 
advanced architecture fusion. 
3  MATERIALS AND METHODS       
3.1  Dataset Used     
The VisDrone dataset, which contains diverse aerial 
images  from  various  urban  and  rural  scenes  across 
Asia,  was  used  in  this  study.  Initially,  the  dataset 
included  many  objects,  such  as  cars,  buildings,  and 
trees. The following steps were performed to tailor it 
to research needs. 
Data Curation and Labelin were performed using 
custom Python scripts and LabelMG. The dataset was 
filtered to retain only images containing people. The 
annotations  were  re-labeled  to  ensure  uniformity, 
combining  labels  for  "person"  and  "people"  into  a 
single "person" label. 
A  format  conversion  was  performed  while 
preprocessing  the  dataset  and  researching  ViT  and 
DETR accepted formats. Originally in COCO format, 
the dataset was converted to Pascal VOC format. This 
involved  adapting  the  annotations  and  restructuring 
the dataset files using a custom Python script.