
sive farm management solutions. By focusing on a
real-world problem, this work aims to bridge the gap
between advanced technical solutions and practical
agricultural challenges, setting the stage for more in-
tegrative, technology-driven farming practices in the
future. In summary, our contributions are:
1. We introduce a novel real world application: Fine-
tuning a deep object detector to count coconut
palm trees.
2. We show that performance can substantially be
increased by not only considering coconut palm
trees during training but also other plants. This
allows the object detector to better differentiate
between the plants it sees.
3. We show that it suffices to train on synthetically
generated images and thereby eliminate the need
to manually label the images.
2 BACKGROUND
This section delves into the foundational concepts and
the current state of object detection. We start with
Convolutional neural networks. Subsequently, our fo-
cus shifts to object detection, a critical component to
address our use case. In this context, we enumer-
ate the four scenarios encountered during object de-
tection, especially when positioning bounding boxes.
We also elucidate the concept of Intersection-Over-
Union. Building on this, we discuss key performance
metrics in object detection, namely precision, recall,
and mean average precision. Concluding this section,
we present the YOLO architecture, emphasizing the
advancements in YOLOv7—a state-of-the-art object
detector. The experiments we discuss in Section 4
predominantly leverage this technology.
2.1 Convolutional Neural Networks
Convolutional Neural Networks (CNNs) have
emerged as a pivotal technology in the domain of
computer vision. Originating from the larger family
of Deep Neural Networks (DNNs), CNNs are specif-
ically tailored for image data, making them adept at
tasks such as face recognition, image classification,
and object detection.
A standard CNN architecture comprises several
layers. The inaugural layer, the convolutional layer,
is pivotal for feature extraction. By utilizing filter ma-
trices, or kernels, this layer captures intricate patterns
such as colours and edges from an image. The unique
property of these filters is their translational invari-
ance, allowing objects to be recognized irrespective
of their spatial positioning in an image.
Following the convolutional layer is the pooling
layer, designed for dimensionality reduction. This di-
mension reduction not only decreases the computa-
tional burden but also helps in extracting dominant
features. Two common pooling methods exist: max
pooling and average pooling, which respectively cap-
ture the maximum and average values from a desig-
nated window in the input.
In the deeper sections of the network, fully-
connected layers serve the crucial role of integrating
features from previous layers and mapping them to
the desired output. These layers essentially form the
decision-making component of the CNN.
Training a CNN involves defining its architec-
ture and then optimising it over several iterations,
known as epochs. Throughout this training phase,
the model’s internal parameters get refined to enhance
prediction accuracy. This training is supervised, re-
quiring labelled datasets to guide the iterative minimi-
sation of prediction errors. Within the realm of CNNs,
renowned architectures include AlexNet, GoogLeNet,
and VGGNet.
Critical challenges in training CNNs encompass
phenomena like underfitting and overfitting. Address-
ing these challenges, often by tuning hyperparame-
ters, ensures that the trained model is both robust and
accurate in its predictions (Sevarac, 2021).
2.2 Object Detection
Computer vision, pivotal in numerous applications,
encompasses tasks such as image classification, seg-
mentation, and object detection. Object detection
marries object recognition — identifying and clas-
sifying objects within media — with object local-
ization, which encapsulates these identified objects
within bounding boxes (Khandelwal, 2020). A key
metric here is the Intersection-over-Union (IoU),
which quantifies the overlap between the predicted
bounding box and the ground truth — the actual an-
notated bounding box. It calculates the ratio of their
intersection to their union. IoU values range between
0 and 1: values close to 0 imply minimal overlap and
those nearing 1 signify accurate predictions (Anwer,
2022).
2.3 YOLO: You Only Look Once
This section delves into modern object detectors, fo-
cusing on YOLO (You Only Look Once). The field
of object detection began with Region-based Convo-
lutional Network (R-CNN) in 2014. It proposed re-
Coconut Palm Tree Counting on Drone Images with Deep Object Detection and Synthetic Training Data
663