Comparing Boosted Cascades to Deep Learning Architectures for Fast

and Robust Coconut Tree Detection in Aerial Images

Steven Puttemans

∗

, Kristof Van Beeck

∗

and Toon Goedem

KU Leuven, EAVISE Research Group, Jan Pieter De Nayerlaan 5, Sint-Katelijne-Waver, Belgium

Keywords:

Object Detection, Boosted Cascades, Deep Learning.

Abstract:

Object detection using a boosted cascade of weak classiﬁers is a principle that has been used in a variety of

applications, ranging from pedestrian detection to fruit counting in orchards, and this with a high average

precision. In this work we prove that using both the boosted cascade approach suggest by Viola & Jones and

the adapted approach based on integral or aggregate channels by Doll

ar yield promising results on coconut

tree detection in aerial images. However with the rise of robust deep learning architectures for both detection

and classiﬁcation, and the signiﬁcant drop in hardware costs, we wonder if it is feasible to apply deep learning

to solve the task of fast and robust coconut tree detection and classiﬁcation in aerial imagery. We examine

both classiﬁcation- and detection-based architectures for this task. By doing so we prove that deep learning is

indeed a feasible alternative for robust coconut tree detection with a high average precision in aerial imagery,

keeping attention to known issues with the selected architectures.

1 INTRODUCTION

Getting a robust and accurate location of any object in

a given input image, combined with a correct label, is

a key part in solving many automation tasks. In most

cases the localization is only a small part of the com-

plete pipeline, thus requiring a very high accuracy, in

order to reduce the propagated error through the re-

maining pipeline as much as possible.

Large companies use human annotators in low-

cost countries to manually analyse every single

image, locating and labelling each instance of speci-

ﬁc object classes, by manually augmenting the given

image with bounding boxes of the object instances.

In our application of coconut tree detection in aerial

images (see Figure 1), the human annotators are as-

ked to click on the coconut tree centres, after which a

circle with a predeﬁned average diameter (which can

be deﬁned due to the ﬁxed height at which the images

are captured using aerial photography) is placed as

annotation on top of the coconut tree. An example of

such a labelled image can be seen in Figure 2.

This manual annotation is a cumbersome, time-

consuming and expensive process. Furthermore, to

avoid bias introduced by a single annotator, we need

to incorporate annotation redundancy, by forcing mul-

∗

Both authors equally contributed to this work.

tiple annotators to process the same image and then

averaging out on the annotation locations. Additi-

onally, manual annotation is very prone to mistakes

when performing these repetitive tasks (e.g. missing

coconut trees, selecting wrong locations, . . . ).

Many of these tasks could be automated given the

possibilities of state-of-the-art object detection algo-

rithms. These systems can, given pre-trained models

on labelled training data, robustly locate objects in a

Figure 1: Example aerial image captured through remote

sensing equipment in which we need to detect coconut trees.

230

Puttemans, S., Beeck, K. and Goedemé, T.

Comparing Boosted Cascades to Deep Learning Architectures for Fast and Robust Coconut Tree Detection in Aerial Images.

DOI: 10.5220/0006571902300241

In Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2018) - Volume 5: VISAPP, pages

230-241

ISBN: 978-989-758-290-5

Figure 2: Example of an aerial image containing manually annotated coconut trees.

given input image with human-like accuracy using the

power of machine learning. Compared to human an-

notators these repetitive tasks are just perfect for au-

tomated systems, which can heavily parallelize these

tasks and look for multiple object instances at once.

The challenging part lies in ﬁnding the correct al-

gorithm for training these accurate machine-learned

object detection models. This is where we position

this research on ﬁnding the optimal solution for auto-

mated and robust coconut tree detection. We are con-

vinced that several object detection algorithms based

on the principle of a boosted cascade of weak clas-

siﬁers (Viola and Jones, 2001; Doll

ar et al., 2009;

Doll

ar et al., 2010) are a perfect solution for this task.

On the other hand, deep learning algorithms have in-

troduced a new wave of state-of-the-art object detec-

tors, capable of achieving top-notch accuracy results.

Combined with the fact that the required GPGPU har-

dware is becoming affordable and the fact that many

pre-trained networks already exist (e.g. caffe model

zoo (Jia et al., 2014)), it seems a valid alternative to

the boosted cascade classiﬁer based approaches.

In this work we compare the well-known cas-

cade classiﬁer object detection algorithms to these

new and powerful deep-learned object detection algo-

rithms. We evaluate the trained detection models both

in achieved accuracy and execution speed, keeping

in mind that achieving real-time performance is in

many cases a hard constraint for the actual applica-

tion. Furthermore we give recommendations on how

to efﬁciently use deep learning algorithms in the con-

text of object detection in aerial images and propose

some general rules to keep in mind.

Initial reading on the topic of deep learned object

detection shows us that training models from scratch

without large amounts of training data and expensive

hardware is infeasible, thus we examine the subject

of transfer learning, where we adapt existing deep le-

arned models, trained on larger datasets (e.g. Pas-

cal VOC (Everingham et al., 2010), Microsoft COCO

(Lin et al., 2014)) to suit our speciﬁc object detection

needs, by ﬁne-tuning the weights of all the convoluti-

onal layers onto new object classes.

The remainder of this paper is organized as fol-

lows. Section 2 discusses the state-of-the-art in object

detection and takes a look at similar research perfor-

ming object detection in aerial imagery. This is follo-

wed by section 3 where the collected data for training

and validating our object detection solutions is discus-

sed. Section 4 and 5 discuss the different approaches

we suggest for coconut tree detection. The achieved

results are subsequently discussed in section 6, follo-

wed by conclusions in section 7.

2 RELATED WORK

The principle of a boosted cascade of weak classiﬁers

is introduced by (Viola and Jones, 2001), where it is

used to obtain a robust face detection algorithm based

on the very simple and weak HAAR-wavelet like fe-

atures. By letting the boosting process decide which

features are good at separating object from non-object

patches, they obtain a robust classiﬁer with a limited

set of features. Furthermore the concept of a cascade,

which allows for the early rejection of non-object pa-

tches, increases execution speeds of this algorithm

drastically.

(Doll

ar et al., 2009) argues that dropping colour

information and sticking to a single invariant feature

representation might limit the possibilities of boosted

cascade classiﬁers. The paper suggests using multi-

ple feature representations, called integral channels

(ICF), ranging from Gabor ﬁlters, to edge ﬁlters, co-

lour ﬁlters, . . . all with the aim to improve the de-

tection quality of boosted cascades. By doing so they

signiﬁcantly improve the accuracy of pedestrian de-

tection. In (Doll

ar et al., 2010) they introduce a set

of extensions and the concept of aggregated channel

features (ACF), pushing the achieved accuracy on pe-

destrian detection even further.

Even though algorithms based on a boosted cas-

cade of weak classiﬁers are already a bit older, there

is ongoing research showing that these techniques are

Comparing Boosted Cascades to Deep Learning Architectures for Fast and Robust Coconut Tree Detection in Aerial Images

231

still valid solutions for existing industrial problems

(Puttemans et al., 2016a; Puttemans et al., 2016b;

Zheng et al., 2016; Shaikh et al., 2016). These soluti-

ons offer high accuracies, by smartly using scene- and

application-speciﬁc constraints to improve the efﬁ-

ciency of the boosted cascade algorithms, while main-

taining real-time processing speeds.

Since 2015 deep learning frameworks are taking

over the state-of-the-art in both object classiﬁcation

and object detection tasks. Due to the rise of enor-

mous datasets and the drop in hardware cost, deep

learning becomes a valid alternative for any clas-

sic machine learning task. With the introduction of

pre-trained networks like AlexNet (Krizhevsky et al.,

2012), InceptionV3 (Szegedy et al., 2016), Dense-

Net (Iandola et al., 2014), ResNet (Targ et al., 2016),

. . . classiﬁcation results on challenging datasets like

ImageNet (Deng et al., 2009) rise to the top.

On the detection part, interesting approaches are

suggested, based on combining the above classiﬁca-

tion networks with a multi-scale sliding window ba-

sed approach (Szegedy et al., 2013). However, the do-

wnside of these approaches is the tremendous amount

of region proposals due to the multi-scale sliding win-

dow approach, which results in computationally ex-

pensive algorithms. To tackle this issue, region pro-

posal networks are proposed as a pre-ﬁlter for the

CNN classiﬁcation pipeline (Ren et al., 2015). Even

though several parts of the region proposal network

can be shared with the subsequent classiﬁcation net-

work, this approach still needs two separate architec-

tures, which need to be tuned individually.

Finally, the current state-of-the-art in object de-

tection is found in single-pass deep learned object de-

tectors. These algorithms integrate the region propo-

sal approach directly as a layer inside the classiﬁca-

tion pipeline, thus only requiring a single pass of the

network for detecting all objects in the given image.

Examples of these state-of-the-art algorithms are the

‘Single Shot Multibox’-detector (Liu et al., 2016) and

the ‘You-Only-Look-Once’-detector (Redmon et al.,

2016). These efﬁcient networks reduce the amount

of region proposals so drastically that obtaining real-

time performance is achievable, even rising to up to

120 FPS at VGA resolution.

3 DATASET AND FRAMEWORK

As training and validation data for our suggested ap-

proaches we make use of a 10.000 × 10.000 pixel ae-

rial image covering a partial coconut plantage provi-

ded in RGBA format. Inside the image all coconut

trees are manually annotated using their centre posi-

tion. The average size of a patch covering the whole

tree is 100 ×100 pixels, so this size is used as annota-

tion patch around the centre position. The image con-

tains 3798 coconut tree patches, while the remaining

image parts are used as background information.

In all cases we split the provided data into parts,

using one part for model learning and the other part

for model validation, to ensure the detector is not va-

lidated on actual training data. Speciﬁc data splits are

discussed at each separate technique.

To train our object detection models we use three

publicly available frameworks. The ﬁrst boosted cas-

cade approach, based on the principle of (Viola and

Jones, 2001) combined with LBP features, is trai-

ned using the implementation of the OpenCV3.2 fra-

mework (Bradski and Kaehler, 2000). The second

boosted cascade approach, based on the principle of

(Doll

ar et al., 2010), using the aggregate channel fea-

tures approach, is trained using the authors own MAT-

LAB toolbox (Doll

ar, 2005). For our deep learned

models we start by using an implementation of the

InceptionV3 architecture in Tensorﬂow (Abadi et al.,

2015), but quickly switched to the C and CUDA based

framework Darknet (Redmon, 2013), which includes

both the classiﬁcation (Darknet19,Densenet201) and

detection architectures (YOLOv2) we further test in

our paper.

4 APPROACHES WITH BOOSTED

CASCADES

In this section we discuss the different boosted cas-

cade based approaches we used for training our coco-

nut tree detectors, combined with the speciﬁc number

of training and validation samples used.

4.1 An Adaptive Boosted Cascade of

Weakly Trained Classiﬁers

Our ﬁrst approach (Viola and Jones, 2001) is a

boosted cascade of weak classiﬁers using the adap-

tive boosting algorithm (Margineantu and Dietterich,

1997) for learning the weak classiﬁers, based on

the local binary pattern (LBP) feature representation

(Ahonen et al., 2004). This invariant feature represen-

tation ignores colour information and works directly

on a grayscale image, focusing on local differences in

pixel intensities. We choose to work on a grayscale

image since no obvious separation between coconut

trees and background (e.g. grasslands, other vegeta-

tion types, ...) seems achievable using colour infor-

mation. If however we want to detect another object

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

232

Table 1: Training data for the Viola&Jones based detection

models and the model complexity (expressed by the number

of stages and weak stump classiﬁers).

#pos #neg #weak #feats

Model 1 1000 2500 16 126

Model 2 1000 5000 15 123

Model 3 1000 10000 15 142

Model 4 2000 8000 16 221

class, with distinct colour based features, we would

ﬁrst apply a colour-based transformation on our input

data, as suggested by (Puttemans et al., 2016b).

For training the detection model we split the

source image in four equal parts. The annotations of

the top left image part are used as positive training

samples, while the remaining image parts are used

for validation. As background training patches we

randomly sample patches at the model size, from the

image, not containing actual coconut trees. We incre-

ase the number of negative samples with each model,

to obtain a more accurate detector with less false po-

sitive detections, achieving a higher average precision

with each step, as discussed in section 6.

On top of the gathered training samples we apply

data augmentation for our ﬁnal model, by randomly

ﬂipping the training patches around their vertical or

horizontal axis. The amount of training samples used

for each model can be seen in Table 1, together with

the amount of stages of weak classiﬁers and the actual

number of weak classiﬁers, somewhat indicating the

models complexity. All weak classiﬁers are represen-

ted as single depth binary decision trees on top of the

selected LBP features.

4.2 An Aggregate Channel Features

based Approach

In comparison to our ﬁrst approach, the algorithm of

(Doll

ar et al., 2010) proposes to add multiple inva-

riant feature representations to the adaptive boosting

process, as aggregate feature channels. By adding co-

lour, gradient ﬁlters, Gabor wavelets, . . . the accuracy

of the trained detectors increases compared to using a

single feature channel.

We ﬁrst of all train a model using a similar amount

of positive training data (2000 samples) to the best

performing model of the previous technique. Howe-

ver we notice that the negative data might be gathered

from patches that are also validated afterwards, since

the single top left corner did not contain enough back-

ground patches to use in the Viola&Jones based ap-

proach. Therefore, two extra ACF-based models are

trained, splitting the dataset into a lower (1.741 trai-

ning samples) and an upper (1.914 training samples)

image half. We train a model using one half and vali-

dated the model using the other half of the image. In

general the ACF algorithm uses a lot more negative

training samples gathered from the same image as the

positive training, leading up to 150.000 patches.

5 APPROACHES WITH DEEP

LEARNING

After training our boosted cascade models, we switch

to the deep learned models. We ﬁrst try learning a

complete architecture from scratch without initialized

weights, then apply several transfer learning appro-

aches, where existing weights of a pre-trained deep

model are ﬁne-tuned towards application-speciﬁc

weights, resulting in a model that can detect the new

object class. Finally we investigate the difference

between classiﬁcation and detection architectures in

deep learning and their applicability on coconut tree

detection in aerial imagery.

5.1 Learning a Complete New Deep

Learned Architecture

Although literature advises not to do this, we train a

completely new model on the available case-speciﬁc

data, where no initialization of the weights, based on

a previous training, is done. This model seems to con-

verge, looking at the loss-rate over the number of trai-

ning iterations, but in subsection ?? we discuss why

this converged model is misleading and usable in our

application.

5.2 Freezing (n-1) Layers and

Fine-tuning Final Layer Weights

A second approach is to freeze the weights of the pre-

trained convolutional layers, and only re-train the ﬁ-

nal layer and its connections. This forces the deep le-

arned model to make new constellations of existing

features for a new object class. We apply this ap-

proach on the existing InceptionV3 model inside Ten-

sorFlow and try to ﬁne-tune the ﬁnal layer to be able

to classify coconut trees in aerial image patches, while

freezing all the other convolutional layers.

One major advantage of this approach is that

the amount of data needed for this kind of transfer-

learning is very small. Sample cases in the Ten-

sorFlow framework prove that only 75 samples per

class can already be enough for obtaining satisfying

results.

Comparing Boosted Cascades to Deep Learning Architectures for Fast and Robust Coconut Tree Detection in Aerial Images

233

This approach only works if the object class to be

detected is somehow related to the data contained in

the initial dataset on which the model was trained. If

the data is however drastically different, like in the

case of aerial imagery, then obtaining satisfying re-

sults using this approach is quite hard, as illustrated

in subsection 6 and other approaches should be con-

sidered.

5.3 Fine-tuning Weights of All Layers

Instead of freezing the weights of all the pre-trained

layers, we can also tolerate slight adaptations of the

pre-trained weights of the convolutional layers. This

allows to change the learned features to be more spe-

ciﬁc to our desired detection task and then learn a con-

stellation of those new ﬁne-tuned features on top of

that. When doing so, setting a small learning rate is

mandatory, else the initial weights will be changed too

drastically too fast, prohibiting the model to converge

on an optimal solution.

Using Darknet, we apply transfer learning using

this ﬁne-tuning approach on both the Darknet19 and

the Densenet201 architecture, trained on ImageNet,

with the goal of obtaining a deep learned classiﬁer for

our new case of coconut tree localisation and classi-

ﬁcation, using a smaller set of case-speciﬁc manually

annotated image patches.

5.4 From Classiﬁcation towards

Detection Architectures

Since we are aware that using a classiﬁcation network

implies that we need to provide a multi-scale sliding-

window based approach for gathering image patches,

we try training a single pass detection based model

(YOLOv2 architecture). Literature clearly states that

these single pass networks are much faster than their

sliding-window counterparts.

Unfortunately due to the coarse grid-based region

proposals, the proposed architecture is not able to

cope with dense object-packed scenes, where object

instances are closely together and slightly overlap-

ping. This triggers ﬁnal output detections that cover

multiple object instances, instead of retrieving sin-

gle object instances and furthermore doesn’t allow the

model to converge to an optimal conﬁguration. This

is a major problem in our application of coconut tree

detection in aerial imagery and thus this approach was

abandoned.

6 RESULTS

This section discusses the various results we obtain

with the different object detection approaches focu-

sing on our case of robust and accurate coconut tree

detection.

6.1 Viola&Jones-based Object

Detection

Figure 3 displays the obtained precision-recall curves

for the Viola&Jones boosted cascades of weak clas-

siﬁers using local binary pattern features. For each

detector we also report the average precision (AP),

which is calculated as the area-under-the-curve for the

given precision-recall curve.

The closer the precision-recall curve lies to the

top-right corner, the better the detector. Increasing the

number of negative samples, which gives the model a

better descriptive power for its negative class, seems

to work well. This should in principle also mean a

higher average precision, but we reckon our graph

does not directly represent this. Our OpenCV ba-

sed implementation does not allow to generate more

precision-recall points for the given data, and since

we do not want to extrapolate the unknown data

points, we do not take the area under this non-existing

part into account. This gives a wrong impression on

the achieved average precision.

We conclude that given a fairly limited set of case-

speciﬁc annotated training data, and a limited training

time of only two hours, we obtain a detector that is

able to detect coconut trees with 90% precision at a

recall of 80%. Furthermore we notice that applying

data augmentation helps boosting the generalization

properties of our boosted cascade models. The ﬁnal

model performs detections on a 10.000×10.000 pixel

image within ten minutes.

6.2 ACF-based Object Detection

Figure 4 shows the obtained precision-recall curves

for the ACF boosted cascades. We immediately no-

tice that this framework is able to draw power from

multiple feature channels and is thus able to obtain

higher average precisions. Our best scoring model,

trained on the bottom half of our dataset image and

validated on the top part, achieves an average preci-

sion of 94.55%.

The optimal point of the best model, and thus the

optimal setting of our detector, achieves 96% preci-

sion at a recall of 90% which is quite amazing gi-

ven the very limited training time of only 30 minu-

tes. The best performing model performs detections

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

234

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Recall

Precision

Model 1: 1000 pos 2500 neg, AP = 84.52%

Model 2: 1000 pos 5000 neg, AP = 81.74%

Model 3: 1000 pos 10000 neg, AP = 71.27%

Model 4: 2000 pos 8000 neg, AP = 83.53%

Figure 3: Precision-recall curves for the different Viola&Jones-based detection models accompanied by the number of training

samples used and the achieved average precision.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Recall

Precision

ACF trained on top left corner, AP = 92.44%

ACF trained on bottom half − test on top half, AP= 94.55%

ACF trained on top half − test on bottom half, AP = 90.34%

Figure 4: Precision-recall curves for the different ACF-based detection models with their achieved average precision.

Comparing Boosted Cascades to Deep Learning Architectures for Fast and Robust Coconut Tree Detection in Aerial Images

235

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Recall

Precision

ACF, AP = 92.44%

V&J: 1000 pos 2500 neg, AP = 84.52%

Figure 5: Precision-recall comparison between VJ and ACF models on the same validation dataset.

on the 10.000 × 10.000 pixel image within ﬁve minu-

tes.

6.3 Viola&Jones versus ACF

Since both detectors of subsection 6.1 and 6.2 are va-

lidated on different amounts of data, we decide to per-

form an additional comparative study.

Figure 5 shows the result of validating the best

performing Viola&Jones and ACF detector trained on

the top left quarter and then validated on the same

remaining image as validation. This clearly shows

that ACF outperforms Viola&Jones with a 7.5% hig-

her average precision.

6.4 Deep Learned Object Classiﬁcation

Our initial attempt at training a complete deep lear-

ning classiﬁcation model from our limited set of trai-

ning data, did not produce usable results, although the

model seems to converge. With a top1-accuracy, a va-

lidation metric used in large scale classiﬁcation ben-

chmarks, for classiﬁcation of only 33% given a two-

class problem (coconut tree or background), this trai-

ned model performs worse than random guessing on

the class label, which given a large enough dataset,

should eventually converge to 50% top1-accuracy.

6.4.1 Transfer Learning with Frozen Layers

The transfer learning using TensorFlow is done with

only 75 coconut tree samples and 75 background sam-

ples, randomly sampled from the dataset, because re-

training the ﬁnal convolutional layer is computatio-

nally less demanding. All other layers are frozen in

this set-up, meaning their weights cannot be changed.

The remaining image content is used for validation

and compared to the ground truth annotations. The

trained model achieves a 77% top1-accuracy.

To be able to compare this accuracy to the accu-

racy of our previously trained boosted cascades we

calculate the precision and recall at pixel level. This

results in a precision of 72% at a recall of 52%. Com-

pared to the results obtained with our boosted casca-

des we decide that his approach does not yield sa-

tisfying results, and thus this approach was abando-

ned.

6.4.2 Transfer Learning by Fine-tuning All

Layers

Following the frozen-layer-model approach we sug-

gest using pre-trained weights as initialization for mo-

del ﬁne-tuning. However, instead of freezing the

weights of all but the last convolutional layers, we al-

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

236

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2

# Iterations

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Training loss

Loss rate

Averaged loss rate

0 1 2 3 4 5 6 7 8 9 10 11

# Iterations

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Training loss

Loss rate

Averaged loss rate

Figure 6: Loss-rate and average loss-rate curves during training for both (top) Darknet19 and (bottom) Densenet201 models.

Comparing Boosted Cascades to Deep Learning Architectures for Fast and Robust Coconut Tree Detection in Aerial Images

237

low the complete network to ﬁne-tune its weights.

We started with the default Darknet19 network,

existing of nineteen convolutional layers and then

tried a similar approach with the more complex Den-

senet201 network, containing two-hundred-and-one

convolutional layers. The reason of testing both ar-

chitectures is the fact that the author of the Darknet

framework illustrated that using an even deeper net-

work achieves higher top1-accuracy while being slo-

wer at inference time (Redmon, 2013). We decided

to verify if this behaviour was reproducible using our

coconut tree dataset.

Figure 6 displays the loss rate versus the number

of training iterations for both bodels. As seen both

models seem to be able to converge to a stable mo-

del given enough iterations. In order to avoid over-

ﬁtting to our training data we evaluated our deep le-

arned classiﬁcation models at several iteration inter-

vals to determine the best model weights for our co-

conut tree classiﬁcation task. The fast drop in loss

rate is explained by the fact that we increase the batch

size for training these models. This allows to take lar-

ger learning rate steps and at once is some sort of sa-

fety measurement against outliers. For our ﬁne-tuned

Darknet19 model we ﬁnd that using 10.000 iterations

seems optimal at a top1-accuracy of 95.2%, while for

the ﬁne-tuned Densenet201 model using 20.000 itera-

tions gives us the best performance at a top1-accuracy

of 97.4%.

6.4.3 Execution Speeds and Memory Footprints

of Deep Learning Approaches

Since we are using the classiﬁcation architectures of

Darknet instead of using detection architectures, we

are aware that we need to apply a sliding-window ba-

sed evaluation on our large input image to perform

coconut tree localisation.

We evaluate our models using a single NVIDIA

TitanX GPU. The Darknet19 model evaluates our

100 × 100 pixel input patches at 265 FPS while our

Densenet201 model evaluates patches at 52 FPS. Ta-

ble 2 gives an overview of the step size, used for the

sliding-window, in relation to the execution time of

both of our models. Due to the nature of our images,

performing multi-scale analysis is quite useless, since

images are captured on a constant ﬂight height. Com-

bined with the fact that deep learning is quite robust

to slight size changes, we stick to 100 × 100 pixels.

Considering a 50 pixel overlap between patches

in both x and y directions and thus at a 50 pixel step

size, the complete image can be evaluated in only two

and a half minutes using our Darknet19 model. While

this does increase when a smaller step size is selected,

one can argue if this smaller step size is actually nee-

ded, since there is already a 50% overlap of patches in

both dimensions. Given the fact that there are several

more optimization possibilities (e.g. using multiple

GPUs) makes us believe that we can achieve even

faster processing speeds. If we compare this to our

boosted cascade based approaches, as shown in Ta-

ble 3, our Viola&Jones model takes 10 minutes for

a 10.000 × 10.000 pixel image, while the ACF mo-

del takes 5 minutes for the same resolution. Given

the high top-1 accuracy obtained with the deep lear-

ned models, one could deﬁnitely consider switching

to these more advanced algorithms.

Finally, taking a look at the memory footprint of

our deep learning models might be interesting for fu-

ture research. For training on our NVIDIA TitanX

instance, we made the batch sizes as large as possible

to ﬁll the complete 12GB of dedicated memory. Ho-

wever, at run time, we process image patch per image

patch and thus the footprint is only 400MB for both

models, which means the model can also be run on a

low-end GPU, albeit slower.

6.5 Visual Results

Precision-recall curves or top1-accuracy results give

a quantitative evaluation of the trained models, but

for customers, it is always interesting to see visual re-

sults of the trained models. Therefore we developed

a visualisation tool that allows to visualise the out-

put detections of any given model with a speciﬁc co-

lour code as an overlay on top of the original input

image, as seen in Figure 7. Here we see the output

of our VJ and ACF boosted cascade algorithms and

for our deep learning classiﬁcation output. For visu-

alisation purposes we need to select a ﬁxed point on

the precision-recall curve. This threshold is set at a

precision of 90.46% and a recall of 81.12% for the

Viola&Jones model, a precision of 90.55% and a re-

call of 86.43% for the ACF model and a precision of

97.31% and a recall of 88.58% for the deep learning

approach. Green patches are true positive detections

(patches classiﬁed as coconut tree by the model and

actually containing a coconut tree), red patches are

false positives (patches classiﬁed as coconut tree by

the model but not containing a coconut tree) and pur-

ple patches are false negatives (patches classiﬁed as

Table 2: Execution speeds for a full 10.000 × 10.000 pixel

image for both deep learning models (Darknet19 and Den-

senet201), at different step sizes.

step #patches Darknet19 Densenet201

5px 3.924.361 4h 20h30m

25px 157.609 9m5s 50m20s

50px 39.601 2m30s 12m35s

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

238

Figure 7: Visual results for the (top) VJ boosted cascade model [P=90.46%,R=81.12%], the (middle) ACF boosted cascade

model [P=90.55%,R=86.43%] and the (bottom) deep learned Darknet19 model [P=97.31%,R=88.58%] showing: (green)

TP (red) FP (purple) FN.

Comparing Boosted Cascades to Deep Learning Architectures for Fast and Robust Coconut Tree Detection in Aerial Images

239

Table 3: Conﬁgurations for the visual output, including pre-

cision, recall, training and inference time (for a 10.000 ×

10.000 pixel image).

Model Precision Recall Train Infer

V&J 90.64% 81.12% 2h 10m

ACF 90.55% 86.43% 30m 5m

DN19 97.31% 88.58% 24h 2m30s

background by the model but actually containing a

coconut tree).

Comparing the different output images, we clearly

see some expected behaviour. The VJ model suffers

from a higher false positive rate than the ACF model.

This can be explained by the fact that VJ does not

take into account colour information and thus triggers

several detections on coconut tree shadows, whereas

ACF is more robust to this. Comparing the ACF mo-

del to the Darknet19 model, we see that the Darknet19

model has almost no false positive detections, hence

the high precision at a high recall rate. However the

approach still suffers from false negative detections.

We are convinced that this is partly due to the step

size of 50 pixels, used for this evaluation. Decreasing

the step size towards 25 or even 10 pixels, should furt-

her reduce the number of false negative detections.

7 CONCLUSIONS

With this research we have proven both the capabi-

lities of boosted cascade as well as deep learned de-

tection models for coconut tree localisation in aerial

images. Our best boosted cascade performs at an

average precision of 94.56% while our best deep lear-

ning model achieves a top1-accuracy of 97.4%. Alt-

hough our deep learning pipeline evaluates two times

as fast, we reckon that boosted cascades are still in the

race, especially given the lower computational com-

plexity demands, but the high classiﬁcation accuracy

and speed of deep learning can simply not be ignored.

As future work we suggest taking a look at re-

gion proposal networks, to combine with our classi-

ﬁcation deep learning networks. This would reduce

the amount of image patches drastically and make the

complete pipeline even faster. On top of that we also

notice that more recent research, focusses on combi-

ning the best of both worlds, as described in (Ouy-

ang et al., 2017; Zhang et al., 2017). While using

the principle of a boosted cascade, to beneﬁt from

the early rejection principle, the weak classiﬁers are

built using convolutional neural network architectu-

res, which guarantees a higher average precision in

the long end.

ACKNOWLEDGEMENTS

This work is supported by the KU Leuven, Campus

De Nayer and the Flanders Innovation & Entrepre-

neurship (AIO).

REFERENCES

Abadi, M., Agarwal, A., et al. (2015). TensorFlow: Large-

scale machine learning on heterogeneous systems.

Ahonen, T., Hadid, A., and Pietik

ainen, M. (2004). Face

recognition with local binary patterns. Proceedings of

the European Conference on Computer Vision, pages

469–481.

Bradski, G. and Kaehler, A. (2000). The opencv library.

Doctor Dobbs Journal, 25(11):120–126.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,

L. (2009). Imagenet: A large-scale hierarchical image

database. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, pages 248–

255.

Doll

ar, P. (2005). Piotr’s Computer Vision Matlab Toolbox

(PMT). https://github.com/pdollar/toolbox.

Doll

ar, P., Belongie, S., J, S., and Perona, P. (2010). The

fastest pedestrian detector in the west. In Proceedings

of the British Machine Vision Conference, volume 2.

Doll

ar, P., Tu, Z., Perona, P., and Belongie, S. (2009). In-

tegral channel features. In Proceedings of the British

Machine Vision Conference, volume 2, pages 5–12.

Everingham, M., Van Gool, L., Williams, C. K., Winn,

J., and Zisserman, A. (2010). The pascal visual ob-

ject classes (voc) challenge. International Journal of

Computer Vision, 88(2):303–338.

Iandola, F., Moskewicz, M., Karayev, S., Girshick, R., Dar-

rell, T., and Keutzer, K. (2014). Densenet: Imple-

menting efﬁcient convnet descriptor pyramids. arXiv.

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J.,

Girshick, R., Guadarrama, S., and Darrell, T. (2014).

Caffe: Convolutional architecture for fast feature em-

bedding. In Proceedings of the International Confe-

rence on Multimedia, pages 675–678. ACM.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).

Imagenet classiﬁcation with deep convolutional neu-

ral networks. In Advances in Neural Information Pro-

cessing Systems, pages 1097–1105.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P.,

Ramanan, D., Doll

ar, P., and Zitnick, C. L. (2014).

Microsoft coco: Common objects in context. In Pro-

ceedings of the European Conference on Computer Vi-

sion, pages 740–755. Springer.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,

Fu, C.-Y., and Berg, A. C. (2016). SSD: Single shot

multibox detector. In Proceedings of the European

Conference on Computer Vision, pages 21–37. Sprin-

ger.

Margineantu, D. D. and Dietterich, T. G. (1997). Pruning

adaptive boosting. In Proceedings of the International

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

240

Conference on Machine Learning, volume 97, pages

211–218.

Ouyang, W., Wang, K., Zhu, X., and Wang, X. (2017).

Chained cascade network for object detection. In

Proceedings of the IEEE International Conference on

Computer Vision.

Puttemans, S., Van Ranst, W., and Goedem

e, T. (2016a).

Detection of photovoltaic installations in RGB aerial

imaging: a comparative study. In Proceedings of GE-

OBIA.

Puttemans, S., Vanbrabant, Y., et al. (2016b). Automated vi-

sual fruit detection for harvest estimation and robotic

harvesting. In Proceedings of the International Con-

ference on Image Processing Theory, Tools and Appli-

cations.

Redmon, J. (2013). Darknet: Open source neural networks

in c. http://pjreddie.com/darknet/.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.

(2016). You only look once: Uniﬁed, real-time object

detection. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, pages 779–

788.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster

r-cnn: Towards real-time object detection with region

proposal networks. In Advances in Neural Informa-

tion Processing Systems, pages 91–99.

Shaikh, F., Sharma, A., Gupta, P., and Khan, D. (2016).

A driver drowsiness detection system using cascaded

AdaBoost. Imperial Journal of Interdisciplinary Re-

search, 2(5).

Szegedy, C., Toshev, A., and Erhan, D. (2013). Deep neural

networks for object detection. In Advances in Neural

Information Processing Systems, pages 2553–2561.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wo-

jna, Z. (2016). Rethinking the inception architecture

for computer vision. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition,

pages 2818–2826.

Targ, S., Almeida, D., and Lyman, K. (2016). Resnet in

resnet: generalizing residual architectures. arXiv pre-

print arXiv:1603.08029.

Viola, P. and Jones, M. (2001). Rapid object detection using

a boosted cascade of simple features. In Proceedings

of the IEEE Conference on Computer Vision and Pat-

tern Recognition, volume 1, pages I–511.

Zhang, K., Zhang, Z., Wang, H., Li, Z., Qiao, Y., and Liu,

W. (2017). Detecting faces using inside cascaded con-

textual cnn. In Proceedings of the IEEE International

Conference on Computer Vision.

Zheng, Y., Merkulovb, A., and Bandarib, M. (2016). Early

breast cancer detection with digital mammograms

using haar-like features and AdaBoost algorithm. In

Scientiﬁc Sensing and Imaging.

Comparing Boosted Cascades to Deep Learning Architectures for Fast and Robust Coconut Tree Detection in Aerial Images

241