Using Whole and Part-based HOG Filters in Succession to Detect Cars

in Aerial Images

Satish Madhogaria

, Marek Schikora

1,2

and Wolfgang Koch

Dept. Sensor Data and Information Fusion, Fraunhofer FKIE, Wachtberg, Germany

Department of Computer Science, Technical University of Munich, Munich, Germany

Keywords:

Car Detection, Image Analysis, HOG, SVM, LSVM, Part Models, Aerial Images.

Abstract:

Vehicle detection in aerial images plays a key role in surveillance, transportation control and trafﬁc monitoring.

It forms an important aspect in the deployment of autonomous Unmanned Aerial System (UAS) in rescue

and surveillance missions. In this paper, we propose a two-stage algorithm for efﬁcient detection of cars in

aerial images. We discuss how sophisticated detection technique may not give the best result when applied to

large scale images with complicated backgrounds. We use a relaxed version of HOG (Histogram of Oriented

Gradients) and SVM (Support Vector Machine) to extract hypothesis windows in the ﬁrst stage. The second

stage is based on discriminatively trained part-based models. We create a richer model to be used for detection

from the hypothesis windows by detecting and locating parts in the root object. Using a two-stage detection

procedure not only improves the accuracy of the overall detection but also helps us take complete advantage

of the accuracy of sophisticated algorithms ruling out it’s incompetence in real scenarios. We analyze the

results obtained from Google Earth dataset and also the images taken from a camera mounted beneath a ﬂying

aircraft. With our approach we could achieve a recall rate of 90% with a precision of 94%.

1 INTRODUCTION

In this paper we address the task of solving object de-

tection in large-scale aerial images. When we talk

about large-scale aerial images, car detection could

be termed as one of the most challenging task as car

appear very small in large images and vary greatly

in shapes and sizes. Besides, the appearance of the

object within the observed scene changes quite often

depending on the ﬂight altitude and camera orienta-

tion. Given the complexity of the problem and the

scope for improving the accuracy of detection makes

it an important topic of research. This work is inspired

from the fact that although the problem of aerial car

detection is attempted number of times, still, there is

much scope for improving the accuracy and efﬁciency

of the task. Various approaches have been proposed

for vehicle detection in aerial images like that of neu-

ral network-based hierarchical model for detection in

(Ruskone et al., 1996), use of gradient features to cre-

ate a generic model and Bayesian network for classiﬁ-

cation as shown in (Zhao and Nevatia, 2001), feature

extraction comprising of geometric and radiometric

features and detection using top-down matching ap-

proach shown in (Hinz, 2003; Nguyen et al., 2007).

(Han et al., 2006) proposed a two-stage method to

detect people and vehicles by using HOG+SVM as

the ﬁnal verﬁcation stage. HOG-based features (Dalal

and Triggs, 2005) have consistently outperformed in

various object detection tasks, however, it has its lim-

itation when it comes to small objects like that of

cars in aerial images because many details of the cars

are not always visible. There are attempts to com-

bine hog features with several other feature extraction

technique for performance improvement. The most

recent work is the one shown in (Kembhavi et al.,

2011), where the authors combine HOG with Color

probability Maps and Pairs of pixels to form a high-

dimensional feature set and shows good result. Com-

parisons of results can easily prove that the perfor-

mance of the proposed method improves.

The main aim of this work is to build an effec-

tive system which can distinguish cars from the back-

ground in aerial images with high accuracy. We pro-

pose a two-stage method for detecting vehicles in

large-scale aerial images. We show that using the

standard HOG ﬁlters (Dalal and Triggs, 2005) in two

steps, one for the root object detection and another

for parts detection (Felzenszwalb et al., 2008), in se-

ries can greatly improve the detection accuracy. First,

681

Madhogaria S., Schikora M. and Koch W..

Using Whole and Part-based HOG Filters in Succession to Detect Cars in Aerial Images.

DOI: 10.5220/0004297406810686

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2013), pages 681-686

ISBN: 978-989-8565-47-1

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

Figure 1: Proposed vehicle detection method.

we apply the HOG ﬁlter to extract hypothesis win-

dows followed by part-based ﬁlters on each of the hy-

pothesis windows to detect parts at twice the resolu-

tion of original image. Although part-based models

have high accuracy rate, it is often avoided in big im-

ages because of efﬁciency issues. In addition, when

such sophisticated models are applied to large-scale

images with multiple small objects, we show that it

misses out objects (see Figure 3). However, when se-

lected windows are given as an input to the part-based

model, it can give an impressive performance. In our

approach, the second stage is strongly constrained by

speciﬁc knowledge and the ﬁrst stage is more general

and less constrained.

We test our method using two different data sets.

First, we create a library of training and test images

from Google Earth. Our second set of testing images

consists of high resolution camera images taken from

a camera mounted on an aircraft. With our approach,

we could achieve a detection rate of more than 90%

with a precision of 94%. We also show that our ap-

proach achieves higher accuracy when compared to

each step applied individually to the test images.

In the next section (2), we describe the methods

adapted for use in vehicle detection task and follow it

up with the performance analysis and results (Section

3) and conclusion (Section 4)

2 SYSTEM OVERVIEW

Figure 1 shows how the two steps work in series.

The overall system is based on HOG ﬁlters. In the

ﬁrst stage, a relaxed version of HOG+SVM method is

used to generate hypothesis windows. We make sev-

eral deviations from the standard HOG+SVM (Dalal

and Triggs, 2005) in order to have negligible or a very

low miss rate. The hypothesis windows are generated

at multiple orientations. These subwindows and the

cartesian coordinates in image space serves as an in-

put to the second stage. The second stage is highly

constrained by using part-based ﬁlters to verify the

presence of object parts in the hypothesis windows.

The part-based ﬁlters are applied at double the reso-

lution at which single HOG ﬁlters are applied. The

part ﬁlters give an overall score to each of the hypoth-

esis window and a decision whether it contains a car

is made by thresholding the score. Finally, the non-

maximal suppression method is used to remove the

overlapping windows.

2.1 Relaxed HOG+SVM Detection

To create a less constrained model, we use the ”his-

togram of orientation gradient” feature descriptors

(Dalal and Triggs, 2005) to extract features that can

resemble a car. Since the time HOG features are intro-

duced to detect people, there has been constant mod-

iﬁcations to the standard HOG in order to improve

the detection of people as well as other objects in im-

ages (Wang and Zhang, 2008; Monzo et al., 2011;

Meng et al., 2012). In this work, as we are interested

in using HOG features and a Linear SVM classiﬁer

(Cortes and Vapnik, 1995; Chang and Lin, 2001) to

extract hypothesis windows. HOG features count the

occurrences of gradient orientations within overlap-

ping rectangular blocks in the search window. HOG

ﬁlters are rectangular templates deﬁning weights for

features. Let x be an image subwindow and Θ(x) de-

note its extracted feature. x is labeled as a ”hypothesis

window”, if

f (x) > 0, f (x) = w · Θ(x) (1)

where w is the ﬁlter. Here, w is obtained from the

linear SVM training of positive and negative training

samples.

To Achieve High Detection Rate in First Stage:

1. In the detection step, we keep the window strides

low so as to have as many detection as possible

around the same object. This in turn increases the

probability of detection of the object in the second

stage.

2. The threshold, deﬁned by the distance between

the feature and SVM classifying plane, is kept

lower than usual to improve the detection rate. We

conduct several initial experiments with different

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

682

(a) (b) (c) (d)

Figure 2: (a) shows the HOG ﬁlter. (b) Shows higher resolu-

tion part ﬁlters and (c) shows the deformation model which

deﬁnes the cost of placing part ﬁlters inside the root ﬁlter.

(d) shows the parts located in a hypothesis windows.

threshold values and choose the one which can de-

tect nearly all the cars. Despite the fact that it re-

sults in high number of false detection, the overall

performance of the detector is least affected be-

cause of the highly accurate second stage of our

algorithm.

3. We do not suppress the overlapping windows in

the ﬁrst stage since the detection rate improves

slightly when the second classiﬁer is given mul-

tiple windows around the same object.

Besides, the window size also plays an important role

in improving the performance of the detector. We

chose 48x96 size window to represent the object. Ex-

periments show that having a size smaller than 48x96

reduces the detection. To detect objects at multiple

scales, the given input image is upscaled or down-

scaled depending on the altitude.

Rotation-invariance Detection in the First Stage.

As the ﬁrst stage is relatively faster because of parallel

implementation, during the ﬁrst stage, we detect cars

at all possible orientations. Since the HOG fe.atures

provide slight invariance in rotation (depending on the

number of orientation bins), instead of smaller angle

we rotate the image in steps of 30

◦

each time up to

150

◦

. The detected window coordinates are rotated

and translated back with respect to the input image.

Saved window coordinates and the patches represent-

ing subwindows in the input image serves as an input

to the second stage. In our ﬁrst stage, since the image

is rotated 6 times and then precessed to detect cars at

each rotation angle, it adds to the overall time taken

to evaluate the image. Currently, it takes less than 1

second on a 2.8 GHz intel processor with NVIDIA

GeForce GT 430 graphics card to extract hypothesis

windows at all rotations from a 1000x1000 image.

2.2 Part-based Detection

We now build a model which is strongly constrained

by part locations in the whole object. For this pur-

pose we adapt a sophisticated approach described

in (Felzenszwalb et al., 2010) to use as the second

stage detection model. Part-based models are built on

the pictorial structural framework, ﬁrst introduced in

(Fischler and Elschlager, 1973). The main concept

introduced in (Felzenszwalb et al., 2008) was that of

”Latent SVM”, which enables the use of part posi-

tions as latent variables. The latent SVM formulation

of Equation (1) would be:

(x) = max

z∈Z(x)

β · Θ(x, z) (2)

where β is the concatenation of whole ﬁlter, part ﬁl-

ters and deformation cost weights, z are latent values,

in this case part placements and Θ(x, z) is the con-

catenation of subwindows and part deformation fea-

tures. Part ﬁlters are deﬁned at double the resolution

of root ﬁlter which means that they represent ﬁner

edges compared to the root ﬁlter. The model for an

object with n parts is deﬁned by a root ﬁlter and a

set of part models (P

, ..., P

). To make a decision on

whether the hypothesis window contains car or not,

we score the window according to the best possible

placement of the parts and threshold this score. A

placement of a model in HOG feature space is de-

ﬁned by z = (p

, ..., p

), where p

is the location of

root ﬁlter and p

, ..p

are the location of part ﬁlters.

The score of placement z is expressed by Equation

(2). For further details about how the model is trained

using latent variables we recommend reading (Felzen-

szwalb et al., 2008).

Improving the Accuracy of the Part-based Detec-

tor:

1. When given a small search area, in this case ”hy-

pothesis windows” the object detector automati-

cally becomes more efﬁcient, given the fact that

the detection need not be done at multiple scales

and rotations. In this case we, have ﬁxed size win-

dows on which parts are located using the part ﬁl-

ters and a conﬁdence value is generated based on

the location of parts in the whole object.

2. We use 6 part ﬁlters as it shows slight improve-

ment in the detection rate in comparison to 4 or 5

parts.

3. Since the part models are used as the ﬁnal decid-

ing model, we could increase the threshold (the

distance between the classifying plane and the

feature vector) slightly to be able to reduce false

alarms keeping the recall rate constant, thereby

having greater precision in overall detection.

Apart from improving the detection rate (see Table 1),

there are several advantages of using the two-stage

approach: First, for effective detection in a sliding

window approach, part-based decision model must

UsingWholeandPart-basedHOGFiltersinSuccessiontoDetectCarsinAerialImages

683

be applied at all positions (orientations, if we want

to have rotation-invariance detections). Considering

only the positions, the decision model would have to

make decisions for more than 900,000 windows for

a 1000x1000 image. In the current scenario, these

models are not fast enough to be used for such large

images. With our approach, we generate hypoth-

esis windows using the parallel implementation of

HOG+SVM. The number of windows, given as an in-

put to the second stage is reduced to a few hundreds

as against close to a million if we were to evaluate

directly with part-based detection method. Second,

the rotation-invariance and the scale factor is taken

care of in the much faster stage 1 of our algorithm.

Therefore, in the second stage, the need of evaluation

at multiple scales and orientations is averted which,

therefore, makes it more efﬁcient apart from being

highly accurate.

2.3 Using Two Detectors in Sequence

In many cases, we have seen that a number of weak

classiﬁers are used in series and the decision is passed

from left-to-right. Normally, different sets of train-

ing samples are used in order to generate weak clas-

siﬁers and the combination of weak classiﬁers gives

the ﬁnal decision. However, in this case, we use two

strong classiﬁers using the same set of training sam-

ples. To use two classiﬁers in series, we should try not

to miss objects in the ﬁrst stage, which is why, we re-

lax the detection parameters of the ﬁrst stage. Given

the range of our test images, we deduce an optimal

threshold for detection, by which we make sure that

the minimal number of cars are missed. In Figure 5,

we show one example where we reduce the threshold

value (from (a) to (c)), so that all cars are detected.

This however, generates many false windows. Alto-

gether, we call them ”hypothesis windows”. Depend-

ing on the size and complexity of the image, num-

ber of such windows can be anywhere between 50

and 500 (note that the rotation-invariance detection

increases this number considerably). In this exam-

ple, we show that with a threshold of 0.8, all the cars

are detected. Likewise, we use the same threshold

value for evaluating all our test images. Also, using

the hypothesis windows from ﬁrst stage allows us to

increase the threshold of the part-based detector to re-

duce the false alarms in the second stage. We also

compare the results obtained separately from stan-

dard HOG+SVM classiﬁer (Dalal and Triggs, 2005),

part-based classiﬁer (Felzenszwalb et al., 2010) and

our approach (see Figure 3 and Table 1). For com-

parison, we evaluate the images at ﬁxed orientation

as Felzenszwalb’s part-based model is not rotation-

(a) Standard HOG+SVM approach (b) Part-based detection in the entire

image

Figure 3: Example comparing 3 different approaches -

We see that the sophisticated approach such as HOG part-

based models, when applied to a large image misses objects.

However, with our approach where we give hypothesis win-

dows as an input to the part-based approach, the detection

is improved to a great extent.

Table 1: Performance Comparison.

Dalal &

Triggs

Felzenszwalb

et al.

Our

approach

No of images

processed

(Fixed orien-

tation)

32 32 32

No of cars

present

240 240 240

Detection

Rate

65.1% 82.2% 91.1%

False Alarm

Rate

42% 5.2% 6%

Shows overall comparison of 3 methods in terms of

”detection rate” and ”false alarm rate”.

invariant. The Table 1 shows that our approach out-

performs the part-based detection method by 9% and

the HOG+SVM method by about 26%. This clearly

proves the superiority of using a whole and part ﬁlter

in succession as against the part-based detection alone

in a large-scale image.

3 RESULTS

We verify the performance of our method using the

images taken from Google Earth. The data set con-

sists of 35 images with varying urban background

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

684

(a) Threshold = 1.2 (b) Threshold = 1.0 (c) Threshold = 0.8 (d) Part-based detection applied on (c)

Figure 5: The ﬁrst stage is designed is such a way that it detects nearly all the cars in our test data set. From (a) to (c), we

can see that lowering the threshold results in all the cars being detected. We call these detections as the hypothesis windows

which are given as an input to the part-based detection method. (d) shows the ﬁnal result of the classiﬁer.

(a) Positive Samples (b) Negative Samples

Figure 4: Training data samples -

Google 2011.

and multiple cars present in each image with image

size ranging from 700x700 to 1200x1200 (approxi-

mately). Training data (Figure 4) consists of about

200 cars and 600 non-car images. In these experi-

ments, we have kept the window size ﬁxed to 48x96

because the size of cars is more or less within a con-

ﬁned window size for a given altitude. For varying al-

titudes, the input image should be upscaled or down-

scaled depending on the height at which the image

is taken. In Figure 6, we see sample results, each

from Google Earth and an image from ﬂight exper-

iment. In the ﬁrst stage, the hypothesis windows are

generated at multiple orientations. Each of these sub-

windows is validated by the part-based models in the

second stage. Figure 8 displays few more results ob-

tained from our approach. The performance of our

system is analyzed by means of the precision-recall

curve shown in Figure 7. We see that the precision

rate and the recall rate remains above 85% for all our

test images. Table 2 gives a clearer picture of the over-

all performance. With this method, we could detect

90% of the total cars with a precision of 94%. It is

worth mentioning that out of 374 total cars present in

the test data set, 21 were missed in the ﬁrst stage it-

self because of occlusion or shadows, which means

that the actual recall rate of the second stage stands at

95%.

(a) Google Earth image

(b) Image from ﬂight experiment

Figure 6: Shows sample results from our two-stage ap-

proach. The rotation-invariant method is able to detect cars

at all orientation with high recall rate and good precision.

4 CONCLUSIONS

We presented a two-stage approach to detect cars in

aerial images. Instead of choosing several classiﬁers

in series (which is a more usual practice), we select

UsingWholeandPart-basedHOGFiltersinSuccessiontoDetectCarsinAerialImages

685

Figure 8: Shows some more results from Google Earth images.

Figure 7: Illustrates the performance of the two-stage algo-

rithm on Google Earth data set.

Table 2: Performance of our two-stage approach

No. of images

processed

No. of cars Overall

Overall PR

36 374 90% 94%

The overall recall and precision rate gives a clearer picture

of an impressive performance obtained through our

approach.

two strong classiﬁers one after the other. In the pro-

cess, we improve the detection rate of the ﬁrst clas-

siﬁer in order not to miss objects in the ﬁrst stage

and improve the precision of the second classiﬁer.

Hence, we were able to achieve a high recall rate and

with very high precision rate. We have achieved very

good results in terms of accuracy, however, to make

it a robust system, more work in this direction is re-

quired. Knowing that the proposed system performs

well, we would be interested in a faster implementa-

tion of sophisticated approach such as part-based de-

tection methods so that we are able to detect objects in

large images in real time. Besides, we expect to de-

velop a more efﬁcient rotation-invariance scheme to

be used in the ﬁrst stage.

REFERENCES

Chang, C.-C. and Lin, C.-J. (2001). LIBSVM: a library for

support vector machines.

Cortes, C. and Vapnik, V. (1995). Support vector networks.

In Machine Learning, volume 20, pages 273–297.

Dalal, N. and Triggs, B. (2005). Histograms of oriented

gradients for human detection. In International Con-

ference on Computer Vision & Pattern Recognition,

volume 2, pages 886–893.

Felzenszwalb, P., McAllester, D., and Ramanan, D. (2008).

A discriminatively trained, multiscale, deformable

part model. In Computer Vision and Pattern Recog-

nition, 2008. CVPR 2008. IEEE Conference on.

Felzenszwalb, P. F., Girshick, R. B., McAllester, D., and

Ramanan, D. (2010). Object detection with discrimi-

natively trained part-based models. IEEE Trans. Pat-

tern Anal. Mach. Intell., 32(9).

Fischler, M. A. and Elschlager, R. A. (1973). The repre-

sentation and matching of pictorial structures. IEEE

Trans. Comput., 22.

Han, F., Shan, Y., Cekander, R., Sawhney, H., and Kumar,

R. (2006). A two-stage approach to people and vehicle

detection with HOG-based SVM. In The 2006 Perfor-

mance Metrics for Intelligent Systems Workshop.

Hinz, S. (2003). Detection and counting of cars in aerial

images. In International Conference on Image Pro-

cessing.

Kembhavi, A., Harwood, D., and Davis, L. (2011). Vehicle

detection using partial least squares. Pattern Analy-

sis and Machine Intelligence, IEEE Transactions on,

33(6):1250 –1265.

Meng, X., Lin, J., and Ding, Y. (2012). An extended HOG

model: SCHOG for human hand detection. In Systems

and Informatics (ICSAI), 2012 International Confer-

ence on.

Monzo, D., Albiol, A., Albiol, A., and Mossi, J. (2011).

Color HOG-EBGM for face recognition. In Im-

age Processing (ICIP), 2011 18th IEEE International

Conference on.

Nguyen, T., Grabner, H., Bischof, H., and Gruber, B.

(2007). On-line boosting for car detection from aerial

images. In International Conference on Research, In-

novation and Vision for the Future.

Ruskone, R., Guigues, L., Airault, S., and Jamet, O.

(1996). Vehicle detection on aerial images: A struc-

tural approach. In International Conference on Pat-

tern Recognition, pages 900–904.

Wang, Q. J. and Zhang, R. B. (2008). LPP-HOG: A

new local image descriptor for fast human detection.

In Knowledge Acquisition and Modeling Workshop,

2008. KAM Workshop 2008. IEEE International Sym-

posium on.

Zhao, T. and Nevatia, R. (2001). Car detection in low res-

olution aerial image. In International Conference on

Computer Vision.

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

686