Structured Edge Detection for Improved Object Localization using the

Discriminative Generalized Hough Transform

Eric Gabriel

, Ferdinand Hahmann

, Gordon B

oer

, Hauke Schramm

1,2

and Carsten Meyer

1,2

Institute of Applied Computer Science, Kiel University of Applied Sciences, Kiel, Germany

Department of Computer Science, Faculty of Engineering, Kiel University (CAU), Kiel, Germany

Keywords:

Object Detection, Object Localization, Feature Extraction, Edge Detection, Canny Edge Detection, Structured

Edge Detection, Discriminative Generalized Hough Transform.

Abstract:

Automatic localization of target objects in digital images is an important task in Computer Vision. The Gen-

eralized Hough Transform (GHT) and its variant, the Discriminative Generalized Hough Transform (DGHT),

are model-based object localization algorithms which determine the most likely object position based on accu-

mulated votes in the so-called Hough space. Many automatic localization algorithms - including the GHT and

the DGHT - operate on edge images, using e.g. the Canny or the Sobel Edge Detector. However, if the image

contains many edges not belonging to the object of interest (e.g. from other objects, background clutter, noise

etc.), these edges cause misleading votes which increase the probability of localization errors. In this paper

we investigate the effect of a more sophisticated edge detection algorithm, called Structured Edge Detector,

on the performance of a DGHT-based object localization approach. This method utilizes information on the

shape of the target object to substantially reduce the amount of non-object edges. Combining this technique

with the DGHT leads to a signiﬁcant localization performance improvement for automatic pedestrian and car

detection.

1 INTRODUCTION

The ﬁrst step in many automatic Computer Vision

systems is the localization of objects of interest in a

given digital image. In this paper, object localization

refers to estimating the coordinates of a given refer-

ence point (e.g. the center of gravity) of the target ob-

ject in any test image. A bounding box around the tar-

get object can then be predicted as described in Sec-

tion 3.3. Object localization is a prerequisite for many

subsequent automatic image processing algorithms,

e.g. automatic segmentation of organ structures in

medical images (Ecabert et al., 2008), automatic ob-

ject classiﬁcation (Hahmann et al., 2012; Hahmann

et al., 2014), automatic object tracking (Andriluka

et al., 2008) etc. Approaches to automatic object lo-

calization in still images can i.a. be grouped into

sliding-window approaches and model-based voting

frameworks. A popular model-based object localiza-

tion algorithm is the Generalized Hough Transform

(GHT) (Ballard, 1981). Here, a template of the target

object is created by specifying a set of model points

representing the object shape, together with the off-

set of each model point to a speciﬁed reference point.

Applied to a test image, the model casts votes for

likely object transformations, e.g. translations, and

the parameter set with the highest number of votes

provides the detected object position and, potentially,

further transformations. This framework has been ex-

tended in (Ruppertshofen et al., 2010) to the Discrim-

inative Generalized Hough Transform (DGHT). Here,

a weight is assigned to each model point, characteriz-

ing its importance for localization of the target object

on the given training database; these weights are op-

timized by a discriminative training algorithm (Rup-

pertshofen et al., 2010). The main advantage of the

GHT / DGHT approach is its robustness with regard

to image noise and object occlusion due to the voting

mechanism (Ballard, 1981; Ruppertshofen, 2013).

Most object localization approaches do not work

directly on raw images, but ﬁrst perform automatic

edge detection, leading to a binary edge image

(Gavrila, 2000; Chaohui et al., 2007) (see Figures 1

a,b and 2 a,b). This is because an edge image often de-

scribes the object shape sufﬁciently well, while dras-

tically reducing the computational effort for a subse-

quent localization algorithm. Often, the Canny Edge

Detection algorithm (Canny, 1986) is used due to its

Gabriel, E., Hahmann, F., Böer, G., Schramm, H. and Meyer, C.

Structured Edge Detection for Improved Object Localization using the Discriminative Generalized Hough Transform.

DOI: 10.5220/0005722803930402

In Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theor y and Applications (VISIGRAPP 2016) - Volume 4: VISAPP, pages 393-402

ISBN: 978-989-758-175-5

393

(a) (b) (c)

Figure 1: (a) Input image (b) Canny edge image (low thresh.: 0.6; high thresh.: 0.9) (c) Edge output of Structured Edge

Detector (trained for car edge detection + sharpening).

(a) (b) (c) (d)

Figure 2: (a) Input image (b) Output of Structured Edge Detector (c) Input image (d) Output of Structured Edge Detector

(both trained for car edge detection + sharpening).

efﬁciency (see Figure 1b). In the context of the GHT

/ DGHT, this leads to impressive object localization

performance on a large variety of tasks with limited

target object variability (Ecabert and Thiran, 2004;

Ruppertshofen et al., 2010). However, in many object

recognition tasks the images are often characterized

by a highly variable background composed of many

confounding objects and structures, clutter etc. (see

e.g. Figures 1 - 3). In those cases, the Canny Edge

Detection leads to many unwanted edges which cast

votes in addition to the required edge pixels of the

target object (Figure 1b). Thus, the voting pattern

may be signiﬁcantly perturbed, potentially leading to

a mislocalization.

Recently, an improved edge detection algorithm

has been proposed (Doll

ar and Zitnick, 2013). The

main idea is to learn from training data the appearance

of target object edge pixels in order to discriminate

them from the edge pixels of all other structures. In

this way, confusing edges can be suppressed, so that

the target object(s) in the image are better pronounced

(see Figures 1c and 2). Thus, this technique poten-

tially avoids the generation of many Hough space

votes which do not arise from the target object and

may therefore substantially improve the localization

performance.

In this work, we compare the Structured Edge De-

tector (Doll

ar and Zitnick, 2013; Doll

ar and Zitnick,

2014) to a standard Canny Edge Detector in the con-

text of DGHT-based automatic object localization. In

particular, we quantitatively analyze the object local-

ization performance with the two edge detectors in

two real-world tasks, namely pedestrian and car local-

ization. We obtain signiﬁcant performance improve-

ments on both tasks when using the Structured Edge

Detector, as compared to Canny Edge Detection. The

results demonstrate that the GHT / DGHT framework

can be successfully applied to automatic object lo-

calization scenarios with a large degree of variability

with respect to background and clutter.

The rest of the paper is organized as follows: The

Canny Edge Detection and the Structured Edge De-

tection algorithms are brieﬂy summarized in Section

2, followed by a short presentation of the DGHT ob-

ject localization approach. The databases used in our

study, experimental results and analyses are reported

in Section 3. A discussion and conclusion can be

found in Sections 4 and 5.

2 METHODS

2.1 Edge Detection

2.1.1 Canny Edge Detection

In 1986 John Francis Canny introduced a general and

robust approach for edge detection in images (Canny,

1986). The values of the ﬁrst derivatives in horizontal

and vertical direction, G

and G

, are obtained by ap-

plying the Sobel operator to the input image smoothed

with a Gaussian ﬁlter to reduce noise. Using these

values, the gradient magnitude and the edge direction

can be calculated:

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

394

Figure 3: Examples of confusing Canny edges and background clutter in pedestrian localization.

(Left) Input image (middle) Canny edge image (right) Output of Structured Edge Detector (trained for ped. edge detection).

G =

)

+ (G

)

(1)

θ = arctan(G

) (2)

The resulting edges are thinned using non-

maximum suppression (NMS). Subsequently, the re-

maining edge pixels are classiﬁed using a high and a

low threshold value. Edges above the high threshold

(strong edges) are kept, whilst edges below the low

threshold are discarded. Edges between the low and

the high threshold are so called weak edges. Whether

they will remain in the resulting edge image is de-

termined by hysteresis, i.e. those edges are kept

only if there is a strong edge within the respective 8-

connected neighborhood. Examples of Canny edge

images are provided in Figures 1b and 3.

Other variants of edge detection based on ﬁrst or

second order derivatives exist, see e.g. (Shrivakshan

and Chandrasekar, 2012) or anisotropic Gaussian ﬁl-

tering (Knossow et al., 2007; Montesinos and Mag-

nier, 2010). Their evaluation is however beyond the

scope of this work.

2.1.2 Structured Edge Detection

Recently, Doll

ar and Zitnick introduced a novel ma-

chine learning approach for detecting edges, which

incorporates information on the object of interest

(Doll

ar and Zitnick, 2013; Doll

ar and Zitnick, 2014).

Their approach utilizes the fact that patches of edges

show common forms of local structure like straight

lines or T-junctions or similar (Doll

ar and Zitnick,

2014). Thus, a learning framework, like Random

Forests, can be applied to assign an output edge

patch to features extracted from an input image patch.

As features, Doll

ar and Zitnick use pixel-lookups

and pairwise-difference features of 13 channels (three

color, two magnitude and eight orientation feature

channels). However, since the space of observed

image patches is high-dimensional and complex, it

is mapped to a discrete space based on ideas of

structured learning (Nowozin and Lampert, 2011;

Kontschieder et al., 2011), thus enabling an efﬁcient

training of Random Forests (Breiman, 2001). For a

test image, the trained detector is applied to densely

sampled, overlapping image patches. The resulting

edge patch predictions which refer to the same image

pixel are locally averaged after applying a sharpening

procedure in order to reduce diffusion. This is done

by aligning each predicted edge patch to the underly-

ing image patch data.

Using the novel edge detection algorithm which

runs at real-time, Doll

ar and Zitnick obtained state-of-

the-art accuracy on two contour datasets and demon-

strated cross-dataset generalization (Doll

ar and Zit-

nick, 2013; Doll

ar and Zitnick, 2014).

2.2 Object Localization

2.2.1 Generalized Hough Transform

The Generalized Hough Transform (GHT), intro-

duced by Ballard in 1981 (Ballard, 1981), is a general

and well-known model-based approach for object lo-

calization, which belongs to the category of template-

matching techniques. Each model point m

is repre-

sented by its coordinates with respect to the reference

point.

The GHT transforms a feature image, in our case

an edge image, into a parameter space, called Hough

space, utilizing a simple voting procedure. The

Structured Edge Detection for Improved Object Localization using the Discriminative Generalized Hough Transform

395

Hough space consists of accumulator cells (Hough

cells), representing possible target point locations

and, potentially, shape model transformations. The

number of votes per accumulator cell reﬂects the de-

gree of matching between the (transformed) model

and the feature image.

Since each additional parameter in a model trans-

formation directly increases the computational com-

plexity of the algorithm, we restrict the model trans-

formation to a simple translation in this work. Moder-

ate object variability with respect to shape, size, and

rotation is not explicitly parameterized, but implicitly

learned into the model by appropriately placing model

points as indicated by the training data.

The voting procedure, which transforms a feature

image X into the Hough space H (with discrete el-

ements c

) by using the shape model M, can be de-

scribed by

H(c

,X) =

∑

∀m

∈M

,X) (3)

with

,X) =

∑

∀e

∈X











1, if c

= b(e

− m

)/ρc

and |ϕ

− ϕ

| < ∆φ

0, otherwise.

(4)

The quantized Hough space H (with quantization

parameter ρ) consists of Hough cells c

that accumu-

late the number of matching pairs of all model points

and feature points e

. Each Hough cell c

repre-

sents a target hypothesis whose coordinates in image

space are given by b(c

+ 0.5) · ρc.

,X) determines how often model point m

votes for Hough cell c

for the given feature image

X. However, note that a voting is only possible, if

the orientation

of the model and feature point, ϕ

and ϕ

, respectively, has a small difference of below

∆φ. The most likely target point location results from

the Hough cell ˜c(X) with the highest number of votes,

corresponding to the best match between the model M

and the feature image X:

˜c(X) = argmax

H(c

,X) (5)

2.2.2 Discriminative Generalized Hough

Transform

The Discriminative Generalized Hough Transform

(DGHT) extends the Generalized Hough Transform

(Section 2.2.1) by an individual weighting scheme for

Note that bac denotes the ﬂoor of each component of a.

I.e. the gradient direction as in Eq. 2

the J model points m

of the shape model M, opti-

mized by a discriminative training algorithm.

During the voting procedure of Equation 3, the in-

dividual model point weights λ

are incorporated as

described in Equation 6:

H(c

,X) =

∑

∀m

∈M

,X) (6)

with f

,X) as in Equation 4.

In GHT-based approaches, the quality of the local-

ization highly depends on the quality of the model. A

good model has to fulﬁll two important conditions: A

high correlation with the feature image on the target

point location and a small correlation at confusable

objects. In the DGHT, this is achieved by an iterative

training procedure. It starts with an initial model that

is generated by superimposing annotated feature im-

ages at the reference point. The model point weights

are optimized using a Minimum Classiﬁcation Er-

ror (MCE) approach, and model points with a low

absolute weight are eliminated. At last, the model

is extended by target structures from training images

which still have a high localization error. This pro-

cedure is repeated until all training images are used

or have a low localization error. A more detailed

description of the technique can be found in (Rup-

pertshofen, 2013).

2.2.3 Shape Consistency Measure

As a result of the iterative training procedure (see Sec-

tion 2.2.2), the DGHT models may cover medium ob-

ject variability (e.g. regarding size, aspect) by con-

taining model points representing the most important

modes of variation observed in the training data. In

the GHT / DGHT voting procedures (Eqs. 3 and 6, re-

spectively), it can be seen that these points vote inde-

pendently for a localization hypothesis c

. This means

that f

,X) is not inﬂuenced by f

,X),∀ j 6= k. In

practice, however, these dependencies exist since mu-

tually exclusive variations should not be allowed to

accumulate their votes for the same Hough cell c

. For

example, a subset of the model points may represent

a frontal view of a person, and a different subset a

side view of a person. While it is reasonable to in-

corporate aspect variations into the Hough model, the

voting of model points from mutually exclusive varia-

tion types should not be mixed. This is, since a Hough

cell might coincidentally get a large number of votes

from different variants which may lead to a mislocal-

ization.

To this end, (Hahmann et al., 2015) suggested to

analyse the pattern of model points voting for a par-

ticular Hough cell c

. More speciﬁcally, this model

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

396

(a)

(b)

Figure 4: Sample images of (a) UIUC Car Database and (b) subset of IAIR-CarPed.

point pattern is classiﬁed into a class ”regular shape”

Ω

(representing e.g. frontal view of a person or a

side view of a person) and a class ”irregular shape”

Ω

. Note that the exact number of votes from a model

point m

for a particular cell c

is less relevant than

the distance d(c

) between cell c

and the closest

cell c

in the neighborhood of c

for which the model

point m

voted for. Therefore, the attribute vector

R(c

,X) = {r

,X),r

,X),..., r

,X)} (7)

is introduced with

,X) = min











d(c

), if f

,X) ≥ 1

and d(c

) < ϑ

ϑ, otherwise

(8)

and d(a,b) = max

− b

|. A value α =

,X) < ϑ indicates that the model point m

voted

in a (2α + 1)× (2α + 1) neighborhood around c

. The

parameter ϑ serves as a maximum limit up to which

a relation from model point m

to Hough cell c

can

be assumed. Above this (experimentally optimized)

limit of seven cells a link between m

and c

is un-

likely and therefore the exact distance is irrelevant for

the analysis of the voting pattern. The attribute vector

R(c

,X) is used as input for a Random Forest Classi-

ﬁer, which is trained on appropriate training data to

discriminate the two classes. For a test image X, a

(D)GHT model is applied to generate a Hough space

H(c

,X) and a list of most probable object positions c

(which correspond to an ordered list of the positions

of the maxima in Hough space). For each candidate

, the attribute vector R(c

,X) is calculated from Eqs.

7 and 8, and the Random Forest Classiﬁer is used to

calculate the probability p(Ω

|R(c

,X)) that the set of

model points, voting for c

, has a regular shape.

The obtained probability is used in the localiza-

tion procedure as an additional weighting factor for

the Hough space votes, changing Eq. 5 to

˜c(X) = argmax

p(Ω

|R(c

,X)) · H(c

|X) (9)

We refer to the Random Forest classiﬁcation of the

Hough voting pattern, Eq. 9, as Shape Consistency

Measure (SCM). To generate the required attribute

vectors for the training of the Random Forest Clas-

siﬁer, the DGHT is applied to each training image.

Then the class labels Ω

and Ω

are assigned to the

individual Hough cells of the training images using

the following rule: Cells with a localization error be-

low a threshold ε

are labelled as class Ω

while those

with an error of above ε

are assigned to class Ω

. Hy-

potheses which cannot be assigned to either class are

not used in the Random Forest training in order to en-

sure a better discrimination between the two classes.

3 EXPERIMENTS

3.1 Databases

In this work, we apply our object localization frame-

work (consisting of DGHT and SCM, Eq. 9) to two

kinds of feature images: Edge images generated by

applying the Canny Edge Detector (Section 2.1.1) and

the Structured Edge Detector (Section 2.1.2). In par-

ticular, we evaluate the performance of the Canny and

Structured Edge Detector as features for the DGHT

on two datasets:

The ﬁrst one is the UIUC Car Database (Agarwal

and Roth, 2002; Agarwal et al., 2004). Here, we use

the 550 positive car training images for model train-

ing, all 1050 training images (550 positive and 500

negative) for the training of the Shape Consistency

Measure and the 170 single-scale test images for per-

formance evaluation.

The second database, used for pedestrian localiza-

tion, is a subset of the IAIR-CarPed database (Wu

et al., 2012). We ﬁltered the dataset for images con-

taining pedestrians and computed the mean height

(150 px). In order to include size variability to a

moderate extent, we decided for a pedestrian height

range of approximately 25% of the mean height. This

leads to a pedestrian height range of 130 to 170 pixels.

We keep all images containing pedestrians of a size

Structured Edge Detection for Improved Object Localization using the Discriminative Generalized Hough Transform

397

within this range and discard images containing only

smaller ones. Images with larger pedestrians were

downscaled by a random factor such that the scaled

pedestrian height falls into the speciﬁed range. Fol-

lowing this procedure we obtain 457 images, of which

the ﬁrst 300 were used to train the DGHT model and

the SCM and the remaining 157 images were used for

evaluation. In the test set only pedestrians within the

size range remain annotated.

Sample images of both datasets are shown in Fig-

ure 4.

3.2 Experimental Setup and System

Parameters

As a ﬁrst edge detector, we use the Canny Edge De-

tector. Here, we speciﬁcally optimize the low and

high threshold for each localization task. This is done

by qualitatively assessing the Canny edge images on a

sample basis, searching for a tradeoff between keep-

ing essential edges of the target object and not hav-

ing too much background clutter. For the UIUC Car

Database we use a high threshold percentage of 0.9

and a low threshold percentage of 0.6. For the pedes-

trian localization task on the IAIR-CarPed subset we

use a high threshold percentage of 0.8 and a low

threshold percentage of 0.5.

As a second edge detector, we use the Structured

Edge Detector from Doll

ar and Zitnick. For com-

puting the structured edges we use publicly avail-

able code

. As explained in Section 2.1.2, the Struc-

tured Edge Detector must be trained with domain-

speciﬁc, annotated edge images. In our case, we use

the cars side category of Caltech-101 database (Fei-

Fei et al., 2006) and the PennFudan dataset (Wang

et al., 2007) for cars and pedestrians, respectively

In this manner, the speciﬁc Structured Edge Detec-

tors are trained to highlight edges belonging to the

respective object category and suppress background

edges or those of non-target objects. We refer to the

domain-speciﬁc Structured Edge Detector as SSE.

For comparison, we also used a general purpose

Structured Edge Detector provided by Doll

ar and

Zitnick. This edge detector is trained on the gen-

eral BSDS500 segmentation dataset (Arbelaez et al.,

2011) and is referred to as GSE.

The experiments for the different edge images on

both datasets are conducted as follows:

http://research.microsoft.com/en - us/downloads/38910

9f6-b4e8-404c-84bf-239f7cbf4e3d, accessed: 2015-09-16

We use these databases and not UIUC and IAIR for the

training of the speciﬁc structured edge detectors, because

the former already provide a ground truth contour annota-

tion.

First, the different feature images (Canny, GSE

and SSE) are generated for both datasets, which serve

as input images for the DGHT training. Then, a

DGHT model is trained for each feature type and

dataset using the iterative training process according

to Section 2.2.2. The quantization parameter ρ is set

to 2 in x- and y-direction and ∆Φ to 16 for all experi-

ments. Afterwards, the Shape Consistency Measure

(SCM) is trained by applying the resulting DGHT

model to each training image and extracting local-

ization hypotheses below ε

as samples for Ω

and

above ε

as samples for Ω

(see Section 2.2.3) with

ϑ set to 7. After these training steps both the DGHT

model and the SCM are applied to the test image set,

where the same edge detection algorithm as in train-

ing is applied to each test image. Then, for each test

image X we compute the best localization hypothesis

˜c

(X) according to Eq. 9.

3.3 Results

In order to classify whether the best localization hy-

pothesis ˜c

(X) per image X is a correct localization,

we use the common PASCAL VOC overlap measure

(Everingham et al., 2010):

area(B

∩ B

)

area(B

∪ B

)

, (10)

where B

∩ B

refers to the intersection and B

∪ B

to the union of the predicted bounding box B

and the

ground truth bounding box B

. For a correct detec-

tion the overlap measure a

must exceed 0.5.

Because this measure needs a predicted bounding

box, we need to obtain B

from the best localization

hypothesis ˜c

(X). In the UIUC single-scale test set,

we take the ﬁxed object width and height as the width

and height of B

around ˜c

(X) as the center point.

For pedestrian localization in the size range of 130 to

170 px, we take the mean object width and height as

the width and height of B

around ˜c

(X) as the center

point.

The localization and detection results for the

UIUC single-scale test set and for the test subset of

IAIR-CarPed are shown in Table 1 and 2, respec-

tively. A bar chart of the localization accuracies on

both datasets with the investigated edge detection al-

gorithms is presented in Figure 6. Since multi-object

detection is currently not being addressed in our local-

ization framework, please note that all reported results

are with respect to the overlap score a

of the best hy-

pothesis ˜c

(X) with respect to the closest ground truth

5 and 10 for UIUC and IAIR, respectively.

15 and 25 for UIUC and IAIR, respectively.

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

398

(a) (b) (c) (d) (e)

Figure 5: Sample localization with: (a) Input image (b) Structured Edge Detection output (c) Learned model for pedestrian

detection (d) Hough space (e) Resulting localization; yellow: prediction, green: ground truth annotation.

Table 1: Localization results on UIUC single-scale.

Accuracy:

# of images with a

>0.5

total # of images

; Localization error: dis-

tance of ˜c

(X) to closest ground truth annotation in pixel.

Feature Accuracy Mean a

Mean loc.

[%] [%] error [px]

Canny 96.47 77.69 4.46

GSE 97.65 79.72 4.39

SSE 100.00 82.83 3.13

Table 2: Localization results on IAIR-CarPed subset

database.

Accuracy:

# of images with a

>0.5

total # of images

; Localization error: dis-

tance of ˜c

(X) to closest ground truth annotation in pixel.

Feature Accuracy Mean a

Mean loc.

[%] [%] error [px]

Canny 85.35 64.19 20.07

GSE 91.08 69.65 12.56

SSE 92.99 69.27 15.71

Figure 6: Localization accuracies of the different edge

detectors (see Section 2.1) for both databases.

annotation per test image X only. Therefore, the ac-

curacy denotes the number of images with a correct

localization, i.e. the overlap between the predicted

and the closest ground truth bounding box according

to Eq. 10 must exceed 0.5, in relation to the total

amount of test images.

4 DISCUSSION

In the car localization task on the UIUC single-scale

set, using task-speciﬁc structured edges (SSE) instead

of Canny edges improves localization accuracy from

96.47% to 100%. This is because the trained car SSE

edge detector can successfully suppress many confus-

ing edges of non-target objects as in the example in

Figure 7. Here, the maximum in Hough space inci-

dentally arising from non-object-related edges, which

is observed for Canny edges and leads to a wrong ob-

ject localization, disappears when using SSE features.

Instead, the object-related edges lead to a maximum

in Hough space at the (almost) correct object position

(see Figure 7). The accuracy of the GSE features rank

in between (97.65%). This means that without an ad-

ditional category-speciﬁc training effort the localiza-

tion accuracy compared to Canny edges can still be

improved, although less than with task-speciﬁc struc-

tured edges.

Regarding the pedestrian localization task, the

conclusions from the car localization task can be con-

ﬁrmed, however, in a harder task exhibiting size vari-

ability and much more confusable background struc-

tures (see Figure 4b). We obtained accuracy values

of 85.35% for Canny, 91.08% for GSE and 92.99%

for SSE, respectively. These numbers demonstrate

that also the pedestrian localization performance can

be substantially increased using speciﬁcally trained

SSE features. When qualitatively inspecting the error

cases of the SSE experiment, in eight of the eleven

mislocalizations, however, pedestrians were localized

(see Figure 8), but their height is not within the al-

lowed size range from 130 to 170 px (Section 3.1) and

therefore those pedestrians are not annotated leading

to an overlap a

of 0%. When slightly enlarging the

annotated size range to 120 to 180 px the localization

accuracies are 89.17% (Canny), 94.27% (GSE) and

95.54% (SSE), respectively.

To assess the statistical signiﬁcance of the accu-

racy, we assume a binomial distribution for the de-

tection results per image (correctly localized versus

not correctly localized, corresponding to a

> 0.5 ver-

Structured Edge Detection for Improved Object Localization using the Discriminative Generalized Hough Transform

399

Figure 7: Example feature images and localization results on UIUC (rows 1 - 4) and IAIR (rows 5 - 8);

Odd rows: Canny Edge Detection, even rows: Structured Edge Detection (SSE)

(First column) Input image (Second column) Edge image (Third column) Hough space (Fourth column) Localization

result; yellow: prediction, green: ground truth annotation. (Best viewed in color).

sus a

<= 0.5, respectively), and calculate the 95%-

Clopper-Pearson conﬁdence interval for the accuracy

(see Table 3). On both tasks, the localization accu-

racy of the SSE edges is beyond the conﬁdence inter-

val for the localization accuracy using Canny edges,

demonstrating a signiﬁcant improvement in localiza-

tion accuracy by structured edges compared to Canny

edges. On the pedestrian localization task, this also

holds when using GSE edges.

To assess statistical signiﬁcance of the continuous

overlap parameter a

when using category-speciﬁc

structured edges as input features, we use the non-

parametric Wilcoxon-Mann-Whitney test (Wilcoxon,

1945; Mann and Whitney, 1947), since a

is not nor-

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

400

Table 3: Clopper-Pearson intervals for the experiment re-

sults.

Dataset Feature CP Interval

UIUC Canny [92.48%, 98.69%]

SSE [97.85%, 100.0%]

IAIR Canny [78.83%, 90.48%]

SSE [87.81%, 96.45%]

mally distributed (p-values of all experiments in the

Shapiro-Wilk tests (Shapiro and Wilk, 1965) are <

10e-3). Comparing the independent groups Canny

and SSE for both datasets in the Wilcoxon-Mann-

Whitney test, we obtain p-values of 9.86e − 6 and

0.001872 for UIUC and IAIR, respectively. Thus,

the mean overlap value a

for the SSE edge detec-

tion tests is larger at the 95% conﬁdence level than

the mean overlap for the Canny Edge Detection tests.

Additionally, we statistically evaluated the result-

ing localization errors for Canny and SSE edge fea-

tures in the same way as described above. We obtain

p-values of 0.005118 and 0.001696 for UIUC and

IAIR, respectively. Therefore, the mean localization

error for the SSE edge detection tests is lower at the

95% conﬁdence level than the mean localization error

for the Canny Edge Detection tests.

5 CONCLUSIONS

We have shown that the object localization perfor-

mance obtained by the voting-based DGHT approach

in real-world tasks with variable background and clut-

ter can be signiﬁcantly improved by a sophisticated

edge detection algorithm, namely the Structured Edge

Detector. This applies to general structured edge

features without additional training effort as well as

category-speciﬁc Structured Edge Detectors in par-

ticular. More precisely, we obtained absolute im-

provements in localization accuracy of 3.53% and

7.64% on a car and pedestrian localization task, re-

spectively. We conclude that the DGHT framework

can be successfully used for object localization also

in real-world images with larger and more variable

background.

In future work, we aim to integrate an intelligent

edge detection mechanism into the voting framework

and to explore strategies to handle object variability

(e.g. object size, rotation) as well as multi-object and

multi-class localization.

Figure 8: Error case: Detected pedestrian is not within the

annotated height range of 130-170 px. Bounding box col-

ors: yellow: prediction; green: ground truth annotation;

red: not in ground truth, because height 6∈ [130,170] px.

(Best viewed in color).

REFERENCES

Agarwal, S., Awan, A., and Roth, D. (2004). Learning to

detect objects in images via a sparse, part-based rep-

resentation. IEEE Transactions on Pattern Analysis

and Machine Intelligence, 26(11):1475–1490.

Agarwal, S. and Roth, D. (2002). Learning a sparse repre-

sentation for object detection. In Computer Vision –

ECCV 2002, pages 113–127. Springer.

Andriluka, M., Roth, S., and Schiele, B. (2008).

People-tracking-by-detection and people-detection-

by-tracking. In IEEE Conference on Computer Vision

and Pattern Recognition, pages 1–8. IEEE.

Arbelaez, P., Maire, M., Fowlkes, C., and Malik, J. (2011).

Contour detection and hierarchical image segmenta-

tion. IEEE Transactions on Pattern Analysis and Ma-

chine Intelligence, 33(5):898–916.

Ballard, D. H. (1981). Generalizing the hough trans-

form to detect arbitrary shapes. Pattern recognition,

13(2):111–122.

Breiman, L. (2001). Random forests. Machine learning,

45(1):5–32.

Canny, J. (1986). A computational approach to edge de-

tection. IEEE Transactions on Pattern Analysis and

Machine Intelligence, (6):679–698.

Chaohui, Z., Xiaohui, D., Shuoyu, X., Zheng, S., and Min,

L. (2007). An improved moving object detection al-

gorithm based on frame difference and edge detec-

tion. In Fourth International Conference on Image

and Graphics, pages 519–523. IEEE.

Doll

ar, P. and Zitnick, C. L. (2013). Structured forests for

fast edge detection. In IEEE International Conference

on Computer Vision, pages 1841–1848. IEEE.

Doll

ar, P. and Zitnick, C. L. (2014). Fast edge detection

using structured forests.

Ecabert, O., Peters, J., Schramm, H., Lorenz, C., Von Berg,

J., Walker, M. J., Vembar, M., Olszewski, M. E., Sub-

ramanyan, K., Lavi, G., et al. (2008). Automatic

Structured Edge Detection for Improved Object Localization using the Discriminative Generalized Hough Transform

401

model-based segmentation of the heart in ct images.

IEEE Transactions on Medical Imaging, 27(9):1189–

1201.

Ecabert, O. and Thiran, J.-P. (2004). Adaptive hough trans-

form for the detection of natural shapes under weak

afﬁne transformations. Pattern Recognition Letters,

25(12):1411–1419.

Everingham, M., Van Gool, L., Williams, C. K., Winn, J.,

and Zisserman, A. (2010). The pascal visual object

classes (voc) challenge. International journal of com-

puter vision, 88(2):303–338.

Fei-Fei, L., Fergus, R., and Perona, P. (2006). One-

shot learning of object categories. IEEE Transac-

tions on Pattern Analysis and Machine Intelligence,

28(4):594–611.

Gavrila, D. M. (2000). Pedestrian detection from a moving

vehicle. In Computer Vision – ECCV 2000, pages 37–

49. Springer.

Hahmann, F., B

oer, G., Deserno, T. M., and Schramm,

H. (2014). Epiphyses localization for bone age as-

sessment using the discriminative generalized hough

transform. In Bildverarbeitung f

ur die Medizin, pages

66–71. Springer.

Hahmann, F., B

oer, G., Gabriel, E., Meyer, C., and

Schramm, H. (2015). A shape consistency measure

for improving the generalized hough transform. In

10th Int. Conf. on Computer Vision Theory and Ap-

plications, VISAPP.

Hahmann, F., Ruppertshofen, H., B

oer, G., Stannarius,

R., and Schramm, H. (2012). Eye localization us-

ing the discriminative generalized Hough transform.

Springer.

Knossow, D., Van De Weijer, J., Horaud, R., and Ron-

fard, R. (2007). Articulated-body tracking through

anisotropic edge detection. In Dynamical Vision,

pages 86–99. Springer.

Kontschieder, P., Rota Bul

o, S., Bischof, H., and Pelillo, M.

(2011). Structured class-labels in random forests for

semantic image labelling. In IEEE International Con-

ference on Computer Vision, pages 2190–2197. IEEE.

Mann, H. B. and Whitney, D. R. (1947). On a test of

whether one of two random variables is stochastically

larger than the other. The annals of mathematical

statistics, pages 50–60.

Montesinos, P. and Magnier, B. (2010). A new percep-

tual edge detector in color images. In Advanced Con-

cepts for Intelligent Vision Systems, pages 209–220.

Springer.

Nowozin, S. and Lampert, C. H. (2011). Structured learn-

ing and prediction in computer vision. Foundations

and Trends

 in Computer Graphics and Vision, 6(3–

4):185–365.

Ruppertshofen, H. (2013). Automatic modeling of anatom-

ical variability for object localization in medical im-

ages. BoD – Books on Demand.

Ruppertshofen, H., Lorenz, C., Beyerlein, P., Salah, Z.,

Rose, G., and Schramm, H. (2010). Fully automatic

model creation for object localization utilizing the

generalized hough transform. In Bildverarbeitung f

die Medizin, pages 281–285.

Shapiro, S. S. and Wilk, M. B. (1965). An analysis

of variance test for normality (complete samples).

Biometrika, pages 591–611.

Shrivakshan, G. and Chandrasekar, C. (2012). A compari-

son of various edge detection techniques used in im-

age processing. IJCSI International Journal of Com-

puter Science Issues, 9(5):272–276.

Wang, L., Shi, J., Song, G., and Shen, I.-F. (2007). Ob-

ject detection combining recognition and segmenta-

tion. In Computer Vision – ACCV 2007, pages 189–

199. Springer.

Wilcoxon, F. (1945). Individual comparisons by ranking

methods. Biometrics bulletin, pages 80–83.

Wu, Y., Liu, Y., Yuan, Z., and Zheng, N. (2012). Iair-

carped: A psychophysically annotated dataset with

ﬁne-grained and layered semantic labels for object

recognition. Pattern Recognition Letters, 33(2):218–

226.

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

402