The paper is organised as follows. Section 2 dis-
cusses related work, then section 3 describes the de-
tails of the method. The results in section 4 show that
the method can distinguish planes from non-planes
and reliably predict their orientation in a variety of
situations, and we conclude in section 5 with sugges-
tions for future work.
2 RELATED WORK
A standard way to obtain geometry from a single
image is the use of vanishing points – for example
(Ko
ˇ
seck
´
a and Zhang, 2005) rely on the orthogonality
of planes to group lines and hypothesise rectangles,
from which the pose of the camera can be recovered.
Similarly, (Mi
ˇ
cu
ˇ
s
´
ık et al., 2008) treat rectangle de-
tection as a labelling problem, and use the detected
planes’ orientation for wide baseline matching.
Another cue which may be exploited is the dis-
tinctive appearance of certain parts of images. The
method most similar to our own is that of (Hoiem
et al., 2007), which classifies ‘super-pixels’ into ge-
ometric classes, with orientations limited to being ei-
ther horizontal, left, right, or front facing. A vari-
ety of features are used to create a coherent grouping
from the initial super-pixels, resulting in an esimate
of scene layout which has been used to create simple
3D models and for object recognition.
(Saxena et al., 2008) focus on the related task of
estimating depth, by training on range data from a
laser scanner. From absolute and relative depth esti-
mates at individual regions, a Markov Random Field
is used to find a consistent depth map over the whole
image. This has been used for sophisticated 3D model
building, and to drive a high-speed toy car (Michels
et al., 2005).
These methods show considerable progress in un-
derstanding single images; however they either rely
on a restrictive ‘Manhattan’ assumption, or when ap-
plicable to more general scenes, can only obtain very
coarse orientation or depth.
3 METHOD
Here we give an overview of our method, with more
details in subsequent subsections. First, we gather a
database of training examples, and manually assign
a class (plane or not plane) and orientation (normal
vector). Then the class and orientation of new regions
are estimated using a K-Nearest Neighbour classifier,
with similarity between regions evaluated as follows.
(a) (b) (c)
(d) (e) (f)
Figure 2: Examples of the training data we use, showing
the manually selected region of interest and plane orienta-
tion (regions (a)-(d)); examples (d) and (f) were obtained by
warping the original images.
We use histograms of oriented gradients to de-
scribe the local appearance at salient points in an im-
age region; since these are not informative enough on
their own, we accumulate information using a bag of
words approach, applying a variant of Latent Seman-
tic Analysis (Deerwester et al., 1990) for dimension-
ality reduction.
The resulting vectors of latent ‘topics’ can be used
for classification and orientation, but performance is
improved by also considering their spatial configu-
ration, which we represent using a histogram aug-
mented with means and covariances – a ‘spatiogram’
(Birchfield and Rangarajan, 2005); as far as we are
aware, using a spatiogram with a bag of words is
novel. Further technical details can be found in
(Haines and Calway, 2011).
3.1 Training Data
We collect training images of planes and non-planes
from a variety of outdoor locations; these have a res-
olution of 320 × 240 pixels, and have been corrected
for radial distortion. For each image we mark a re-
gion of interest, and assign them to the plane or non-
plane class as appropriate. To get the true orientation,
corners of a quadrilateral are marked, corresponding
to a real rectangle; this defines two orthogonal sets
of parallel lines, whose intersections define vanishing
points v
1
and v
2
. From this we calculate n, the normal
vector of the plane, using n = K
T
l, where l = v
1
× v
2
is the vanshing line of the plane and K is the 3 × 3
matrix encoding the camera paramters (see figure 2).
We generate more training examples, to approx-
imate planes seen from different viewpoints, by ap-
plying geometric transformations to the images. The
simplest of these is to reflect about the vertical axis;
we can also use the known relative pose of the planar
regions to render new views from different locations,
ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods
290