Relative Pose Estimation from Straight Lines using Parallel Line

Clustering and its Application to Monocular Visual Odometry

Naja von Schmude

1,2

, Pierre Lothe

and Bernd J

ahne

Computer Vision Research Lab, Robert Bosch GmbH, Hildesheim, Germany

Heidelberg Collaboratory for Image Processing, Ruprecht-Karls-Universit

at Heidelberg, Heidelberg, Germany

Keywords:

Relative Pose Estimation, Lines, Clustering, Monocular Camera, Visual Odometry.

Abstract:

This paper tackles the problem of relative pose estimation between two monocular camera images in texture-

less scenes. Due to a lack of point matches, point-based approaches such as the 5-point algorithm often fail

when used in these scenarios. Therefore we investigate relative pose estimation from line observations. We

propose a new approach in which the relative pose estimation from lines is extended by a 3D line direction

estimation step. The estimated line directions serve to improve the robustness and the efﬁciency of all pro-

cessing phases: they enable us to guide the matching of line features and allow an efﬁcient calculation of

the relative pose. First, we describe in detail the novel 3D line direction estimation from a single image by

clustering of parallel lines in the world. Secondly, we propose an innovative guided matching in which only

clusters of lines with corresponding 3D line directions are considered. Thirdly, we introduce the new relative

pose estimation based on 3D line directions. Finally, we combine all steps to a visual odometry system. We

evaluate the different steps on synthetic and real sequences and demonstrate that in the targeted scenarios we

outperform the state-of-the-art in both accuracy and computation time.

1 INTRODUCTION

Relative pose estimation is the problem of calculating

the relative motion between two or more images. It

is a fundamental component for many computer vi-

sion algorithms such as visual odometry, simultane-

ous localization and mapping (SLAM) or structure

from motion (SfM). In robotics, these computer vi-

sion algorithms are heavily used for visual navigation.

The classical approach to estimate the relative

pose between two images combines point feature

matches (e.g. SIFT (Lowe, 2004)) and a robust (e.g.

RANSAC (Fischler and Bolles, 1981)) version of the

5-point-algorithm (Nister, 2004). This works well

under the assumption that enough point matches are

available, which is usually the case in structured and

textured surroundings.

As our target application is visual odometry in in-

door environments (e.g. ofﬁce buildings) where only

little texture is present, point-based approaches do not

work. But lots of lines are present in those scenes (cf.

Figure 1), hence, we investigate lines for relative pose

estimation.

In this paper, we present a novel relative pose es-

timation scheme for lines. In our work, which got

Figure 1: A typical indoor scene with few texture in which

a lot of lines are detected.

inspired by Elqursh and Elgammal (Elqursh and El-

gammal, 2011), we propose to start the relative pose

estimation process with a new 3D line direction es-

timation step. Moreover, we then use this direction

information through all steps of the processing which

enables us to drastically improve both the robustness

and the computation time of the relative pose estima-

tion.

Our contributions are:

• an improved line clustering algorithm for estima-

tion of the 3D line directions per image with a

novel clustering initialization.

Schmude, N., Lothe, P. and Jähne, B.

Relative Pose Estimation from Straight Lines using Parallel Line Clustering and its Application to Monocular Visual Odometry.

DOI: 10.5220/0005661404210431

In Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2016) - Volume 3: VISAPP, pages 423-433

ISBN: 978-989-758-175-5

423

• a new line matching scheme in which only lines

with corresponding 3D line directions are consid-

ered. The proposed scheme improves the robust-

ness of the line matching since there is until now

no “SIFT” equivalent for lines.

• a robust and fast framework combining all neces-

sary steps for relative pose estimation using lines.

2 RELATED WORK

The trifocal tensor is the standard method for relative

pose estimation using lines. The trifocal tensor calcu-

lation requires at least 13 line correspondences across

three views (Hartley, 1994; Hartley and Zisserman,

2004). For two views, it is in general not possible to

estimate the camera motion from lines as shown by

Weng et al. (Weng et al., 1992), unless further knowl-

edge of the observed lines can be assumed. If for ex-

ample different pairs of parallel or perpendicular lines

are given, as it is always the case in the “Manhattan

world”

, the number of required views can be reduced

to two.

As the number of views is reduced, the number

of required line correspondences declines as well, be-

cause only ﬁve degrees of freedom have to be esti-

mated (three for the rotation and two for the transla-

tion up to scale) compared to 26 for the trifocal ten-

sor. Problems with small degrees of freedom are ben-

eﬁcial for robust methods like RANSAC as the num-

ber of iterations can be decreased. Concerning the

method proposed in this paper, only two views are

required where the rotation is estimated from a mini-

mal of two 3D line direction correspondences and the

translation from at least two intersection points.

In their work, Elqursh and Elgammal (Elqursh and

Elgammal, 2011) employ the “Manhattan world” as-

sumption and require only two views as well. They

try to ﬁnd “triplets” of lines, where two of the lines

are parallel and the third is perpendicular to the oth-

ers. The relative pose can then be estimated from

one triplet. The pose estimation process is split up

into two steps: ﬁrst, the vanishing point information

of the triplet is used to calculate the relative rotation.

Then, the relative translation is estimated using the

already calculated rotation and intersection points be-

tween lines. The detection of valid triplets for rotation

estimation is left over to a “brute force” approach,

in which all possible triplet combinations are tested

A scene complies with the “Manhattan world” assump-

tion if it has three dominant line directions which are or-

thogonal and w.l.o.g. can be assumed to coincide with the

x-, y- and z-axis of the world coordinate system.

through RANSAC. As the number of possible triplets

is in O(n

) for n lines, this computation is very ex-

pensive. Contrarily, our rotation estimation method is

more efﬁcient as we calculate it from 3D line direc-

tions and the number m of different 3D line directions

per image is much smaller than the number of lines

(in our cases m < 10 whereas n > 100). In addition,

we do not need the restricting “Manhattan world” as-

sumption which would require orthogonal dominant

directions but a less stricter form where we allow ar-

bitrary directions.

Similar approaches requiring two views were pre-

sented by Wang et al. (Wang et al., 2008) and Bazin

et al. (Bazin et al., 2010). In both works, the pose es-

timation is split up into rotation and translation esti-

mation as well, where the rotation calculation relies

on parallel lines. Bazin et al. estimate the transla-

tion from SIFT feature point matches. Their approach

is also optimized for omnidirectional cameras. Our

method requires only lines and no additional point

feature detection as we calculate the translation from

intersection points.

Computer vision systems requiring relative pose

estimation are e.g. visual odometry, SLAM and SfM.

Several line-based SfM methods exist like (Weng

et al., 1992; Hartley and Zisserman, 2004; Bartoli and

Sturm, 2005; Schindler et al., 2006) which formulate

the SfM problem as a nonlinear optimization. The ini-

tial conﬁguration is calculated using the trifocal ten-

sor or is derived from other input data. The approach

from Schindler et al. (Schindler et al., 2006) takes

“Manhattan World” assumptions into account and in-

cludes a vanishing point clustering on pixel level. Our

method could be used in these SfM algorithms as an

alternative for initialization requiring only two views.

An EKF-based SLAM method called “StructSLAM”

(Zhou et al., 2015) has recently been published which

extends a standard visual SLAM method with “struc-

ture lines” which are lines aligned with the dominant

directions. Witt and Weltin (Witt and Weltin, 2013)

presented a line-based visual odometry algorithm us-

ing a stereo camera setup. The relative pose is esti-

mated using a nonlinear optimization of the 3D line

reconstruction similar to ICP which requires the use

of a stereo camera whereas our approach is designed

for monocular cameras. In section 3, we present our

relative pose estimation approach: ﬁrst, we introduce

a method to estimate line directions through parallel

line clustering using a single image (section 3.3). In

section 3.4, we describe a novel matching algorithm

where only lines with the same 3D line direction are

considered. Afterwards, the relative pose is estimated

as described in section 3.5. In chapter 4, we present

a visual odometry system based on our relative pose

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

424

Parallel line

clustering

Line acquisition

Guided

matching

Relative pose estimation

Relative rotation

estimation

Relative translation

estimation

Figure 2: Process chain for relative pose estimation.

estimation method. We ﬁnally evaluate our method

in section 5 and show that we outperform the current

state-of-the-art.

3 RELATIVE POSE ESTIMATION

FROM LINES

In this section, we describe our relative pose estima-

tion pipeline in detail which is visualized in Figure

3.1 Notation

In relative pose estimation, we deal with entities in

different coordinate systems: there are objects in the

image (denoted by subscript i, like the line l

or point

) and in the camera frame (denoted by subscript C,

like the 3D line direction vector d

A projective mapping is used to project a point X

from the camera frame onto the image plane. K is the

known 3 ×3 calibration matrix of the camera.

= KX

(1)

The transformation between two coordinate sys-

tems is deﬁned by T

to← f rom

= [R

to← f rom

, t

to← f rom

]

with R

to← f rom

∈ SO(3) the rotation and t

to← f rom

∈ R

the translation. The transformation from a point X

in camera a into the coordinate system of the camera

b is done with

= R

←C

+ t

←C

. (2)

3.2 Line Acquisition

At ﬁrst, we need to detect lines in the image. We use

the line-segment extraction from (Witt and Weltin,

2013), but any other line-segment or line extraction

method would do. Prominent other approaches would

be the line-segment detector “LSD” (Gioi et al., 2010)

or Hough transformation-based approaches (Duda

and Hart, 1972). We only use line-segments with a

minimum length of 30 pixels.

It is worth mentioning that even if we extract line-

segments deﬁned by two endpoints p

and p

, we

solely operate on the homogeneous line representa-

tion which can be calculated from l

= p

× p

in the

following steps.

3.3 Parallel Line Clustering

The goal of this phase is to cluster lines of an im-

age which are parallel in the world and to extract the

shared 3D line direction for each cluster. This prob-

lem is closely related to the vanishing point detection,

as the vanishing point v

of parallel lines is the pro-

jection of the 3D line direction d

(Hartley and Zis-

serman, 2004):

= Kd

(3)

We suggest to work directly with 3D line directions

and not with the 2D vanishing points. Working

in 3D space is beneﬁcial, because it is independent

from the actual camera (perspective, ﬁsheye etc.) and

allows an intuitive initialization of the clustering. In

(3), we introduced how the vanishing point and the 3D

line direction are related. Now, we have to transfer the

line l

into its corresponding 3D expression. The line

is the intersection of the image plane with the plane

. Π

is the line’s equivalent for a ray and contains

the camera center and the observed 3D line L

, we

call this plane the “back-projected plane”. The nor-

mal vector of Π

is given by (Hartley and Zisserman,

2004):

(4)

This relation is visualized in Figure 3.

Figure 3: Visualization of the back-projection of an image

line l

. Π

is the back-projected plane with normal vector

. The observed world line L

has line direction d

Many of the vanishing point detection algorithms

employ the Expectation-Maximization (EM) cluster-

ing method (Dempster et al., 1977) to group im-

age lines with the same vanishing point (Antone and

Teller, 2000; Ko

seck

a and Zhang, 2002). We got in-

spired by the work of Ko

seck

a and Zhang (Ko

seck

and Zhang, 2002) and we adapt their algorithm so that

it directly uses 3D line directions instead of vanishing

Relative Pose Estimation from Straight Lines using Parallel Line Clustering and its Application to Monocular Visual Odometry

425

points. This enables us to introduce a new and much

simpler initialization for the clustering in which we

directly set initial directions derived from the target

environment.

The EM-algorithm iterates the expectation and the

maximization step. In the expectation phase, the pos-

terior probabilities p(d

(k)

( j)

) are calculated. The

posterior mirrors how likely a line l

( j)

(with plane

normal n

( j)

) belongs to a certain cluster k repre-

sented by direction d

(k)

. Bayes’s theorem is applied

to calculate the posterior:



(k)

( j)





( j)

(k)





(k)





( j)



(5)

We deﬁne the likelihood as



( j)

(k)



2πσ

exp







−



( j)

(k)



2σ







(6)

The likelihood reﬂects that a 3D line in the camera

frame (and its direction d

( j)

) should lie in Π

( j)

and is therefore perpendicular to the plane normal

( j)

. If we substitute n

( j)

and d

(k)

with (3) and

(4), we come to a likelihood term in the 2D image

space which is the same as in the work of Ko

seck

and Zhang (Ko

seck

a and Zhang, 2002).

In the maximization step, the probabilities from

the expectation step stay ﬁxed. The direction vectors

are in this phase re-estimated by maximizing the ob-

jective function:

argmax

(k)

∏



( j)



= argmax

(k)

∑

log p



( j)



(7)

with



( j)



∑



(k)





( j)

(k)



(8)

As pointed out in (Ko

seck

a and Zhang, 2002), in

the case of a Gaussian log-likelihood term, which is

here the case, the objective function is equivalent to a

weighted least squares problem for each d

(k)

= argmin

∑



( j)



( j)



(9)

After each EM-iteration, we delete clusters with less

than two assignments to gain robustness.

For initialization, we deﬁne a set of 3D directions

which are derived from the targeted environment as

follows: We apply our method in indoor scenes, hence

we ﬁnd the three dominant directions of the “Manhat-

tan world”. In addition, the camera is mounted point-

ing forward with no notable tilt or rotation against

the scene, therefore we use the three main directions

(1 0 0)

, (0 1 0)

, (0 0 1)

for initialization. For ro-

bustness, we add all possible diagonals like (1 1 0)

(1 − 1 0)

, . . . , (1 1 1)

(e.g. to capture the staircase

in Figure 4b) ) and end up with overall 13 line direc-

tions. All line directions are normalized to unit length

and have initially the same probability. The variance

of each cluster is initially set to σ

= sin

(1.5°) which

reﬂects that the plane normal and the direction vec-

tor should be perpendicular up to a variation of 1.5°.

Note that this derivation of initial directions is easily

adoptable for other scenes or camera mountings. If

the camera is for example mounted in a rotated way,

we can simply rotate the directions accordingly. If

such a derivation is not possible, we suggest to use

the initialization technique proposed in (Ko

seck

a and

Zhang, 2002) where the initial vanishing points are

calculated directly from the lines in the image.

If we process an image sequence, we additionally

use the directions estimated from the last image in the

initialization as “direction priors”. In this case, we

assign these priors a higher probability.

Results from the clustering step are visualized in

Figure 4.

(a) (b)

Figure 4: Results of the parallel line clustering. Lines with

the same color belong to the same cluster and are parallel

in 3D. Each cluster has a 3D direction vector d

assigned

which represents the line direction viewed from this image.

3.4 Guided Line Matching

In this section, we describe the matching of lines be-

tween two images. As we have clustered parallel lines

in both images, we can use this information to guide

the matching, so that only clusters of lines with corre-

sponding 3D direction are considered.

For this, we need to establish correspondences be-

tween the 3D line directions (the clusters) of the two

images. This is done using RANSAC.

The mathematical basis for the algorithm is that

the transformation of a direction d

from the ﬁrst

camera to the direction d

in the second camera de-

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

426

pends only on the rotation R

←C

= R

←C

(10)

In the presence of noise, this equation does not hold,

so we use the angular error ε between the directions:

ε = arccos

←C

kkd

(11)

The RANSAC process tries to hypothesize a rota-

tion which has a low angular error over a maximized

subset of all possible correspondences. A rotation can

be hypothesized from two randomly selected direc-

tion correspondences as described in 3.5.1. The idea

behind this procedure is that only the correct set of

correspondences yields a rotation matrix with small

angular errors. Therefore, the correct rotation matrix

only selects the correct correspondences into the con-

sensus set.

The method has one drawback: it happens always

that different rotation hypotheses with different sets of

correspondences result in the same angular error. This

happens especially in an “Manhattan world” environ-

ment. This ambiguity is illustrated in Figure 5. As we

???

Figure 5: Ambiguity in the direction matching. Which is the

correct correspondence assignment for the directions from

the left to the right? In this setting, the angular errors for R

and R

are the same.

have no other sensors in our system to resolve these

ambiguities, we assume small displacements between

the images and therefore restrict the allowed rotation

to less than 45°. As we target a visual odometry sys-

tem, this is a valid assumption. Alternatively, if e.g.

an IMU is present, its input could also be used.

Once the correspondences between the directions

(the clusters) are determined, lines in the correspond-

ing clusters can be matched. To match the lines, we

apply a descriptor-based matching using the LEHF

descriptor from Hirose and Saito (Hirose and Saito,

2012). Here, the gradients in the neighborhood of

a line are arranged in histograms which are com-

bined to form the descriptor. The distance between

two descriptors is simply the euclidean distance. We

add a global threshold of 0.4 and a left-right consis-

tency check (LRC) (Wang et al., 2009) to robustify

the matching.

Compared to a brute-force matching, we can re-

duce the search space for each line to the correspond-

ing cluster and gain robustness.

3.5 Relative Pose Estimation

As the lines between two images are now matched,

we can estimate the relative motion T

←C

, t

←C

] between the two camera locations.

We split the estimation into two steps: at ﬁrst, the ro-

tation is estimated using only the line direction corre-

spondences. Then, the translation is calculated rely-

ing on the rotation and intersection points.

3.5.1 Rotation Estimation from Line Directions

As shown in (10), the transformation of a 3D line di-

rection depends only on the rotation. Given m corre-

sponding (and possible noisy) directions, we want to

ﬁnd the rotation R

←C

which minimizes

←C

= argmin

− RD

(12)

where D

and D

are 3 × m matrices which contain

in each column the corresponding directions. This

problem is an instance of the “Orthogonal Procrustes

Problem” (Gower and Dijksterhuis, 2004). We em-

ploy the solution presented by Umeyama (Umeyama,

1991) which returns a valid rotation matrix as result

by enforcing det(R

←C

) = 1:

←C

= USV

(13)

with

UDV

= svd





(14)

S =





1 0 0

0 1 0

0 0 sign(det(U)det(V))





(15)

At least two non-collinear directions are required

to calculate a solution. This algorithm is used in the

RANSAC of the guided matching step (cf. 3.4). In

practice, the rotation is already estimated there and

we do not have to compute it again.

3.5.2 Translation Estimation from Intersection

Points

The translation is estimated in the same way as pro-

posed by Elqursh and Elgammal (Elqursh and El-

gammal, 2011). Intersection points of coplanar 3D-

lines are invariant under projective transformation and

therefore fulﬁll the epipolar constraint (Hartley and

Zisserman, 2004):

−T

←C

]

←C

−1

= 0 (16)

If two intersection point correspondences and the rel-

ative rotation are given, the epipolar constraint equa-

tion can be used to solve t

←C

up to scale.

Relative Pose Estimation from Straight Lines using Parallel Line Clustering and its Application to Monocular Visual Odometry

427

As we have no knowledge which intersection

points from all

n(n−1)

possibilities (with n the number

of line correspondences) belong to coplanar lines, we

follow the idea from (Elqursh and Elgammal, 2011)

and use RANSAC to select the correct correspon-

dences from all possible combinations while mini-

mizing the Sampson distance deﬁned in (Hartley and

Zisserman, 2004). In contrast to (Elqursh and Elgam-

mal, 2011), we can reduce the initial correspondence

candidates as we take the clusters into account and

only calculate intersection points between lines of dif-

ferent clusters.

As errors on the rotation directly inﬂuence the

translation estimation, we use the inlier intersection

points from the RANSAC step for a nonlinear opti-

mization of the overall pose.

4 APPLICATION: VISUAL

ODOMETRY

Now we introduce a visual odometry system using the

presented relative pose estimation algorithm.

The pipeline is pictured in Figure 6.

Parallel line

clustering

Line acquisition

Guided

matching

Relative pose estimation

Relative rotation

estimation

Relative translation

estimation

Scale adjustment

Update trajectory

Images

Distance

measure

Figure 6: Processing pipeline for the visual odometry.

For each successive pair of images from the in-

put stream, the described relative pose estimation is

executed. If for some reason (e.g. not enough line

matches found) the relative pose estimation fails, we

use the previously estimated relative pose. As the rel-

ative pose has no global scale, we need to adjust it

to reﬂect the real distance traveled. We get these dis-

tance measures from other sensors, like wheel odom-

etry which is common on robotic platforms. Once the

relative motion is properly scaled, we update the tra-

jectory.

We want to mention that this is a very basic ap-

proach using only information from two successive

frames. The robustness could be improved signiﬁ-

cantly by using more images (e.g. tracking lines over

multiple images and doing a sliding-window bundle-

adjustment on the relative poses). We will focus on

that in our future work.

5 EVALUATION

For our experiments, we use synthetic and real image

sequences. For the synthetic data, we created a typical

indoor scene with a 3D wireframe model and gener-

ated images from it by projecting the line-segments

from the model into a virtual pinhole camera. The

generated sequence consists of 1503 images. Exam-

ple images are shown in Figure 7.

(a) (b)

Figure 7: A synthetic images showing a typical indoor sce-

nario.

To see how image noise affects the processing, we

add Gaussian noise on the image lines. We do not add

noise on the endpoints of the line-segments, because

this effects segments of various length differently. We

rather rotate the segments around their center point,

where σ is the standard deviation of the rotation in

degree.

For the experiments on real data, we recorded a

circular sequence in a typical ofﬁce environment, us-

ing an AVT Guppy camera mounted on top of a robot.

The image resolution is 1032 × 778 pixel. The total

distance of the “Ofﬁce Circle” sequence is about 70 m

and 1735 images. Example images from the sequence

are depicted in Figure 8.

(a) (b)

Figure 8: Examples from the sequence “Ofﬁce Circle”.

Note the textureless environment in (a) and (b).

In addition, we use the “Corridor” sequence from

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

428

Table 1: Evaluation of the parallel line clustering and the estimated line directions on the synthetic data.

σ (in °) Cluster Initialization Avg. error (in °) Std. dev. (in °)

0.0 Predeﬁned directions 0.4458 3.9923

0.0 Predeﬁned directions + priors 0.0005 0.0018

0.0 Ko

seck

a and Zhang’s method 4.9025 12.3382

0.0 Ko

seck

a and Zhang’s method + priors 0.0052 0.1275

0.5 Predeﬁned directions 3.2674 9.4691

0.5 Predeﬁned directions + priors 0.5871 0.6741

0.5 Ko

seck

a and Zhang’s method 9.2458 15.4053

0.5 Ko

seck

a and Zhang’s method + priors 0.5890 0.6749

1.0 Predeﬁned directions 5.8362 11.5567

1.0 Predeﬁned directions + priors 2.8989 7.0723

1.0 Ko

seck

a and Zhang’s method 10.8633 15.5824

1.0 Ko

seck

a and Zhang’s method + priors 4.7644 9.7788

Oxford

which consists of 11 frames showing a cor-

ridor with attached ground truth poses. Although this

sequence is very short, other methods used it for eval-

uation which makes a direct comparison possible.

As our test setup is now introduced, we evaluate

the different steps of our processing pipeline. First,

the performance (5.1) and the computation time (5.2)

of the different proposed algorithms are analyzed.

Then (5.3), we evaluate the visual odometry system.

5.1 Performance of Proposed

Algorithms

5.1.1 Parallel Line Clustering

In this section, we propose to evaluate the accuracy

of the 3D line direction estimation using the synthetic

data with different noise levels.

For each image, the proposed parallel line clus-

tering is executed, and the calculated line directions

are compared to the ground truth. The comparison

is done by calculating the angular error between the

ground truth and the estimated direction. Since the

number of estimated clusters may differ from the ac-

tual number, we associate each estimated direction

vector with the ground truth direction which results

in the smallest angular error.

We evaluate our proposed initialization technique

where we predeﬁne a set of directions for initializa-

tion (cf. 3.3) against the initialization approach pro-

posed by Ko

seck

a and Zhang (Ko

seck

a and Zhang,

2002) where the initial directions are estimated di-

rectly from the line observations in the image. We

also combine both techniques with “direction priors”

where the estimated directions from the previous im-

age are used additionally.

The results of this experiment are summarized in

Table 1. As expected, the accuracy of the estimation

http://www.robots.ox.ac.uk/˜vgg/data1.html

drops as more and more noise is added to the data.

On this kind of data, we clearly see that using our ini-

tialization gives better results compared to the method

from Ko

seck

a and Zhang. It can also be seen that us-

ing direction priors is beneﬁcial as the mean and stan-

dard deviation of the angular error is reduced. There-

fore we choose the “direction prior” strategy with pre-

deﬁned directions for all following experiments.

5.1.2 Matching

We compare the brute-force matching to our proposed

guided matching in which only corresponding line

clusters are considered. An example is shown in Fig-

ure 9.

(a) Brute-force matching. (b) Guided matching.

Figure 9: Brute-force matching vs. guided matching. De-

tected lines are drawn in orange, matches are indicated by

green connections between bottom and top image. The

wrong matches (red lines) are avoided in the guided match-

ing because these lines have different 3D directions and be-

long to different clusters.

This example highlights that our proposed guided

matching approach is better in terms of robustness

than a brute-force matching.

Relative Pose Estimation from Straight Lines using Parallel Line Clustering and its Application to Monocular Visual Odometry

429

Table 2: Accuracy of the relative pose computation on the synthetic sequence.

Rotation error (in °) Translation error (in °)

Approach σ in ° Avg. Median Std. dev. Avg. Median Std. dev.

“Triplet” 0.0 0.388 0.066 1.006 60.568 55.947 50.488

Ours 0.0 0.008 0.000 0.017 41.019 0.054 55.156

“Triplet” 0.5 31.150 5.293 48.048 79.174 82.673 50.884

Ours 0.5 1.096 0.646 1.202 76.720 76.970 43.305

“Triplet” 1.0 35.049 6.597 49.967 84.459 86.723 46.080

Ours 1.0 3.264 1.368 6.048 82.161 83.711 45.271

5.1.3 Relative Pose Estimation

In this experiment, we compare the accuracy of our

relative pose computation with the “Triplet” approach

(Elqursh and Elgammal, 2011). We evaluate the rota-

tion and translation error between consecutive image

frames, where the rotation error is the rotation angle

of R

est

and the translation error the angle between

the groundtruth translation vector and the estimated

translation. The results for the synthetic dataset are

listed in Table 2.

The rotation error of our approach is lower than

for the “Triplet” approach, which shows that our

method is more robust. Also, in the noise-free test the

translation error of our method is signiﬁcantly lower

(median of 0° compared to 56°). For the tests under

noise, both translation errors rise a lot. This is be-

cause the intersection points in the images do not cor-

respond anymore and the epipolar constraint equation

therefore yields wrong results.

On the “Corridor” sequence, we estimate the ro-

tation with a mean error of 1.06°. The authors of the

“Triplet” approach reported a mean error of 1.3° (we

measured 0.97°). For the translation, we measured a

mean error of 15.54° for our method and 26.01° for

the “Triplet” algorithm as they do not reported this

value. We can deduce that our method is more accu-

rate in estimating the relative pose since the rotation

accuracy is similar and the translation is estimated

with higher accuracy.

Overall, we can conclude that our method outper-

forms the triplet approach in terms of accuracy and

robustness on synthetic and real data sequences.

5.2 Execution Times

In the following, we analyze the runtime of the dif-

ferent steps. All experiments were conducted on a

Intel® Xeon™ CPU with 3.2 GHz and 32 GB RAM.

The average execution times in milliseconds per im-

age are listed in Table 3.

For the matching itself, our proposed guided

matching and the brute-force approach have compa-

rable runtimes. But the guided matching requires

the clustering step, so we have to take the time for

the clustering into account. Combining the timings

of guided matching and clustering, we are slightly

slower than the brute-force matching. Nevertheless,

the higher robustness of the guided matching compen-

sates for the prolonged execution time.

Our approach for the relative pose estimation step

signiﬁcantly outperforms the “Triplet” approach by

only requiring 0.9% to 3% of its runtime. This is due

to the fact that in the “Triplet” approach all possible

O(n

) triplet combinations (n is the number of line

matches) are generated and then tested in a RANSAC

scheme to calculate the rotation which is very time

consuming. Contrary to that, our estimation is based

on the precalculated line-directions of the clustering

step. As we normally extract only around 5 different

line directions, the rotation calculation is very fast.

Also the number of generated intersection points is

lower due to the restriction to intersection points gen-

erated from lines from different clusters. We conclude

that the additional runtime spent for the clustering is

largely compensated by the enormous improvement

in the relative pose estimation step.

5.3 Visual Odometry System

The preceding experiments showed that our proposed

algorithms give accurate results and are computa-

tional inexpensive. Hence, we plug them together

to create a robust and fast visual odometry system

(c.f. section 4) which is evaluated now.

We evaluate the visual odometry system on the

synthetic and real image sequences and compare it to

the results that one would reach when interchanging

the relative pose estimation block with the “Triplet”

approach.

The trajectories generated from noise-free syn-

thetic data applying the “Triplet” and our approach

are shown in Figure 10.

We analyze the root mean squared error (RMSE)

of the relative pose error (RPE) and the absolute tra-

jectory error (ATE) introduced by Sturm et al. (Sturm

et al., 2012) in Table 4. The RPE is calculated be-

tween two consecutive frames.

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

430

Table 3: Average execution times in milliseconds.

Matching Rel. Pose Estimation

Sequence Avg. lines

per image

Line Ac-

quisition

Clustering Brute-Force Guided (our) “Triplet” Our

Synthetic (σ = 0°) 31.32 - 0.65 - - 501.38 10.06

Ofﬁce Circle 121.35 46.24 6.35 3.71 3.73 3230.76 28.29

Corridor 120.09 34.28 18.22 3.17 3.49 639.27 16.71

(a) Top view.

(b) Side view.

Figure 10: Top view and side view on the trajectories gen-

erated with the “Triplet” approach (red) and our approach

(green) of the synthetic image sequence. The ground truth

path is drawn in blue. Start and endpoint are represented by

the green and red star.

Table 4: RPE / ATE analysis on the synthetic sequence.

RPE ATE

Approach σ in ° RMSE RMSE

“Triplet” 0.0 0.073 22.577

Ours 0.0 0.044 0.419

“Triplet” 0.5 0.099 21.901

Ours 0.5 0.093 13.466

“Triplet” 1.0 0.106 23.113

Ours 1.0 0.101 19.038

Trajectories and analysis of RPE and ATE show

that our method succeeds much better in reconstruct-

ing the ground truth trajectory as the “Triplet” ap-

proach.

The average runtime per image is for the triplet

approach 524 ms and for our approach 26 ms.

For the real data sequences, we compare our

method also to a point-based approach using SIFT

(Lowe, 2004) feature matching and a RANSAC-based

5-point-algorithm (Nister, 2004) for relative pose esti-

mation. We are aware that there are better point-based

visual odometry systems (e.g. using sliding-window

bundle adjustment) but we choose this approach since

it is very similar from its structure.

For the “Corridor” sequence, our method results in

an RPE of 0.027 and an ATE of 0.052. For the triplet

approach we measure an RPE of 0.046 and ATE of

0.079. Using the point-based approach, the RPE is

0.007 and the ATE 0.010. This conﬁrms the previous

experiments that our method has a higher accuracy

then the “Triplet” approach. But on this sequence, the

point-based approach is clearly better as the images

are textured and enough points are detected.

The average runtime per image is for the triplet ap-

proach 687 ms, for the point-based approach 282 ms

and for our approach 73 ms.

The result for the ofﬁce sequence is presented in

Figure 11.

(a) Top view.

(b) Side view.

Figure 11: Top view and side view on the trajectories gen-

erated with the “Triplet” approach (red) and our approach

(green) and the 5-point-algorithm (yellow) of the “Ofﬁce

Circle” sequence. The original path is drawn in blue. Start

and endpoint are represented by the green and red star.

For the “Ofﬁce Circle” sequence, the “Triplet” ap-

proach fails to reconstruct the path completely shortly

after the ﬁrst corner. In addition, the relative pose esti-

mation fails two times. The same can be observed for

Relative Pose Estimation from Straight Lines using Parallel Line Clustering and its Application to Monocular Visual Odometry

431

the point-based approach, here it fails four times be-

cause too few SIFT feature matches are found. Con-

trary, our algorithm processes the whole sequence.

We observe, that with our algorithm the reconstructed

trajectory resembles the true trajectory the most, since

it has almost a rectangular form and less variation in

the height dimension then the other approaches. This

indicates that our rotation estimation is more robust.

Please note that due to inaccurate distance measure-

ments of our robot, the overall scaling of the trajecto-

ries is not correct.

The average runtime per image on the “Ofﬁce Cir-

cle” sequence is for the triplet approach 3277 ms, for

the point-based approach 146 ms and for our approach

85 ms.

We can conclude that our method is better suited

for the visual odometry task than the “Triplet” ap-

proach. We also demonstrated that in low textured

environments a line-based approach can function as

replacement for point-based relative pose estimation.

6 CONCLUSION

In this paper, we presented a novel relative pose esti-

mation scheme using lines. In our approach, we esti-

mate the 3D line directions through a clustering step

of parallel lines in the world and use this information

throughout the whole processing pipeline. The direc-

tion information is used to guide the line matching

and to calculate the relative rotation. We also pre-

sented a visual odometry using our relative pose esti-

mation.

As the 3D line direction estimation is such an im-

portant step, we evaluated it on synthetic data and

showed how the usage of “direction priors” in a se-

quential processing boosts the accuracy.

Furthermore, we compared our relative pose esti-

mation to the state-of-the-art approach from Elqursh

and Elgammal (Elqursh and Elgammal, 2011). We

showed that our method outperforms theirs in terms

of accuracy and runtime. Especially the runtime can

be reduced from seconds to milliseconds.

The visual odometry system is evaluated and com-

pared to the state-of-the-art for line-based relative

pose estimation (Elqursh and Elgammal, 2011) and a

comparable point-based algorithm (Nister, 2004). We

showed that our method is better applicable in tex-

tureless indoor scenarios than both other approaches.

In the future, we want to extend our approach to a

complete SLAM system. To reach this goal, we need

to relax the restriction to small motions in the guided

matching step. How this could be done is future work.

REFERENCES

Antone, M. E. and Teller, S. (2000). Automatic recovery

of relative camera rotations for urban scenes. In IEEE

Conference on Computer Vision and Pattern Recogni-

tion. CVPR 2000, pages 282–289.

Bartoli, A. and Sturm, P. (2005). Structure-from-motion

using lines: Representation, triangulation, and bundle

adjustment. Computer Vision and Image Understand-

ing, 100(3):416–441.

Bazin, J., Demonceaux, C., Vasseur, P., and Kweon, I.

(2010). Motion estimation by decoupling rotation and

translation in catadioptric vision. Computer Vision

and Image Understanding, 114(2):254–273.

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977).

Maximum likelihood from incomplete data via the em

algorithm. Journal of the Royal Statistical Society. Se-

ries B (Methodological), pages 1–38.

Duda, R. O. and Hart, P. E. (1972). Use of the hough trans-

formation to detect lines and curves in pictures. Com-

munications of the ACM, 15(1):11–15.

Elqursh, A. and Elgammal, A. (2011). Line-based relative

pose estimation. In 2011 IEEE Conference on Com-

puter Vision and Pattern Recognition (CVPR), pages

3049–3056.

Fischler, M. A. and Bolles, R. C. (1981). Random sample

consensus: a paradigm for model ﬁtting with appli-

cations to image analysis and automated cartography.

Communications of the ACM, 24(6):381–395.

Gioi, R. v., Jakubowicz, J., Morel, J.-M., and Randall, G.

(2010). Lsd: A fast line segment detector with a false

detection control. IEEE Transactions on Pattern Anal-

ysis and Machine Intelligence, 32(4):722–732.

Gower, J. C. and Dijksterhuis, G. B. (2004). Procrustes

problems, volume 3. Oxford University Press.

Hartley, R. I. (1994). Projective reconstruction from line

correspondences. In Proceedings of IEEE Conference

on Computer Vision and Pattern Recognition, pages

903–907.

Hartley, R. I. and Zisserman, A. (2004). Multiple View Ge-

ometry in Computer Vision. Cambridge University

Press, ISBN: 0521540518, second edition.

Hirose, K. and Saito, H. (2012). Fast line description for

line-based slam. In Bowden, R., Collomosse, J., and

Mikolajczyk, K., editors, British Machine Vision Con-

ference 2012, pages 83.1–83.11.

seck

a, J. and Zhang, W. (2002). Video compass. In Goos,

G., Hartmanis, J., Leeuwen, J., Heyden, A., Sparr, G.,

Nielsen, M., and Johansen, P., editors, Computer Vi-

sion — ECCV 2002, volume 2353 of Lecture Notes

in Computer Science, pages 476–490. Springer Berlin

Heidelberg, Berlin, Heidelberg.

Lowe, D. G. (2004). Distinctive image features from scale-

invariant keypoints. International Journal of Com-

puter Vision, 60(2):91–110.

Nister, D. (2004). An efﬁcient solution to the ﬁve-point

relative pose problem. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 26(6):756–770.

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

432

Schindler, G., Krishnamurthy, P., and Dellaert, F. (2006).

Line-based structure from motion for urban envi-

ronments. In Third International Symposium on

3D Data Processing, Visualization, and Transmission

(3DPVT’06), pages 846–853.

Sturm, J., Engelhard, N., Endres, F., Burgard, W., and Cre-

mers, D. (2012). A benchmark for the evaluation of

rgb-d slam systems. In 2012 IEEE/RSJ International

Conference on Intelligent Robots and Systems (IROS

2012), pages 573–580.

Umeyama, S. (1991). Least-squares estimation of transfor-

mation parameters between two point patterns. IEEE

Transactions on Pattern Analysis and Machine Intel-

ligence, 13(4):376–380.

Wang, G., Wu, J., and Ji, Z. (2008). Single view based

pose estimation from circle or parallel lines. Pattern

Recognition Letters, 29(7):977–985.

Wang, L., Neumann, U., and You, S. (2009). Wide-baseline

image matching using line signatures. In 2009 IEEE

12th International Conference on Computer Vision

(ICCV), pages 1311–1318.

Weng, J., Huang, T., and Ahuja, N. (1992). Motion and

structure from line correspondences; closed-form so-

lution, uniqueness, and optimization. IEEE Transac-

tions on Pattern Analysis and Machine Intelligence,

14(3):318–336.

Witt, J. and Weltin, U. (2013). Robust stereo visual odom-

etry using iterative closest multiple lines. In Proceed-

ings of the IEEE/RSJ International Conference on In-

telligent Robots and Systems, page XX.

Zhou, H., Zou, D., Pei, L., Ying, R., Liu, P., and Yu, W.

(2015). Structslam: Visual slam with building struc-

ture lines. IEEE Transactions on Vehicular Technol-

ogy, page 1.

Relative Pose Estimation from Straight Lines using Parallel Line Clustering and its Application to Monocular Visual Odometry

433