Sampling based Bundle Adjustment using Feature Matches

between Ground-view and Aerial Images

Hideyuki Kume, Tomokazu Sato and Naokazu Yokoya

Nara Institute of Science and Technology, 8916-5 Takayama, Ikoma, Nara, Japan

Keywords:

Structure-from-Motion, Bundle Adjustment, Aerial Image, RANSAC.

Abstract:

This paper proposes a new pipeline of Structure-from-Motion that uses feature matches between ground-

view and aerial images for removing accumulative errors. In order to ﬁnd good matches from unreliable

matches, we newly propose RANSAC based outlier elimination methods in both feature matching and bundle

adjustment stages. To this end, in the feature matching stage, the consistency of orientation and scale extracted

from images by a feature descriptor is checked. In the bundle adjustment stage, we focus on the consistency

between estimated geometry and matches. In experiments, we quantitatively evaluate performances of the

proposed feature matching and bundle adjustment.

1 INTRODUCTION

Structure-from-Motion (SfM) is one of key technolo-

gies developed in computer vision ﬁeld, and SfM

has been used to achieve great works such as ‘build-

ing Rome in a day’ (Agarwal et al., 2009) and

‘PTAM’ (Klein and Murray, 2007). In SfM, one es-

sential problem is accumulation of estimation errors

for a long image sequence. Although many kinds

of methods that reduce accumulative errors are pro-

posed, SfM methods essentially cannot be free from

accumulative errors unless some external references

(e.g., like GPS, aerial images, and feature landmarks)

are given.

The techniques that remove accumulative errors

in the bundle adjustment (BA) stage of SfM using

some external references are called as extended bun-

dle adjustment (extended-BA), and earlier researches

of extended-BA mainly focused on the combination

of GPS and SfM (Lhuillier, 2012). In this paper, we

employ the framework of extended-BA to the SfM

problem that uses an aerial image as an external ref-

erence. In order to successfully use an aerial image

as a reference of SfM, successful matching between

aerial image and ground-view image is very impor-

tant. To ﬁnd good matches from unreliable matches,

in addition to the use of GPS and gyroscope sensors

embedded in most of recent smartphones, we newly

use two methods: (1) RANSAC (Fischler and Bolles,

1981) based outlier elimination in the feature match-

ing stage by focusing on consistency of orientation

and scale extracted from images by feature descriptor

like SIFT (Lowe, 2004) and (2) RANSAC based out-

lier elimination in the BA stage using consistency of

estimated geometry and matches.

2 RELATED WORK

In order to reduce accumulative errors in SfM, loop

closing techniques (Williams et al., 2009) are some-

times employed with the BA. Although these tech-

niques can reduce accumulativeerrors, it is essentially

difﬁcult for the techniques that rely on only images to

remove accumulative errors for a long sequence with-

out the loop.

In order to reduce accumulative errors in gen-

eral movement of camera, several kinds of external

references such as GPS (Lhuillier, 2012) and road

maps (Brubaker et al., 2013) are used with SfM.

Lhuillier (Lhuillier, 2012) proposed extended-BA us-

ing GPS that minimizes the energy function deﬁned

as the sum of reprojection errors and a penalty term

of GPS. This method can globally optimize camera

parameters and reduce accumulative errors by updat-

ing parameters so as to minimize the energy func-

tion. However, the accuracy of this method is directly

affected by errors of GPS positioning, which easily

grow to the 10m level in urban areas. Brubaker et

al. (Brubaker et al., 2013) proposed the method that

uses community developed road maps. Although this

692

Kume H., Sato T. and Yokoya N..

Sampling based Bundle Adjustment using Feature Matches between Ground-view and Aerial Images.

DOI: 10.5220/0004850306920698

In Proceedings of the 9th International Conference on Computer Vision Theory and Applications (VISAPP-2014), pages 692-698

ISBN: 978-989-758-009-3

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

method can reduce accumulative errors by matching

trajectory from SfM to road maps, there are ambigui-

ties for some scenes such as straight roads or Manhat-

tan worlds.

On the other hand, in order to estimate abso-

lute camera positions and postures, some methods

estimate camera parameters directly from references

without SfM (Pink et al., 2009; Noda et al., 2010).

Pink et al. (Pink et al., 2009) used aerial images as

references and estimated camera parameters based on

feature matching between input images and aerial im-

ages. However, it is not easy to ﬁnd good matches

for all the images of a long video sequence especially

for the scenes where unique landmarks cannot be ob-

served. Although Mills (Mills, 2013) proposed a ro-

bust feature matching procedure that compares orien-

tation and scale of each matches with dominant ori-

entation and scale identiﬁed by histogram analysis,

it cannot work well when there exist a huge num-

ber of outliers. Noda et al. (Noda et al., 2010) re-

laxed the problem by generating mosaic images of

the ground from multiple images for feature match-

ing. However, accumulative errors are not considered

in this work. Unlike previous works, we estimate

relative camera poses for all the frames using SfM,

and we remove accumulative errors by selecting cor-

rect matches between aerial image and ground-view

image from candidates by employing a multi-stage

RANSAC scheme.

3 FEATURE MATCHING

BETWEEN GROUND-VIEW

AND AERIAL IMAGES

In this section, we propose a robust method to obtain

feature matches between ground-view and aerial im-

ages. As shown in Figure 1, the method is composed

of (1) ground-view image rectiﬁcation by Homogra-

phy, (2) feature matching, and (3) RANSAC. Here,

in order to achieve robust matching, we propose new

criteria for RANSAC with consistency check of ori-

entation and scale from a feature descriptor. It should

be noted that matching for all the input frames are not

necessary in our pipeline. Even if we can ﬁnd only

several candidates of matched frames, they can be ef-

fectively used as references in the BA stage.

3.1 Image Rectiﬁcation by Homography

Before calculating feature matches using a feature de-

tector and a descriptor, as shown in Figure 1, we rec-

tify ground-view images so that texture patterns are

similar to those of the aerial image. In most cases,

aerial images are taken very far away from the ground

and thus they are assumed to be captured by an ortho-

graphic camera whose optical axis is directed to grav-

ity direction. In order to rectify ground-view images,

we also assume that the ground-view images contain

the ground plane whose normal vector is directed to

gravity direction. Then, we compute Homography

matrix using the gravity direction in camera coordi-

nate system which can be estimated from the vanish-

ing points of parallel lines or a gyroscope.

3.2 Feature Matching

Feature matches between rectiﬁed ground-view im-

ages and the aerial image are calculated. Here, we

use GPS data corresponding to the ground-view im-

ages to limit searching area in the aerial image. More

concretely, we select the region whose center is GPS

position and its size is l × l. In the experiment de-

scribed later, l is set to 50 [m]. Feature matches are

then calculated by a feature detector and a descriptor.

We employ SIFT (Lowe, 2004) in the experiment be-

cause of its robustness for changes in scale, rotation

and illumination.

3.3 RANSAC with Orientation and

Scale Check

As shown in Figure 1, tentative matches often include

many outliers. In order to remove outliers, we use

RANSAC with consistency check of orientation and

scale parameters.

For matches between rectiﬁed ground-view im-

ages and the aerial image, we can use the similarity

transform which is composed of scale s, rotation θ

and translation τ

τ. In RANSAC procedure, we ran-

domly sample two matches (minimum number to es-

timate similarity transform) to compute the similarity

transform (s, θ,τ

τ). Here we count the number of in-

liers which satisfy

− (sR(θ)g

+ τ

τ)| < d

, (1)

where a

and g

are the 2D positions of the k-th match

in the aerial image and the rectiﬁed ground-view im-

age, respectively. R(θ) is the 2D rotation matrix with

rotation angle θ and d

is a threshold. After repeating

random sampling process, the sampled matches with

the largest number of inliers are selected.

The problem here is that the distance-based crite-

rion above cannot successfully ﬁnd correct matches

when there exist a huge number of outliers. In or-

der to achieve more robust matching, we modify the

criterion of RANSAC by checking the consistency of

SamplingbasedBundleAdjustmentusingFeatureMatchesbetweenGround-viewandAerialImages

693

Ground-view image

Aerial image

(1) Homography

(2) Feature matching

(3) RANSAC with

orientation and scale check

Figure 1: Flow of feature matching.

orientation and scale from a feature descriptor. Con-

cretely, we count the number of inliers which simul-

taneously satisfy Equation (1) and the following two

conditions.

max



· s



< s

, (2)

aad(θ

+ θ, θ

) < θ

, (3)

where (s

, s

) and (θ

, θ

) are the scale and orien-

tation of feature points for the k-th match on the aerial

image and the rectiﬁed ground-view image, respec-

tively. The function ‘aad’ returns the absolute angle

difference in the domain [0

◦

, 180.0

◦

]. s

and θ

are

thresholds for scale and angle, respectively.

4 SAMPLING BASED BUNDLE

ADJUSTMENT

Even by the modiﬁed RANSAC proposed in previous

section, it is not possible to remove all the incorrect

matches in principle because there may exist repeti-

tive and similar patterns, e.g., road signs in real envi-

ronments. In order to overcome this difﬁculty, we also

employ RANSAC for the BA stage by focusing on

consistency between feature matches and estimated

camera poses from images.

4.1 Deﬁnition of Energy Function

In order to consider the matches between ground-

view and aerial images, here, an energy function is

deﬁned and minimized. The energy function E is

deﬁned by using reprojection errors for ground-view

(perspective) images Φ and the aerial (orthographic)

image Ψ as follows:

E({R

}

i=1

, {p

}

j=1

) =

Φ({R

}

i=1

, {p

}

j=1

) + ωΨ({p

}

j=1

), (4)

where R

and t

represent rotation and translation from

world coordinate system to camera coordinate system

for the i-th frame, respectively. p

is a 3D position

of the j-th feature point. I and J are the number of

frames and feature points, respectively, and ω is a

weight that balances Φ and Ψ. Since the energy func-

tion is non-linearly minimized in BA, good initial val-

ues of parameters are required to avoid local minima.

Before minimizing the energy function, we ﬁt the pa-

rameters estimated by SfM to the positions from GPS

by 3D similarity transform. In the following, the en-

ergy associated with reprojection errors Φ and Ψ are

detailed.

4.1.1 Reprojection Errors for Ground-view

Images

The commonly used reprojection errors employ the

pinhole camera model which cannot deal with projec-

tions from behind the camera. Projections from be-

hind the camera often occur in BA with references due

to dynamic movement of camera parameters by refer-

ences. Here, instead of common squared distance er-

rors on image plane, we employ the reprojection error

by using angle of rays as follows:

Φ({R

}

i=1

, {p

}

j=1

) =

∑

i=1

∑

i=1

∑

j∈P

, (5)

= ∠









+ ∠









(6)

, Z

)

= R

+ t

, (7)

where P

is a set of feature points detected in the i-

th frame. Function ∠ returns an angle between two

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

694

vectors, (x

, y

)

is a detected 2D position of the j-

th feature points in the i-th frame, and f

is the focal

length of the i-th camera.

Here, we split the angular reprojection error

into xz component and yz component because Jaco-

bian matrix of E required by non-liner least squares

method such as Levenberg-Marquardt method be-

comes simple. Especially, the ﬁrst term of Φ

does

not depend on the y component of t

and the sec-

ond term does not depend on the x component of t

in this deﬁnition. We have experimentally conﬁrmed

that this splitting largely affects the performance of

convergence.

4.1.2 Reprojection Errors for Aerial Image

The reprojection errors for the aerial (orthographic)

image are deﬁned as follows:

Ψ({p

}

j=1

) =

∑

i∈M

∑

i∈M

∑

j∈A



− pr

)



(8)

where M

M is a set of frames in which feature matches

between ground-view and aerial images are obtained.

is a set of feature points which are matched to the

aerial image in the i-th frame. a

is the 2D position

of the j-th feature point in the aerial image. pr

is a

function that projects a 3D point onto xy plane (aerial

image coordinate system).

4.2 RANSAC for Bundle Adjustment

RANSAC scheme is introduced in BA by using the

consistency between feature matches and estimated

camera poses from images. First, we randomly sam-

ple n frames from the candidates of matched frames

and do BA using feature matches included in sampled

frames, i.e., using a set of selected frames M

′

instead

of M

M in Equation (8). Then we count the number of

inlier frames which satisfy the following two condi-

tions.

average

j∈A

(α

) < α

, (9)

∠



(0, 0, 1)

, o



< β

, (10)

where α

is an angular reprojection error of the j-th

feature point on aerial image coordinate system. o

is the direction of optical axis in world coordinate

system calculated from Homography and similarity

transform estimated in the feature matching stage. α

and β

are thresholds. Here, α

is computed as fol-

lows:

= ∠



− pr

), pr



, y

, f

)



(11)

GPS on iPhone

RTK-GPS (Ground truth)

Correctly matched feature point

on aerial image

Incorrectly matched feature point

on aerial image

BA with references without RANSAC

BA without references

BA with references and RANSAC

Figure 2: Experimental environment and results.

After repeating the random sampling process at

given times, sampled frames with the largest number

of inlier frames are selected. Finally, camera poses

are reﬁned by redoing BA using feature matches with

selected inlier frames as references.

5 EXPERIMENTS

In order to validate the effectiveness of the proposed

method, we quantitatively evaluated performances of

sampling based BA as well as the feature matching

process.

5.1 Experimental Setup

We used iPhone 5 (Apple) as a sensor unit includ-

ing a camera, GPS and gyroscope. The camera cap-

tured video images (640× 480 pixels, 2471 frames,

494 seconds). GPS and gyroscope measured position

at 1 Hz and gravity direction for every frame, respec-

tively. We also used RTK-GPS (Topcon GR-3, 1 Hz,

accuracy of horizontal positioning is 0.03 [m]) to ob-

tain the ground truth positions. Positions from GPS

data were assigned temporally to the nearest frame.

As the external reference, we downloaded the aerial

image covering the area used in this experiment from

Google Maps [maps.google.com] whose coordinate

system is associated with the metric scale, i.e., 19.2

[pixel] = 1 [m]. Figure 2 shows an aerial image and

GPS positions.

SamplingbasedBundleAdjustmentusingFeatureMatchesbetweenGround-viewandAerialImages

695

Angle [ ] Scale

325.56

0.06

Angle [ ] Scale

334.02

0.85

332.84

0.91

330.88

0.84

352.56

0.43

156.73

1.06

Similarity transform

SIFT keypoints

(a) Without orientation and scale check

Angle [ ] Scale

158.71

1.19

Angle [ ] Scale

159.92

1.11

154.96

0.89

154.24

1.25

148.45

1.12

Similarity transform

SIFT keypoints

(b) With scale check, with/without orientation check

Angle [ ] Scale

66.47

0.07

Angle [ ] Scale

164.35

0.86

334.38

0.82

334.94

0.78

329.40

0.68

171.96

0.46

Similarity transform

SIFT keypoints

Angle [ ] Scale

336.34

0.96

Angle [ ] Scale

342.27

0.90

338.34

0.90

339.67

0.73

341.82

0.77

333.08

0.77

Similarity transform

SIFT keypoints

(d) With orientation check, with/without scale check

Figure 3: Selected inliers for example images. Solid and dashed lines are correct and incorrect matches, respectively. Relative

angle and scale of matched feature points are shown in bottom right table with corresponding lines’ colors. Green points are

ground truths of camera positions. RANSAC with/without orientation check for (b) and scale check for (d) gave the same

results.

Angle [ ] Scale

3.87

0.89

Angle [ ] Scale

343.16

1.63

19.02

0.80

12.64

1.22

13.32

0.93

Similarity transform

SIFT keypoints

Angle [ ] Scale

169.41

1.07

Angle [ ] Scale

169.48

0.99

168.87

1.15

163.06

1.17

170.35

1.04

Similarity transform

SIFT keypoints

Figure 4: Examples of incorrect matches by RANSAC with orientation and scale check. The interpretations of symbols are

the same as Figure 3.

In order to obtain initial values for BA, we em-

ployed VisualSFM (Wu, 2011) as a state-of-the-art

implementation of SfM. For non-linear minimization,

we used ceres-solver (Agarwal et al., 2013).

5.2 Result of Feature Matching

In this experiment, we evaluate the effectiveness

of proposed feature matching process including

RANSAC with scale and orientation check described

in Section 3. Here, we have compared four types of

RANSAC by enabling and disabling scale and orien-

tation check. In order to countthe number of correctly

matched frames, we ﬁrst selected frames which have

four or more inlier matches after RANSAC. From

these frames, we manually counted frames whose

matches are correct. Here, we set d

= 2 [pixel],

= 2 and θ

= 40 [

◦

Table 1 shows rates of frames in which all the se-

lected matches are correct. From this table, we can

conﬁrm that the rates are signiﬁcantly improved by

scale and orientation check. Figure 3 shows effects of

scale and orientation check for sampled two images.

In both cases, RANSAC without scale and orientation

check could not select correct matches and proposed

RANSAC with scale and orientation check could se-

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

696

lect correct matches. However, as shown in Figure

4, incorrect matches still remain even if we use both

scale and orientation check.

5.3 Result of Bundle Adjustment

In this experiment, we evaluate the effectiveness of

BA with RANSAC described in Section 4. In this

stage, the frames with GPS data were sampled (650

of 2471 frames) and used in order to reduce com-

putational time. As external references, we used

frames and feature matches selected with orientation

and scale check described in previous section. Here,

we set ω = 10

−5

, α

= 10.0 [

◦

], β

= 150.0 [

◦

], and

n = 2.

We ﬁrst evaluate the proposed RANSAC in terms

of capability to select frames whose matches are cor-

rect. Here, as shown in Table 1, 10 of 14 frames

have correct matches. We tested all the pairs of 14

frames as samples of RANSAC, and the number of

inlier frames that are selected in each trial is checked.

Figure 5 shows the number of trials and inlier frames

derived by each trial. From this ﬁgure, we can see that

the sampled frames without incorrect matches tend to

increase the number of inlier frames. We also con-

ﬁrmed that the trials which derive the largest num-

ber of inlier frames successfully selected all of collect

matches.

Next, we evaluate the accuracy of the proposed

method by comparing the following methods.

• BA without references (Wu, 2011),

• BA with references without RANSAC which uses

all the matches obtained by feature matching pro-

cess,

• BA with references and RANSAC.

Since the BA without reference cannot estimate the

metric scale, we ﬁtted the camera positions estimated

by SfM to the ground truths by similarity transform.

Figures 2 and 6 show the estimated camera positions

and horizontal position errors for each frame, respec-

tively. From these results, it is conﬁrmed that es-

timated camera positions by BA without references

Table 1: Rates of frames in which all the selected matches

are correct. Number of frames in which all the selected

matches are correct / number of frames which have four or

more inlier matches are shown in bracket.

w/ orientation w/o orientation

check check

w/ scale 0.714 0.103

check (10 / 14) (9 / 87)

w/o scale 0.134 0.005

check (9 / 67) (2 / 380)

1 2

5 6 7 8 9 10

Number of trials

Number of inlier frames

Sampled frames without incorrect matches

Sampled frames with incorrect matches

Figure 5: Number of trials and inlier frames derived by each

trial.

0 500 1000 1500 2000

Horizontal position error [m]

Frame number

BA w/o references

BA w/ references w/o RANSAC

BA w/ references and RANSAC

Figure 6: Horizontal position error in each frame.

are affected by accumulative errors. The BA with-

out RANSAC is affected by incorrect matches. The

proposed BA with RANSAC can reduce the accumu-

lative errors. It should be noted that, in the end of the

sequence, the accumulative errors are still remained

because there are no available matches.

6 CONCLUSIONS

In this paper, we have proposed a method to remove

accumulative errors of SfM by using aerial images as

external references that already exist for many places

in the world. To this end, we have proposed SfM

method that uses feature matches between ground-

view and aerial images. In order to ﬁnd correct

matches from unreliable matches, we have introduced

the new RANSAC schemes to both feature matching

and bundle adjustment stages. In experiments, we

have conﬁrmed that the proposed method is effective

for estimating camera poses of real video sequence

taken in an outdoor environment. In the future, we

will test the effectiveness of the proposed method in

various environments including roadways.

ACKNOWLEDGEMENTS

This research was partially supported by JSPS Grant-

in-Aid for Scientiﬁc Research No. 23240024 and

“Ambient Intelligence” project granted from MEXT.

SamplingbasedBundleAdjustmentusingFeatureMatchesbetweenGround-viewandAerialImages

697

REFERENCES

Agarwal, S., Mierle, K., and Others (2013). Ceres solver.

https://code.google.com/p/ceres-solver.

Agarwal, S., Snavely, N., Simon, I., Seitz, S. M., and

Szeliski, R. (2009). Building Rome in a day. In Proc.

Int. Conf. on Computer Vision, pages 72–79.

Brubaker, M. A., Geiger, A., and Urtasun, R. (2013).

Lost! leveraging the crowd for probabilistic visual

self-localization. In Proc. IEEE Conf. on Computer

Vision and Pattern Recognition, pages 3057–3064.

Fischler, M. A. and Bolles, R. C. (1981). Random sample

consensus: a paradigm for model ﬁtting with appli-

cations to image analysis and automated cartography.

Communications of the ACM, 24(6):381–395.

Klein, G. and Murray, D. (2007). Parallel tracking and map-

ping for small AR workspaces. In Proc. Int. Symp. on

Mixed and Augmented Reality, pages 225–234.

Lhuillier, M. (2012). Incremental fusion of structure-from-

motion and GPS using constrained bundle adjust-

ments. IEEE Trans. on Pattern Analysis and Machine

Intelligence, 34(12):2489–2495.

Lowe, D. G. (2004). Distinctive image features from

scale-invariant keypoints. Int. J. Computer Vision,

60(2):91–110.

Mills, S. (2013). Relative orientation and scale for improved

feature matching. In Proc. IEEE Int. Conf. on Image

Processing, pages 3484–3488.

Noda, M., Takahashi, T., Deguchi, D., Ide, I., Murase,

H., Kojima, Y., and Naito, T. (2010). Vehicle ego-

localization by matching in-vehicle camera images to

an aerial image. In Proc. ACCV2010 Workshop on

Computer Vision in Vehicle Technology: From Earth

to Mars, pages 1–10.

Pink, O., Moosmann, F., and Bachmann, A. (2009). Vi-

sual features for vehicle localization and ego-motion

estimation. In Proc. IEEE Intelligent Vehicles Sympo-

sium, pages 254–260.

Williams, B., Cummins, M., Neira, J., Newman, P., Reid, I.,

and Tard´os, J. (2009). A comparison of loop closing

techniques in monocular SLAM. Robotics and Au-

tonomous Systems, 57(12):1188 – 1197.

Wu, C. (2011). VisualSFM: A visual structure from motion

system. http://ccwu.me/vsfm/.

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

698