method can reduce accumulative errors by matching
trajectory from SfM to road maps, there are ambigui-
ties for some scenes such as straight roads or Manhat-
tan worlds.
On the other hand, in order to estimate abso-
lute camera positions and postures, some methods
estimate camera parameters directly from references
without SfM (Pink et al., 2009; Noda et al., 2010).
Pink et al. (Pink et al., 2009) used aerial images as
references and estimated camera parameters based on
feature matching between input images and aerial im-
ages. However, it is not easy to find good matches
for all the images of a long video sequence especially
for the scenes where unique landmarks cannot be ob-
served. Although Mills (Mills, 2013) proposed a ro-
bust feature matching procedure that compares orien-
tation and scale of each matches with dominant ori-
entation and scale identified by histogram analysis,
it cannot work well when there exist a huge num-
ber of outliers. Noda et al. (Noda et al., 2010) re-
laxed the problem by generating mosaic images of
the ground from multiple images for feature match-
ing. However, accumulative errors are not considered
in this work. Unlike previous works, we estimate
relative camera poses for all the frames using SfM,
and we remove accumulative errors by selecting cor-
rect matches between aerial image and ground-view
image from candidates by employing a multi-stage
RANSAC scheme.
3 FEATURE MATCHING
BETWEEN GROUND-VIEW
AND AERIAL IMAGES
In this section, we propose a robust method to obtain
feature matches between ground-view and aerial im-
ages. As shown in Figure 1, the method is composed
of (1) ground-view image rectification by Homogra-
phy, (2) feature matching, and (3) RANSAC. Here,
in order to achieve robust matching, we propose new
criteria for RANSAC with consistency check of ori-
entation and scale from a feature descriptor. It should
be noted that matching for all the input frames are not
necessary in our pipeline. Even if we can find only
several candidates of matched frames, they can be ef-
fectively used as references in the BA stage.
3.1 Image Rectification by Homography
Before calculating feature matches using a feature de-
tector and a descriptor, as shown in Figure 1, we rec-
tify ground-view images so that texture patterns are
similar to those of the aerial image. In most cases,
aerial images are taken very far away from the ground
and thus they are assumed to be captured by an ortho-
graphic camera whose optical axis is directed to grav-
ity direction. In order to rectify ground-view images,
we also assume that the ground-view images contain
the ground plane whose normal vector is directed to
gravity direction. Then, we compute Homography
matrix using the gravity direction in camera coordi-
nate system which can be estimated from the vanish-
ing points of parallel lines or a gyroscope.
3.2 Feature Matching
Feature matches between rectified ground-view im-
ages and the aerial image are calculated. Here, we
use GPS data corresponding to the ground-view im-
ages to limit searching area in the aerial image. More
concretely, we select the region whose center is GPS
position and its size is l × l. In the experiment de-
scribed later, l is set to 50 [m]. Feature matches are
then calculated by a feature detector and a descriptor.
We employ SIFT (Lowe, 2004) in the experiment be-
cause of its robustness for changes in scale, rotation
and illumination.
3.3 RANSAC with Orientation and
Scale Check
As shown in Figure 1, tentative matches often include
many outliers. In order to remove outliers, we use
RANSAC with consistency check of orientation and
scale parameters.
For matches between rectified ground-view im-
ages and the aerial image, we can use the similarity
transform which is composed of scale s, rotation θ
and translation τ
τ
τ. In RANSAC procedure, we ran-
domly sample two matches (minimum number to es-
timate similarity transform) to compute the similarity
transform (s, θ,τ
τ
τ). Here we count the number of in-
liers which satisfy
|a
a
a
k
− (sR(θ)g
g
g
k
+ τ
τ
τ)| < d
th
, (1)
where a
a
a
k
and g
g
g
k
are the 2D positions of the k-th match
in the aerial image and the rectified ground-view im-
age, respectively. R(θ) is the 2D rotation matrix with
rotation angle θ and d
th
is a threshold. After repeating
random sampling process, the sampled matches with
the largest number of inliers are selected.
The problem here is that the distance-based crite-
rion above cannot successfully find correct matches
when there exist a huge number of outliers. In or-
der to achieve more robust matching, we modify the
criterion of RANSAC by checking the consistency of
SamplingbasedBundleAdjustmentusingFeatureMatchesbetweenGround-viewandAerialImages
693