COMBINING ABSOLUTE POSITIONING AND VISION FOR WIDE

AREA AUGMENTED REALITY

Tom Banwell and Andrew Calway

Department of Computer Science, University of Bristol, Bristol, U.K.

Keywords:

Visual SLAM, Localisation, Mapping, Absolute positioning, Augmented reality, Sensor fusion.

Abstract:

One of the major limitations of vision based mapping and localisation is its inability to scale and operate over

wide areas. This restricts its use in applications such as Augmented Reality. In this paper we demonstrate

that the integration of a second absolute positioning sensor addresses this problem, allowing independent local

maps to be combined within a global coordinate frame. This is achieved by aligning trajectories from the two

sensors which enables estimation of the relative position, orientation and scale of each local map. The second

sensor also provides the additional beneﬁt of reducing the search space required for efﬁcient relocalisation.

Results illustrate the method working for an indoor environment using an ultrasound position sensor, building

and combining a large number of local maps and successfully relocalising as users move arbitrarily within the

map. To show the generality of the proposed method we also demonstrate the system building and aligning

local maps in an outdoor environment using GPS as the position sensor.

1 INTRODUCTION

A fundamental requirement for Augmented Reality

(AR) applications is to be able to localise the pose of

a mobile device with respect to the physical environ-

ment. In the past, work in this area has focused pri-

marily on localisation based on known structure in the

form of calibrated targets (Piekarski et al., 2004) or

models (Park et al., 2008; Pupilli and Calway, 2006).

However, more recent work has attempted to over-

come the limitations of this by employing techniques

able to operate in previously unseen environments. Of

particular interest has been the signiﬁcant advances

made in vision based simultaneous localisation and

mapping (SLAM) systems which have their roots in

the Robotics literature, see for example the monocular

systems described in (Davison et al., 2007; Chekhlov

et al., 2006; Klein and Murray, 2007). These systems

have now reached a level of robustness where they

can be reliably used in a variety of AR applications

(Chekhlov et al., 2007; Castle et al., 2008). Although

the robustness and reliability of these visual SLAM

systems is impressive, the challenge of building very

large maps over wide areas still remains. One limit-

ing factor is computational effort, which for most al-

gorithms increases quadratically with the number of

features in the map. This can be addressed by adopt-

ing sub-mapping techniques to build consistent maps

over relative wide areas (Clemente et al., 2007;

Pinies and Tardos, 2008). Pinies and Tardos build

sub-maps of limited size before initialising a new map

referenced to the current pose, allowing the sharing of

common information. Remapping common features

in two sub-maps enables loop closure and the sub-

maps to be joined into a single global map (Pinies

and Tardos, 2008). Although these systems enable

a wider area to be mapped, they assume continuous

texture between the sub-maps and cannot handle sit-

uations where there is no texture or when the camera

has been kidnapped.

To overcome these limitations Castle et al (Castle

et al., 2008) developed a system that assumes there

is not always continuous texture between sub-maps.

Their system enables a user to build sub-maps in ar-

eas of interest and relocalise in those maps once they

return to them. This allows users to build sub-maps

of spatially separated areas. However, the relative lo-

cation and orientation of the sub-maps are not known

and relocalisation may become an issue once there are

a large number of maps.

A kidnapped camera can provide no informa-

tion about its current position relative to a previous

map until it reobserves features from that map. One

method to recover such information would be to seek

support from an additional sensor.

Pinies et al (Pinies et al., 2007) combine their mo-

353

Banwell T. and Calway A. (2010).

COMBINING ABSOLUTE POSITIONING AND VISION FOR WIDE AREA AUGMENTED REALITY.

In Proceedings of the International Conference on Computer Graphics Theory and Applications, pages 353-357

DOI: 10.5220/0002830803530357

 SciTePress

Figure 1: User with mobile device and laptop desk.

nocular-camera system with an Inertial Measuring

Unit, this improves their estimated trajectory and

map. Newman et al (Newman et al., 2001) devel-

oped a building-wide AR system by combining ultra-

sound position estimates and rotation estimates from

an inertial tracker. This allows the wearer of a head

mounted display to be positioned, but does not pro-

vide detailed information about the environment they

are in. The main focus of these systems has been im-

proving the overall system robustness or improving

the accuracy of the estimate.

In this work we overcome these limitations by us-

ing a second sensor, an absolute positioning system

(APS). During areas of texture we simultaneously es-

timate our global trajectory using the APS and our

local trajectories using a monocular-camera system.

Using a least-squares approach we estimate the trans-

formation between the global and local trajectories

and build a single scalable consistent global map.

During periods of loss of visual information we use

the APS to estimate 3-D position within the global

map. This allows the system to efﬁciently relocalise,

when returning to a previously built local map, even

when there are a very large number of maps.

Results show that by using a second sensor we can

overcome the limitations of scalability and the need

for continuous texture, and buildmany local maps that

are known relative to one another. This provides us

with a novel AR capability, to be able to track in a

local map and see AR in other local maps that are not

joined to the current local map by continuous texture.

The remainder of the paper is organised as fol-

lows. The next section describes the general frame-

work for combining a mapping system with an APS.

The third section describes the implementation we

have developed based on this framework. Conclu-

sions are drawn in the ﬁnal section.

2 POSITIONING AND VISION

This section describes the core method underlying our

work described in a general framework. One of the

advantages of our method is that it is directly appli-

cable to any camera mapping system, for example;

probabilistic and Structure from Motion approaches

and any absolute positioning system, for example; Ul-

trasound, Ultra-Wideband and GPS.

Another advantage of this work is that we can

create a very large number of local maps at an arbi-

trary position, orientation and scale and combine all

of these maps into a single coordinate frame to pro-

vide a single scalable global map. Crucially, the es-

timated map and trajectory from the local mapping

system are locally correct, but a transformation out,

when compared to the estimate in the global coordi-

nate frame. By simultaneously estimating this local

trajectory and the global trajectory we are able to es-

timate this alignment transformation and combine the

local maps into the global coordinate frame.

2.1 Estimating Transformations

We begin by estimating the motion of a mobile device

using two sensors; a monocular-camera and an abso-

lute positioning system. The two sensors are rigidly

attached at a known offset. Each sensor estimates the

mobile’s trajectory in their own coordinate frame; our

goal is to recover the transformation between the two

coordinate frames.

To estimate the transformation we need to esti-

mate two trajectories, propagated to a common time.

After each measurement update a new position is es-

timated and this is stored in the trajectory set. This

provides us with two sets of data, or trajectories, X

for the vision system and Y for the absolute position-

ing system.

Now we have the two trajectories we estimate

the transformation between them. We ﬁnd it best

to estimate the transformation parameters over the

full trajectories once the local map has been ‘ﬁn-

ished’. To estimate the desired transformation we

use a least-squares approach introduced by Umeyama

(Umeyama, 1991). This method is based on Singu-

lar Value Decomposition, which is known for its nu-

merical stability. We now give a brief overiew of the

method. First we ﬁnd the means and variances of the

two trajectories and then the covariance Σ

, between

the two. We then determine the singular value de-

composition of Σ

which gives UDV

. From these

matrices we follow the steps described in Umeyama

(Umeyama, 1991) to estimate the rotation R, transla-

tion t, and scale s, of the transformation. These are

the parameters of the transformation required to con-

vert the local map into the global map.

2.2 Building the Global Map

Once the transformation parameters s

, R

and t

have

been estimated for local map j, they can be used to

GRAPP 2010 - International Conference on Computer Graphics Theory and Applications

354

(a) Relocalised Camera (b) Mapping with GPS

Figure 2: (a) Relocalised within a local map: the mean-

spheres in neighbouring maps can be seen. (b) Local maps

aligned with a truly global coordinate frame.

transform the local map into the global coordinate

frame. The trajectory X

l j

of local map j, is trans-

formed into the global coordinate system X

as fol-

lows:

= s

l j

+ t

(1)

Each local-feature f

, in local map j, can be trans-

formed to the global coordinate frame f

as follows:

= s

+ t

(2)

3 EXPERIMENTAL RESULTS

To demonstrate our method we conducted experi-

ments both indoors, using an ultrasound system, and

outdoors using GPS.

3.1 Indoor Ofﬁce Environment

The hardware was setup as shown in Figure 1. The

user has a handheld mobile device that consists of

two sensors that are rigidly attached at a known off-

set. The two sensors are; a calibrated handheld cam-

era with 320x240 pixels and wide-angled lens, and

an ultrasound receiver. To enable mobility a laptop

desk is used to carry the laptop. An ultrasound po-

sitioning system is used to provide estimates of 3-D

position (Randell and Muller, 2001). To provide the

local maps and trajectories we use the visual-SLAM

system developed by Chekhlov et al (Chekhlov et al.,

2006).

3.1.1 Building and Correcting Local Maps

Testing was performed in an indoor environment. A

sequence of steps can be seen in Figure 3 (the full ex-

periment can be seen in the attached video). The ul-

trasound system estimates the 3-D position of the mo-

bile device in the global coordinate frame. Although

in practice one would want to start a new map after a

loss of visual track, for ease of testing we allow the

user to control the building of local maps. The user

provides input to start building a local map. Once

the user has decided they have ﬁnished building the

local map they again provide input and the system

stops building the current local map. At this point the

system estimates the transformation to align the local

trajectory with the global trajectory and applies that

transformation to the local estimate.

3.1.2 Viewing into and Across Local-maps

One of the major advantages our method offers for

AR applications is the ability to see across disjoint

maps. To demonstrate this we placed rotating spheres

at the mean of aligned maps, as can be seen in Figure

3. After aligning six local maps the user entered the

relocalisation phase. The mobile returned to and relo-

calised in the second map (green sphere). As this lo-

cal map had been globally aligned the camera’s global

position could be estimated. All global graphics can

be seen in the camera view allowing it to ‘look-across’

maps and see the contents of those maps. This is

shown in Figure 2a. This novel contribution is not

achievable with other current single-camera systems.

3.1.3 Efﬁcient Relocalisation

To demonstrate the second contribution of our work

we improve the relocalisation method developed by

Chekhlov et al (Chekhlov et al., 2008). Between

building local maps, if the user returns to a previously

mapped area, we seek to relocalise the camera. Once

there are a very large number of maps, it is unrealis-

tic to attempt to relocalise in all maps. Our contribu-

tion comes from the fact that our method provides us

with an estimate of the global position of the mobile

and all local maps. We use this information to decide

which local map(s) we should attempt to relocalise

in. Once the current local map has been ‘ﬁnished’

the system enters a relocalisation phase. A map is se-

lected as a potential for relocalisation if it falls within

a sphere centered about the mobile’s position. The

sphere deﬁnes a region where there is a signiﬁcantly

high chance of being able to detect features from the

map, it’s radius is estimated based on the distance be-

tween the mobile’s current position and the mean of

the local map.

3.2 Combining GPS and Vision

To demonstrate the generality of our method we have

applied it using an alternative sensor, GPS. Using

GPS we have the potential to scale across very wide

COMBINING ABSOLUTE POSITIONING AND VISION FOR WIDE AREA AUGMENTED REALITY

355

Figure 3: Screenshots from a map building sequence. The top row shows the building of local maps, with the locally estimated

trajectory (green) and features (blue). The bottom row shows the same trajectory estimated in the global coordinate frame

(blue dotted-line) and the correctly transformed local maps (black). The arrows (yellow) represent the correct transformation

of the local maps. The spheres (black, green, cyan and yellow) represent the means of the local maps.

areas and the entire global. This enables users to share

maps and combine them with applications such as

Google Earth. Using the same mapping system as de-

scribed in section 3.1 and the method of section 2 we

tested the system outside by walking around a build-

ing. Each wall of the building was locally mapped

using the vision system, once ﬁnished the local map

was aligned with the global coordinate frame to up-

date the global map. This can be seen in Figure 2b.

4 CONCLUSIONS

This work developed a new scalable mapping and lo-

calisation technique. It combines trajectories from

an absolute positioning system and a local mapping

system to produce a single map. The key contribu-

tions are the ability to map in areas where there is no

continuous texture and the correct alignment of local

maps into a single global coordinate system. The po-

sition and orientation of the local maps are known rel-

ative to each other. This provides the ability to view

across disjoint maps and improves the efﬁciency of re-

localisation. We demonstrated our method by build-

ing and transforming local maps to a global coordi-

nate frame using different sensors both indoors and

outdoors. Future work will focus on auto-calibrating

the ultrasound system to reduce the installation cost.

ACKNOWLEDGEMENTS

We are grateful to Walterio Mayol-Cuevas, Andrew

Gee and Denis Chekhlov for discussions and to Cliff

Randell and Henk Muller for help with the ultrasound

positioning system.

REFERENCES

Castle, R. O., Klein, G., and Murray, D. W. (2008). Video-

rate localization in multiple maps for wearable aug-

mented reality. In Proc 12th IEEE Int Symp on Wear-

able Computers 2008, pages 15–22.

Chekhlov, D., Gee, A., Calway, A., and Mayol-Cuevas,

W. (2007). Ninja on a plane: Automatic discovery

of physical planes for augmented reality using visual

slam. In International Symposium on Mixed and Aug-

mented Reality (ISMAR).

Chekhlov, D., Mayol-Cuevas, W., and Calway, A. (2008).

Appearance based indexing for relocalisation in real-

time visual slam. In 19th Bristish Machine Vision

Conference, pages 363–372. BMVA.

Chekhlov, D., Pupilli, M., Mayol-Cuevas, W., and Calway,

A. (2006). Real-time and robust monocular slam using

predictive multi-resolution descriptors. In 2nd Inter-

national Symposium on Visual Computing.

Clemente, L., Davison, A., Reid, I., Neira, J., and Tardos, J.

(2007). Mapping large loops with a single hand-held

camera. In Robotics: Science and Systems.

Davison, A., Reid, I., Molton, N., and Stasse, O.

(2007). MonoSLAM: Real-Time Single Camera

SLAM. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 29(6):1052–1067.

Klein, G. and Murray, D. (2007). Parallel tracking and

mapping for small AR workspaces. In Proc. Sixth

IEEE and ACM International Symposium on Mixed

and Augmented Reality (ISMAR’07), Nara, Japan.

Newman, J., Ingram, D., and Hopper, A. (2001). Aug-

mented reality in a wide area sentient environment.

GRAPP 2010 - International Conference on Computer Graphics Theory and Applications

356

In Augmented Reality, 2001. Proceedings. IEEE and

ACM International Symposium on, pages 77–86.

Park, Y., Lepetit, V., and Woo, W. (2008). Multiple 3d ob-

ject tracking for augmented reality. In Proc. Seventh

IEEE and ACM International Symposium on Mixed

and Augmented Reality, pages 117–120.

Piekarski, W., Avery, B., Thomas, B., and Malbezin, P.

(2004). Integrated head and hand tracking for indoor

and outdoor augmented reality. In VR ’04: Proceed-

ings of the IEEE Virtual Reality 2004, pages 11–276,

Chicago, IL.

Pinies, P., Lupton, T., Sukkarieh, S., and Tardos, J. D.

(2007). Inertial aiding of inverse depth slam using a

monocular camera. In IEEE International Conference

on Robotics and Automation, Roma, Italy.

Pinies, P. and Tardos, J. (2008). Large-scale slam building

conditionally independent local maps: Application to

monocular vision. IEEE Transactions on Robotics,

24(5):1094–1106.

Pupilli, M. and Calway, A. (2006). Real-time camera track-

ing using known 3d models and a particle ﬁlter. In

International Conference on Pattern Recognition.

Randell, C. and Muller, H. (2001). Low cost indoor posi-

tioning system. In Ubicomp 2001: Ubiquitous Com-

puting, pages 42–68.

Umeyama, S. (1991). Least-Squares Estimation of Trans-

formation Parameters Between Two Point Patterns. In

IEEE Transactions on Pattern Analysis and Machine

Intelligence, volume 13, pages 376–380.

COMBINING ABSOLUTE POSITIONING AND VISION FOR WIDE AREA AUGMENTED REALITY

357