MOBILE DEVICE AND INTELLIGENT DISPLAY INTERACTION

VIA SCALE-INVARIANT IMAGE FEATURE MATCHING

Leigh Herbert, Nick Pears

Department of Computer Science, University of York, Deramore Lane, York, U.K.

Dan Jackson, Patrick Olivier

School of Computing Science, Culture Lab, Newcastle University, King’s Walk, Newcastle Upon Tyne, U.K.

Keywords:

Human-computer interaction, Direct touch interaction, Mobile device.

Abstract:

We present further developments of our system that allows direction interaction between a camera-equipped

hand-held device and a remote display. The essence of this system is the ability to estimate a planar projectivity

between the remote display and the displayed image of that display on the handheld device. We describe how

to achieve this by matching scale invariant SURF features across the two displays (remote and hand-held).

We implement a prototype system and a drawing application and conduct both performance and usability

evaluations. The feedback given indicates that our system is responsive, accurate and easy to use.

1 INTRODUCTION

Cameras on hand-held (mobile) devices open up a

host of visual interaction techniques and here we fo-

cus on interaction with a remote display. Like the mo-

bile device, the remote display is a computer system

in itself, such as an intelligent public display, a table-

top interaction system, or even just the display of a

PC. All that is required is that the camera-equipped

mobile device and the remote display can communi-

cate over a wireless channel, such as bluetooth. Since

the hardware and software for wireless communica-

tion is routinely built into mobile devices, this is gen-

erally not an additional cost for our system.

The key concept that we build on is the idea of reg-

istered displays (Pears et al., 2009). Imagine that we

have a mobile device with a camera on one side of the

device and a touch display on the other. If we point

the camera of the mobile device at a remote display,

then we will be able to see some or all of the remote

display on the display of the mobile device, depend-

ing on the stand-off distance and the ﬁeld of view of

the camera. Assume for now that we can see only a

small part of the distant display on the mobile device,

such as the ﬁle icons within a window. We would

like to directly interact with the area of the remote

display that we are looking at through the mobile, so

that when we touch part of the mobile display (such as

the centre of an imaged icon), this touch is projected

to exactly the right point (the centre of the same icon)

on the remote display. Thus we can view this type

of system as ‘having a piece of the remote display on

a mobile’, with which we can interact with. A nice

property of this type of system is that, even though

only a small part of the remote display may be viewed

on the mobile, the user can see the rest of the remote

display in their peripheral vision, thus providing con-

text to remote interactions.

Of course, for every point on the mobile display,

we must be able to compute the corresponding point

on the remote display. This is achieved by estimat-

ing the planar projective mapping (homography) be-

tween the mobile’s image and the remote display and

in order to compute this, the mobile’s image (or alter-

natively image features and their positions) must be

transmitted to the remote display over a wireless con-

nection. Once we have this mapping, the two displays

are said to be ‘registered’. Then any point touched

with a ﬁnger or stylus on the mobile can be transmit-

ted wirelessly to the remote display and projected to

its correct corresponding position on the remote dis-

play using the estimated homography. An outline of

this registered displays concept is shown in ﬁgure 1.

When the two displays are registered, we can

achieve direct interaction with anything on the remote

display, by performing actions directly on the mo-

207

Herbert L., Pears N., Jackson D. and Olivier P. (2011).

MOBILE DEVICE AND INTELLIGENT DISPLAY INTERACTION VIA SCALE-INVARIANT IMAGE FEATURE MATCHING.

In Proceedings of the 1st International Conference on Pervasive and Embedded Computing and Communication Systems, pages 207-214

DOI: 10.5220/0003368302070214

 SciTePress

image

mobile

image

remote

match

homography

mapping

User touches point

at centre of red circle

Mobile display

Remote display

Remote display PC

Mobile image wirelessly

transmitted to remote PC

Touch coordinates

transmitted to remote PC

position

Projected

Figure 1: Outline of the ‘registered displays’ concept. The mobile’s captured image and ‘touched’ screen position are trans-

mitted wirelessly to the remote system, which computes the corresponding projected position in the remote display.

bile’s image. Direct touch on the mobile’s image of

the remote display is indistinguishable from directly

interacting with the remote display itself. This is

hugely beneﬁcial, since the remote display itself does

not need to have a touch screen interface, and indeed,

one may wish to interact with an remote display that

is not physically reachable; for example, consider in-

teracting with a display behind an estate agent’s front

window, where we could use registered display tech-

nology to acquire the details of properties that inter-

est us. Indeed, property details could be sent over the

same wireless channel as is used by our registration

system.

Equally, we can view this type of system as us-

ing the camera as a pointing device. In this view-

point, when we touch a part of the mobile display with

ﬁnger or stylus, this point on the mobile display and

the optical centre of the camera form a ray in space,

which points to the relevant part of the remote display.

The estimated planar projective mapping, effectively

implements the pointing for us, encapsulating camera

position and orientation, camera intrinsic parameters

and display intrinsic parameters into a 3x3 matrix.

In this paper, we formulate a markerless display

registration system where we propose and evaluate

the use of state-of-the-art scale invariant features in

order to compute the required plane-to-plane projec-

tive mapping (homography).

2 RELATED WORK

The concept of using registered displays via an esti-

mated planar projective mapping is rather new and,

to our knowledge, there have only been two systems

that directly employ this point-to-point mathemati-

cal mapping. The ﬁrst of these is our previous work

(Pears et al., 2008)(Pears et al., 2009) which uses a

set of four large green markers on the remote display.

The smartphone segments the green markers from the

rest of the display and ﬁnds each of the four centroids.

These centroid positions are passed over the bluetooth

link to remote display PC, which then computes the

homography between the two displays. Using the es-

timated homography, any point click on the mobile

display can be mapped onto the corresponding point

on the remote display. The markers on the remote dis-

play are moved and deformed on the remote display

such that they are always the same size, shape and

position in the smartphone image, thus making them

relatively easy to detect. Careful synchronisation is

required so that the remote display frame index and

smartphone frame index are correctly matched.

The second of these registered display approaches

computes the homography by extracting the bound-

aries of rectangular windows using the Hough trans-

form (Boring et al., 2010). Furthermore, interaction

across multiple displays is considered, where display

identiﬁcation is based on the distribution of rectangu-

lar windows in the hand held’s visible frame.

Early examples of interaction through video in-

clude the Chameleon system (Fitzmaurice, 1993) and

Object Oriented Video (Tani et al., 1992). Since

these early systems, several projects have used hand-

helds for interaction, for example, collaboration us-

ing multiple PDAs connected to a PC in a ‘shared

whiteboard’ application has been demonstrated in the

Pebbles project (Myers et al., 1998). Infrared prox-

imity sensors, touch sensors and tilt sensors have

been used to initiate and control hand-held interac-

tion (Hinckley et al., 2000). Peephole displays (Yee,

2003) involved pen interaction on spatially aware

PECCS 2011 - International Conference on Pervasive and Embedded Computing and Communication Systems

208

handheld computers. Many recent proposals for in-

tegrating phones and situated displays have sought to

use visually-controlled interaction, such as the Mo-

bile Magic Hand (Yoshida et al., 2007), which allows

a user to manipulate virtual objects using the mea-

sured optical ﬂow in an image sequence, when ob-

serving a visual code. Wang and co-researchers de-

veloped motion estimation techniques for a camera-

equipped phone that derive from full-search block

matching, similar to those used by MPEG video en-

coders (Wang et al., 2006).

3 IMAGE REGISTRATION

The image registration technique that we employ in-

volves scale-invariant feature matching followed by

estimation of a plane-to-plane projective mapping (a

homography) between the smartphone’s camera im-

age plane and the plane of the remote, public display.

This projective mapping is described in detail in pre-

vious work (Pears et al., 2009). Brieﬂy recapping,

if we express a point position on the display in ho-

mogenous coordinates as x

= [x

, y

, 1]

and a point

position on the camera as x

= [x

, y

, 1]

, then for

some scalar λ, their homogenous coordinates are re-

lated linearly by a simple 3x3 matrix:

λx

= Hx

(1)

This matrix encapsulates 3D camera position, 3D

camera orientation, camera intrinsic parameters and

display intrinsic parameters, but the important point is

that none of these parameters are explicitly necessary

to estimate the homography. Rather, we simply esti-

mate (up to scale) the elements within the matrix H,

using four known correspondences (x

, x

) in equa-

tion 1. Note that no three points may be collinear.

More corresponding points can yield a more accurate

estimate of H, using some variant of a least-squares

technique.

In earlier work (Pears et al., 2009), we noted that a

potential problem with using four coloured markers in

order to register displays is that they can be distracting

to users and they can cause conﬂicts when multiple

users attempt to interact with the same remote display.

Furthermore, there are obvious colour segmentation

problems, when the colour of the markers happens to

be located in region of similar colour on the remote

display. In order to circumvent these problems, we

propose scale-invariant feature matching as a means

of achieving the required image registration.

The question now is how to ﬁnd markerless, nat-

ural point correspondences between the mobile’s im-

age and the remote display’s image. As described ear-

lier, the matching problem is made difﬁcult because

the hand held’s image has been subject to a general

planar projective mapping, which may include scale

changes, rotations and even general perspective dis-

tortion. What we require is a description of a local

image patch that does not change (is ‘invariant’) with

respect to typical transformations experienced in the

planar projection process.

We decided to evaluate the two most commonly

used scale-invariant image descriptors, namely SIFT

(Lowe, 2004) and SURF (Bay et al., 2008). Open

source MATLAB implementations of these algo-

rithms are readily available. The SIFT implemen-

tation used is provided by David Lowe and can be

found online (Lowe, 2009). The SURF implementa-

tion used is written by Strandmark and can also be

obtained online (Strandmark, 2009). In both cases,

for performance reasons, the feature detector and de-

scriptor is a compiled C executable which is called by

MATLAB. This is a means of overcoming the speed

limitations of interpreted MATLAB code.

In a small scale (36 image) evaluation, the most

noticeable difference between SIFT and SURF was

speed. On average, extracting and describing SURF

features was ten times faster than extracting SIFT fea-

tures. Since the aim is to implement an interactive

system with as high a frame rate as possible, SURF

was selected as our feature descriptor.

4 PROTOTYPE SYSTEM

Our prototype system employed a mobile webcam op-

erated by the same PC as the display, as shown in

ﬁgure 2. This is simply a convenient system within

which to evaluate our ideas. Of course, it has the dis-

advantage of not having a touch screen for direct in-

teraction. However, a webcam can be used as a point-

ing device and pen up/down can be expressed with

simple PC keyboard input. Thus we were able to im-

plement a simple remote drawing application, where

the central pixel of the webcam points to the pen po-

sition on the remote display.

4.1 Background Images

Employing images containing extractable features

which can be used as a background ‘wallpaper’ has

two main advantages: Firstly, wallpapers cover the

whole of the screen. As such, features can be made

to be present over the full area of the remote display.

This means that a registration could be computed re-

gardless of camera pose. Secondly, wallpapers are a

familiar feature of modern operating systems. If an

MOBILE DEVICE AND INTELLIGENT DISPLAY INTERACTION VIA SCALE-INVARIANT IMAGE FEATURE

MATCHING

209

display

webcam

(server) display

Update remote

Output

I/O

keyboard

Registration

Display desktop

background

Initialise

Get frame

from camera

input

Get keyboard

Input

features

Extract

features matching

Feature

homography

Compute

Figure 2: Prototype system outline.

image is used which was both dense in features but

also resembled the type of image a user may select

as a wallpaper, the method of presenting extractable

features for matching would seem uncontrived. We

selected three images, which contained non-abstract,

human-understandable images. The images were se-

lected to be varied in their apparent complexity as

shown in ﬁgure 3

Of course foreground GUI elements, such as win-

dows will obscure parts of the background wallpaper.

However, we envisage that the text and icons within

windows will often provide sufﬁcient texture to ex-

tract scale-invariant features dynamically. Smooth

textureless area on the remote display will always be

a problem at high zoom levels on the mobile. Part

of our ongoing work is to devise a system of natural

looking GUI elements that provide sufﬁcient texture

at a wide range of zoom scales.

4.2 Feature Matching

Our system computes matches between the sensed

image on the mobile and precomputed features of the

reference image on the remote display by using the

1-nearest neighbour scheme in the feature space. To

search for a candidate match for a sensed image fea-

ture, the Euclidean distance between its descriptor,

and all reference feature descriptors D

are com-

puted. The reference feature with descriptor closest

to D

is the most similar and therefore, considered

the most likely match. For robustness, however, a fur-

ther criterion must to be satisﬁed for the features to be

(a)

(b)

(c)

Figure 3: Example wallpapers: (a) low complexity, (b)

medium complexity, (c) high complexity with dense fea-

tures.

considered a match. An uncertainty’ score, u, of the

correctness of the match is computed as follows:

u(D

, D

) =

d(D

, D

)

d(D

, D

)

(2)

Where d(x, y) is the Euclidean distance between

PECCS 2011 - International Conference on Pervasive and Embedded Computing and Communication Systems

210

vectors x and y and D

and D

are the closest and

second closest reference descriptor vectors, respec-

tively. If these vectors are equidistant from D

, the

uncertainly score has its largest possible value of

unity. This indicates that there were two equally close

neighbours, so D

is just as likely to be the corre-

sponding feature as D

. However, if D

is much fur-

ther from D

than D

, the uncertainly will be very

small indicating that no other feature was likely to be

a match.

In our system, we specify a threshold uncertainty

score and only computed correspondences with an un-

certainty below this threshold are returned as matches.

In the original SURF implementation (Bay et al.,

2008), this uncertainty threshold is set at 0.7 and we

use the same value. For the rest of this document,

match uncertainty will be discussed in terms of a

match conﬁdence score, c, which is deﬁned as:

c(D

, D

) =

u(D

, D

)

(3)

This measure is used to refer to the quality, or

strength, of matches, such that the larger the conﬁ-

dence, the stronger the match, since the uncertainty

of its correctness is lower. We produce a list of corre-

spondences in descending order of conﬁdence, allow-

ing the strongest matches to be located efﬁciently. It

is still possible that matches returned by this process

are incorrect. Detecting incorrect matches on an in-

dividual basis, is very difﬁcult. Instead, two different

techniques will be used during the computation of the

homography to reduce the effect of incorrect matches.

The success of this will be measured by the accuracy

of the estimated homography in projecting touched

points into remote display positions, since this can be

automated more easily than measuring match correct-

ness directly.

4.3 Computation of the Homography

Only four feature matches (correspondences) are re-

quired to determine H, subject to the constraint that

no three are collinear. However, for our complex wall-

paper image, typically several hundred (around 200-

600) feature matches are made per frame. Our aim

was to generate a homography estimate that is both

fast and accurate. To this end, we implemented and

tested two systems.

In system 1, we used the strongest 50% of corre-

spondences, those with the largest match conﬁdence,

to estimate the homography. This improves the speed

of the least-squares homography estimate and, assum-

ing weaker matches are more likely to be incorrect,

this should improve the proportion of correct matches

and hence improve the homography estimate.

In system 2, we used a sample consensus ap-

proach, motivated by the RANSAC approach (Fis-

chler and Bolles, 1981). The principle is (i) to cal-

culate a strong initial homography estimate using a

minimal set of four correspondences, (ii) to identify

the remaining correspondences which agree with this

estimate, and ﬁnally (iii) use this larger set to compute

a new, improved least-squares homography estimate.

Our algorithm takes two lists, P

and P

, such that

(i) is a point in the reference image, computed as

corresponding to the point in the sensed image given

by P

(i). These lists are ordered in descending or-

der of conﬁdence, as returned by the feature matching

component. First, the four strongest, non-collinear

correspondences are identiﬁed and used to compute

an estimate homography, H

. By using a minimal

set of the strongest available matches, the chance of

this estimate being made using an incorrect corre-

spondence is minimised. However, local inaccuracy

is possible due to noise in the image leading to im-

perfect measurement of the feature positions in the

images. A new list, P

is computed such that

(i) = H

(i) (4)

This list contains the points in the sensed image

corresponding to each point in P

, as suggested by

the initial estimated homography. The Euclidean dis-

tance between each pair (P

(i), P

(i)) is calculated. If

this distance is smaller than some threshold, the cor-

respondence containing P

(i) is classed as an inlier,

otherwise it is an outlier. All of the inlier correspon-

dences are then used to compute the ﬁnal homogra-

phy estimate, H, using least squares. Using a large set

of points minimises noise-related error and only using

points which agree with the initial estimate largely ex-

cludes incorrect correspondences.

5 PERFORMANCE EVALUATION

A test cycle was designed to test the performance of

the underlying registration system; not the usability of

the system as a whole, which is discussed in section

6. As such, the initial testing framework was designed

to measure speed and accuracy of display registration

using the high complexity wallpaper (c) in ﬁg. 3.

In order to ensure testing was fair and repeatable,

live input from the client device was not used. Repli-

cating camera movements manually is unreliable and

would affect the results gathered in testing. Instead,

the hand-held webcam was used to record four video

sequences. These videos were used as input, allow-

ing tests to be run on identical data. Each of the four

videos was designed to represent one of the possible

MOBILE DEVICE AND INTELLIGENT DISPLAY INTERACTION VIA SCALE-INVARIANT IMAGE FEATURE

MATCHING

211

geometric transformations the system would need to

deal with. These are:

1. Pan.wmv: The camera is held with the image

plane parallel to the display and translated (in-

plane) from left to right at a constant distance of

30cm from the display.

2. Zoom.wmv: The camera is held central and par-

allel to the display and moved from a distance of

10cm to 50cm from it. This tests out-of-plane

translations (or equivalently, zooming out).

3. Rotate.wmv: The camera is held central and par-

allel to the display at a distance of 30cm, while it

is rotated (in-plane) through 180 degrees.

4. Skew.wmv: The camera starts by pointing to-

wards the centre of the display at a distance of

30cm. The camera then follows a roughly circu-

lar arc through 30 degrees, rotating about the dis-

play’s y-axis, until it is 5cm from the screen and

at a 60 degree angle to the screen.

Given an input video, our evaluation framework

computes the homography between the frames of the

video and the background image, exactly as the live

system would. We record timing data and the com-

puted homography for each frame processed and save

this data to disk for analysis. We record the time it

takes for each component of the system to execute as

well as the time taken for each frame to be processed.

To facilitate the measuring of accuracy, we man-

ually selected the pixel in the reference image cor-

responding to the pixel at the centre of each frame

in each of the test videos, as accurately as possible.

These manually selected points correspond to where

the cursor should be on the remote display if the sys-

tem was being used as a pointing system. These

points serve as the ground truth position of the cur-

sor and allow the accuracy of the point projected on

to the remote display to be measured. More specif-

ically, for any frame of a test video, upon comput-

ing the homography, the cursor position (central cam-

era pixel) projected by this homography can be deter-

mined. This can be compared to the manually deter-

mined (ground-truth) cursor position for the frame to

give an indication of accuracy. Accuracy is measured

as the Euclidean distance, in pixels, between the com-

puted and manually selected cursor positions. This

measure is computed for every frame of every video

and saved to disk for analysis.

We evaluated the accuracy of system 1, where

the best 50% of feature matches are used to compute

the homography, and system 2, which employs sam-

ple consensus. An overview of average performance

across all videos for the two systems is shown in ﬁg-

ure 4. The sample consensus approach (system 2)

shows superior performance, particularly in the dif-

ﬁcult ‘skew’ video.

Examining the detail of the accuracy on a frame-

by-frame basis, we see that for (i) translating, (ii)

zooming and (iii) rotating in-plane, the error is ac-

ceptably small and always within ﬁve pixels. How-

ever, in the ‘skew’ video sequence as the angle of

camera approaches 30 degrees from the remote dis-

play normal, the errors can become large, particularly

in system 1. This is because SURF features are not

invariant to general projective transformations and so

the quality of feature matching deteriorates.

geometric transformations the system would need to

deal with. These are:

1. Pan.wmv: The camera is held with the image

plane parallel to the display and translated (in-

plane) from left to right at a constant distance of

30cm from the display.

2. Zoom.wmv: The camera is held central and par-

allel to the display and moved from a distance of

10cm to 50cm from it. This tests out-of-plane

translations (or equivalently, zooming out).

3. Rotate.wmv: The camera is held central and par-

allel to the display at a distance of 30cm, while it

is rotated (in-plane) through 180 degrees.

4. Skew.wmv: The camera starts by pointing to-

wards the centre of the display at a distance of

30cm. The camera then follows a roughly circu-

lar arc through 30 degrees, rotating about the dis-

play’s y-axis, until it is 5cm from the screen and

at a 60 degree angle to the screen.

Given an input video, our evaluation framework

computes the homography between the frames of the

video and the background image, exactly as the live

system would. We record timing data and the com-

puted homography for each frame processed and save

this data to disk for analysis. We record the time it

takes for each component of the system to execute as

well as the time taken for each frame to be processed.

To facilitate the measuring of accuracy, we man-

ually selected the pixel in the reference image cor-

responding to the pixel at the centre of each frame

in each of the test videos, as accurately as possible.

These manually selected points correspond to where

the cursor should be on the remote display if the sys-

tem was being used as a pointing system. These

points serve as the ground truth position of the cur-

sor and allow the accuracy of the point projected on

to the remote display to be measured. More specif-

ically, for any frame of a test video, upon comput-

ing the homography, the cursor position (central cam-

era pixel) projected by this homography can be deter-

mined. This can be compared to the manually deter-

mined (ground-truth) cursor position for the frame to

give an indication of accuracy. Accuracy is measured

as the Euclidean distance, in pixels, between the com-

puted and manually selected cursor positions. This

measure is computed for every frame of every video

and saved to disk for analysis.

We evaluated the accuracy of system 1, where

the best 50% of feature matches are used to compute

the homography, and system 2, which employs sam-

ple consensus. An overview of average performance

across all videos for the two systems is shown in ﬁg-

ure 4. The sample consensus approach (system 2)

shows superior performance, particularly in the dif-

ﬁcult ‘skew’ video.

Examining the detail of the accuracy on a frame-

by-frame basis, we see that for (i) translating, (ii)

zooming and (iii) rotating in-plane, the error is ac-

ceptably small and always within ﬁve pixels. How-

ever, in the ‘skew’ video sequence as the angle of

camera approaches 30 degrees from the remote dis-

play normal, the errors can become large, particularly

in system 1. This is because SURF features are not

invariant to general projective transformations and so

the quality of feature matching deteriorates.

Figure 4: Mean pixel error results for the four movies.

On examining the time to process a frame, there

were no signiﬁcant differences between any of the

videos. We also noted that the mean frame rate, across

all frames, is 0.64 frames per second. This is clearly

too slow for a direct interaction device; the cursor will

only update every 1.56 seconds which would result in

a very low level of usability. Within our timing re-

sults, we found that feature extraction took on aver-

age 17.8% of the time, feature matching took 74.4%

of the time, whilst computing the homography took

only 7.8% of the time.

6 USABILITY EVALUATION

In our system, both responsiveness and accuracy are

preferred. However, this is not currently possible and

some trade-off has to be made. In order to determine

where a good balance between speed and accuracy

lies, a user test was conducted. This was also used

to quantitatively measure perceived usability.

We evaluated three techniques to improve the sys-

tem update rate: (i) bounding the time over which fea-

ture matching can take place; (ii) reducing the sensed

PECCS_2011_82_CR - LATEX - Correcção de Volumes

JVARELA - 10JAN2011

202

Figure 4: Mean pixel error results for the four movies.

On examining the time to process a frame, there

were no signiﬁcant differences between any of the

videos. We also noted that the mean frame rate, across

all frames, is 0.64 frames per second. This is clearly

too slow for a direct interaction device; the cursor will

only update every 1.56 seconds which would result in

a very low level of usability. Within our timing re-

sults, we found that feature extraction took on aver-

age 17.8% of the time, feature matching took 74.4%

of the time, whilst computing the homography took

only 7.8% of the time.

6 USABILITY EVALUATION

In our system, both responsiveness and accuracy are

preferred. However, this is not currently possible and

some trade-off has to be made. In order to determine

where a good balance between speed and accuracy

lies, a user test was conducted. This was also used

to quantitatively measure perceived usability.

We evaluated three techniques to improve the sys-

tem update rate: (i) bounding the time over which fea-

ture matching can take place; (ii) reducing the sensed

PECCS 2011 - International Conference on Pervasive and Embedded Computing and Communication Systems

212

image resolution on the camera; (iii) using a wallpa-

per with a lower density of features.

Four different system conﬁgurations were used,

each with a different speed-accuracy trade off. Each

user conducted the same task with the system in each

conﬁguration and asked to provide feedback on their

experiences. The order in which the user tested each

conﬁguration was random, to reduce bias being intro-

duced as the user learned the task. The details of the

four conﬁgurations are as follows:

• System A: Maximum Accuracy. High complex-

ity background in ﬁg. 3(c), unbounded match

time, 640x480 sensed image. Mmean error per

frame = 1.96, frames per second = 0.59

• System B: Accuracy Biased. High complex-

ity background, 350ms bounded match time,

640x480 sensed image. Mean error per frame =

3.47, frames per second = 0.87

• System C: Speed Biased. Medium complexity

background in ﬁg. 3(b), unbounded match time,

640x480 sensed image. Mean error per frame =

2.54, frames per second = 1.89.

• System D: Maximum Speed. Medium complex-

ity background, unbounded match time, 320x240

sensed image. Mean error per frame = 16.12,

frames per second = 3.07.

The stated mean error per frame is calculated

over all frames for the relevant test videos while the

frames/second measure is taken from the application

itself and so accounts for all overheads related to up-

dating the display to show the user input.

The task involved drawing and a single instance

of the task was made up of three units. For each unit,

a simple shape was shown on the remote screen, over

the top of the background. Each of the three shapes

was designed to test a speciﬁc drawing action. A tri-

angular shape encouraged the user to track straight

lines and sharp corners, a circle encouraged the user

to track a continuous curve and a set of dashed lines

encouraged the user to start and stop line tracking.

The user was asked to trace the shape by drawing

over the top of it, as “quickly and accurately as pos-

sible”. Before the start of each task, the participant

was given thirty seconds to practice drawing with no

shape to trace. This gave them time to familiarise with

the current conﬁguration before the task begun. Upon

completing the task, they were asked to ﬁll in a form,

rating their experience of conducting the task against

three criteria: (i) responsiveness (how quickly move-

ments of the camera were reﬂected on screen); (ii)

accuracy (how accurately movements of the camera

were reﬂected on screen); (iii) usability (how simple

it was to complete the task). All scores were to be

between 1 (low) and 10 (high). Participants were also

asked the deliberately open-ended question: “What

would have made completing the task easier?. This

question allowed the user to talk more freely about

their experience with the system, allowing us to de-

termine which factors were most important to them.

The test described above was undertaken by six

volunteers. The participant was instructed to conduct

each unit of the test as quickly and as accurately as

possible, but was told that they could rest between

each unit and between each task for as long as they

wished. They were supervised in their ﬁrst practice

period to ensure that they had understood the instruc-

tions. The participant was then left alone to complete

the test in their own time.

Responsiveness Results. 1.83, 2.33, 5.50 and

7.67 were the mean scores (max 10) given by partic-

ipants when asked to rate ‘how quickly movements of

the camera were reﬂected on screen for each conﬁgu-

ration A to D respectively. These results were as ex-

pected. The reported responsiveness corresponds to

the frame rate of the conﬁgurations.

Accuracy Results. 2.83, 4.33, 6.50 and 7.50 were

the mean score given by the participants when asked

to rate ‘how accurately movements of the camera were

reﬂected on screen for each conﬁguration A to D re-

spectively. It should be noted that in the automated

testing, conﬁguration A was the most accurate, while

conﬁguration D was the least accurate. However, in

the user tests, conﬁguration D received a higher aver-

age rating than A.

Ease of use Results. 4.0, 4.50, 4.50 and 8.33 were

the mean scores given by the participants when asked

to rate ‘how simple it was to complete the task for each

conﬁguration A to D respectively. Conﬁguration D

received a much higher mean rating for this criterion.

Qualitative Results and Comments. Figure 5

shows sample drawings by participants, using conﬁg-

uration D (top images) and conﬁguration A (bottom

images). While not perfect, the drawings with conﬁg-

uration D show no spikes due to registration error and

the beginning, ends and general position of most lines

appears to be accurate.

All six participants said that if the response time

of conﬁguration A was faster, the task would have

been easier to complete. Similar remarks were made

by three or more of the participants for conﬁgurations

B and C. Only two of the six participants said that a

faster response time would have made the task easier

when using conﬁguration D. Accuracy was only ex-

plicitly mentioned by two participants. One of these

participants made the statement about conﬁguration

C and the other said improving the accuracy of the

crosshair would have made the test easier on all con-

MOBILE DEVICE AND INTELLIGENT DISPLAY INTERACTION VIA SCALE-INVARIANT IMAGE FEATURE

MATCHING

213

Figure 5: Sample drawing results: system D (top) and sys-

tem A (bottom).

ﬁgurations. The background images were mentioned

by two participants. Speciﬁcally, it was noted that the

colourfulness of the most complex background made

it difﬁcult to locate the crosshair or the lines to be

traced, making the task harder.

7 CONCLUSIONS

Development of a working prototype system has

shown that the extraction and matching of scale-

invariant features provides an effective route to mark-

erless mobile/display interaction. Usability testing

was undertaken and our prototype system in conﬁg-

uration D received high scores for responsiveness, ac-

curacy and ease of use. During development it was

identiﬁed that a responsive system (high frame rate)

gives a greater level of perceived usability than a less

responsive, but more accurate one. Overall the work

presented here has demonstrated the possibility of a

marker-less display registration system, which can be

built upon to create rich, direct interaction applica-

tions via camera-equipped mobile devices, such as

smartphones.

REFERENCES

Bay, H., Ess, A., Tuytelaars, T., and Gool, L. V. (2008).

Speeded-up robust features (surf). Computer Vision

and Image Understanding, 100(3):346–359.

Boring, B., D.Baur, Butz, A., Gustafson, S., and Baud-

isch, P. (2010). Touch projector: Mobile interaction

through video. In ACM Conference on Human Factors

in Computing Systems (CHI 2010), pages 2287–2296.

Fischler and Bolles (1981). Random sample consensus: a

pardigm for model ﬁtting with application to image

analysis and automated cartography. Commun. Assoc.

Comp. Mach., 24:381–395.

Fitzmaurice, G. W. (1993). Situated information spaces and

spatially aware palmtop computers. Communications

of the ACM, 36(7):39–49.

Hinckley, K., Pierce, J., Sinclair, M., and Horvitz, E.

(2000). Sensing techniques for mobile interaction. In

13th Annual ACM Symposium on User interface Soft-

ware and Technology, pages 91–100.

Lowe, D. G. (2004). Distinctive image features from scale-

invariant keypoints. International Journal of Com-

puter Vision, 60:91–110.

Lowe, D. G. (2009). Demo software: Sift keypoint detector.

http://www.cs.ubc.ca/ lowe/keypoints/.

Myers, B., Stiel, H., and Gargiulo, R. (1998). Collaboration

using multiple pdas connected to a pc. In Proc. 1998

ACM Conference on Computer Supported Coopera-

tive Work, pages 285–294.

Pears, N. E., Jackson, D. G., and Olivier, P. L.

(2009). Smartphone interaction with registered dis-

plays. IEEE Pervasive Computing, 8(2):14–21.

Pears, N. E., Olivier, P. L., and Jackson, D. G. (2008). Dis-

play registration for device interaction. In Proc. 3rd

Int. Conf. on Computer Vision Theory and Applica-

tions (VISAPP’08).

Strandmark, P. (2009). Demo software: Surfmex. http://

www.maths.lth.se/matematiklth/petter/surfmex.php.

Tani, M., Yamaashi, K., K.Tanikoshi, Futakawa, M., and

Tanifuji, S. (1992). Object-oriented video: Interaction

with real-world objects through live video. In ACM

Conference on Human Factors in Computing Systems

(CHI 1992), pages 593–598.

Wang, J., Zhai, S., and Canny, J. (2006). Camera phone

based motion sensing: Interaction techniques, appli-

cations and performance study. In 19th Annual ACM

Symposium on User interface Software and Technol-

ogy (UIST’06), pages 101–110.

Yee, K. (2003). Peephole displays: Pen interaction on spa-

tially aware handheld computers. In SIGCHI Confer-

ence on Human Factors in Computing Systems, pages

1–8.

Yoshida, Y., Miyaoku, K., and Satou, T. (2007). Mobile

magic hand: Camera phone based interaction using

visual code and optical ﬂow. In Human-Computer In-

teraction Part II, LNCS 4551, pages 513–521.

PECCS 2011 - International Conference on Pervasive and Embedded Computing and Communication Systems

214