Photo-based Multimedia Applications using Image

Features Detection

Rui N

obrega and Nuno Correia

CITI, Departamento de Inform

atica, Faculdade de Ci

encias e Tecnologia, Universidade Nova de Lisboa, Lisboa, Portugal

Keywords:

Computer Vision, Augmented Reality, Computer Graphics.

Abstract:

This paper proposes a framework for the creation of interactive multimedia applications that take advantage

of detected features from user-captured photos. The goal is to create games, architectural and space planning

applications that interact with visual elements in the images such as walls, ﬂoors and empty spaces. The frame-

work takes advantage of a semi-automatic algorithm to detect scene elements and camera parameters. Using

the detected features, virtual objects can be inserted in the scene. In this paper several example applications

are presented and discussed, and the reliability of the detection algorithm is compared with other systems. The

presented solution analyses the photos using graph-cuts for segmentation, vanishing point detection and line

analysis to detect the scene elements. The main advantage of the proposed framework is the semi-automatic

creation of the tri-dimensional model to be used in mixed reality applications. This enables scenarios where

the user can be responsible for the input scene without much prior knowledge or experience. The current

implemented examples include a furniture positioning system and a snake game with a user-built maze in the

real world. The proposed system is ideal for multimedia mobile applications where interaction is combined

with the back camera of the device.

1 INTRODUCTION

This paper presents a mixed reality system where vi-

sual features are acquired from user-captured photos

and used to create interactive applications. The main

motivation for this work is to create a framework that

enables the user to virtually reshape a photographed

scene. By adding virtual objects, changing colors

or changing lights the scene can be transformed into

something different. In this paper the main focus will

be given to the introduction of virtual objects that re-

act to the scene as if they had physical properties such

as mass or weight. The applications of this system

could be the creation of furniture testing simulations

in user spaces, virtual modifying of old photos, or

spatial-aware games.

The main problem presented in this paper is how

to acquire spatial features from a single image that

will allow the seamless embedding of virtual objects

in the scene for augmented reality (AR) applications.

A user with no special skills should be able to acquire

and manipulate the image in order to generate the in-

put for the given AR applications.

The expected contributions of this work are the

introduction of a semi-automatic method for high-

Figure 1: Augmented reality system, taking advantage of

visual high-level features detected from the image. The vir-

tual box is aligned with the world, and is laid on a virtual

plane that emulates the ﬂoor.

level feature extraction and several use-case proto-

types, which implement and make use of the given

features. The proposed use-case prototypes are: a

virtual reshaping tool (Figure 1) to contextually add

virtual objects in photos, and a virtual snake game

that uses a maze based in real world objects or shapes

(Figure 7).

The presented solution is based on an initial im-

age taken by the user. The image is analyzed to ﬁnd

the ﬂoor, vanishing points and room orientation. The

augmented reality application introduces virtual ob-

298

Nóbrega R. and Correia N..

Photo-based Multimedia Applications using Image Features Detection.

DOI: 10.5220/0004244702980307

In Proceedings of the International Conference on Computer Graphics Theory and Applications and International Conference on Information

Visualization Theory and Applications (GRAPP-2013), pages 298-307

ISBN: 978-989-8565-46-4

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

jects with a guideline and snapping system to assist

the user in the introduction of objects as seen in Fig-

ure 1.

2 BACKGROUND AND RELATED

WORK

Most augmented reality applications depend on some

form of computer vision technique (Forsyth and

Ponce, 2002; Szeliski, 2010) by recognizing marks or

using image tracking. Most AR applications (Tillon

and Marchal, 2011; Turcsanyi-Szabo and Simon,

2011) use real-time video from cameras, and then try

to superimpose virtual objects and animations. One

known example is the ARtoolkit

where virtual 3D

objects are introduced above predeﬁned markers.

In this work, the main focus is given to sys-

tems that construct a model of the world and use the

same model to enhance the AR application (Wagner

et al., 2010; Yu and Malik, 2008). In the PTAM

project (Klein and Murray, 2007), a model of a small

workspace is instantly built by translating a camera

sideways. The model detects a plane where games

can be played.

Many techniques can be used to acquire features

for scene reconstruction. Our current approach is

based on extraction of features from a single image.

Currently some projects focus more on examining the

perspective knowing details such as vanishing lines

and a reference plane (Criminisi et al., 2000), oth-

ers analyse depth cues from the image and discover

3D by connecting small images planes (Saxena et al.,

2009). The analysis consists essentially of detecting

interesting properties from the image lines (Gupta,

2010; Gupta et al., 2011). The goal is to detect struc-

ture without the help of markers (Simon et al., 2000).

Using the detected features we intend to create AR

applications with new forms of interaction including

(Del Pero et al., 2011; Gupta et al., 2011; N

obrega

and Correia, 2012; Lai et al., 2011). Most of the pro-

cessing requires the analysis of the image in search

for main lines and vanishing points (Del Pero et al.,

2011; Lee et al., 2009; Rother, 2002; Yu and Malik,

2008). Several techniques have also been studied to

avoid clutter and noise (Hedau and Hoiem, 2009), re-

construct architectural environments (Pollefeys et al.,

2007; Rother, 2002) and automatically calibrate cam-

eras (Pollefeys et al., 1998).

Many systems use several images to reconstruct

the scene, namely stereo techniques (Sinha et al.,

2009), parallel images (Klein and Murray, 2007), sev-

ARtoolkit, http://www.hitl.washington.edu/artoolkit/.

eral photos (Debevec et al., 1996; Ishikawa et al.,

2011) or just plain video (Crandall et al., 2011; Polle-

feys et al., 2007) as in the structure from motion tech-

niques. Some systems rely on extra sensors or special

cameras such as depth cameras (Izadi et al., 2011) to

acquire additional information. Currently this work

focuses on cameras with no extra hardware, such as

the ones present on mobile devices or webcams.

Other systems rely on segmenting the scene into

super-pixels (Liu and Gould, 2010; Gould et al., 2009;

Kowdle et al., 2011) and infer the underlying model.

The Make3D (Saxena et al., 2009) project uses seg-

mentation to automatically create an explorable 3D

view from a picture or painting. In our work, seg-

mentation is used to detect the ﬂoor (Rother and Kol-

mogorov, 2004) with a human in the loop technique

(Kowdle et al., 2011). Some projects (Karsch et al.,

2011) use photo realistic rendering of synthetic ob-

jects on top of photographs using a human semi-

assisted model.

Different approaches include using Bayesian in-

ference (Furukawa et al., 2009), to detect scene orien-

tation. Some authors give more attention to scenes

with clutter (Hedau and Hoiem, 2009; Wang and

Gould, 2010), while others are more interested in real-

time applications (Klein and Murray, 2007; Wagner

et al., 2010). The main difference of our system is the

integration of single view techniques into a human as-

sisted mixed reality application.

3 CONCEPT OVERVIEW

With the advent of tablets and smartphones with

frontal and back cameras new opportunities arise for

multimedia applications that mix reality and virtual

content. Images and videos can be recorded instantly

and immediately used in the system or saved for later

interaction. Another obvious possibility is the real-

time interaction with the real world using the camera.

Figure 2: Handheld application using camera. The current

prototypes run in tablets and regular computers with web-

cams.

Using real video or captured images as input for

interactive mobile applications (Figure 2) presents

several challenges that need to be addressed.

Photo-basedMultimediaApplicationsusingImageFeaturesDetection

299

First of all, there is a factor of uncertainty related

with the user skill to capture good quality images.

There can be limitations related with the device cam-

era itself, as the image can be blurred and the scene

poorly illuminated. These are common problems in

all computer vision projects.

At the application level, users may have difﬁcul-

ties in positioning or moving objects in a 3D space.

They may expect that the affordances of the real world

should apply to the virtual objects in the AR applica-

tion. As an example, virtual objects should collide

and be occluded by objects in the real world.

There are several properties that can be explored

in mixed reality in order to create rich multimedia

applications, such as: 3D structure, Segmentation,

Object detection and Tracking. Combining several

techniques is sometimes difﬁcult due to increasing

processing cost (image processing is usually a heavy

task). The main goal of this work is to create a frame-

work that can be used to simplify the process of cre-

ating a multimedia application based on images cap-

tured by the user. Currently the main focus is on 3D

structure detection from a single image and segmen-

tation of user assigned areas.

4 PROBLEM FORMULATION

The fundamental concept for the proposed system is

the possibility of creating mixed and augmented real-

ity applications that take advantage of detected visual

features inside images of real world spaces. These

images are obtained with regular cameras with no ad-

ditional hardware.

The research problem being addressed here is how

to embed virtual objects inside an image in such a way

that the virtual objects react to physical concepts (i.e.,

gravity, obstacles) inside the virtual world. As seen in

the section 2, several previous studies (Karsch et al.,

2011) have found different answers to this problem.

The main problem can be divided into the follow-

ing different sub problems:

δ = f (I), (1)

φ = g(I, δ,C), and (2)

arApp(I,O,M) where M = {φ,δ,C} (3)

First (1) the input image I must be analyzed us-

ing f to search for low-level features δ such as edges,

straight lines or relevant points. Using δ, the im-

age I and the camera parameters C, the function g

(2) attempts to ﬁnd high-level φ such as vanishing

points, horizon, room orientation or ﬂoor. The camera

parameters C and all the detected features and con-

strains, compose the virtual model M of the scene.

The ﬁnal AR application (3) receives the model, the

virtual objects O and the input image I.

Our solution is based on the interpretation of the

main straight lines of the scene (Lee et al., 2009;

Rother, 2002) to ﬁnd vanishing points (ϕ

) and scene

orientation (ϕ

) and segmentation techniques (Rother

and Kolmogorov, 2004) to ﬁnd the ﬂoor (ϕ

). The

current model focuses mainly on these three high-

level features but can be easily extended to incorpo-

rate others (Simon, 2006) such as walls or plain sur-

faces. These features are primarily obtained trough

automatic analysis of the image ( f and g) but may

need to be reﬁned through a human in the loop pro-

cess (Kowdle et al., 2011) where the user can give

simple hints to help the algorithm.

The image I is assumed to be an image of a Man-

hattan World scenario (Coughlan and Yuille, 1999;

Del Pero et al., 2011), where the ﬂoor is in the lower

part of the image and there is a large uncluttered space

in the room, where the virtual objects can be laid on.

The camera parameters C are assumed to be known,

and when they are not, they can sometimes be ex-

trapolated from the vanishing points, or several prior

models can be assumed (e.g., webcams).

5 ALGORITHM DETAILS

The implementation of our solution is summarized in

Figure 3 and the next subsections will detail the con-

struction of the augmented reality model M.

5.1 Input

The initial input of the problem is a raster image I

that meets the assumptions described in the previous

section. Additionally, the image should not be very

blurred and it should contain enough detail for feature

detection. In the current prototype, images can be ob-

tained through a webcam or by loading a ﬁle from the

disk. To help the user select a good picture the system

does a real-time analysis of the image to understand

if it has enough potential. To assess if the image is

complex enough for further analysis, FAST

feature

points and lines segments are extracted. For the image

to be accepted the amount of detected features must

be above a certain threshold value (dependent on the

resolution).

FAST in OpenCV, http://docs.opencv.org/modules/

features2d/doc/feature detection and description.html

GRAPP2013-InternationalConferenceonComputerGraphicsTheoryandApplications

300

Figure 3: Steps involved in semi-automatic construction of

two Augmented Reality applications. The contribution of

the user consists in providing an input image and roughly

selecting the ﬂoor. The AR model is then automatically in-

ferred.

5.2 Region Segmentation

The detection of the scene ﬂoor (ϕ

) is necessary for

the AR application to understand where to place the

virtual object. In order to detect the ﬂoor the system

requires the assistance of the user. Using the input

image the user has to roughly sketch a line around

what she considers to be the ﬂoor as seen in Fig-

ure 4. More complex systems (i.e., statistical, prior

knowledge base inference) could be introduced to do

this step automatically, but for now the current system

does not focus on this topic.

The rough sketch is then used as input for an im-

plementation of the grabcut algorithm (Rother and

Kolmogorov, 2004) to deﬁne a mask m

, where for

each pixel it states if it belongs to the class ﬂoor or

not. This mask will later be important to verify colli-

sions with the ﬂoor.

Figure 4: Using the grabcut algorithm to extract the ﬂoor

mask (on the right). The user roughly sketches the green

line (best seen in color) around the ﬂoor and the algorithm

extracts the dominant ﬂoor pixels.

5.3 Vanishing Point Extraction

After acquiring the image and a estimate of where the

ﬂoor lays the scene must be analyzed in search for the

remaining features (ϕ

and ϕ

, as stated in g).

The vanishing points (ϕ

), are crucial to the aug-

mented reality system. Virtual objects must fade into

the same horizon as the scene. If at least two major

vanishing points are found, the ﬁeld-of-view can be

extrapolated. The world orientation (ϕ

) of the scene

(more details on the next section) can be identiﬁed

and emulated.

Our algorithm for ﬁnding candidate vanishing

points (Lee et al., 2009; Rother, 2002) is brieﬂy de-

scribed in the Algorithm 1 code block. It analyzes the

main lines containing the scene and tries to extrap-

olate the three most common line directions of the

scene, bearing in mind that parallel lines have com-

mon vanishing points. Figure 5 illustrates the current

algorithm. The main difference in our algorithm is

that we analyze the scene several times (e × h) with

different parameters, namely e different thresholds

and erode and dilate operations in the Canny ﬁlter,

and h different gap tolerances in the Hough transform

line detector. To be accurate the current used imple-

mentation of the Hough line detector returns line seg-

ments. The choice of the best set of parameters is

delayed until the last moment to evaluate through a

score function which one gives better results.

The score function helps to decide which param-

eters give a better, uncluttered view of the scene. To

calculate s, the ﬁrst score, all lines are classiﬁed using

its slope as Horizontal (0

◦

± 5

◦

), Vertical (90

◦

± 5

◦

)

and Oblique (everything else). Considering only the

Oblique line segments, for each line the number of in-

tersections int(i) with other line segments is counted.

Photo-basedMultimediaApplicationsusingImageFeaturesDetection

301

Algorithm 1: Generating Candidate Vanishing Points.

L e t I be t h e i n p u t i mage

L e t E[e] be t h e v e c t o r o f e image s

c o n t a i n i n g Canny e d g e s o f t h e

s c e n e w i t h d i f f e r e n t t h r e s h o l d s .

L e t L[e][h] be t h e v e c t o r o f e × h l i n e

s e t s c o n t a i n i n g d i f f e r e n t

d e t e c t e d l i n e s f o r e ac h E[i] .

L e t S ←

0 be t h e s c o r e o f t h e l i n e

s e t .

L e t S

←

0 be t h e r e f e r e n c e t o t h e

l i n e s e t .

f o r i = 1 u n t i l e

E [ i ] ← ca nny(I,cannyParam

)

f o r j = 1 u n t i l h

L[i][ j] ←Ho ugh Lin es(I,lineParam

)

s← s c o r e (L[i][ j]) , t e s t s i f h a s

en ough i n f o r m a t i o n o r

e x c e s s i v e c l u t t e r .

i f s > S t h e n S ← s and S ’← L[i][ j]

e n d i f

e n d f o r

L e t (l

) ∈ S

L e t V ← be t h e s e t o f l i n e

i n t e r s e c t i o n s .

f o r some r a n d p a i r o f l i n e s e g m e n t s

) (RANSAC)

i j

← i n t e r s e c t i o n of (l

) .

i f c o n s t r a i n s (t

i j

) = 1 t h e n

add i n t e r s e c t i o n t o V

e n d i f

e n d f o r

v ← c l u s t e r i n g (V ) , v e r i f y

wh er e a r e t h e

most p r e d o m i n a n t i n t e r s e c t i o n s

a c c o r d i n g t o x , y and k

r e t u r n v , v e c t o r w i t h 1 t o 3 p o i n t s

Line segments with a large number of intersections

are considered clutter and have a lower weight in s,

longer line segments give an extra weigh w(i) to

∑

w(i)

int(i) + 1

(4)

Having chosen the line set with better s score, the

same line set is analyzed for intersections between

the lines. For efﬁciency reasons only some pairs

of lines are tested using Random Sample Consen-

sus (RANSAC). Each intersection t

i j

is tested against

several constraints, namely the intersection should be

outside the two line segments and the line segments

should have a slope difference larger than 10 degrees.

In the end, there should be a large number of two-

dimensional points populating the scene. The three

vanishing points are extracted using a nearest neigh-

Figure 5: Vanishing Point analysis for Figure 4. On the

top right: Canny ﬁlter analysis. Bottom right: detected line

segments, oblique lines are in black. Left: main line inter-

section points in green and main vanishing point detected

using clustering nearest neighbors.

bors approach to cluster the most common groups of

points as seen in Figure 6. Each group must have a

minimum number of points for the vanishing point to

be considered. The vanishing point is the average po-

sition of the cluster points. Sometimes it may not be

possible to ﬁnd the three dominant vanishing points.

If no vanishing point is found the algorithm tries the

next line set from L with better score s.

5.4 3D World Orientation

The real world orientation (ϕ

) depends on the direc-

tion in which the picture was taken. Currently we are

considering that the Y axis (vertical) is ﬁxed and the

X and Z are free. The goal is to ﬁnd the X and Z ori-

entation of the scene in the photo. Most Manhattan

world indoor scenes have parallel walls, the ceiling is

usually parallel to the ground and corners are usually

intersections of perpendicular planes.

Knowing the main vanishing points allows rotat-

ing the 3D virtual camera model in order to be aligned

with the real world point of view. This means that

when the depth of an object increases it will fade not

to the center but to the direction of the vanishing point

that corresponds to the Z axis. Furthermore objects

should look like they are parallel to the elements in

the scene (i.e., walls). Using the main vanishing point

(the one with a larger point cloud) the virtual world

is rotated in two axis to align the objects with the

scene. The rotations depend on the ﬁel-of-view (fov)

and on the main vanishing point vp. roty is the rota-

tion around the Y axis and rotx is the same around the

X axis.

δ = tan(π · f ov/360) (5)

di f f

= vp

− middle

,di f f

= vp

− middle

(6)

rot

= arctan(di f f

· δ/middle

) (7)

rot

= arctan(di f f

· δ/middle

) (8)

The above are only valid for small rotations (<∼

45 degrees) because for larger values the distortion is

GRAPP2013-InternationalConferenceonComputerGraphicsTheoryandApplications

302

too large. The detection of the scene orientation (ϕ

)

is dependent on the success of the algorithm to detect

. Additionally, the virtual ﬂoor (ϕ

) is assigned to

be a plane in the world with a predeﬁned negative y

coordinate. This value is deﬁned considering the av-

erage height that people take photos, but can later be

changed by the user in the AR tool.

6 MIXED REALITY SYSTEM

The augmented reality (AR) framework takes into

consideration the detected features vanishing points

(ϕ

), scene orientation (ϕ

) and ﬂoor (ϕ

). The cur-

rent prototypes explore several possibilities that can

be achieved in AR applications using CV high-level

features.

Using the 3D oriented model M (introduced in

section 4) it is possible to deﬁne a virtual ﬂoor where

users can lay 3D meshes. The users just have to push

the objects back and forth, or left and right, in a 2D

model that allows this freedom. This is an important

feature since not all users have the skills to correctly

visualize and place an object in a 3D world. Reliev-

ing the user from having to match the object with the

ﬂoor and from having to orient the object to match the

direction of the walls helps in the process. Figures 3

and 6 are examples of the 3D oriented world model

with superimposed 3D objects that represent virtual

furniture.

The introduction of virtual content should be done

seamlessly, and the user should not be aware of all

the processing that is involved in the application. In

the user perspective, the interface should be effective,

with some simple instruction hints. The objects and

drawings should be placed with the correct perspec-

tive.

6.1 Implementation Details

The current framework, prototypes, examples and ex-

periences were tested in a dual-core 2.53GHz com-

puter with 4GB of RAM. Some usability tests pre-

sented in section 7 were conducted in a tablet with

a single 1.33GHz processor with 4GB of RAM. All

webcams were used with a resolution of 640 by 480

pixels. Imported photos had different resolutions with

a minimum of 640 by 480.

The system is currently implemented in C++. The

algorithms takes advantage of the OpenCV

library,

especially the data types, the Hough-transform imple-

mentation, several image ﬁlters and image descrip-

OpenCV, http://opencv.willowgarage.com

tors. The interface and image/camera input is sup-

ported by the openFrameworks

library.

6.2 Virtually Reshape a Room

One application for the detected features is a furniture

testing system where users can photograph the place

where they wish to add the piece and test virtually

if the object ﬁts and looks pleasant in that spot. One

hypothesis, emulated by the current prototype (Figure

6), is to create guidelines around the virtual objects

indicating if the object is placed on the ﬂoor or if it is

aligned with the scene. A snapping system helps the

user to ease the task of placing objects in the correct

position.

The current guidelines are created taking into ac-

count the detected features ϕ . The snapping to the

ﬂoor guideline, seen in yellow in Figure 6, is shown

when the object reaches the invisible plane represent-

ing the ground (see subsection 5.4). Additionally the

object can be tested to verify if it is inside the ﬂoor

area or above scene objects. Using the previously cal-

culated ﬂoor mask m

(see subsection 5.2) it is possi-

ble to test if the object is really on the ground. Know-

ing the current projection matrix (Szeliski, 2010) P,

the Model View matrix V and the w coordinates of

the object (w = (x,y,z,1)) it is possible through

= PV w (9)

to obtain w

, the screen coordinates of the ob-

ject. Testing if the point belongs to the ﬂoor mask (if

) is positive) gives the information about the ob-

ject being in the ﬂoor or in the rest of the scene. This

technique can be used in different ways. The virtual

objects can collide with the walls, ﬂoor and scene ob-

jects or can snap to the limits of the ﬂoor. The current

prototype does not deal with occluded ﬂoors.

Figure 6: Example of the introduction of virtual objects that

are aligned with the scene and snapped to the ﬂoor.(best

seen in color)

The other implemented guideline appears when

the object has the same orientation as the other main

OpenFrameworks, http://www.openframeworks.cc/

Photo-basedMultimediaApplicationsusingImageFeaturesDetection

303

scene components. This guideline takes advantage

of the scene orientation (ϕ

) feature and is shown in

green in Figure 6. This could be an important feature

in applications, since virtual objects can be presented

as if parallel to the wall and scene objects, helping the

immersion of the user and facilitating the positioning

of the object in the scene.

6.3 Snake in Real-world Maze

Although conceptually simple the three detected fea-

tures (ϕ

, ϕ

and ϕ

) can be combined to recreate

several well known applications, but with interaction

with the real-world. The snake game is a well known

computer game, in this game, a snake travels through

a maze with obstacles and walls. The goal is to never

hit a wall or obstacle and survive as much time as pos-

sible. Usually there are bonus objects to catch which

make the player earn points but have the side effect

of increasing the size of the snake and/or speeding up

the game.

In order to showcase the detected features, a

proof-of-concept application was created: a snake

game based in a user image. It is a 3D snake game

where the user creates the maze in the real world. The

maze can be made of any material; it only needs to be

sufﬁciently different from the background.

To prepare the game, the player should create a

maze in an empty plane (e.g.,ﬂoor or table). To take

advantage of scene orientation (ϕ

), the objects com-

posing the maze should be orthogonal as in a Manhat-

tan world the detection is improved. The user takes

a picture of the scene from the desired position and

roughly sketches the ﬂoor as seen in Figure 7. This

will be used to detect the ﬂoor (ϕ

) of the maze. This

process was already described in subsection 5.2.

The image is then analyzed to ﬁnd the main van-

ishing points (ϕ

) and scene orientation (ϕ

). Cur-

rently, there is a previous step of scale adjustment of

the snake, but that could be, in the future, inferred by

the size of the maze.

Finally the snake is positioned on the virtual plane

and reacts to the detected ﬂoor (the maze in this con-

text) as seen in Figure 7. The player looses when the

snake leaves the maze. In this system the coordinates

of the head of the snake are constantly being trans-

formed using Equation 9 and tested against the mask

of the detected ﬂoor. With this simple game the

user has the possibility to instantly change the level

in a very tangible and direct form. As mentioned,

any place can be used as scenario for the gameplay.

Social interactions can arise where one user creates

the labyrinth and other tries to complete it or survive

a certain amount of time in it. Games can be played

Figure 7: Real world maze, made from sheets of papers.

Maze selection, based on the ﬂoor selection presented in

subsection 5.1.Using the room orientation and the mask m

of the ﬂoor, a virtual snake travels through the paper maze.

with improbable scenarios such as monuments, casual

areas or intimate places.

6.4 Other Possible Applications

The current prototypes implement the features de-

scribed in the previous section. The goal is to con-

struct a generic framework, with these and other fea-

tures, so that different augmented reality applications

can be explored. Examples include games that can

be played in 3D in a user-designated scenario, space

sharing and editing between known users or virtually

reshaping a room with virtual objects.

7 EXPERIMENTS AND RESULTS

The current system extracts several features from an

input image and presents an augmented reality in-

terface based on the same features. Several tests

obrega and Correia, 2012) have been made to as-

sess the effectiveness of the used detection algo-

rithms.

The detection of the ﬂoor mask m produces good

results on most surfaces and images. Failures are

usually related with low contrast images and non-

Lambertian surfaces with high reﬂectance.

The detection of the vanishing points show a high-

degree of success in Manhattan (Coughlan and Yuille,

1999) scenes.In order to assess the reliability of the

vanishing point detection presented in subsection 5.3,

GRAPP2013-InternationalConferenceonComputerGraphicsTheoryandApplications

304

the algorithm was tested with several images from dif-

ferent datasets. The system relies on the visual infor-

mation available in each image, especially orthogonal

lines. Other factors must be taken in consideration,

such as image clutter, misleading lines and blurred

images. Note that since the goal is to use this system

in a interactive multimedia system, processing times

should be kept as low as possible.

To evaluate the detection system, a large amount

of images were collected and tested. For each image

the two major vanishing points were annotated and

the processing time registered. Next a visual appli-

cation was created to evaluate the results and its cor-

rectness. The application seen in Figure 8, presents

sequentially all the images from the dataset with three

lines. The red line represents the most probable van-

ishing point detected. This will be the main vanishing

point used to signal the scene and virtual objects ori-

entation. The green line represents the second most

probable vanishing point. The blue (horizontal) line

represents the horizon line of the scene. Depending

on the conﬁdence level of the detection, the system

may only present some lines and in worst-case sce-

nario none. The goal of this application is to have a

user that manually analyses each image and classiﬁes

each image as Correctly detected or Incorrectly de-

tected. The operation is quickly achieved with two

keyboard keys.

The algorithm was tested in four different

datasets, each with its own characteristics:

• General: mixed collection of photos used to test

the algorithm, that includes indoor and outdoor

photos, old, blurred and pixelated images.

• Lab: photos from the interior of a building and a

university campus.

• Flickr-Mix: Mix of photos obtained from Flickr

with keywords such as House, House Interior and

Buildings. It includes some landscape images

with nature.

• YorkUrbanDB: The York Urban Line Segment

Database

is an online database used by Cough-

lan and Yuille (Coughlan and Yuille, 1999), which

contains images of urban environments consisting

mostly of scenes from the campus of York Uni-

versity and downtown Toronto, Canada. Most im-

ages follow the rule of the Manhattan world.

The results from the classiﬁcation using the appli-

cation presented in Figure 8 can be seen in Table 1

and Figure 9.

Flickr, http://www.ﬂickr.com/.

York Urban DB, http://www.elderlab.yorku.ca/

YorkUrbanDB/.

Figure 8: The system automatically detected the most prob-

able main vanishing point (red line) and the second most

probable (green line). The blue line represents the detected

horizon line. The above images are screenshots of the man-

ual application used to verify the correctness of the auto-

matic detection algorithm.

Table 1: Scene Orientation and Detection Data and Compu-

tational Processing Times.

Photos Total

Time

(s)

Avg.

Time

(s)

Success.

Detec-

tions

General 53 59.47 1.12 43

Lab 35 54.87 1.57 30

Flickr-Mix 317 908.4 2.87 182

YorkU.DB 102 190.58 1.87 81

The results show a high reliability of the algorithm

with most datasets having a success rate around

80% (average combined success 66.27%). The sys-

tem worked better with Manhattan scenes, especially

house interiors. Outdoor images of city buildings also

perform well. This is the reason for the good results in

the Lab and YorkUrbanDB datasets. The Flickr-Mix

was not so successful mostly due to problems related

Photo-basedMultimediaApplicationsusingImageFeaturesDetection

305

Figure 9: Success rate for each dataset.

with outdoor nature such as trees, landscapes, and

modern and unconventional architecture. The Gen-

eral dataset had images of lobbies, ﬂoors, tables and

spaces suitable for the presented AR prototypes, and

it also had a higher rate.

These results show that, although not 100% re-

liable in all situations, the current algorithm works

in enough situations to be useful in a mixed reality

interactive system. In a multimedia application, the

user can be instructed to take pictures of places with

enough line information or systems, as the one pre-

sented in subsection 5.1, can be used to increase the

success rate of the detection. Additionally, since the

processing time is under 3 seconds, the user can al-

ways take or use another picture if the ﬁrst one was

not correct.

8 CONCLUSIONS AND FUTURE

DIRECTIONS

This paper introduces a new approach to the use of re-

constructed models from single images in AR appli-

cations. The main contributions are a feature detect-

ing system and an AR interface where virtual objects

react to the scene. To validate the ideas proposed in

the framework, two working prototypes are presented

taking full advantage of the discussed high-level fea-

tures. The ﬁrst represents a multimedia application

where virtual furniture and interesting objects could

be used in the context of a house interior. The second

is an application that makes use of ordinary physical

objects to create the different game-levels of the vir-

tual game. These are some of the possibilities that can

be achieved with this system.

In the future, additional features could be pursued

to enhance the applications. The vanishing points

and horizon detection can also beneﬁt greatly from

the use of accelerometers to detect the direction of

the ﬂoor, although this limits the system to devices

that have these technologies (excluding, for example,

most laptops and desktops). Our goal is to further

explore these prototypes and develop better user in-

terfaces to fully explore all the possible interactions

and enhance the quality of the experience. We are

also planning to release the software as open source

so that other developers could create more complex

applications such as games and photo-editing applica-

tions used seamlessly in devices with a camera. This

could provide valuable insights about the current ver-

sion of the framework and the path that should be fol-

lowed regarding future developments.

ACKNOWLEDGEMENTS

This work was funded by FCT/MEC, through grant

SFRH/BD/47511/2008, and CITI/FCT/UNL through

grant PEst- OE/EEI/UI0527/2011.

REFERENCES

Coughlan, J. J. M. and Yuille, A. L. A. (1999). Manhat-

tan World : Compass Direction from a Single Image

by Bayesian Inference. In Proceedings of the Inter-

national Conference on Computer Vision- Volume 2

(ICCV ’99), volume 2, pages 1–10. IEEE Computer

Society.

Crandall, D., Owens, A., Snavely, N., and Huttenlocher, D.

(2011). Discrete-continuous optimization for large-

scale structure from motion. In Computer Vision and

Pattern Recognition (CVPR’11), 2011 IEEE Confer-

ence on, pages 3001–3008. IEEE Computer Society.

Criminisi, A., Reid, I., and Zisserman, A. (2000). Single

View Metrology. International Journal of Computer

Vision, 40(2):123–148.

Debevec, P. E., Taylor, C. J., and Malik, J. (1996). Mod-

eling and Rendering Architecture from Photographs

: A hybrid geometry- and image-based approach.

In Proceedings of the 23rd annual conference on

Computer graphics and interactive techniques (SIG-

GRAPH ’96), pages 11–20. ACM.

Del Pero, L., Guan, J., Brau, E., Schlecht, J., and Barnard,

K. (2011). Sampling bedrooms. In Computer Vision

and Pattern Recognition (CVPR’11), 2011 IEEE Con-

ference on, pages 2009–2016. IEEE Computer Soci-

ety.

Forsyth, D. A. and Ponce, J. (2002). Computer Vision: A

Modern Approach, volume 54. Prentice Hall.

Furukawa, Y., Curless, B., Seitz, S. M., and Szeliski, R.

(2009). Manhattan-world stereo. In Computer Vision

and Pattern Recognition (CVPR’09), 2009 IEEE Con-

ference on, volume 0, pages 1422–1429. IEEE Com-

puter Society.

Gould, S., Fulton, R., and Koller, D. (2009). Decompos-

ing a Scene into Geometric and Semantically Consis-

tent Regions. In Computer Vision, 2009 IEEE 12th

International Conference on (ICCV’09), number Iccv,

pages 1–8. IEEE Computer Society.

GRAPP2013-InternationalConferenceonComputerGraphicsTheoryandApplications

306

Gupta, A. (2010). Blocks world revisited: Image under-

standing using qualitative geometry and mechanics.

Computer VisionECCV 2010, 125(1-2):482–496.

Gupta, A., Satkin, S., Efros, A. a., and Hebert, M. (2011).

From 3D scene geometry to human workspace. In

Computer Vision and Pattern Recognition (CVPR’11),

2011 IEEE Conference on, pages 1961–1968. IEEE

Computer Society.

Hedau, V. and Hoiem, D. (2009). Recovering the spatial

layout of cluttered rooms. In Computer Vision, 2009

IEEE 12th International Conference on (ICCV’09).

IEEE Computer Society.

Ishikawa, T., Thangamani, K., Kourogi, M., Gee, A. P.,

Mayol-Cuevas, W., Hyun, J., and Kurata, T. (2011).

Interactive 3-D indoor modeler for virtualizing service

ﬁelds. Virtual Reality.

Izadi, S., Newcombe, R. A., Kim, D., Hilliges, O.,

Molyneaux, D., Hodges, S., Kohli, P., Shotton, J.,

Davison, A. J., and Fitzgibbon, A. (2011). Kinect-

Fusion: real-time dynamic 3D surface reconstruction

and interaction. In ACM SIGGRAPH 2011 Talks,

UIST ’11, page 23. ACM, ACM.

Karsch, K., Hedau, V., and Forsyth, D. (2011). Render-

ing synthetic objects into legacy photographs. ACM

Trans. Graph, 30(6):1–12.

Klein, G. and Murray, D. (2007). Parallel Tracking and

Mapping for Small AR Workspaces. 2007 6th IEEE

and ACM International Symposium on Mixed and

Augmented Reality, 07:1–10.

Kowdle, A., Chang, Y., and Gallagher, A. (2011). Active

Learning for Piecewise Planar 3D Reconstruction. In

Computer Vision and Pattern Recognition (CVPR’11),

2011 IEEE Conference on, pages 24–26. IEEE Com-

puter Society.

Lai, J., Chen, C., Wu, P., and Kao, C. (2011). Tennis real

play: an interactive tennis game with models from real

videos. In Proceedings of the 19th ACM international

conference on Multimedia - MM ’11, volume 3, pages

483–492.

Lee, D. C., Hebert, M., and Kanade, T. (2009). Geomet-

ric reasoning for single image structure recovery. In

Computer Vision and Pattern Recognition (CVPR’09),

2009 IEEE Conference on, volume 0, pages 2136–

2143. IEEE Computer Society.

Liu, B. and Gould, S. (2010). Single image depth estima-

tion from predicted semantic labels. In Computer Vi-

sion and Pattern Recognition (CVPR’10), 2010 IEEE

Conference on. IEEE Computer Society.

obrega, R. and Correia, N. (2012). Magnetic augmented

reality: virtual objects in your space. In Proceeding

of the 2012 International Working Conference on Ad-

vanced Visual Interfaces (AVI’12), pages 3–6. ACM.

Pollefeys, M., Koch, R., and Gool, L. V. (1998). Self-

calibration and metric reconstruction in spite of vary-

ing and unknown internal camera parameters. Com-

puter Vision and Pattern Recognition (CVPR’98),

1998 IEEE Conference on, pages 90–95.

Pollefeys, M., Nist

er, D., Frahm, J. M., Akbarzadeh, A.,

Mordohai, P., Clipp, B., Engels, C., Gallup, D., Kim,

S. J., Merrell, P., Salmi, C., Sinha, S., Talton, B.,

Wang, L., Yang, Q., Stew

enius, H., Yang, R., Welch,

G., and Towles, H. (2007). Detailed Real-Time Urban

3D Reconstruction from Video. International Journal

of Computer Vision, 78(2-3):143–167.

Rother, C. (2002). A new Approach to Vanishing Point De-

tection in Architectural Environments. Elsevier, (Jan-

uary 2002):1–17.

Rother, C. and Kolmogorov, V. (2004). GrabCut Interac-

tive Foreground Extraction using Iterated Graph Cuts.

ACM Transactions on Graphics (TOG).

Saxena, A., Sun, M., and Ng, A. Y. (2009). Make3D: Learn-

ing 3D Scene Structure from a Single Still Image.

IEEE Transactions on Pattern Analysis and Machine

Intelligence, 31(5):824–40.

Simon, G. (2006). Automatic online walls detection for im-

mediate use in AR tasks. In Mixed and Augmented

Reality, 2006. ISMAR 2006., pages 4–7. IEEE Com-

puter Society.

Simon, G., Fitzgibbon, A. W., and Zisserman, A. (2000).

Markerless tracking using planar structures in the

scene. Proceedings IEEE and ACM International

Symposium on Augmented Reality ISAR 2000, 9:120–

128.

Sinha, S. N., Steedly, D., and Szeliski, R. (2009). Piece-

wise planar stereo for image-based rendering. 2009

IEEE 12th International Conference on Computer Vi-

sion, (Iccv):1881–1888.

Szeliski, R. (2010). Computer vision: algorithms and ap-

plications. Springer.

Tillon, A. and Marchal, I. (2011). Mobile augmented re-

ality in the museum: Can a lace-like technology take

you closer to works of art? In Mixed and Augmented

Reality - Arts, Media, and Humanities (ISMAR-AMH),

2011 IEEE International Symposium On, number Fig-

ure 1, pages 41–47. IEEE Computer Society.

Turcsanyi-Szabo, M. and Simon, P. (2011). Augmenting

Experiences Bridge Between Two Universities. In

Augmented Reality (ISMAR), 2011, pages 7–13.

Wagner, D., Reitmayr, G., Mulloni, A., Drummond, T.,

and Schmalstieg, D. (2010). Real-time detection

and tracking for augmented reality on mobile phones.

IEEE Transactions on Visualization and Computer

Graphics, 16(3):355–368.

Wang, H. and Gould, S. (2010). Discriminative Learning

with Latent Variables for Cluttered Indoor Scene Un-

derstanding. In Proceedings of the 11th European

conference on Computer vision: Part IV (ECCV’10),

pages 497–510. Springer-Verlag.

Yu, S. X. and Malik, J. (2008). Inferring spatial layout

from a single image via depth-ordered grouping. 2008

IEEE Computer Society Conference on Computer Vi-

sion and Pattern Recognition Workshops, pages 1–7.

Photo-basedMultimediaApplicationsusingImageFeaturesDetection

307