A SPACE- AND TIME-EFFICIENT MOSAIC-BASED ICONIC

MEMORY FOR INTERACTIVE SYSTEMS

Birgit M

oller and Stefan Posch

Institute of Computer Science

Martin-Luther-University Halle-Wittenberg

Von-Seckendorff-Platz 1, 06099 Halle (Saale) / Germany

Keywords:

visual memory, mosaic images, online processing, memory- and time-efﬁciency, polytopes, multi-resolution.

Abstract:

One basic capability of interactive and mobile systems to cope with unknown situations and environments is

active, sequence-based visual scene analysis. Image sequences provide static as well as dynamic and also 2D

as well as 3D information about a certain scene. However, at the same time they require efﬁcient mechanisms

to handle their large data volumes. In this paper we introduce a new concept of a visual scene memory

for interactive mobile systems that supports these systems with a space- and time-efﬁcient data structure for

representing iconic information. The memory is based on a new kind of mosaic images called multi-mosaics

and allows to efﬁciently store and process sequences of stationary rotating and zooming cameras. Its main

key features are polytopial reference coordinate frames and an online data processing strategy. The polytopes

provide euclidean coordinates and thus allow the application of standard image analysis algorithms directly

to the data yielding easy access and analysis, while online data processing preserves system interactivity.

Additionally, mechanisms are included to properly handle multi-resolution data and to deal with dynamic

scenes. The concept has been implemented in terms of an integrated system that can easily be included as

an additional module in the architecture of interactive and mobile systems. As one prototypical example for

possible ﬁelds of application the integration of the memory into the architecture of an interactive multi-modal

robot is discussed emphasizing the practical relevancy of the new concept.

1 INTRODUCTION

To build efﬁcient interactive computer systems able

to cope with unknown situations and environments

is a present-day research topic. Using active cam-

eras enables task-directed data acquisition and pro-

vides increased ﬂexibility to such interactive systems.

However, since active visual scene analysis also im-

plies the processing and storage of complete image

sequences rather than of single frames, also efﬁcient

algorithms for data extraction and analysis as well as

appropriate data structures for compact representation

of the sequences are necessary ingredients.

Mosaicing algorithms are one possible approach

for representing image sequences in a compact and

space-efﬁcient way. The basic idea is to fuse all im-

ages of a sequence into one single frame. The re-

sulting mosaic image covers the ﬁeld of vision of the

complete sequence and thus may be viewed as extend-

ing the camera’s ﬁeld of vision both in space and time.

In this sense, it supplies interactive systems with a vi-

sual scene memory. Processing of image data may

be done online during acquisition, but also at anytime

later. Thus, e.g., a detailed scene analysis might be

performed some time after image acquisition using

only the visual scene memory without need for a time

consuming physical rescan of the scene nor the neces-

sity to store all images ever acquired by the system.

During the last years a large amount of mosaicing

approaches has been published. Most of these works

are directed towards the generation of high quality

mosaic images as needed, e.g., in computer graph-

ics applications. However, adopting the algorithms

published for use with interactive and mobile systems

bears several problems since the general conditions

are pretty much different putting special demands on

mosaicing algorithms. Especially in mobile systems

most of the time only limited storage and computa-

tional resources are available. Thus it is, e.g., not pos-

sible to process whole sequences simultaneously, as

it is often done in existing approaches (e.g., (Davis,

1998; Sawhney et al., 1998)). Besides, such a strat-

egy would severely impede a system’s interactivity.

On the one hand a mosaic image is only available after

all images have been processed, and on the other hand

the system’s ability to immediately address user inter-

413

Möller B. and Posch S. (2006).

A SPACE- AND TIME-EFFICIENT MOSAIC-BASED ICONIC MEMORY FOR INTERACTIVE SYSTEMS.

In Proceedings of the First International Conference on Computer Vision Theory and Applications, pages 413-421

DOI: 10.5220/0001375204130421

 SciTePress

actions is limited. Hence, within this ﬁeld of appli-

cation mosaicing algorithms working in online mode

and providing data access all the time are mandatory,

preventing many existing approaches from easily be-

ing transferred to this ﬁeld of application.

Regarding interactive systems not only online data

processing and anytime access but also easy access

with respect to the structures and representations of

the data are of signiﬁcant importance. Especially the

direct applicability of existing image analysis tech-

niques to the mosaic data enables interactive systems

to efﬁciently exploit the visual data without need for

any algorithmic adaptations. Consequently, since the

majority of existing image processing algorithms as-

sumes an euclidean reference frame such a frame is

prerequisite for mosaic images to be applicable in this

context and to gain broad acceptance.

In this paper we present a new concept for a

mosaic-based visual scene memory that meets the

abovementioned requirements for use with interac-

tive mobile systems. It is currently capable of rep-

resenting image sequences of rotating and zooming,

but stationary cameras. Although mobile systems are

non-stationary this is not a severe restriction at all

since visual data of a scene can usually adequately

be represented based on a set of mosaic ’screenshots’

taken from different positions within the scene. The

most important features of the memory are its sup-

port for an incremental online-update of the data and

the direct applicability of standard image analysis al-

gorithms due to its euclidean coordinate frame (see

next paragraph). It allows a large ﬁeld of vision and

implements efﬁciently varying resolutions of image

data as required for zooming active cameras. This pa-

per focusses on the representation of static scene data

within the memory. Nevertheless also mechanimsms

to additionally include dynamic as well as more ab-

stract representations of image data are integrated, for

details refer to (M

oller and Posch, 2002).

One important decision in generating mosaic im-

ages is to choose a suitable reference coordinate sys-

tem for projecting the data. As one of our primary

goals is to support the application of conventional

image analysis techniques, our memory requires a

euclidean reference frame. Spheres or cylinders are

usually chosen to represent large ﬁeld of vision of

stationary cameras but do not deliver such an euclid-

ean reference frame. Further on, they are difﬁcult to

represent (Bishop and McMillan, 1995) and data ac-

cess as well as registration and integration of new data

in an online fashion using such coordinate frames of-

ten requires explicit view rendering (e.g., (Shum and

Szeliski, 2000)). Thus, an uniform representation and

handling of the mosaic images in registration and in-

tegration as well as for data access is favorable. Due

to these requirements our approach is based on poly-

topes approximating a sphere. Projecting the data

onto tiles of a polytope enclosing the camera center

allows to represent the entire ﬁeld of vision of the

cameras while at the same time image distortions are

reduced and euclidean coordinates are granted.

Image sequences are typically captured with vary-

ing zoom. Using a single mosaic image with ﬁxed

resolution is as a consequence not adequate for these

sequences. Hence, our memory is hierarchically or-

ganized and nests differently scaled incarnations of

a polytope. The resulting data structure consists of

different image planes arranged as polytopes and is

capable of representing different levels of resolution.

We call this enhanced mosaic a multi-mosaic.

The remainder of this paper is organized as fol-

lows. Section 2 outlines the basic mosaicing algo-

rithms while in section 3 our new multi-mosaic con-

cept is introduced. Additionally implementatory de-

tails can be found there. Section 4 presents some re-

sults before in section 5 an exemplary application in

the ﬁeld of human-machine interaction is considered.

The paper ﬁnishes with a conclusion in section 6.

2 MOSAIC IMAGE BASICS

Mosaic images are usually generated following a two-

step strategy. First, for each image of a sequence pa-

rameters of a suitable motion model are estimated to

compensate for the camera motion (registration). Af-

terwards all images are warped towards a common co-

ordinate frame and integrated by fusing their color in-

formation. Both steps can either be performed in an

ofﬂine or an online fashion.

In the ﬁrst case, all images of a sequence are

processed simultaneously in registration and integra-

tion. This yields a globally optimal mosaic repre-

sentation which is only accessible after all images

have been integrated. In interactive and mobile sys-

tems image data becomes available incrementally and

hence it is straightforward to process the data by fol-

lowing a continuous online strategy. This is accom-

plished by registering and integrating each image sep-

arately into the evolving mosaic image. With each

new image the mosaic is updated and a complete data

representation can be provided after each registration

step. However, it should be noted that the result-

ing representation is only locally optimal since im-

age registration and integration can only rely on the

current frame and the mosaic itself. Hence, incon-

sistent parameters and integration errors due to error

accumulation cannot be completely omitted. Never-

theless, even in long-term mosaicing as aimed by our

approach image quality is still sufﬁcient to enable fur-

ther data analysis.

Registration. The parameter estimation is based on

a suitable model for the camera motion. This model

VISAPP 2006 - IMAGE ANALYSIS

414

Figure 1: Example mosaic demonstrating distortions and uncontrollable image growth during mosaicing an image sequence

of a rotating camera. The data is projected onto a single image plane yielding an inadequate coordinate frame.

mainly depends on the degrees of freedem of the cam-

era and is often chosen to be a pure translation or

an afﬁne coordinate mapping. In our application we

use stationary rotating and zooming cameras. Their

motion can best be modeled by homographies (Hart-

ley and Zisserman, 2000) whose parameters are esti-

mated using projective ﬂow (Mann and Picard, 1996).

The main idea is given to calculate the optical ﬂow

between two images constraint by the projective mo-

tion model. However, due to the non-linearities within

the homographies the algorithm operates in an itera-

tive framework based on piecewise linear homogra-

phy approximations. Further on a resolution hierar-

chy (Bergen et al., 1992) is used to cope with large

offsets. To reduce the inﬂuence of accumulating er-

rors in online image registration as outlined in the pre-

vious section our mosaics are generated in frame-to-

mosaic mode (Burt and Anandan, 1994). Parameters

for each image are estimated with regard to a suitable

clip reprojected from the mosaic image generated so

far. Thus all images formerly integrated at least im-

plicitly inﬂuence the current estimation and parame-

ter quality is improved without processing the whole

sequence simultaneously.

Integration. During image integration the color in-

formation of all sequence images is merged to give

single mosaic pixel values. This can be accomplished

by fusing the values of all image pixels that are pro-

jected onto a mosaic pixel, e.g., calculating an average

or median. However, in long-term mosaicing contin-

uously averaging pixel values causes image blurring

which is primarily due to small registration errors un-

avoidable in online mosaic generation. Thus, an inte-

gration method needs to be applied that provides high-

quality images even for a long period of time.

In the literature several quite sophisticated meth-

ods for (ofﬂine) mosaic image quality enhancement

are to be found (e.g., (Capel, 2004)). However, for

efﬁciency reasons and due to the fact that online in-

tegration is required we rely on a more simple but

equally appropriate strategy. One single image is se-

lected as source for each mosaic pixel so that aver-

aging pixel values is omitted. The source images are

selected based on the time stamps of the input im-

ages. Each mosaic pixel is assigned the pixel value

from that input image providing the most recent data.

Thus, new information is directly integrated whenever

it becomes available yielding a continuous data up-

date. This strategy results in a segmentation of the

mosaic image into regions with each region originat-

ing from a different sequence image. Due to changes

in the lighting conditions or camera exposure settings

visible seams in the mosaic image might appear at the

boundaries between different regions. They are elim-

inated by applying linear or sigmoid blending func-

tions along region boundaries.

3 MULTI-MOSAICS

Mosaic images share a wide variety of applications

ranging from image-based rendering in computer

graphics and virtual reality to computer vision appli-

cations and visual scene analysis. Depending on cam-

era motion and intended area of application, which

in our case is given by mobile interactive systems, a

suitable reference coordinate system has to be chosen

for the mosaic images as already mentioned. In most

work it is deﬁned as a single image plane, e.g., (Mann

and Picard, 1996). However, in case of a stationary

camera performing large rotations projecting all im-

ages onto a single plane usually results in undesirable

large distortions (ﬁg. 1). They cause an excessive

growth of the mosaic image area and consequently

enforce extensive use of pixel interpolations which

results in low image quality. This in turn hampers

registration and integration and renders image analy-

sis nearly impossible. Using a cylinder (Bishop and

A SPACE- AND TIME-EFFICIENT MOSAIC-BASED ICONIC MEMORY FOR INTERACTIVE SYSTEMS

415

McMillan, 1995) or sphere (Coorg and Teller, 2000)

as coordinate frame avoids distortions, but dealing

with spherical coordinates in image registration, in-

tegration and especially in further analysis steps is

bulky and often yields undesirable incompatibilities

to existing software modules.

3.1 Polytopial Coordinate Frames

Both requirements of representing the complete ﬁeld

of vision of rotating cameras as well as providing

euclidean coordinates for image processing can ide-

ally be met employing polytopes to deﬁne the ref-

erence coordinate frame in registration and integra-

tion as well as in representation of the mosaic im-

ages itself. Such frames are up to now primarily

used with ofﬂine rendering applications (e.g.,(Shum

and Szeliski, 2000)), however, they also offer a great

ﬂexibility for online scene modelling and representa-

tion tasks as discussed in this paper.

Figure 2: Left, polytopial coordinate frame with FIP at-

tached, and right, hierarchical structure of visual scene

memory based on nested polytope incarnations.

The polytopes are centered at the optical center of

the camera with their tiles regularly arranged around

tangentially to the sphere (ﬁg. 2, left). The origin of

the 3D coordinate frame for the polytope is located at

the center of the camera and for convenience its z axis

is arranged parallel to the optical axis of the camera

when acquiring the ﬁrst image of the sequence. Each

tile of the polytope owns a local orthogonal 2D co-

ordinate system. These coordinate frames as well as

projective transformations between neighboring tiles

are computed ofﬂine during an initialization step. All

transformations and the neighborhood relations are

represented in a graph data structure (ﬁg. 3) providing

easy data access and saving time in online generation

and update.

The scaling of the polytope is initially chosen ac-

cording to the focal length of the camera given the

assumption that image pixels and pixels on the poly-

tope share the same scaling and aspect ratio. The fo-

cal length is currently extracted facilitating an ofﬂine

calibration strategy based on a 3D calibration pattern

(Hartley and Zisserman, 2000) and using a functional

mapping between hardware parameters and corre-

sponding focal length in online mode. In principal

self-calibration techniques (de Agapito et al., 1999)

Figure 3: Implicit representation of polytope geometry: the

graph data structure stores neighborhood relations in terms

of its edges as well as homographies valid between various

tiles in terms of corresponding edge lables.

might also be applied, but they have proven to be too

unstable in long-term mosaicing so far.

3.2 Online Registration and

Integration

One problem during online generation of the multi-

mosaic representation results from discontinuities be-

tween neighboring tiles of the polytope. Obviously

the number of these discontinuities grows with the

number of tiles the polytope owns. On the other hand

this number should be large for a good approxima-

tion to the sphere and a reduction of distortions. We

typically use polytopes with about 20 to 30 tiles (e.g.

rhombicuboctahedrons, see ﬁg. 2). Nevertheless, efﬁ-

cient handling of the memory data structure requires

an elaborate approach for dealing with these disconti-

nuities, as presented below.

Integrating a new image into the multi-mosaic rep-

resentation requires registration of the image and pro-

jecting its data onto related tiles. For registration a

suitable reference image is needed which in frame-to-

mosaic mode is deﬁned as a clip of the current mo-

saic representation. This clip might be chosen as part

of the one single tile that best approximates the orien-

tation of the new image. However, in worst case an

overlap of slightly more than 50% between the clip

and the image to be registered will result. Most of

the time this is not sufﬁcient to guarantee robust pa-

rameter estimation (besides disregarding available in-

formation in any case). As an alternative, we can con-

struct a larger reference frame by clipping not only

from one tile, but also from neighboring tiles. How-

ever, constructing such a composed reference image

and back projecting the integrated new data to the

polytope is a time-consuming procedure. In a naive

implementation this would have to be repeated for

each frame of the image sequence.

Focus Image Plane. We solve this problem by fa-

cilitating an additional image plane, the so called fo-

cus image plane (FIP). It serves as some kind of

”cache” storing recently acquired image data and

granting easy access to it. The FIP is attached to the

VISAPP 2006 - IMAGE ANALYSIS

416

polytope (ﬁg. 2, left) and used as reference in regis-

tration and integration, thus, masking the underlying

topological structure of the multi-mosaic representa-

tion. New image data is registered according to the

data on the FIP and also integrated into it following

the strategy mentioned in section 2. Integration into

the polytope itself is accomplished only when the po-

sition of the FIP needs to be updated. This is the case

if parts of the integration area of a new image do not

intersect with the domain of the focus image plane

any longer due to signiﬁcant discrepancies between

the orientation of the current image plane and the one

of the FIP. As the size of the FIP is usually chosen two

to three times larger than that of the input images this

occurs only after several images have been integrated

depending on speed and/or size of rotation angles of

the camera.

Focus Image Plane Update. The shortest point dis-

tance of the area of integration of the current image on

the FIP to its boundaries is considered to monitor for

necessary updates of the position of the FIP. If the dis-

tance falls below a certain threshold the image data of

the FIP is projected onto related polytope tiles. These

are detected by projecting the rectangular bounding

box of available image data onto the polytope and cal-

culating intersections with domains of single tiles. A

pointer to the tile meeting the orientation of the FIP

best is always kept in memory for efﬁciency reasons

and used as starting point for copying the data. If parts

of the FIP data project outside of the domain of a tile,

the data update is recursively continued on neighbor-

ing tiles until the complete image data has been inte-

grated into some tile of the polytope. In this proce-

dure only tiles are checked if their orientations differ

by not more than 80 degrees from the one of the FIP.

Orientation and position of the FIP are updated af-

ter copying the data. The new parameters are chosen

according to the position and focal length of the cur-

rent input frame. Additionally the history of the mo-

tion path of the camera during the last frames is taken

into account. It is quadratically extrapolated based

on the assumption of smooth camera motion and thus

helps to minimize the overall number of FIP updates

necessary. Finally, new reference data is projected to

the new FIP. To efﬁciently identify the tiles providing

image data to the new FIP, the same strategy as for

copying data onto the polytope is applied.

3.3 Multi-Resolution Data

Representing image sequences that contain different

levels of resolution puts specials demands on a mosaic

data structure. Usually only a single resolution can

adequately be represented within one mosaic image.

Integrating image data with higher resolution forces

to downsample these data causing a loss of informa-

tion. Contrary, inserting low-resolution data into a

mosaic with higher resolution requires interpolating

the data and thus enlarges the data volume without

gaining more information.

In our memory several differently scaled instances

of the polytope are nested into each other (ﬁg. 2, right)

covering a discrete set of resolutions. Depending on

the current focal length of the camera the polytope en-

tity is chosen for data integration that best meets the

focal length of the current input image. The granu-

larity of the resolution hierarchy can freely be con-

ﬁgured by the user and is, thus, highly ﬂexible. In

particular, each distance between adjacent levels can

individually be deﬁned according to the required level

of representation detail in certain resolutions.

In contrast to standard and commonly used multi-

scale representations like, e.g., gaussian resolution

pyramids or wavelets, the structure proposed provides

direct data access in all resolutions without need for

intermediate image reconstruction. Further on rep-

resentation of image data might be restricted to only

some few levels of resolution and need not to be car-

ried out for all available resolutions. On each level

only data is represented that was actually provided by

the camera. Consequently, the memory allows simul-

taneous representation of image data of a single scene

part that might have been acquired at completely dif-

ferent points in time and with varying zoom.

3.4 Sparse Memory Representation

Integrating an image sequence into a mosaic image

signiﬁcantly reduces the amount of iconic data to be

represented in the multi-mosaic. However, covering

the whole potential ﬁeld of vision at all resolutions

still requires a lot of memory space. This can cause

performance problems especially in mobile systems.

Due to the fact that most of the time not all parts of a

scene are actually scanned and explored in all resolu-

tions anyway, the space needed can be reduced facili-

tating a sparse memory representation: only tiles of a

multi-mosaic are instantiated that actually contain im-

age data. Hence, memory for a single tile is allocated

only after the camera has scanned the corresponding

regions of the scene and data needs to be stored.

Although restricting the representation to tiles that

actually contain data allows a reduction of the mem-

ory space needed, in levels of high-resolution an ad-

ditional segmentation of tiles into subcells is implied.

In these levels the size of a single tile usually exceeds

the size of acquired images for several times and only

some few regions of interest need to be represented.

Thus, the single tiles are further segmented into sub-

cells. Their actual number is chosen individually for

each tile and is derived according to the relation of

the input image size to the tile size so that the num-

ber of subcells that have to be checked on integrating

A SPACE- AND TIME-EFFICIENT MOSAIC-BASED ICONIC MEMORY FOR INTERACTIVE SYSTEMS

417

Figure 4: Multi-mosaic resulting from a scene scan cov-

ering approximately 180

◦

in horizontal direction, rendered

using Open Inventor

. Black regions indicate areas where

waste memory was allocated, and gray regions were added

to better illustrate the underlying structure of the complete

polytope representation.

new image data is kept small. Subcells where mem-

ory has to be allocated are determined by projecting

rectangular bounding boxes of available image data

onto the tiles. This leads to few regions on the poly-

topes where memory is allocated without image data

present (black regions in the example images). Never-

theless, we prefer this scheme to a polygonal approx-

imation of valid image regions since this complicates

memory handling and is less time efﬁcient.

4 IMPLEMENTATION &

RESULTS

The concept of the multi-mosaics presented in this pa-

per has been implemented in terms of an integrated

system that might be included as an additional module

in the architecture of interactive systems. Within this

paragraph two exemplary memory representations are

discussed in detail while in section 5 a prototypical in-

tegration of the new concept into the architecture of a

mobile robot is presented.

Figure 4 shows a multi-mosaic representation

based on image data including only one single level

of resolution. The scene represented is the same as in

ﬁgure 1, but this time the image data is projected onto

an adequately scaled rhombicuboctahedron. Large

distortions are no longer present in the mosaic and the

image quality is sufﬁcient to allow for image analy-

sis algorithms to be applied directly to the mosaic. It

should be mentioned that registration and integration

errors cannot be avoided completely over time. Espe-

cially in long-term mosaicing small errors accumulate

and could only be eliminated by registering all im-

ages simultaneously. This is not feasible for an online

processing strategy, but errors are reduced by apply-

ing the frame-to-mosaic mode as described. Further

on the integration heuristic copying new data region-

wise to the mosaic minimizes the effects of small er-

rors, but causes some blurring at region boundaries

within the mosaic. Since most image analysis algo-

rithms apply smoothing to the data anyway, the mo-

saic provides sufﬁcient quality for image processing.

Figure 5: Hierarchical representation of a scene explored.

Two areas of the scene (whiteboard and shelves) have been

scanned in detail as apparent in the levels of higher resolu-

tion of the multi-mosaic image.

An example for the multi-resolution visual memory

is shown in ﬁg. 5. The camera ﬁrst scanned the whole

scene in coarse resolution to get an overview. Sub-

sequently the whiteboard and shelves were manually

selected for detailed exploration. The corresponding

representations for different resolution levels of the

shelf are magniﬁed in ﬁg. 6. Although the books in

the shelf are visible even in coarse resolution, their ti-

tle and other details are only accessible by zooming

in. The same holds, e.g., for text written on the white-

board. Contrarily, neither the whiteboard itself nor

other parts of the shelf need to be represented in larger

detail. The memory, thus, yields a compact represen-

tation of the scene requiring only minimal memory

resources to store important scene data.

5 MULTI-MOSAICS IN

PRACTICE

The multi-mosaic data structure yields an efﬁcient

iconic representation of image sequences as acquired

by interactive and mobile systems. Due to the online

data processing and the euclidean coordinate frame

the memorized data provides a suitable base for more

efﬁcient image sequence analysis in interactive and

particularly in mobile systems. Such systems often

explore new environments by ﬁrst collecting data and

VISAPP 2006 - IMAGE ANALYSIS

418

Figure 6: Two example images of low- (left) and high-

resolution data (right) extracted from the multi-mosaic in

ﬁg. 5. Details are only visible in high resolution, however,

there are usually only few sections of a scene where such

details need to be represented for scene analysis anyway.

then building some kind of map to be subsequently

used in navigation. However, maps are most of the

time not based on visual input and vision-based ap-

proaches have only been used in a few publications

(e.g. (Ishiguro and Tsuji, 1996)). This is mainly due

to the fact that pure navigation and localization tasks

can often be better solved relying on more robust

range data. However, if a mobile robot is supposed to

perform some tasks of scene analysis as well, visual

information is indispensable. Moreover it is essen-

tial that the visual data is adequately represented and

easily accessible, ruling out rather indirect representa-

tions like, e.g., wavelets or mipmaps. One large ﬁeld

of application for such kind of representations is intu-

itive human-machine-interaction where visual data is

one of the most important sources of information.

As a prototypical example for such an application

the visual memory has been integrated into the ar-

chitecture of a mobile robot (M

oller et al., 2005).

The robot is supposed to support a human user in his

everyday life at home by performing tasks like, e.g.,

searching and fetching objects (home tour scenario).

Due to the fact that tasks the robot has to perform are

highly user-dependent and may exhibit a large vari-

ance one key capability of the robot is autonomous

learning of new tasks by instruction. Therefore the

robot provides the human user with intuitive multi-

modal communication facilities like speech and ges-

ture recognition as well as processing of visual in-

formation for efﬁciently analyzing and understanding

human instructions. Besides, another important in-

gredient in multi-modal learning is object recognition

and learning since nearly all tasks more or less deal

with objects. The robot solves this task following an

appearance-based object recognition strategy.

In appearance-based object recognition objects are

recognized by matching current images of the ob-

ject against formerly acquired views of different ob-

jects as stored in the scene model, e.g., using appro-

priate measures of image correlation. The more of

such views are available, the better in principal the

ﬁnal recognition results will be. However, acquir-

ing different views of an object is sometimes quite

time-consuming and sophisticated for a mobile robot.

In worst case it has to manoeuver around the object

completely to get these views which on the one hand

might be difﬁcult, e.g., due to obstacles on the ground,

and on the other hand limits the robot’s interactivity

during this period of time boring the human user.

To come up with these problems of view acquisi-

tion we adopt the multi-mosaic images as some kind

of visual memory in the robot’s architecture. This idea

originates from the observation that the robot some-

times just idles around while waiting for a commu-

nication partner. During this time the robot already

gathers visual information about the scene and espe-

cially about objects included. However, at the time

of acquisition this visual information is irrelevant to

the robot and, hence, is usually discarded raising the

need for later on rescanning the scene when the data

is actually needed. This can be avoided by having the

robot building up an iconic mosaic-based visual map

of the environment in idle periods of time. In this

way all visual data ever acquired is kept in a compact

and space-efﬁcient way allowing the robot to refer to

it afterwards in concrete communication and learn-

ing situations. At the same time expensive hardware-

based re-explorations of a scene are avoided. Figures

7 and 8 show an example for such a representation,

extracted object views of an object to learn and, ﬁ-

nally, results from a post-processing step.

Figure 7: An exemplary multi-mosaic image from the scene

memory as generated from a 60

camera pan.

(a) (b) (c) (d)

Figure 8: (a), (c) Subimages as extracted from various

multi-mosaics during object learning; (b), (d) exemplary re-

sults of a subsequent color segmentation step.

The memory representation itself consists of a set

of various multi-mosaic images each acquired at a

speciﬁc position in the world. Since the concept of the

A SPACE- AND TIME-EFFICIENT MOSAIC-BASED ICONIC MEMORY FOR INTERACTIVE SYSTEMS

419

multi-mosaics enforces the robot to remain at a single

position during image acquisition it is not possible to

gather visual information while moving. Neverthe-

less, this is not a serious drawback. Most of the time

a scene can even better be represented by scanning

it from some key positions within the scene than by

monitoring all the different pathways the robot pur-

suits. Besides, building mosaics while the robot is

moving bears different problems. On the one hand no

closed-form solution for modelling the camera mo-

tion is available since homographies do not hold in

such situations, and on the other hand appropriate mo-

saic images, like, e.g., manifold mosaics (Peleg et al.,

2000), often exhibit (perspective) distortions of the

scene data hampering easy analysis and scene under-

standing. The multi-mosaics are currently primarily

used to extract additional views of an object in object

learning and recognition. However, they could also

be used for extracting 3D data of the scene provided

that the mosaic 3D world positions are given as, e.g.,

proposed in (Teller, 1998).

6 SUMMARY AND

CONCLUSIONS

Active scene analysis and exploration gains increas-

ing importance in computer vision. Since analyz-

ing image sequences of active cameras has proven a

suitable base for extracting useful information from

a scene, interactive and mobile systems are nowa-

days often equipped with active sensing devices. The

visual scene memory based on multi-mosaics pre-

sented in this paper perfectly ﬁts into this framework

as an additional module between active data acqui-

sition on the one hand and its analysis on the other.

The memory is based on a polytopial reference coor-

dinate system. In contrast to spheres and cylinders the

polytopes provide an euclidean reference frame and,

hence, allow the direct application of standard image

analysis techniques. This is important for interactive

systems since they can work on the memorized data as

on the originally acquired input images. Further on,

based on this euclidean mosaic representation and the

chosen data processing strategy, the data within the

memory is easily updated in an online fashion. Incre-

mental parameter estimation and integration heuris-

tics are used in combination with the focus image

plane. The latter masks the underlying polytope struc-

ture of the memory and thus allows efﬁcient data ac-

cess despite present discontinuities on the memory it-

self. Given these techniques the memory works quite

stable in practice, nevertheless, future work has to be

carried out on investigating more robust online para-

meter estimation techniques and mechanisms for au-

tomatically detecting registration errors.

The memory is ideally suited to be used with inter-

active and mobile systems that have to store and after-

wards access image sequences. Especially systems in

human-machine interaction signiﬁcantly beneﬁt from

the memory as it provides an improved and more ef-

ﬁcient exploition of available visual data and yields a

higher ﬂexibility as it is necessary to act in dynam-

ically changing environments as well as to perform

intuitive communication with human beings.

REFERENCES

Bergen, J., Anandan, P., Hanna, K., and R.Hingorani

(1992). Hierarchical model-based motion estimation.

In ECCV, pages 237–252.

Bishop, G. and McMillan, L. (1995). Plenoptic modeling:

An image-based rendering system. In SIGGRAPH

Computer Graphics Proceedings, pages 39–46. An-

nual Conference Series.

Burt, P. and Anandan, P. (1994). Image stabilization by

registration to a reference mosaic. In Image Under-

standing Workshop, pages 1:425–434, Monterey, CA.

Capel, D. (2004). Image Mosaicing and Super-resolution.

Springer.

Coorg, S. and Teller, S. (2000). Spherical mosaics with

quaternions and dense correlation. International Jour-

nal of Computer Vision, 37(3):259–273.

Davis, J. (1998). Mosaics of scenes with moving objects.

In CVPR, pages (1):97–100, Santa Barbara, USA.

de Agapito, L., Hartley, R., and Hayman, E. (1999). Linear

self-calibration of a rotating and zooming camera. In

IEEE Int. Conference on Computer Vision and Pattern

Recognition, pages 15–21.

Hartley, R. and Zisserman, A. (2000). Multiple View Geom-

etry in Computer Vision. Cambridge University Press.

Ishiguro, H. and Tsuji, S. (1996). Image-based memory

of environment. In Proc. of Int. Conf. on Intelligent

Robots and Systems (IROS ’96), pages 634–639.

Mann, S. and Picard, R. (1996). Video orbits of the pro-

jective group: A new perspective on image mosaicing.

Technical Report 338, MIT Media Laboratory Percep-

tual Computing Section.

oller, B. and Posch, S. (2002). Analysis of object interac-

tions in dynamic scenes. In Pattern Recognition, Proc.

of DAGM Symp., LNCS 2449, pages 361–369, Zurich,

Swiss. Springer.

oller, B., Posch, S., Haasch, A., Fritsch, J., and Sagerer,

G. (2005). Interactive object learning for robot com-

panions using mosaic images. In Proc. of Int. Conf.

on Intelligent Robots and Systems (IROS), pages 371–

376, Edmonton, Canada.

Peleg, S., Rousso, B., Rav-Acha, A., and Zomet, A.

(2000). Mosaicing on adaptive manifolds. PAMI,

22(10):1144–1154.

VISAPP 2006 - IMAGE ANALYSIS

420

Sawhney, H. S., Hsu, S., and Kumar, R. (1998). Robust

video mosaicing through topology inference and lo-

cal to global alignment. In ECCV, pages 103–119,

Freiburg.

Shum, H.-Y. and Szeliski, R. (2000). Systems and ex-

periment paper: Construction of panoramic image

mosaics with global and local alignment. IJCV,

36(2):101–130.

Teller, S. (1998). Toward urban model acquisition from geo-

located images. In Proc. of Paciﬁc Graphics, pages

45–51, Singapore.

A SPACE- AND TIME-EFFICIENT MOSAIC-BASED ICONIC MEMORY FOR INTERACTIVE SYSTEMS

421