TOWARDS HUMAN INSPIRED SEMANTIC SLAM

Dominik Maximilián Ramík, Christophe Sabourin and Kurosh Madani

Signals, Images, and Intelligent Systems Laboratory (LISSI / EA 3956), Université Paris-Est

Senart Institute of Technology, Avenue Pierre Point, 77127 Lieusaint, France

Keywords: SLAM, Semantics, Humanoid robotics, Human inspired, Image segmentation, Scene interpretation.

Abstract: Robotic SLAM is attempting to learn robots what human beings do nearly effortlessly: to navigate in an

unknown environment and to map it in the same time. In spite of huge advance in this area, nowadays

SLAM solutions are not yet ready to enter the real world. In this paper, we observe the state of the art in ex-

isting SLAM techniques and identify semantic SLAM as one of prospective directions in robotic mapping

research. We position our initial research into this field and propose a human inspired concept of SLAM

based on understanding of the scene via its semantic analysis. First simulation results, using a virtual hu-

manoid robot are presented to illustrate our approach.

1 INTRODUCTION

In mobile robotics, the ability of self-localization is

crucial. In fact, knowing precisely where the robot is

and what kind of objects surround it in a given mo-

ment enables it to navigate autonomously. An in-

formal definition of the Simultaneous Localisation

And Mapping (SLAM) says it as a process, in which

a mobile robot explores an unknown environment,

creates a map of it and uses it simultaneously to

infer its own position. The real environment is usu-

ally complex and dynamic and it is not easy to inter-

pret. This complexity makes SLAM a challenging

task. A comprehensive list of nowadays most com-

mon SLAM techniques can be found in (Durrant-

Whyte, et al., 2006a), (Durrant-Whyte, et al., 2006b)

or (Muhammad, et al., 2009). Although from its

beginning a significant advance has been achieved

(Thrun, et al., 2008), SLAM is not yet a solved prob-

lem. Performing SLAM in dynamic environment

(Hahnel, et al., 2003) or understanding the mapped

environment by including semantics into maps

(Nüchter, et al., 2008) are the actual challenges.

In this paper, the state of the art in SLAM is in-

vestigated. A relatively new field of research is iden-

tified, which is attempts to perform SLAM with the

aid of semantic information extracted from sensors.

As one of the research interests of our laboratory

(LISSI) is autonomous robotics notably in relation to

humanoid robots, we are convinced that the research

on semantic SLAM will bring a useful contribution.

We position our initial research into this field, draw-

ing our inspiration from the human way of naviga-

tion. Contrary the precise and “global” approach to

most current SLAM techniques, the human way of

doing is based on very fuzzy description of the

world and it gives preference to local surroundings

of the navigation backdrop. A simulation using a

humanoid robot Nao is presented to demonstrate

some of the proposed ideas. The real Nao will be

used in our further work.

The paper is organized in the following way: sec-

tion 2 focuses on the state of the art in semantic

SLAM. In the third section, our approach to image

segmentation and scene interpretation is discussed.

Section 4 gives an overview of our robotic human-

oid platform. The fifth section presents our initial

results and the paper concludes with section 6.

2 SEMANTIC SLAM

One of the latest research directions on the field of

SLAM is the so-called semantic SLAM. The con-

cept may be perceived as being very important for

future mobile robots, especially the humanoid ones,

which will interact with humans and perform tasks

in human-made environment. In fact, it is this inter-

action, which is one of important motives for em-

ploying semantics in robotic SLAM as humanoids

are particularly expected to share the living space

with humans and to communicate with them.

360

Maximilián Ramík D., Sabourin C. and Madani K. (2010).

TOWARDS HUMAN INSPIRED SEMANTIC SLAM.

In Proceedings of the 7th International Conference on Informatics in Control, Automation and Robotics, pages 360-363

DOI: 10.5220/0002912303600363

 SciTePress

One way of adding semantics to SLAM may be

the introduction of human spatial concepts into

maps. Humans usually do not use metrics to locate

themselves but rather object-centric concepts (“I am

near the sink” and not “I am on [12, 59]”) and they

fluently switch between reference points rather than

using global coordinates. Moreover, the presence of

certain objects is often an important clue to recog-

nize a place. This problem is addressed in

(Vasudevan, et al., 2007). Here, the world is repre-

sented topologically; place recognition is performed

based on probability of presence of objects in an

indoor environment. The work shows a study aimed

to understand human concepts of place recognition.

It proposes that humans understand places by pres-

ence or absence of significant objects. Place classifi-

cation by presence of objects has been used by

(Galindo, et al., 2005), where low-level spatial in-

formation is linked to high-level semantics. Their

robot has interfaced with humans and performed

tasks based on high-level commands involving ro-

bots “understanding” of the meaning of place names

for path planning. However, object recognition is

black-boxed here. In (Persson, et al., 2007) a system

is developed to map an outdoor area, generating a

semantic map with buildings and non-buildings

labelled. In (Nüchter, et al., 2008), a more general

system is presented with a robot equipped by a 3D

laser scanner evolving in an indoor environment and

constructing a 3D semantic map. The processing is

based on Prolog clauses enveloping pre-designed

prior knowledge about the environment enabling the

robot to reason about the environment. In (Ekvall, et

al., 2006), object recognition is performed by a robot

equipped by a laser range finder and a camera. A

semantic structure is extracted from the environment

and integrated to robots map. Another semantic

mapping technique is shown in (Meger, et al., 2008)

including an attention system.

3 IMAGE SEGMENTATION AND

SCENE INTERPRETATION

Section 2 showed the pertinence of semantic SLAM

for state of the art robotic mapping. It is this field, on

which we are focusing our research. Our motivation

comes from the natural ability of human beings to

navigate seamlessly in complex environments. To

describe a place, we use often very fuzzy language

expressions and approximation (see (Vasudevan, et

al., 2007)) in contrast to current SLAM algorithms.

An interesting point is that people are able to infer

distance of an object using its apparent size and their

experience of object’s true size. Recognition of

objects and understanding their nature is an integral

part of “human SLAM”. We believe that application

of semantics and human inspired scene description

could bring a considerable benefit in development of

robust SLAM applications for autonomous robotics.

For scene interpretation, the image has to be

segmented first. Although many image segmentation

algorithms exist (see (Lucchese, et al., 2001) for a

reference), not all are suitable for mobile robotics

due to need of real-time processing. We implement a

fast algorithm that breaks the input image into parts

containing similar colors with less attention to the

brightness. We have chosen the YCbCr color model

with Y channel dedicated to the luminance compo-

nent of the image and other two channels Cb and Cr

containing respectively the blue and the red chro-

minance component. Unlike RGB, the YCbCr model

separates the luminance and the color into different

channels making it more practical for our purposes.

Our algorithm works in a coarse-to-fine manner.

First, the contrast is stretched and median filter is

applied to the Cr and Cb components. Then the first

available pixel not belonging to a detected compo-

nent is chosen as a seed point. Eq. 1 captures how a

seed point is used to extract the segment of interest

(S). P stands for all the pixels in the image, whereas

p is the actually examined pixel. Predicate C is true

if its arguments (p, p

) are in four-connectivity. I

stands for the pixel’s intensity. Seed pixel is denoted

by p

. A pixel of the image belongs to the segment S

if the difference of intensities of the current and the

seed pixel is smaller than a threshold and there exists

a four-connectivity between it and the seed pixel

∀p∈P; C(p, p

) & |I(p) – I(p

)| < ε → p ∈ S .

(1)

Using this on both chroma sub-images we obtain

segments denoted as S

and S

. A new segment S

is then obtained following Eq. 2 as the intersection

of segments found on both chroma sub-images with-

out pixels already belonging to an existing segment

S = S

∩ S

- S

all

. (2)

At the end of the scan, a provisory map of de-

tected segments is available, but the image is often

oversegmented. In the second step, all the segments

are sorted by their area and beginning with the larg-

est one the segmentation is run again. This time the

seed point is determined as the pixel from the skele-

ton whose distance to its closest contour pixel is

maximal. By this step, similar segments from the

TOWARDS HUMAN INSPIRED SEMANTIC SLAM

361

previous step are merged. The ultimate step is con-

struction of a luminance histogram of each segment.

If multiple significant clusters are found in the his-

togram, the segment is broken-up accordingly to

separate them.

Now, the segments are labeled with linguistic

terms describing their adjacency to each other hori-

zontal and vertical position and span on the image.

The average color, its variance and the compactness

(Q) of the segment is computed following Eq. 3,

where n denotes the area of the segment and o the

number of contour pixels.

Q = 4πn / o

. (3)

These features are used in a set of linguistic

rules - the prior knowledge about the world. The aim

is to determine the nature of segments and their

appurtenance to an object of the perceived environ-

ment. E.g. a compact segment found in mid-height

level surrounded by the wall is considered as a

“window”, small compact segments adjacent to the

floor are denoted a “box”, wide span grayish seg-

ment adjacent to the ceiling is labeled a “wall” etc.

4 NAO, THE HUMANOID ROBOT

The robotic platform we use is described in this

section. It is based on Nao, a humanoid robot manu-

factured by Aldebaran Robotics

. The robot is about

58cm high with weight slightly exceeding 4kg with

25 DOF. Among others sensors it is equipped with

two non-stereo 640x480px CMOS cameras. For

simulations, a virtual version of Nao is available for

the Webots simulation program developed by Cy-

berbotics

. For development purposes, we have

chosen URBI language created by Gostai

and aimed

specially to robotics. It allows fast development of

complex behaviours for robots and provides a simple

way of managing parallel processes. LibURBI con-

nectors allow user to develop own objects using so

called UObject architecture and to plug them into

the language. These objects can be developed in

C++, Matlab or Java code. For the demo simulation

presented in the next section, we used the simulated

robot mentioned above and we are going to use the

real one in our further research.

The task itself may be not perceived as being

strictly specific for humanoid robots. However, the

http://www.aldebaran-robotics.com

http://www.cyberbotics.com/

http://www.gostai.com/

Figure 1: A view of the robot’s random walking sequence.

The left image is the original one. The right one shows

segments detected during the segmentation phase.

motivation to use humanoid robots comes from the

fact, that they are specially designed with the aim to

interact with humans and to act in a human-made

environment. The concepts we are exploiting here

come from human approach to navigation and orien-

tation in the space, thus embedding such human

inspired semantic SLAM capabilities onto a huma-

noid robotic platform seems pertinent to us.

5 RESULTS

As a demonstration of some of the mentioned prin-

ciples, we present a simulation using Webots, where

a virtual Nao is walking through a room with objects

(cubes) of different colors inside it. The YCrCb

image, acquired by Nao’s front camera is segmented

using our fast segmentation algorithm described in

the precedent section (see Fig. 1). The processing

speed is several tens of ms for a 320x240 frame on a

2GHz CPU Intel C2D.

After having the image segmented, all segments

are labeled and interpreted by a set of prior know-

ledge rules. Segments can be even merged using

these rules to cope with partial occlusions. The “se-

mantic” information is used to approximate the ac-

tual distance of objects. Having an object of type

“window”, its typical size is looked up in the memo-

ry (at this stage, the dimensions are known a-priori

as the actual learning of object sizes is supposed to

be addressed in the future work). The size informa-

tion is used along with the apparent size of the ob-

ject to compute its approximate distance (see Fig. 2).

This is described by Eq. 4 (simplified for horizontal

size only). The distance d to an object is the product

of estimated real width w

real

of the object and tan-

gent of its width in pixels w

on the image multip-

lied by fraction of the horizontal field of view ϕ and

the width w

img

of the image in pixels

d = w

real

* tan ( w

* ϕ / w

img

) .

(4)

ICINCO 2010 - 7th International Conference on Informatics in Control, Automation and Robotics

362

The aim of this calculation is absolutely not to in-

fer the exact distance of an object, but rather to de-

termine whether it is “far” or “near” in the context of

the world. This can help in further process of crea-

tion of the map of the location. Resigning to precise

metric position of every object in the mapped world

and replacing it only by rough metric and human

expressions like “near to” or “beside of” is believed

to enable us to create faster and more robust algo-

rithms for SLAM. Using “object landmarks” to

navigate in an environment is certainly more mea-

ningful that using e.g. simple points as in case of

classical SLAM.

Precise metric information of course has still its

role here, but only in some specific cases like close

obstacle avoidance or disclosure to grasp an object

and notably when the robot is learning typical sizes

of objects to enable inference of their distance when

they are seen again.

Figure 2: The same view as in case of Fig. 1 after the

interpretation phase. Some of the detected objects are

labeled. The opposing wall is labeled also with its approx-

imate distance with respect to the robot.

6 CONCLUSIONS

State of the art techniques have been discussed in

this paper. In spite of a great advance in past years, a

generally usable SLAM solution is still missing. We

identify the pertinence of semantic SLAM for the

future of mobile robotics and we present our initial

research on this field inspired by the human way of

navigation and place description. We show a con-

cept of a prospective semantic SLAM algorithm

driven by object recognition and the use of human

spatial concepts.

For description of a scene by semantic means, a

fast and efficient algorithm for image segmentation

is an important starting point. A part of our future

work will be dedicated to further development of

such an algorithm. Another part of our future work

will be focused on development of algorithms of

semantic SLAM we outlined in this paper. They will

be consequently implanted and verified in an indoor

environment on the real Nao robot.

REFERENCES

Durrant-Whyte, H., et al. Simultaneous Localisation and

Mapping (SLAM): Part I The Essential Algorithms.

Robotics and Automation Magazine. 2006a, Vols. 13,

No 2, pp. 99-110.

Simultaneous Localisation and Mapping (SLAM): Part II

State of the Art. Robotics and Automation Magazine.

2006b, Vols. 13, No 3, pp. 108-117.

Ekvall, S., Jensfelt, P. and Kragic, D. Integrating Active

Mobile Robot Object Recognition and SLAM in Natu-

ral Environments. International Conference on Intelli-

gent Robots and Systems, 2006 IEEE/RSJ. Beijing :

IEEE, 2006, pp. 5792-5797.

Galindo, C., et al. Multi-Hierarchical Semantic Maps for

Mobile Robotics. International Conference on Intelli-

gent Robots and Systems (IROS 2005). Edmonton :

IEEE, 2005, pp. 2278- 2283.

Hahnel, D., et al. Map Building with Mobile Robots in

Dynamic Environments. Proceedings of the IEEE In-

ternational Conference on Robotics and Automation.

Taipei : IEEE, 2003, Vol. 2, pp. 1557-1563.

Lucchese, L. and Mitra, S. K. Color image segmentation:

A state-of-the-art survey. Proc. Indian Nat. Sci. Acad.

(INSA-A). 2001, Vols. 67-A, pp. 207-221.

Meger, D., et al. Curious George: An attentive semantic

robot. Robotics and Autonomous Systems. Amsterdam

: North-Holland Publishing Co., 2008, Vol. 56, pp.

503-511.

Muhammad, N., Fofi, D. and Ainouz, S. Current state of

the art of vision based SLAM. Image Processing: Ma-

chine Vision Applications II, Proceedings of the SPIE.

2009, Vol. 7251, pp. 72510F-72510F.

Nüchter, A. and Hertzberg, J. Towards semantic maps for

mobile robots. Robotics and Autonomous Systems.

Amsterdam : North-Holland Publishing Co., 2008,

Vol. 56, pp. 915-926.

Persson, M., et al. Probabilistic Semantic Mapping with a

Virtual Sensor for Building/Nature detection. Interna-

tional Symposium on Computational Intelligence in

Robotics and Automation. Jacksonville : IEEE, 2007,

pp. 236-242.

Thrun, S. and Leonard, J. J. Simultaneous Localization

and Mapping. [ed.] B. Siciliano and O. Khatib. Sprin-

ger Handbook of Robotics. Berlin Heidelberg : Sprin-

ger-Verlag, 2008, 37.

Vasudevan, S., et al. Cognitive maps for mobile robots-an

object based approach. Robotics and Autonomous

Systems. Amsterdam : North-Holland Publishing Co.,

2007, Vol. 55, pp. 359-371.

TOWARDS HUMAN INSPIRED SEMANTIC SLAM

363