Real-time People Detection and Mapping System for a Mobile Robot

using a RGB-D Sensor

Francisco F. Sales, David Portugal and Rui P. Rocha

Institute of Systems and Robotics, University of Coimbra, Coimbra, Portugal

Keywords:

People Detection, Mapping, Mobile Robot, RGB-D Sensor, ROS.

Abstract:

In this paper, we present a robotic system capable of mapping indoor, cluttered environments and, simul-

taneously, detecting people and localizing them with respect to the map, in real-time, using solely a Red-

Green-Blue and Depth (RGB-D) sensor, the Microsoft Kinect, mounted on top of a mobile robotic platform

running Robot Operating System (ROS). The system projects depth measures in a plane for mapping pur-

poses, using a grid-based Simultaneous Localization and Mapping (SLAM) approach, and pre-processes the

sensor’s point cloud to lower the computational load of people detection, which is performed using a classical

technique based on Histogram of Oriented Gradients (HOG) features, and a linear Support Vector Machine

(SVM) classiﬁer. Results show the effectiveness of the approach and the potential to use the Kinect in real

world scenarios.

1 INTRODUCTION

One of the main use of robots is to replace humans

in unpleasant situations, such as repetitive manufac-

turing tasks and dangerous environments. In these

harsh scenarios, robots are usually mobile and should

be able to explore, map and detect people, e.g. in the

case of Search and Rescue (SaR) missions and after

an industrial accident, involving the leakage of toxic

substances, they can be used to assist human ﬁrst re-

sponders (Rocha et al., 2013).

Such missions are critical and of extreme impor-

tance because their accomplishment might save many

lives. As a consequence, human rescue teams are of-

ten subject to specialized training. However, they usu-

ally face a lack of technological equipment and risk

themselves in this process. Thus, Robotics plays a

fundamental role by reducing this risk, and can be a

great resource to human rescue teams.

Detecting people and mapping the environment

are key tasks in Robotics for SaR missions and other

applications. Since these environments are usually

dangerous, mobile robots must be endowed with ap-

propriate locomotion skills, provide accurate results,

and the whole system should be affordable due to the

risk taken in such harsh environments.

This work has been supported by the CHOPIN re-

search project (PTDC/EEA-CRO/119000/2010) funded by

“Fundac¸

ao para a Ci

encia e a Tecnologia”.

In this work, we use a RGB-D sensor, the Kinect,

on top of a mobile robotic platform running Robot

Operating System (ROS) (Quigley et al., 2009), to

map the environment and detect victims from visual

cues. To do so, we project depth measurements to 2D

and run a 2D Simultaneous Localization and Mapping

(SLAM) algorithm. At the same time, we pre-process

the point cloud and compute 3D clusters that might

contain people. Afterwards, we run a HOG-based

classiﬁer on a corresponding portion of the coloured

image to assess the presence of people. Finally, we as-

sociate the obtained map and the detections to localize

people in the map. Although we use a Kinect sensor,

our approach can be applied with any RGB-D sensor.

Note however that, for outdoor scenarios, the Kinect

is unusable due to infra-red interference induced by

the sun, but if the depth measures are made available

by more capable sensors under those conditions, our

approach is still applicable.

This paper is organized as follows: Section 2 re-

views important related work; Section 3 presents the

proposed system; Section 4 describes the experimen-

tal setup and validates the mapping and people de-

tection modules; Section 5 presents and discusses the

results of the integrated system; and in Section 6 we

draw conclusions and suggest future work.

467

F. Sales F., Portugal D. and P. Rocha R..

Real-time People Detection and Mapping System for a Mobile Robot using a RGB-D Sensor.

DOI: 10.5220/0005060604670474

In Proceedings of the 11th International Conference on Informatics in Control, Automation and Robotics (ICINCO-2014), pages 467-474

ISBN: 978-989-758-040-6

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

2 RELATED WORK

There has been considerable research on SLAM and

people detection with laser range ﬁnders (LRFs),

stereo cameras and, recently, with RGB-D sensors.

Surprisingly, it is not common to integrate both ef-

forts, i.e. building a map of the environment, localiz-

ing the robot with respect to the map and, simultane-

ously, identifying people within the environment. A

recent approach was proposed in (Soni and Sowmya,

2013). However, in contrast to our approach, it is not

built around a single RGB-D sensor, which requires

performance-oriented approaches to be able to con-

duct these tasks in near real-time while being able to

obtain sufﬁciently accurate results.

2.1 Mapping

Most popular 2D SLAM algorithms rely on probabil-

ities to cope with noise and estimation errors. There

are some popular implementations based on Kalman

Filters and Particle Filters (Dissanayake et al., 2001).

An alternative approach is graph-based SLAM. In this

case, algorithms use the data to build a graph com-

posed of estimated poses, local maps and their rela-

tions, in order to compute a consistent global map.

ROS, the robotic framework used on this paper, has

already available a set of 2D SLAM algorithms, such

as GMapping, HectorSLAM, KartoSLAM, etc.

More recently, 3D mapping has also been inten-

sively studied. However, it often relies on stereo cam-

eras (Konolige and Agrawal, 2008), range scanners

(Triebel and Burgard, 2005), (May et al., 2009), or

monocular cameras (Clemente et al., 2007), thus re-

quiring heavy computation, including aligning con-

secutive frames, detecting loop closures, and the glob-

ally consistent alignment of all processed frames.

The approach used for frame alignment depends

on the data to process. However, the Iterative Clos-

est Point (ICP) algorithm is a popular technique for

3D mapping applications (Droeschel et al., 2009).

For stereo cameras, Scale-Invariant Feature Trans-

form (SIFT) features (Lowe, 2004), as well as fast

descriptors based on random trees (Michael Calonder

et al., 2008) computed for keypoints, such as Features

from Accelerated Segment Test (Rosten and Drum-

mond, 2006), are often applied. Also sparse feature

points can be aligned over consecutive frames via

RANdom SAmple Consensus (RANSAC) (Fischler

and Bolles, 1981).

Regarding the loop closure problem, most tech-

niques rely on image matching between keyframes.

In graph-based techniques, whenever a loop closure is

detected, the correspondence between data frames can

be used as a constraint in the pose graph, which rep-

resents the spatial relationship between frames. The

optimization of these pose graphs originates a glob-

ally aligned set of frames. In this context, bundle

adjustment (Triggs et al., 2000) simultaneously op-

timizes the pose graph and a map. Other alternatives

have also been explored, such as the g2o framework

(Kuemmerle et al., 2011).

With the recent massiﬁcation of RGB-D sensors,

most SLAM approaches were adapted to be used with

sensors providing 3D dense depth data. This adap-

tation was required due to the limitation of the ﬁeld

of view (FoV), usually around 60

◦

, and less precise

depth measurements. The ﬁrst constraint can cause

problems in the ICP alignment due to the lack of spa-

tial structure, and only a few approaches have been

presented that can deal with this particular issue, e.g.,

the combination of a time-of-ﬂight camera and a CCD

camera makes viable to localize the robot (Prusak

et al., 2008).

Recently, with the popularization of RGB-D sen-

sors, an approach was presented which uses sparse

keypoint matches between consecutive RGB images

as an initialization to the ICP algorithm (Henry et al.,

2010). However, it has been concluded through ex-

perimentation that expensive ICP is not always re-

quired. Still, 3D mapping has clearly shown to require

more computational effort than 2D mapping.

2.2 People Detection from Visual Cues

People detection is important for various Robotics

applications. Much effort has been put in human-

robot interaction for the past few years so that robots

can engage and interact with people in a friendly

way (Ferreira et al., 2013). Detecting and localizing

people is essential before initiating such interaction.

However, some of this research has relied solely on

2D visual information provided by cameras (Menezes

et al., 2003). Some methods involve statistical train-

ing based on local features, such as HOG (Dalal and

Triggs, 2005), Edge Orientation Histogram (EOH)

(Levi and Weiss, 2004), while other methods involve

extracting interest points in the image, such as SIFT

features. Recently, with the popularization of 3D sen-

sors, much research has been done on people detec-

tion. This is also important for intelligent vehicles to

avoid collisions. In this context, there is interesting

work, such as (Premebida et al., 2009), (Keller et al.,

2011), (Llorca et al., 2012).

Another relevant approach using 3D information

was proposed by (Satake and Miura, 2009), wherein

depth templates are used to detect the upper human

body. In (Bajracharya et al., 2009), a reduction of the

ICINCO2014-11thInternationalConferenceonInformaticsinControl,AutomationandRobotics

468

point cloud to a 2.5D map is performed to preserve the

low computational effort so that detection is based on

different 2D features.

Later on, a method that combines both depth in-

formation and color images to detect people was in-

troduced (Spinello and Arras, 2011). A HOG-based

detector is used to identify human bodies from image

data and the Histogram of Oriented Depths (HOD)

method is introduced for dense depth data that derives

from HOG; and, ﬁnally, Combo-HOD probabilisti-

cally combines HOG and HOD.

Recently, a method that does not require a Graph-

ics Processing Unit (GPU) implementation and still

presents accurate and real-time results was presented

(Munaro et al., 2012). The only drawback is that

they assume people stand on the ground plane, con-

sequently it does not present accurate results for peo-

ple that stand considerably above or below that plane,

i.e. performs poorly for people climbing stairs or sit-

ting behind a table. It processes information from

the point cloud by downsampling it. Then, it esti-

mates the ground plane with a RANSAC-based least

square method so that it can be removed, thus sep-

arating clusters that might contain people. For each

of these clusters, a HOG-based people detector is ap-

plied to the corresponding part of the RGB image.

2.3 Statement of Contributions

In this work, we aim at providing an insight on per-

forming SLAM and human detection and localization

simultaneously, while achieving reliable results and

acceptable performance using solely one RGB-D sen-

sor in the mobile robot. Even though much research

has been conducted on mapping and people detection

with RGB-D sensors, both subjects are not often in-

tegrated to assemble a functional system for applica-

tions such as SaR missions, where providing rescue

teams with a map of the environment and localizing

possible victims is of inestimable value.

3 SYSTEM OVERVIEW

As seen in Fig. 1, the proposed system uses a Kinect

sensor and comprises two major modules: the People

Detection module and the Mapping module. Addi-

tionally, it runs under ROS which is the most widely

used robotics framework, providing a set of tools, li-

braries, drivers and other resources that make easier

developing robot applications, and provide hardware

abstraction (Quigley et al., 2009). The data from the

Kinect was retrieved using the OpenNI driver

and the

driver used for the Pioneer 3-DX mobile robot was

ROSARIA

, both already available in ROS.

3.1 Mapping

Although the Kinect allows to perform RGB-D map-

ping, our goal is to run a SLAM algorithm along with

other tasks, such as people detection, and eventually

autonomous exploration.

We opted to project the depth measurements pro-

vided by the sensor in the ﬂoor plane and simulate a

2D Laser Scan in order to reduce the computational

cost. This is represented by the “Depth to LaserScan”

block in Fig. 1. It processes the columns of the matrix

and creates a vector with the minimum depth value

per column, thus originating a vector of 640 distance

measures, i.e. a 2D scan.

The 2D range measurements are used as an input

to the GMapping algorithm (Grisetti et al., 2007), al-

ready available in ROS, along with odometry infor-

mation provided by the robot’s driver.

This SLAM algorithm was selected for several

reasons. Firstly, considering our performance con-

straints, it does not present a high computational bur-

den. Secondly, the Kinect has a low FoV, which can

cause problems in scan matching, therefore the mo-

bile robot’s odometry can greatly improve results. Fi-

nally, it was shown to be robust in testing and experi-

ments, when compared to other SLAM approaches.

3.2 People Detection

Several people detection algorithms do not take into

consideration 3D information, while others use that

information to improve results. However, the authors

of (Munaro et al., 2012) proposed an algorithm that

uses the point cloud generated to lower the compu-

tational load of classical people classiﬁers. Further-

more, ROS provides access to the Point Cloud Li-

brary (PCL) (Rusu and Cousins, 2011), which con-

tains algorithms to process 3D data from RGB-D sen-

sors. Therefore, the technical implementation of the

algorithm becomes much simpliﬁed.

The algorithm ﬁrstly processes the point cloud, di-

viding the space into volumetric pixels (voxels) with

an edge length of 0.06m, and reduces the 3D points

into a common voxel according to the voxel’s cen-

troid. Therefore, we obtain a reduced number of

ROS Wiki - openni kinect, http://wiki.ros.org/

openni kinect (Accessed: 2014-06-21)

ROS Wiki - ROSARIA, http://wiki.ros.org/ROSARIA

(Accessed: 2014-06-21)

Real-timePeopleDetectionandMappingSystemforaMobileRobotusingaRGB-DSensor

469

points and also a point cloud with approximately con-

stant point density, avoiding its variation with the dis-

tance from the sensor.

Visual-based People Detection

RGB-D Data

(Kinect)

Depth to

LaserScan

GMapping

Depth image

Virtual Laser Scan

Point Cloud

RGB Image

Robot

(Pioneer 3DX)

Odometry

People

Locator

People Detection

Map

Mapping

Figure 1: System Overview.

With a ﬁltered point cloud, and considering the as-

sumption that people stand on the ground, the ground

plane’s coefﬁcients are estimated and updated at ev-

ery frame using a least square method, therefore it

is robust to small changes such as those experienced

when a mobile robot is moving. At this stage, points

located in the ground plane are removed, by discard-

ing every point located at a distance to the estimated

ground plane lower than a threshold of 6cm. As a con-

sequence, the remaining clusters become no longer

connected by this common plane.

After this ﬁrst stage of point cloud processing,

the different clusters can now be computed by la-

belling neighbouring 3D points on the basis of their

Euclidean distances. In our case, we started by con-

sidering that points closer than a threshold of 2 times

the voxel edge belonged to the same cluster. How-

ever, this process may lead to errors, e.g. dividing par-

tially occluded people into different clusters, or merg-

ing different people in the same cluster when they are

near each other. As for the second issue, the algorithm

uses the position of the heads, that generally are not so

close and occluded, to divide these clusters into sub-

clusters, so that people merged previously in a single

cluster are separated into different clusters.

For the clusters obtained earlier, a HOG-based de-

tector (Dalal and Triggs, 2005) is applied to the por-

tion of the RGB image corresponding to the ﬁxed

aspect ratio bounding box that contains the whole

cluster. This process includes the computation of

HOG features and their application to a trained lin-

ear SVM

. The SVM is a learning model that allows

us to classify the data based on its training.

Figure 2: Summarized overview of the system in ROS.

Boxes refer to ROS nodes and arcs to topics.

3.3 Integration

Although both modules run in the same system, their

data is not in the same reference frame: detections

are made on the Kinect frame which is different from

the reference frame of the map. To deal with this is-

sue, we created a an additional ROS node (see node

map people manager in Fig. 2) that subscribes to the

detections, transforms their coordinates to map coor-

dinates, using ROS tools, and manages the detections,

avoiding multiple detections of the same person in the

same position. Also, it publishes the corresponding

markers to allow the visualization of the map and the

detections on the real relative position in the map. The

signiﬁcant portion of the rqt graph is of the ROS sys-

tem is presented if 2.

4 EXPERIMENTAL SETUP AND

VALIDATION

In order to validate the people detection and mapping

solution, we used the experimental setup depicted in

Fig. 3, with the addition of a laptop on the robot’s plat-

form. The Kinect sensor was tilted 8

◦

up so that the

operating range for people detection is not affected

by the relative position to the ground plane, i.e. point

clouds will contain the whole person instead of half

body at closer distances. The test scenario was indoor

and was located in AP4ISR lab of the Institute of Sys-

tems and Robotics of the Univ. of Coimbra (ISR-UC).

Our experimental work was divided into three stages:

mapping method validation, people detection valida-

tion, and integrated system validation.

The SVM was trained using the well known INRIA Per-

son Dataset. (URL: http://pascal.inrialpes.fr/data/human)

ICINCO2014-11thInternationalConferenceonInformaticsinControl,AutomationandRobotics

470

Figure 3: Mobile Robot (Pioneer 3-DX) with a Microsoft Kinect mounted on top.

4.1 Mapping Validation

In order to validate the mapping task with the Kinect

sensor, we attached to our robot a Hokuyo URG-

04LX-UG01 LRF to produce maps to be compared

with the ones obtained using the Kinect and the

method described in sec. 3.1. The environment tested

was a lab arena with approximately 4.6 × 4.0m, as il-

lustrated in Fig. 4. The robot was teleoperated using

an ssh remote connection, while running GMapping.

Figure 4: Photo of the test area (left) and ground truth map

(right).

Figure 5: Maps produced with the Hokuyo URG-04LX-

UG01 LRF sensor (left) and the Kinect sensor (right).

By visually comparing both maps, in Fig. 5, it be-

comes evident that the Kinect is not as accurate as the

LRF and that its limited FoV, range, and lower accu-

racy have a negative impact on the results obtained.

Nevertheless, both maps are easily interpretable by

the human eye. We computed the absolute pixel-wise

matching of all pixels in the maps generated to assess

their quality, and obtained acceptable matching rates,

as shown in Table 1. In order to compute the match-

ing metric, we binarized the maps obtained and the

ground truth, calculated the best ﬁt alignment by ro-

tating the maps, and computed the pixel-wise match

of each pixel in the image.

Table 1: Pixel-wise matching rates.

Maps Matching Rate

Ground truth - Laser 96.9 %

Ground truth - Kinect 94.3 %

4.2 People Detection Validation

Despite the availability of some datasets, they do not

comply with the constraints and our hardware setup in

Fig. 3, mostly because the Kinect is only 24cm above

the ground, so it is tilted up to acquire visual informa-

tion containing people. In order to validate our people

detection method, we captured a dataset of about 100

frames that was manually annotated with the people

present in each frame. It contains one person walk-

ing in several directions at a distance of 1 to 4 meters

to the camera frame (an example is shown in Fig. 6).

Therefore, we have a dataset of binary decision. This

way, we were able to acquire data in similar condi-

tions to the ﬁnal intended applications.

We applied the people detection method imple-

mented in ROS to process point clouds of the dataset,

and extract results (true positives, true negatives, false

positive and false negatives) in Receiver Operating

Characteristic (ROC) curves (see Fig. 7 and Fig. 8).

The ROC curve is a graphical plot which illustrates

Real-timePeopleDetectionandMappingSystemforaMobileRobotusingaRGB-DSensor

471

Figure 6: Example of a point cloud from the dataset.

0 0.2 0.4 0.6 0.8 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

False Positive Rate

True Postive Rate

Figure 7: ROC curve on our dataset.

the performance of a binary classiﬁer system as the

discrimination threshold is varied.

Afterwards, we applied the method to cross val-

idate with the Kinect Tracking Precision (KTP)

Dataset from (Munaro et al., 2012), which contains

sequences of multiple people captured from a static

camera.

Fig. 7 relates the True Positive and False Positive

Rates (TPR and FPR). Perfect results are near the top

left corner with 0% FPR and 100% TPR. Fig. 8 shows

the precision and recall percentages for each exper-

iment. The ideal result is situated on the top right

corner with 100% recall and precision. Fig. 9 is the

Detection Error Tradeoff (DET) curve which relates

the False Rejection Rate (FRR) in percentage and the

number of False Positives per Frame (FPPF). The best

result is located on the bottom left corner.

We observed in the results obtained with our

dataset that the method is very robust in terms of false

negatives, showing a low FPR for high enough TPR,

e.g. 86.87% TPR for ≈ 0.00% FPR. This is also visi-

ble in the high precision shown in Fig. 8, even for high

recall values, e.g. 100% precision for 86.67% recall.

0 0.2 0.4 0.6 0.8 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Recall

Precision

Our Dataset

KTP Dataset

Figure 8: Precision-Recall for our dataset and for the KTP

dataset.

0 0.2 0.4 0.6 0.8 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

FPPF

FRR

Our Dataset

KTP Dataset

Figure 9: Detection Error Tradeoff (DET) for our dataset

and for the KTP dataset.

We were able to achieve an accuracy of 92.98% which

shows the reliability of the method.

Note however that this analysis is performed inde-

pendently for each processed point cloud (each frame

provided by Kinect). In real world applications, with

depth data from Kinect at 30 fps, we will capture sev-

eral frames for each person, which allows us to gain

certainty when detecting a person in short time inter-

vals. In the case of SaR missions, it is very impor-

tant to lower the false positives as much as possible

to avoid wasting resources and time while keeping a

high FPR to be capable of detecting all the victims.

The low number of frames and the presence of

only one person is clear in the curves and led us to

run the method with the KTP dataset. The results ob-

tained on the KTP dataset were comprehensively not

as good as the ones with our dataset, since the former

is a more complex dataset containing up to 5 peo-

ple in the same sequence. Still, the accuracy of the

method for a single frame is enough considering the

amount of frames available that we can process for

detecting each person. We did not compute the ROC

curve for this dataset because it aims to assess a bi-

ICINCO2014-11thInternationalConferenceonInformaticsinControl,AutomationandRobotics

472

nary classiﬁcation problem and it barely contains bi-

nary decisions due to the nature of the tracking prob-

lem. Still, we can conclude from the Precision-Recall

and DET curves that the results are accurate enough

for our intended applications, e.g. 72.39% precision

for 57.51% recall and 0.39 FPPF for 42.49% FRR.

5 EXPERIMENTAL TESTS WITH

A MOBILE ROBOT

To validate the system, we used the previously pre-

sented setup in an indoor uncontrolled and cluttered

environment in the AP4ISR lab of ISR-UC. The envi-

ronment contains desks, tables, placards, chairs, hard-

ware and all kinds of objects (see Fig. 10). The sys-

tems performed all the computation and also displays

the results in real-time.

Figure 10: Photo of the environment used to test the whole

system.

The environment was reliably mapped (see

Fig. 11). We note that the 2D mapping method uses

the desks top edges in the mapping so the space un-

der them are not considered due to the point cloud

downsampling. On the other hand, for people detec-

tion purposes, the whole space is considered, there-

fore if a detection is made under a table, it would still

be represented in the map.

In our experiments, we used three subject stand-

ing on different locations, two real humans (one male

and one female) and a human model (seen in Fig. 10).

In all experiments, the robot was able to map fairly

accurately and detect all of them. Fig. 11 shows a

picture of the ROS visualization software rviz with

the ongoing construction of the map and the detec-

tions made so far. The experiment depicted lasted 4,3

minutes. The robot was teleoperated with a Wii Re-

mote Control connected via bluetooth to the laptop

mounted on top of the robot. The system depicted

in Fig. 1 and in Fig. 2 was run on a laptop with an

Intel Core i7-4700MQ CPU, 16GB of RAM, Ubuntu

12.04, and ROS Hydro. We computed the average

CPU load along the experiment, which resulted in

44.71% of CPU usage and an average of 16.07 fps was

processed. This frame rate could be increased through

the parallelization of the code in a GPU. Also the

results demonstrate that the system performs well in

real word scenarios and its computational load leaves

room to incorporate further modules in the system,

such as additional sensory cues, e.g. audio input, and

perform other tasks in parallel, such as autonomous

navigation and exploration.

Figure 11: Map obtained and people detected.

6 CONCLUSION

This paper proposed an integrated system that is able

to successfully map the environment, localize the

robot with respect to the map and, simultaneously, de-

tect and localize people within the environment, while

relying solely in a RGB-D sensor. However, we in-

tend in our future work to have a system that is also

able to autonomously explore the environment. This

will probably require an upgrade of the current hard-

ware setup of Fig. 3, due to the narrow FoV of the

Kinect which may cause unreliable navigation. Also,

our goal is a system that can be used to perform SaR

missions, the robot should be able to navigate towards

victims, to eventually interact with them. We intend

to study the use of a second Kinect sensor to achieve

a wider FoV. This improvement does not imply great

costs and should yield a safer navigation.

In the future, we also intend to take advantage of

other capabilities of the Kinect sensor, such as pro-

cessing audio information from its microphone array

to improve people detection results. Furthermore, we

would like to test the system in other applications

such as automated patrolling and surveillance with

robotic teams (Portugal and Rocha, 2013).

Real-timePeopleDetectionandMappingSystemforaMobileRobotusingaRGB-DSensor

473

REFERENCES

Bajracharya, M., Moghaddam, B., Howard, A., Brennan,

S., and Matthies, L. H. (2009). A fast stereo-based

system for detecting and tracking pedestrians from a

moving vehicle. Int. Journal of Robotics Research,

28(11-12):1466–1485.

Clemente, L., Davison, A., Reid, I., Neira, J., and Tard

os,

J. D. (2007). Mapping large loops with a single hand-

held camera. Proc. Robotics: Science and Systems

Conf.

Dalal, N. and Triggs, B. (2005). Histograms of oriented

gradients for human detection. Int. Conf. on Computer

Vision & Pattern Recognition, 2:886–893.

Dissanayake, M. G., Newman, P., Clark, S., Durrant-

Whyte, H. F., and Csorba, M. (2001). A solution to the

simultaneous localization and map building (SLAM)

problem. Rob. & Automation, IEEE Tr. on, 17(3):229–

241.

Droeschel, D., May, S., Holz, D., Ploeger, P., and Behnke,

S. (2009). Robust ego-motion estimation with ToF

cameras. European Conf. on Mobile Robots, pages

187–192.

Ferreira, J. F., Lobo, J., Bessi

ere, P., Castelo-Branco, M.,

and Dias, J. (2013). A Bayesian framework for active

artiﬁcial perception. IEEE Trans. on Cybernetics (Part

B), 43(2):699–711.

Fischler, M. A. and Bolles, R. C. (1981). RANdom SAmple

Consensus: a paradigm for model ﬁtting with appli-

cations to image analysis and automated cartography.

Comm. of the ACM, 24(6):381–395.

Grisetti, G., Stachniss, C., and Burgard, W. (2007).

Improved techniques for grid mapping with Rao-

Blackwellized particle ﬁlters. IEEE Trans. on

Robotics, 23:2007.

Henry, P., Krainin, M., Herbst, E., Ren, X., and Fox, D.

(2010). RGB-D mapping: Using depth cameras for

dense 3D modeling of indoor environments. Experi-

mental Robotics, 79:477–491.

Keller, C. G., Enzweiler, M., Rohrbach, M., Fernan-

dez Llorca, D., Schnorr, C., and Gavrila, D. M. (2011).

The beneﬁts of dense stereo for pedestrian detection.

ITS, IEEE Trans. on, 12(4):1096–1106.

Konolige, K. and Agrawal, M. (2008). FrameSLAM:

From bundle adjustment to real-time visual mapping.

Robotics, IEEE Trans. on, 24(5):1066–1077.

Kuemmerle, R., Grisetti, G., Strasdat, H., Konolige, K., and

Burgard, W. (2011). g2o: A general framework for

graph optimization. ICRA, pages 3607–3613.

Levi, K. and Weiss, Y. (2004). Learning object detection

from a small number of examples: the importance of

good features. CVPR, pages 53–60.

Llorca, D., Sotelo, M., Hell

ın, A., Orellana, A., Gavilan,

M., Daza, I., and Lorente, A. (2012). Stereo regions-

of-interest selection for pedestrian protection: A sur-

vey. Transportation research part C: emerging tech-

nologies, 25:226–237.

Lowe, D. G. (2004). Distinctive image features from scale-

invariant keypoints. Int. Journal Computer Vision,

60(2):91–110.

May, S., Droeschel, D., Holz, D., Fuchs, S., Malis, E.,

uchter, A., and Hertzberg, J. (2009). 3D mapping

with ToF cameras. Journal of Field Robotics, sp. is-

sue on 3D Mapping.

Menezes, P., Brethes, L., Lerasle, F., Danes, P., and Dias, J.

(2003). Visual tracking of silhouettes for human-robot

interaction. pages 971–976.

Michael Calonder, Vincent Lepetit, and Pascal Fua (2008).

Keypoint signatures for fast learning and recognition.

In European Conf. on Computer Vision.

Munaro, M., Basso, F., and Menegatti, E. (2012). Tracking

people within groups with RGB-D data. IROS, pages

2101–2107.

Portugal, D. and Rocha, R. P. (2013). Distributed multi-

robot patrol: A scalable and fault-tolerant framework.

Robotics & Auton. Syst., 61(12):1572–1587.

Premebida, C., Ludwig, O., and Nunes, U. (2009). Lidar

and vision-based pedestrian detection system. Journal

of Field Robotics, 26(9):696–711.

Prusak, A., Melnychuk, O., Roth, H., Schiller, I., and Koch,

R. (2008). Pose estimation and map building with a

time-of-ﬂight camera for robot navigation. Int. Jour-

nal Intell. Syst. Technol. Appl., 5(3/4):355–364.

Quigley, M., Conley, K., Gerkey, B. P., Faust, J., Foote, T.,

Leibs, J., Wheeler, R., and Ng, A. Y. (2009). ROS: an

open-source robot operating system. ICRA Workshop

on Open Source Software.

Rocha, R. P., Portugal, D., Couceiro, M., Araujo, F.,

Menezes, P., and Lobo, J. (2013). The CHOPIN

project: Cooperation between Human and rObotic

teams in catastroPhic INcidents. SSRR, pages 1–4.

Rosten, E. and Drummond, T. (2006). Machine learning

for high-speed corner detection. In European Conf.

on Computer Vision, pages 430–443.

Rusu, R. and Cousins, S. (2011). 3D is here: Point Cloud

Library (PCL). ICRA, pages 1–4.

Satake, J. and Miura, J. (2009). Multiple-Person Tracking

for a Mobile Robot Using Stereo. MVA Conf., pages

273–277.

Soni, B. and Sowmya, A. (2013). Victim detection and

localisation in an urban disaster site. ROBIO, pages

2142–2147.

Spinello, L. and Arras, K. O. (2011). People detection in

RGB-D data. IROS, pages 3838–3843.

Triebel, R. and Burgard, W. (2005). Improving simultane-

ous localization and mapping in 3D using global con-

straints. AAAI, 20(3):1330.

Triggs, B., Mclauchlan, P., Hartley, R., and Fitzgibbon, A.

(2000). Bundle adjustment – a modern synthesis. Vi-

sion Algorithms: Theory and Practice, LNCS, pages

298–375.

ICINCO2014-11thInternationalConferenceonInformaticsinControl,AutomationandRobotics

474