Thermal and 3D Kinect Sensor Fusion for Robust People Detection using

Evolutionary Selection of Supervised Classiﬁers

L. Susperregi

, E. Jauregi

, B. Sierra

, J. M. Mart´ınez-Otzeta

, E. Lazkano

and A. Ansuategui

TEKNIKER-IK4, Autonomous and Smart Systems Unit., Eibar, Spain

Department of Computer Science and Artiﬁcial Intelligence, University of Basque Country, Donostia/San Sebasti´an, Spain

Keywords:

Computer Vision, Machine Learning, Robotics, 3D People Detection, Estimation of Distribution Algorithms.

Abstract:

In this paper we propose a novel approach for combining information from low cost multiple sensors for peo-

ple detection on a mobile robot. Robustly detecting people is a key capability needed for robots that operate in

populated environments. Several works show the advantages of fusing data coming from complementary sen-

sors. Kinect sensor offers a rich data set at a signiﬁcantly low cost, however, there are some limitations using

it in a mobile platform, mainly that Kinect relies on images captured by a static camera. To cope with these

limitations, this work is based on the fusion of Kinect and thermopile array sensor mounted on top of a mobile

platform. We propose the implementation of evolutionary selection of people detection supervised classiﬁers

built using several computer vision transformation. Experimental results carried out with a mobile platform

in a manufacturing shop ﬂoor show that the percentage of wrong classiﬁed using only Kinect is drastically

reduced with the classiﬁcation algorithms and with the combination of the three information sources.

1 INTRODUCTION

Service robots, now and in the near future, perform-

ing tasks as assistants, guides, tutors, or social com-

panions in human populated settings such as muse-

ums, hospitals, etc. pose two main challenges: by the

one hand, robots must be able to adapt to complex,

unstructured environments and, on the other hand,

robots must interact with humans. While interacting

with the environment, the robot must navigate, de-

tect and avoid obstacles (Morales et al., 2011). A

requirement for natural Human Robot interaction is

the robot’s ability to accurately and robustly detect

and localize the persons around it in real-time. This

problem is a challenging one, quite difﬁcult when a

low cost camera is the only available sensor (Yao and

Odobez, 2011).

This article describes the realization of a human

detection system based on low-cost sensing devices.

Recently, research on sensing components and soft-

ware lead by Microsoft provide useful results for ex-

tracting the human kinematics (Kinect motion sensor

device (Kinect, )).

Within this article, the service proposed by the

mobile robot is to approach the closer person in the

room, i.e. to approach the person to a given distance

and to verbally interact with him. This “engaging”

behaviour can be useful in potential robot services

such a tour guide, health care or information provider.

Once the target person has been chosen, the robot

plans a trajectory and navigates to the desired posi-

tion. To accomplish this the robot must be able to de-

tect human presence in its vicinity and it cannot be as-

sumed that the person faces the direction of the robot

since the robot acts proactively.

Kinect offers a rich data set at a signiﬁcantly low

cost. While the Kinect is a great addition to robotics

there are some limitations. First, the depth map is

only valid for objects more than 80cm away from

the sensing device. Second, the Kinect uses an IR

projector with an IR camera which means that sun-

light could affect negatively, taking into account that

the sun emits in the IR spectrum. Third, Kinect rely

on the detection of human activities captured by a

static camera. In mobile robot applications the sen-

sors setup is assumed to be embedded in the robot that

is usually moving. As a consequence the robot is ex-

pected to evolvein environmentswhich are highly dy-

namic, cluttered, and frequently subjected to illumi-

nation changes. To cope with this, this work is based

on the hypothesis that the combination of Kinect and

thermopile array sensor (low cost Heimann HTPA

thermal sensor, (HTPA, )) can signiﬁcantly improve

the robustness of human detection. Thermal vision

102

Susperregi L., Jauregi E., Sierra B., María Martínez-Otzeta J., Lazkano E. and Ansuategi A..

Thermal and 3D Kinect Sensor Fusion for Robust People Detection using Evolutionary Selection of Supervised Classiﬁers.

DOI: 10.5220/0004397601020110

In Proceedings of the 10th International Conference on Informatics in Control, Automation and Robotics (ICINCO-2013), pages 102-110

ISBN: 978-989-8565-71-6

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

helps to overcome some of the problems related to

colour vision sensors, since humans have a distinctive

thermal proﬁle compared to non-living objects and

there are no major differences in appearance between

different persons in a thermal image. Another advan-

tage is that the sensor data does not depend on light

conditions and people can also be detected in com-

plete darkness. Therefore it is a promising research

direction to combine the advantages of different sen-

sor sources because each sensing modality has com-

plementary beneﬁts and drawbacks.

This article outlines the design and development

of a multimodal human detection system. The chosen

approach is:

• To combine machine learning paradigms with

computer vision techniques in order to perform

image classiﬁcation: ﬁrst we apply transforma-

tions using computer vision techniques and af-

terwards we perform classiﬁcation using machine

learning paradigms.

• To combine the resulting classiﬁers obtained by

this new image classiﬁcation paradigm. Appart

of using all the classiﬁers obtained (paradigms ×

transformations), we use a new aproach in multi

classiﬁer construction in which a previous selec-

tion of classiﬁers is performed.

We have experimented in a real manufacturing shop

ﬂoor where machines and humans share the space in

performing production activities. Experiments seem

promising considering that the percentage of wrong

classiﬁed using only Kinect’s detection algorithms is

drastically reduced.

2 RELATED WORK

People detection and tracking systems have been

studied extensively due to the increase of demand of

advanced robots that must integrate natural human-

robot interaction capabilities in order to perform some

speciﬁc tasks for the humans or in collaboration with

them. As a complete review on people detection is

beyond the scope of this work, an extensive work can

be found in (Schiele, 2009), we focus on most related

work.

People detection solutions that can be used on mo-

bile robots should cope with several requirements:

• Camera and other sensors are usually not static

since they are mounted on a moving platform.

As a consequence, many algorithms aim at the

surveillance applications are not applicable.

• fast (real-time). The computational load of the

usedalgorithms should be low in order to perform

real-time detection.

• non-invasive (normal human activity is unaf-

fected).

To our knowledge, two approaches are commonly

used for detecting people on a mobile robot. One,

vision based techniques, and another approach, com-

bining vision with other modalities, normally range

sensors such as laser scanners or sonars like in (Guan

et al., 2007). Methods for people detection in colour

images extract features based on skin colour, face,

clothes and motion information such as (Bellotto and

Hu, 2010). All methods for detecting and track-

ing people in colour images on a moving platform

face similar problems and their performance depends

heavily on the current light conditions, viewing angle,

distance to persons, and variability of appearance of

people in the image.

Most existing combined vision-thermal based

methods, in (St-Laurent et al., 2006; Hofmann et al.,

2011; Johnson and Bajcsy, 2008; Thi Thi Zin and

Hama, 2011), concern non-mobile applications in

video monitoring applications, and especially for

pedestrian detection where the pose of the camera is

ﬁxed . Some works, (Gundimada et al., 2010), show

the advantagesof using thermalimages for face detec-

tion. They suggest that the fusion of both visible and

thermal based face recognition methodologies yields

better overall performance.

As yet, however, there is hardly any published

work on using thermal sensor information to detect

humans on mobile robots. The main reason for the

limited number of applications using thermal vision

so far is probably the relatively high price of this sen-

sor. (Treptow et al., 2005) shows the use of ther-

mal sensors and grey scale images to detect people

in a mobile robot. A drawback of most of these ap-

proaches is the sequential integration of the sensory

cues. People are detected by thermal information only

and are subsequently veriﬁed by visual or auditory

cues.

Most of the abovementioned approaches have

mostly used predeﬁned body model features for the

detection of people. Few works consider the applica-

tion of learning techniques. (Arras et al., 2007) pro-

poses to use supervised learning (AdaBoost) to create

a people detector with the most informative features .

(Mozos et al., 2010) builds classiﬁers able to detect a

particular body part such as a head, an upper body or

a leg using laser data.

Combination of classiﬁers has been widely used

as a useful approach in several machine learning tasks

(Kuncheva, 2004). In the ﬁeld of people detection

several authors have used this approach, like (Oliveira

et al., 2010),that use histograms of oriented gradients

Thermaland3DKinectSensorFusionforRobustPeopleDetectionusingEvolutionarySelectionofSupervisedClassifiers

103

(HOGs) and local receptive ﬁelds (LRFs), which are

provided by a convolutional neural network, and are

classiﬁed by multilayer perceptrons (MLPs) and sup-

port vector machines (SVMs) combining classiﬁers

by majority vote and fuzzy integral.

3 PROPOSED APPROACH

We propose a multimodal approach, which can be

characterized by the fact that all used sensory cues are

concurrently processed. The proposed detection sys-

tem is based on a Kinect motion sensor device for the

XBOX 360 and a HTPA thermal sensor developed by

Heimann, (HTPA, ), mounted on top of a RMP Seg-

way mobile platform, which is shown in Figure 1.

Figure 1: The used robotic platform: a Segway RMP 200

provided with the Kinect and the thermal sensor.

We aim at applying a new approach to combine

machine learning paradigms with computer vision

techniques in order to perform image classiﬁcation.

Our approach is divided into three phases: transfor-

mation using computer vision techniques, classiﬁca-

tion using machine learning paradigms and optimal

combination of classiﬁers using a previous classiﬁer

selection by means of EDA (Estimation of Distribu-

tion Algorithms) (M¨uehlenbein and Paaß, 1996).

1. Computer vision transformations. In order to have

different views of the images, different modiﬁ-

cations over the original pictures are performed.

The main goal of this phase is to have variabil-

ity in the aspect the picture offers, so that differ-

ent values are obtained for the same pixel posi-

tions. As it has been mentioned before, we aim at

using three input images (colour, depth, tempera-

ture) to construct a classiﬁer. To enrich the input

database, we have decided to build some variants

using the matrix obtained in the original images,

applying computer vision related transformations.

In this way, and for each of the three data sources,

a set of equivalent images is obtained, and a set

of databases are constructed, one for each of the

transformation used.

To achieve this, we combine some standard image

related algorithms (edge detection, gaussian ﬁlter,

binarization, and so on) in order to obtain differ-

ent views of the images, and afterward, we apply

some standard machine learning classiﬁers taking

into account the pixel values of the different mod-

iﬁcations of the pictures. From the original train-

ing database collected, a new training database is

obtained for each of the computer vision transfor-

mation used, summing up a total of 24 databases

for each device.

2. In the classiﬁcation phase, the system learns a

classiﬁer from a hand-labeled dataset of images

(abovementioned original and transformations).

As classiﬁers we use of ﬁve well known ML

supervised classiﬁcation algorithms with com-

pletely different approaches to learning and a long

tradition in different classiﬁcation tasks: IB1,

Naive-Bayes, Bayesian Network, C4.5 and SVM.

3. Then, the goal of our fusion process is to maxi-

mize the beneﬁts of each modality by intelligently

fusing their information, and by overcoming the

limitations of each modality alone.

• Considering the large number of possible classi-

ﬁers combinations (24x5 for each sensor) we at-

tempt to get an optimal solution making a selec-

tion of a subset of classiﬁers which obtain better

result from the accuracy point of view. An evo-

lutionary algorithm called Estimation of Distribu-

tions Algorithm (EDA) is used to perform the se-

lection.

3.1 Data Sources

s stated before, two kind of data sources are used com-

ing from the Kinect sensor and the thermopile array.

Kinect 3D Images. Kinect provides 3D images, it

uses near infrared light to illuminate the subject and

the sensor chip measures the disparity between the in-

formation received by the two IR sensors. It provides

a 640x480 distance (depth) map in real time (30 fps).

In addition to the depth sensor the Kinect also pro-

vides a traditional 640x480 RGB image.

ICINCO2013-10thInternationalConferenceonInformaticsinControl,AutomationandRobotics

104

Figure 2: Image thermopile.

Thermal Images. The HTPA allows the measure-

ment of temperature distribution of the environment,

where very high resolutions are not necessary, such

as person detection, surveillance of temperature crit-

ical surfaces, hotspot or ﬁre detection, energy man-

agement and security applications. The sensor only

offers a 32x31 image that allows a rough resolution

of the temperature of the environment as it is shown

in Figure 2. The beneﬁts of this technology are low

costs, the very small power consumption, small size,

as well as the high sensitivity of the system.

3.2 Computer Vision Transformations

Three image type data taken in parallel (image, dis-

tance, temperature) are used to build a classiﬁer

whose goal is to identify whether a person is in the

viewscope of the robot or not. Figure 3 shows an ex-

ample of the three different images obtained; each im-

age is considered as a gray scale one, and the value

of each pixel, position in the matrix, is considered

as a predictor variable within the Machine Learning

database construction, summing up n × m features,

being m the column number and n the row number in

the image. Each image corresponds to a single case

in the generated database.

In order to have different views of the images, dif-

ferent modiﬁcations over the original pictures are per-

formed. The main goal of this phase is to have vari-

ability in the aspect the picture offers, so that different

values are obtained for the same pixel positions.

We have selected some of the most common trans-

formations offered by related software, in order to

show the beneﬁts of the proposed approach making

use of simple algorithms. Table 1 presents the trans-

formations used, as well as a brief description of each

of them. It is worth to point out the fact that any other

CV transformation could be used apart from the se-

lected ones.

P1 P2 ... Pnxm Category

.......

.....

12 23 ... 230 YES

32 19 ... 123 NO

98 76 ... 44 YES

i case

P1 P2 ... Pnxm Category

.......

.....

P1 P2 ... Pnxm Category

.......

.....

9 78 ... 203 NO

77 45 ... 151 YES

15 65 ... 20 YES

99 98 ... 95 YES

98 99 ... 95 NO

90 76 ... 94 YES

IMAGE

DISTANCES

TENPERATURES

KINECT

THERMIC SENSOR

MACHINE LEARNING DATABASES CREATION

Figure 3: Image preprocessing and training database cre-

ation.

Table 1: Used image transformations.

Transform Command Effect

Transf. 1 Convolve Apply a convolution kernel to the image

Transf. 2 Despeckle Reduce the speckles within an image

Transf. 3 Edge Detect edges in the image

Transf. 4 Enhance Apply a ﬁlter to enhance a noisy image

Transf. 5 Equalize Perform histogram equalization

Transf. 6 Gamma Perform a gamma correction

Transf. 7 Gaussian Reduce image noise and levels

Transf. 8 Lat Local adaptive thresholding

Transf. 9 Linear-Str. Linear with saturation histogram stretch

Transf. 10 Median Apply a median ﬁlter to the image

Transf. 11 Modulate Vary the brightness, saturation, and hue

Transf. 12 Negate Negate the image

Transf. 13 Radial-blur Radial blur the image

Transf. 14 Raise Create a 3-D effect

Transf. 15 Selective-blur Blur pixels within a contrast threshold

Transf. 16 Shade Shade the image

Transf. 17 Sharpen Sharpen the image

Transf. 18 Shave Shave pixels from the image edges

Transf. 19 Sigmoidal Increase the contrast

Transf. 20 Transform Afﬁne transform image

Transf. 21 Trim Trim image edges

Transf. 22 Unsharp Sharpen the image

Transf. 23 Wave Alter an image along a sine wave

3.3 Machine Learning Classiﬁers

As classiﬁers we use ﬁve well known ML supervised

classiﬁcation algorithms (Mitchell, 1997) with com-

pletely different approaches to learning and a long

tradition in different classiﬁcation tasks: IB1, Naive-

Bayes, Bayesian Network, C4.5 and SVM. Then, the

goal of our fusion process is to maximize the bene-

ﬁts of each modality by intelligently fusing their in-

formation, and by overcoming the limitations of each

modality alone.

Thermaland3DKinectSensorFusionforRobustPeopleDetectionusingEvolutionarySelectionofSupervisedClassifiers

105

IB1. The IB1 (Aha et al., 1991) is a case-based,

Nearest-Neighbor classiﬁer. To classify a new test

sample, all training instances are stored and the near-

est training instance regarding the test instance is

found: its class is retrieved to predict this as the class

of the test instance.

Naive-Bayes. The Naive-Bayes (NB) rule (Cestnik,

1990) uses the Bayes theorem to predict the class for

each case, assuming that the predictive genes are in-

dependent given the category. To classify a new sam-

ple characterized by d genes X = (X

, X

, ..., X

), the

NB classiﬁer applies the following rule:

N−B

= argmax

∈C

p(c

)

∏

i=1

p(x

)

where c

N−B

denotes the class label predicted by the

Naive-Bayes classiﬁer and the possible classes of the

problem are grouped in C = {c

, . . . , c

Bayesian Networks. A Bayesian network, belief

network or directed acyclic graphical model is a prob-

abilistic graphical model that represents a set of ran-

dom variables and their conditional independencies

via a directed acyclic graph (DAG). Probabilistic clas-

siﬁers give to the new case the most likely class

for the observed data. In this paper we have used

Bayesian Networks as classiﬁcation models (Sierra

et al., 2009).

C4.5. The C4.5 (Quinlan, 1993) represents a classi-

ﬁcation model by a decision tree. It is run with the de-

fault values of its parameters. The tree is constructed

in a top-down way, dividing the training set and be-

ginning with the selection of the best variable in the

root of the tree.

Support Vector Machines (SVM). SVM are a set

of related supervised learning methods used for clas-

siﬁcation and regression. Viewing input data as two

sets of vectors in an n-dimensional space, an SVM

will construct a separating hyperplane in that space,

one which maximizes the margin between the two

data sets. To calculate the margin, two parallel hyper-

planes are constructed, one on each side of the sep-

arating hyperplane, which are pushed up against the

two data sets (Meyer et al., 2003).

3.4 Combination of Classiﬁers

In order to ﬁnally classify the targets as human or non

human, the estimation of the Kinect based classiﬁers

has to be combined with the estimation of the thermal

SVM

IB1

BNC4.5

Meta−classifier

New Case (to be classified)

FINAL DECISSION

Figure 4: Stacked Generalization schemata.

based classiﬁer. After building the individual classi-

ﬁers (5× 24 = 120 for each sensor) the aim is at com-

bining the output of the different classiﬁers to obtain

a more robust ﬁnal people detector.

Two approaches are performed and compared:

1. Stacked generralization approach: standard mul-

ticlassiﬁer to combine the 360 classiﬁers

2. Classiﬁer Subset Selection Stacked. A selection

of some of the classiﬁers is done ﬁrst, and then

the combination is performed among the selected

classiﬁers.

Stacked Generalization. The last step is to com-

bine the results of the classiﬁers obtained for the three

sensors (colour, distance, temperature). To achieve

this, we use a bi-layer Stacked Generalization ap-

proach (Wolpert, 1992; Sierra et al., 2001) in which

the decision of each of the 360 single classiﬁers is

combined by means of another method, the so called

meta-classiﬁer. Figure 4 shows the typical approach

used to perform a classiﬁcation with this multiclas-

siﬁer approach. It has to be noticed that the second

layer classiﬁer could be any function, including a sim-

ple vote approach among the used classiﬁers.

We have used this multiclassiﬁer to combine the

different classiﬁers learned in each type of image. It

is worth to notice that this is done for comparison rea-

sons only, as our proposal is to use some of those clas-

siﬁers only, to reduce computational load and, at the

same time, to increase the obtained accuracy.

Classiﬁer Subset Selection Stacking. The new

multiclassiﬁer paradigm, which extends the Staking

Generalization approach is shown in ﬁgure 5. As it

can be seen, we added to the multiclassiﬁer an inter-

mediate phase in which a subset of the classiﬁers be-

longing to the ﬁrst layer are selected. The criterion to

make the selection depends on the goal of the classi-

ﬁcation task, and we have decided to use the classiﬁ-

cation accuracy in our case.

The way the classiﬁers are selected (and discarted)

is not unique; due to our previous experience, we de-

ICINCO2013-10thInternationalConferenceonInformaticsinControl,AutomationandRobotics

106

Meta−classifier

FINAL DECISSION

NEW APPROACH

APPLY SELECTED CLASSIFIERS

FINAL DECISSION

CLASSIFIER SUBSET SELECTION

...

LEARNING PHASE ONLY

...

IMAGE TRANSFORMATIONS x CLASSIFIERS

LEARNING PHASE (ALL)

(T1, C2)

(T2, C1)

(T1, Cm)

(T20, C1)

(T15, C4)

New Case (to be classified)

Training data (Learning phase)

Figure 5: Classiﬁer Subset Selection Stacking.

cided to use Estimation of Distribution Algorithms

to perform the so called Classiﬁer Subset Selection

(CSS) which reduce the number of classiﬁers to be

used in the ﬁnal model, decreasing in this way the

computational payload while increasing the obtained

accuracy.

3.4.1 Estimation of Distribution Algorithms

Estimation of distribution algorithms (EDAs) have

successfully been developed for combinatorial opti-

mization (Inza et al., 2000). They combine statis-

tical learning with population-based search in order

to automatically identify and exploit certain structural

properties of optimization problems.

4 EXPERIMENTAL SETUP

The manufacturing plant is a real manufacturing shop

ﬂoor where machines and humans share the space in

performing production activities. The shop ﬂoor in

Figure 6 can be characterized as an industrial envi-

ronment, with high ceilings, ﬂuorescent light bulbs,

high windows, etc. The lighting conditions are very

changing from one day to another and even in differ-

ent locations along the path covered by the robot.

Method. These are the steps of the experimental

phase:

1. Collect a database of images that contains three

data types that are captured by the two sensors:

Figure 6: Manufacturing plant.

640x480 depth map in real time (30 fps), 640x480

RGB image, 32x31 thermopile array.

2. Reduce the image sizes from 640×480to 32×24,

and convert colour images to gray-scale ones.

3. For each image, apply 23 computer vision algo-

rithms, obtaining 23 new databases for each image

type. Thus, we have 24 data sets for each image

type.

4. Build 120 classiﬁers, applying 5 machine learning

algorithms for each image type training data sets

(5× 24).

5. Apply 10 fold cross-validation using 5 different

classiﬁers to each of the previous databases, sum-

ming up a total of 3× 24 × 5= 360 validations.

6. Select a combined classiﬁer among its 360 dif-

ferent models using two approaches: (1) a mul-

ticlassiﬁer to combine all the classiﬁers learned in

each type of image; (2) Classiﬁer Subset Selection

stacking approach .

Training Data Sets. The training data set is com-

posed of 1064 samples. The input to the supervised

algorithms is composed of 301 positive and 764 neg-

ative examples. The set of positive examples contains

people at different positions and dressed with differ-

ent clothing in a typical manufacturing environment.

The set of negative examples is composed of images

without people in the image and with other objects

in the environment such as machines, tables, chairs,

walls, etc.

To obtain the positive and negative examples the

robot was operated in an unconstrained indoor envi-

ronment (the manufacturing plant). At the same time,

image data was collected with a frequency of 1Hz.

During robot motion the images were hand-labeled as

positive examples if people was visually detected in

the image, and as negative examples otherwise.

Thermaland3DKinectSensorFusionforRobustPeopleDetectionusingEvolutionarySelectionofSupervisedClassifiers

107

5 EXPERIMENTAL RESULTS

Performance of the people detection system is eval-

uated in terms of detection rates and false positives

or negatives. In order to make a fast classiﬁcation –

real time response is expected– we ﬁrst transform the

colour images in gray-scale 32 × 24, and reduce as

well the size of the infrared images to 32×24 size ma-

trix. Hence we have to deal with 768 predictor vari-

ables, instead of 307200 × (3 colours) of the original

images taken by the Kinect camera.

First of all, we have used the ﬁve classiﬁers us-

ing the reduced original databases (32×24for Images

and Distances, 31 × 31 for thermal pictures). Table 2

shows the 10 fold cross-validation accuracy obtained.

The best obtained result is 92.11% for the thermal im-

ages original database, and using SVM as classiﬁer.

The real time Kinect’s algorithms accuracy among the

same images was quite poor (37.50%), as the robot

was moving around the environment and the Kinect

has been made to be used as a static device. As a mat-

ter of fact, that has been the origin of the presented

research.

Table 2: 10 Fold cross-validation accuracy percentage ob-

tained for each classiﬁer using original images.

Data source BN NB C4.5 K-NN SVM

RGB 89.20 71.74 82.63 90.89 85.35

Depth

86.29 68.64 83.29 90.89 84.04

Thermal

89.67 86.10 87.79 91.74 92.11

The same accuracy validation process has been

applied to each image transformation on each im-

age format. Table 3 shows the results obtained by

each classiﬁer on the transformed 23 image databases.

The best result is obtained by the C4.5 classiﬁer af-

ter transforming the images using Transformation 7

(Gaussian one).

After performing the validation over the distance

images, the results shown in Table 4 are obtained. The

best result is obtained again by the C4.5 classiﬁer af-

ter transforming the images using Transformation 7

(Gaussian one), with a 92.82 accuracy.

Finally, the classiﬁers are applied to the thermal

images, obtaining the results shown in Table 5. In

this case we obtain the best result (93.52) for the

SVM classiﬁer, and for two of the used transforma-

tions (Transf. 8 –Lat– and Transf. 9 –Linear-strech–).

Moreover, the obtained results are identical for both

paradigms, so there are redundant algorithms and, if

selected, only one of them can be used in the ﬁnal

combination obtaining indistinct results.

Table 3: Images: 10 fold cross-validation accuracy percent-

age obtained for each classiﬁer using each of the proposed

transformations.

Images BN NB C4.5 K-NN SVM

Transf. 1 89.20 71.74 90.89 82.63 85.35

Transf. 2

87.89 72.30 90.99 84.41 86.29

Transf. 3

83.19 74.84 87.98 75.87 81.41

Transf. 4

88.92 71.92 90.89 82.44 86.20

Transf. 5

86.76 71.64 89.77 80.47 80.66

Transf. 6

87.98 71.36 90.89 83.29 86.29

Transf. 7

87.79 64.79 91.83 85.92 84.79

Transf. 8

76.81 78.03 85.07 71.36 76.90

Transf. 9

88.54 73.90 91.17 81.31 84.98

Transf. 10

87.98 69.48 90.70 82.82 84.69

Transf. 11

85.54 72.96 91.55 82.07 85.26

Transf. 12

88.92 71.74 90.89 82.63 85.35

Transf. 13

88.73 68.64 90.99 82.63 85.45

Transf. 14

88.83 71.74 90.89 83.76 85.54

Transf. 15

89.20 71.74 90.89 82.63 85.35

Transf. 16

83.85 75.12 86.38 77.93 81.78

Transf. 17

89.77 71.46 90.23 83.00 82.44

Transf. 18

88.73 71.55 90.61 82.35 85.35

Transf. 19

88.17 70.61 91.46 82.82 86.10

Transf. 20

89.11 70.99 90.80 82.63 84.98

Transf. 21

89.20 71.74 90.89 82.63 85.35

Transf. 22

88.83 71.36 90.33 82.35 82.72

Transf. 23

88.73 72.30 90.80 83.85 85.82

5.1 Final Combination

The last step is to combine the results of the classiﬁers

obtained, 120 by each sensor. To do that, we ﬁrstly

use a Stacking classiﬁer (Wolpert, 1992) in which

the decision of each single classiﬁer is combined by

means of another classiﬁer (ths so called metaclassi-

ﬁer). Table 6 shows the obtained results. As it can

be seen, the best obtained accuracy is 95.31%, using

a Bayesian Network as metaclassiﬁer. It signiﬁcantly

improves the result of the best single classiﬁer (93.52

for the Thermal images).

It is worth to mention that the best classiﬁer com-

bination obtained used a total of 51 single classiﬁers,

and that all the three sensors are used, i.e., that trans-

formations and related single classiﬁers to each of the

sensors have been selected. Although the number of

classiﬁers could be seen as high for a real time image

processing, it has to be taken into account that we use

small size images which are fast transformed, and that

the classiﬁers, once constructed, give the classiﬁca-

tion result in miliseconds. A classiﬁer parallelization

could be used also to obtain a faster answer, as all of

the single classiﬁers can be executed independently,

but it is not realy necessary in this case.

ICINCO2013-10thInternationalConferenceonInformaticsinControl,AutomationandRobotics

108

Table 4: Distances: 10 fold cross-validation accuracy per-

centage obtained for each classiﬁer using each of the pro-

posed transformations.

Distances BN NB C4.5 K-NN SVM

Transf. 1 86.29 68.64 90.89 83.29 84.04

Transf. 2

86.38 68.45 91.27 83.38 82.91

Transf. 3

83.66 78.87 87.23 78.97 81.60

Transf. 4

86.10 68.54 90.89 82.91 83.29

Transf. 5

85.35 70.80 90.89 80.38 81.97

Transf. 6

86.38 70.33 90.61 82.25 83.76

Transf. 7

85.92 66.95 92.86 85.26 84.23

Transf. 8

83.19 73.62 84.04 73.15 78.40

Transf. 9

85.26 67.70 90.33 83.00 83.19

Transf. 10

85.54 68.92 92.30 85.16 85.35

Transf. 11

84.69 68.26 90.99 81.50 82.35

Transf. 12

86.67 68.64 90.89 83.38 84.04

Transf. 13

85.35 68.08 92.21 82.54 83.29

Transf. 14

86.57 68.73 90.89 83.76 84.13

Transf. 15

86.29 68.64 90.89 83.29 84.04

Transf. 16

83.66 78.69 87.14 80.38 85.35

Transf. 17

85.63 71.27 90.52 82.25 81.50

Transf. 18

85.63 66.20 89.77 82.72 82.54

Transf. 19

86.48 70.05 90.89 83.85 83.94

Transf. 20

86.67 69.01 90.70 83.29 83.85

Transf. 21

85.45 70.33 91.36 83.29 82.82

Transf. 22

85.73 71.08 90.42 81.78 81.60

Transf. 23

85.92 68.64 91.27 80.47 83.10

6 CONCLUSIONS AND FUTURE

WORKS

This paper presented a people detection system for

mobile robots using using 3D camera and thermal vi-

sion and provided a thorough evaluation of its perfor-

mance. The system uses a combination of Computer

Vision and Machine Learning paradigms. This ap-

proach was designed to manage three kind of input

images depth, color, and temperature to detect peo-

ple. We showed that the detection of a person is im-

proved by cooperatively classifying the feature matrix

computed from the input data, where we made use

of Computer Vision transformations and supervised

learning techniques to obtain the classiﬁers. Our algo-

rithm performed well across a number of experiments

in a real manufacturing plant. This work serves as an

introduction to the potential of multi-sensor fusion in

the domain of people detection in mobile platforms.

In the near future we envisage:

• To extend to other scenarios. The approach will

be extended toward a museum scenario.

• To develop trackers combining/fusing visual cues

using particle ﬁlter strategies, including face

recognition, in order to track people or gestures.

• To integrate with robot’s navigation planning abil-

ity to explicitly consider human in the loop during

Table 5: Thermal sensor: 10 fold cross-validation accu-

racy percentage obtained for each classiﬁer using each of

the proposed transformations.

Thermical images BN NB C4.5 K-NN SVM

Transf. 1 89.67 86.10 91.74 87.79 92.11

Transf. 2

90.99 84.32 92.39 91.46 92.58

Transf. 3

89.30 86.67 90.80 86.29 92.39

Transf. 4

89.11 83.85 92.49 89.39 90.33

Transf. 5

85.73 84.60 92.77 90.33 85.63

Transf. 6

89.67 85.92 91.74 87.79 91.83

Transf. 7

86.57 82.16 89.67 87.79 89.95

Transf. 8

89.11 85.92 91.64 84.04 93.52

Transf. 9

90.80 88.08 92.39 87.89 93.52

Transf. 10

84.98 81.97 86.29 80.56 85.63

Transf. 11

71.74 71.74 71.74 71.74 71.74

Transf. 12

89.77 85.63 91.74 87.79 92.11

Transf. 13

90.05 84.69 92.77 90.14 91.08

Transf. 14

89.11 86.01 91.08 87.89 91.83

Transf. 15

89.67 86.10 91.74 87.79 92.11

Transf. 16

89.48 86.85 91.17 90.33 89.95

Transf. 17

89.67 87.23 91.74 87.04 90.99

Transf. 18

89.11 85.63 91.55 85.63 89.86

Transf. 19

89.67 85.07 91.83 87.79 91.83

Transf. 20

89.77 86.01 91.74 87.79 92.68

Transf. 21

83.57 47.89 84.41 82.54 72.02

Transf. 22

89.77 85.82 91.92 87.79 91.17

Transf. 23

90.05 85.45 92.02 90.33 91.27

Table 6: Multiclassiﬁer combination: 10 fold cross-

validation accuracy percentages obtained.

Metaclassiﬁer BN NB C4.5 K-NN SVM

Results (360 classiﬁers) 95.31 94.93 94.27 94.93 94.27

Results (CSS) 98.87 96.53 96.06 98.12 97.37

robot movement.

• To use other single classiﬁer paradigms, and other

transformations

• To use more complex computer vision approaches

(SIFT, SFOP and so forth)

ACKNOWLEDGEMENTS

The work described in this paper was partially con-

ducted within the ktBOT project and funded by

KUTXA Obra Social, the Basque Government Re-

search Team grant and the University of the Basque

Country UPV/EHU, under grant UFI11/45 (BAILab).

REFERENCES

Aha, D., Kibler, D., and Albert, M. (1991). Instance-based

learning algorithms. Machine Learning, 6:37–66.

Arras, K. O., Martinez, O., and Burgard, M. W. (2007).

Using boosted features for detection of people in 2d

Thermaland3DKinectSensorFusionforRobustPeopleDetectionusingEvolutionarySelectionofSupervisedClassifiers

109

range scans. In In Proc. of the IEEE Intl. Conf. on

Robotics and Automation.

Bellotto, N. and Hu, H. (2010). A bank of unscented

kalman ﬁlters for multimodal human perception with

mobile service robots. International Journal of Social

Robotics, 2(2):121–136.

Cestnik, B. (1990). Estimating probabilities: a crucial task

in machine learning. In Proceedings of the European

Conference on Artiﬁcial Intelligence, pages 147–149.

Guan, F., Li, L., Ge, S., and Loh, A. P. (2007). Robust hu-

man detection and identiﬁcation by using stereo and

thermal images in human-robot interaction. Interna-

tional Journal of Information Acquisition, 4(2):1–22.

Gundimada, S., Asari, V. K., and Gudur, N. (2010). Face

recognition in multi-sensor images based on a novel

modular feature selection technique. Inf. Fusion,

11:124–132.

Hofmann, M., Kaiser, M., Aliakbarpour, H., and Rigoll, G.

(2011). Fusion of multi-modal sensors in a voxel oc-

cupancy grid for tracking and behaviour analysis. In

In: Proc. 12th Intern. Workshop on Image Analysis

for Multimedia Interactive Services (WIAMIS), Delft,

The Netherlands.

HTPA. Heimann sensor. http://www.heimannsensor.

com/index.php.

Inza, I., Larra˜naga, P., Etxeberria, R., and Sierra, B. (2000).

Feature subset selection by bayesian networks based

optimization. Artiﬁcial Intelligence, 123(1-2):157–

184.

Johnson, M. J. and Bajcsy, P. (2008). Integration of thermal

and visible imagery for robust foreground detection in

tele-immersive spaces. In Information Fusion, 2008

11th International Conference on.

Kinect. Kinect sensor. http://en.wikipedia.org/wiki/Kinect.

Kuncheva, L. I. (2004). Combining Pattern Classiﬁers:

Methods and Algorithms. John Wiley and Sons, Inc.

Meyer, D., Leisch, F., and Hortnik, K. (2003). The sup-

port vector machine under test. Neurocomputing,

55(1):169–186.

Mitchell, T. (1997). Machine Learning. McGraw Hill.

Morales, N., Toledo, J., Acosta, L., and Arnay, R. (2011).

Real-time adaptive obstacle detection based on an im-

age database. CVIU, 115(9):1273–1287.

Mozos, O. M., Kurazume, R., and Hasegawa, T. (2010).

Multi-part people detection using 2D range data. In-

ternational Journal of Social Robotics, 2(1):31–40.

M¨uehlenbein, H. and Paaß, G. (1996). From recombination

of genes to the estimation of distributions. In Lecture

Notes in Computer Science: Parallel Solving from Na-

ture IV, volume 1411, pages 178–187. Springer Ver-

lag.

Oliveira, L., Nunes, U., and Peixoto, P. (2010). On explo-

ration of classiﬁer ensemble synergism in pedestrian

detection. IEEE Transactions on Intelligent Trans-

portation Systems, 11(1):16–27.

Quinlan, J. (1993). C4.5: Programs for Machine Learning.

Morgan Kaufmann Publishers.

Schiele (2009). Visual People Detection - Different Models,

Comparison and Discussion. In IEEE International

Conference on Robotics and Automation (ICRA).

Sierra, B., Lazkano, E., Jauregi, E., and Irigoien, I. (2009).

Histogram distance-based bayesian network struc-

ture learning: A supervised classiﬁcation speciﬁc ap-

proach. Decision Support Systems, 48(1):180–190.

Sierra, B., Serrano, N., Larra˜naga, P., Plasencia, E. J., Inza,

I., Jim´enez, J. J., Revuelta, P., and Mora, M. L. (2001).

Using bayesian networks in the construction of a bi-

level multi-classiﬁer. A case study using intensive care

unit patients data. Artiﬁcial Intelligence in Medicine,

22(3):233–248.

St-Laurent, L., Prvost, D., and Maldague, X. (2006). Ther-

mal imaging for enhanced foreground-background

segmentation. In The 8th Quantitative Infrared Ther-

mography (QIRT) conference, Padova, Italie.

Thi Thi Zin, Hideya Takahashi, T. T. and Hama, H. (2011).

Fusion of infrared and visible images for robust per-

son detection. Image Fusion. InTech, Available from:

http://www.intechopen.com/articles/show/title/fusion-

of-infrared-and-visible-images-for-robust-person-

detection.

Treptow, A., Cielniak, G., and Duckett, T. (2005). Ac-

tive people recognition using thermal and grey im-

ages on a mobile security robot. In Proceedings of

the 2005 IEEE/RSJ International Conference on In-

telligent Robots and Systems (IROS 2005), Edmonton,

Canada.

Wolpert, D. H. (1992). Stacked generalization. Neural Net-

works, 5:241–259.

Yao, J. and Odobez, J.-M. (2011). Fast human detection

from joint appearance and foreground feature subset

covariances. Computer Vision and Image Understand-

ing, 115(10):1414–1426.

ICINCO2013-10thInternationalConferenceonInformaticsinControl,AutomationandRobotics

110