Improving User Experience via Motion Sensors in an

Ambient Intelligence Scenario

Giuseppe Lo Re, Marco Morana and Marco Ortolani

DICGIM, University of Palermo, Viale delle Scienze, ed. 6, Palermo, Italy

Keywords:

HCI, Sensor Networks, Ambient Intelligence, Embedded Systems.

Abstract:

Ambient Intelligence (AmI) is a new paradigm in Artiﬁcial Intelligence that aims at exploiting the information

about the environment state in order to adapt it to the user preferences. AmI systems are usually based on

several cheap and unobtrusive sensing devices that allow for continuous monitoring in different scenarios.

In this work we present a gesture recognition module for the management of an ofﬁce environment using a

motion sensor device, namely Microsoft Kinect, as the primary interface between the user and the AmI system.

The proposed gesture recognition method is based on both RGB and depth information for detecting the hand

of the user and a fuzzy rule for determining the state of the detected hand. The shape of the hand is interpreted

as one of the basic symbols of a grammar expressing a set of commands for the actuators of the AmI system.

In order to maintain a high level of pervasiveness, the Kinect sensor is connected to a miniature computer

capable of real-time processing.

1 INTRODUCTION

With the widespread diffusion of cheap and unob-

trusive sensing devices, nowadays it is possible to

perform continuous monitoring of a wide range of

different environments. The availability of an ever-

increasing amount of data acquired by such sensor de-

vices, has piqued the interest of the scientiﬁc commu-

nity in producing novel methods for combining raw

measurements in order to understand what is happen-

ing in the monitored scenario.

Many works have been proposed in literature that

address the problem of heterogeneous data analysis

for obtaining a unitary representation of the observed

scene. In particular, Ambient Intelligence (AmI) is

a new paradigm in Artiﬁcial Intelligence that aims

at exploiting the information about the environment

state in order to adapt it to the user preferences. Thus,

the intrinsic requirement of any AmI system is the

presence of pervasive sensory devices; moreover, due

the primary role of the end user, an additional require-

ment is to provide the system with efﬁcient HCI func-

tionalities. Considering the high level of pervasive-

ness obtained through the use of the nowadays avail-

able sensory and actuating devices, the use of equally

unobtrusive interfaces is mandatory.

In this work we present a system for the manage-

ment of an ofﬁce environment, namely the rooms of

a university department, using a motion sensor de-

vice, i.e. Microsoft Kinect, as the primary interface

between the user and the AmI system. In our architec-

ture, the sensory component is implemented through

a Wireless Sensor and Actuator Network (WSAN),

whose nodes are equipped with off-the-shelf sensors

for measuring such quantities as indoor and outdoor

temperature, relative humidity, ambient light expo-

sure and noise level. Such networks (De Paola et al.,

2012b) do not only passively monitor the environ-

ment, but represent the tool allowing the system to

interact with the surrounding world. WSANs are the

active part of the system and allow to modify the en-

vironment according to the observed data, high-level

goals (e.g., energy efﬁciency) and user preferences.

In our vision, Kinect represents both a sensor

(since it is used for some monitoring tasks, i.e. people

counting) and a controller for the actuators. In par-

ticular, the people counter algorithm we developed is

based on an optimized version of the method natively

implemented by the Kinect libraries, taking into ac-

count the limited computational resources of our tar-

get device. The actuators control is performed by

training a fuzzy classiﬁer for recognizingsome simple

gestures (i.e., open/closed hands) in order to produce

a set of commands opportunely structured by means

of a grammar. The use of a grammar for the com-

prehension of the visual commands is also exploited

Lo Re G., Morana M. and Ortolani M..

Improving User Experience via Motion Sensors in an Ambient Intelligence Scenario.

DOI: 10.5220/0004306000290034

In Proceedings of the 3rd International Conference on Pervasive Embedded Computing and Communication Systems (PECCS-2013), pages 29-34

ISBN: 978-989-8565-43-3

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

since we make use of the parser in order to ﬁne tune

the behavior of the fuzzy recognizer.

The paper is organized as follows: related works

are presented in Section 2, while the proposed sys-

tem architecture is described in Section 3. An exper-

imental deployment realized in our department will

be discussed in Section 4. Conclusions will follow in

Section 5.

2 RELATED WORK

The core of our proposal involves the use of a reli-

able gesture detection device, and we selected Mi-

crosoft Kinect as a promising candidate to this aim.

Kinect is based on the hardware reference design

and the structured-light decoding chip provided by

PrimeSense, an Israeli company whose also provides

a framework, OpenNI, that supplies a set of APIs

to be implemented by the sensor devices, and an-

other set of APIs, NITE, to be implemented by the

middleware components. Moreover, PrimeSense ha

recently released a proprietary Kinect-based sensor,

called PrimeSense 3D Sensor.

Even if Kinect has been on the market for a couple

of years, it has attracted a number of researchers due

to the availability of open-source and multi-platform

libraries that reduce the cost of developing new algo-

rithms. A survey of the sensor and corresponding li-

braries is presented in (Kean et al., 2011; Borenstein,

2012).

In (Xia et al., 2011) a method for human bodies

detection using depth information taken by the Kinect

is presented. The authors perform the detection task

by applying some state of the art computer vision

techniques, however their system is developed in a

traditional PC so that computationally intensive tasks

(i.e., 3D modeling) cannot be implemented in a low-

power device. The problem of segmenting humans by

using the Kinect is also addressed in (Gulshan et al.,

2011), while the authors of (Raheja et al., 2011) fo-

cused on hand tracking. The authors presented an in-

tuitive solution by detecting the palm and then the ﬁn-

gers, however, in order to obtain high-resolution hand

images that can be successfully processed, the user is

forced to stay close to the Kinect.

In our vision, the Kinect sensor represents an

“eye” that observes the user acting freely in the en-

vironment, collects information and forwards user re-

quest to a reasoner according to the architecture de-

scribed in (De Paola et al., 2012a). The remote de-

vices act as the termination of a centralized sentient

reasoner that is responsible of intelligent processing.

Higher-level information is extracted by sensed data

!"##$%&'()"*'%

*'+,&-.#'*%

!"#$%&%$#$%'

/'0'*% 12*('*%

!"##$%&'

()*+',%-(.'

+,&-.#'*%

$&#,)),/0*')1#/"0)'

Figure 1: Block diagram of the proposed gesture recogni-

tion algorithm.

in order to produce the necessary actions to adapt the

environment to the users requirements. A set of actu-

ators ﬁnally takes care of putting the planned modiﬁ-

cations to the environment state into practice.

3 SYSTEM OVERVIEW

In the context of Ambient Intelligence, a key require-

ment is that the presence of the monitoring and con-

trol infrastructure is hidden from the users, so as to

provide a smoother way for them to interact with the

system. For the purpose of the present discussion, we

will speciﬁcally consider the possibility for the users

to interact with the available actuators, as naturally as

possible, by controlling their operation mode and by

querying them about their current state. For instance,

the user can control some actuators (e.g. air condi-

tioning system, or lighting) by providing a set of sub-

sequent commands for obtaining complex conﬁgura-

tions, e.g., turn on the air conditioning system, set the

temperature to a certain degree, set the fan speed to a

particular value, set the air ﬂow to a speciﬁed angle

and so on.

Figure 1 shows a block diagram for the HCI mod-

ule of our system; namely it depicts the core compo-

nents of the proposed gesture recognition algorithm.

The actions of the users are captured by Kinect and

analyzed by the Fuzzy Gesture Recognizer. The rec-

ognized input symbols are then processed by our in-

terpreter and, at each step (i.e. for every recognized

gesture), a set of the next admissible input symbols is

provided as feedback to the fuzzy classiﬁer.

3.1 Fuzzy Gesture Recognition

Several vision-based systems have been proposed

during the last 40 years for simple gesture detection

and recognition. However, the main challenge of any

computer vision approach is to obtain satisfactory re-

sults not only in a controlled testing environment, but

also in complex scenarios with unconstrained lighting

conditions, e.g., a home environment or an ofﬁce. For

this reason, image data acquired by multiple devices

are usually merged in order to increase the system re-

liability. In particular, range images, i.e., 2D images

PECCS2013-InternationalConferenceonPervasiveandEmbeddedComputingandCommunicationSystems

Figure 2: The gesture recognition method. For each frame (ﬁrst row), the hand is detected by means of OpenNI / NITE APIs.

The coordinates of the detected hands are used to deﬁne a search window whose size is proportionally to the distance from the

Kinect. RGB-D information allow to obtain an hand mask (second row) that is described in terms of its roundness. A fuzzy

technique (third row) is ﬁnally applied to express the uncertainty about the hand shape.

in which each pixel contains the distance between the

sensor and a point in the scene, provide very useful

information about the elements of the scene, e.g., a

moving person, but range sensors used to obtain them

are very expensive.

According to the considered scenario, we found

that Kinect represents the most suitable device both in

terms of cost and functionalities since it is equipped

with ten input/output components that make it possi-

ble to sense the users and their interaction with the

surrounding environment. The Kinect sensor rests

upon a base which contains a motor that allows for

controlling the tilt angle of the cameras (30 degrees

up or down). Starting from the bottom of the device,

you can ﬁnd three adjacent microphones on the right

side, while a fourth microphone is placed on the left

side. A 3- axis accelerometer can be used for mea-

suring the position of the sensor, while a led indicator

shows its state. However,the core of the Kinect is rep-

resented by the vision system composed of: an RGB

camera with VGA standard resolution (i.e., 640x480

pixels); an IR projector that shines a grid of infrared

dots over the scene; an IR camera that captures the

infrared light. The factory calibration of the Kinect

make it possible to know the exact position of each

projected dot against a surface at a known distance

from the camera. Such information is then used to

create depth images of the observed scene (i.e., pixel

values represent distances) that capture the object po-

sition in a three-dimensional space.

An example of hand tracking using Kinect comes

with the OpenNI/NITE packages.

However, the APIs are based on a global skeleton

detection method, so that the hand is deﬁned just as

the termination of the arm and no speciﬁc information

about the hand state (e.g., an open hand vs a ﬁst) is

provided. For this reason, such approach is useful just

as ﬁrst step of our detection procedure since it allows

us to deﬁne the image area where the hand is located.

The coordinates of the detected hand are used to

deﬁne a search window, whose size is chosen accord-

ing to a heuristic rule based on the distance z from the

Kinect. In our system, the depth information is com-

bined with data acquired by the RGB camera and a

classiﬁcation algorithm (Lai et al., 2011) is applied

in order to deﬁne the hand mask within the search

window. Each hand mask is then normalized with re-

spect to scale and the ﬁnal binary region is described

in terms of its roundness. In particular, the roundness

of the hand is efﬁciently computed as the variance of

the set of distances between each point along the hand

border and the center of mass of the hand region:

∑

i=1

− µ)

(1)

where µ are the coordinates of the center of mass

and x are the coordinates of the n points along the

border.

This feature provides useful information for dis-

criminating between open/half open hands that have

a low-level of roundness and closed hands that result

almost round. Moreover, in order to better deal with

the uncertainty of visual features, the concept of hand

shape is modeled through a fuzzy logic rule based on

the roundness values:

IF roundness IS low THEN hand is open

IF roundness IS normal THEN hand is half closed

ImprovingUserExperienceviaMotionSensorsinanAmbientIntelligenceScenario

IF roundness IS high THEN hand is closed

Fig. 2 shows the steps of the recognition proce-

dure. The high-level reasoning for recognizing the

hand gesture is performed by using the output of the

fuzzy rule in conjunction with a set of grammar rules.

3.2 Gesture Language Description

As already mentioned, Kinect provides an effective

way of capturing the input from the user, which in

principle could be directly translated into commands;

however, the mere recognition of hand gestures may

prove inadequate to cover with sufﬁcient detail the

broad spectrum of possible instructions. The fuzzy

recognizer described above, for instance, is able to tell

only two hand gestures apart and, while those could

be sufﬁcient to translate a set of commands accord-

ing to a binary alphabet, such coding would produce

lengthy words, and would be too cumbersome to be

of any practical use.

Our goal consists in providing the users with a

tool able to let them express a (relatively) broad set

of commands, starting from elementary and custom-

ary gestures; to this aim, we regard the set of possi-

ble commands and queries as a language, which can

be precisely deﬁned with the notation borrowed from

formal language theory.

For our purposes, we deﬁne such language by

specifying a simple grammar, expressed in the usual

BNF notation (Aho et al., 2007); from this point of

view, the hand gestures can be regarded as the sym-

bols of the underlying alphabet, assuming we can

sample the images acquired by the Kinect with a pre-

ﬁxed frequency(i.e., we can identify repetitions of the

same symbol); moreover, we will consider an addi-

tional symbol representing a separator, corresponding

to the case when no gesture is made. The following

alphabet will thus constitute the basis for the subse-

quent discussion:

Σ = {◦, •, };

with ◦ indicating the open hand, • the ﬁst, and the

separator; it is clear, however, that the alphabet can be

easily extended by acting on the fuzzy recognizer.

Such alphabet is used to code the basic keywords,

such as those for identifying the beginning of a state-

ment; upon this, we devised a basic grammar cap-

turing a gesture language expressing simple queries

and commands to the actuators. So for instance, the

proper sequence of gestures by the user (i.e. “• ”)

will be understood as the query keyword, represent-

ing the beginning of the corresponding statement, and

similarly for other “visual lexemes”.

The grammar we used is a context-free grammar,

which is completely deﬁned in Backus-Naur Form

(BNF) by the following productions

P → Slist

Slist → stat | stat Slist

stat → query | cmd

query → query id [status | value]

cmd → act on id start cmdLoop stop

cmdLoop → [increase | decrease]

| [increase | decrease] cmdLoop

Despite the simplicity of the devised language, its

grammar is able to capture an acceptable range of in-

structions given by the user in a natural and unobtru-

sive way; the software running on the motion detec-

tion sensor provides the input symbols which are then

processed by our interpreter and translated into com-

mands/queries.

Such structured approach gives us also the oppor-

tunity to exploit the potentialities of the parser used

to process the visual language; in particular, as is cus-

tomary practice, our interpreter performs the recogni-

tion of an input sequence as a word of the language by

building an internal data structure which matches the

syntax of the sentence recognized so far; moreover,

in order to keep the process computationally manage-

able, a set of the next admissible input symbols is con-

structed at each step (i.e. for every input symbol).

In our system, we exploit this information in or-

der to tune the fuzzy recognizer tied to the motion

sensor by means of a weighting mechanism (Cho and

Park, 2000; Alcal´a et al., 2003). Namely, the fact that

at a given time instant only some of the possible in-

put symbols (i.e. gestures) are expected provides an

invaluable feedback which may be exploited by the

fuzzy recognizer from avoiding useless computations.

Although the effect of such feedback might appear al-

most negligible when only two different gestures are

considered, the addition of more symbols is straight-

forward in our system, and such a feedback can heav-

ily improve the efﬁciency of the fuzzy classiﬁer by

preliminarily discarding inadmissible symbols. For

instance, we may conceivably consider some easily

recognizable gestures involving the use of both hands

and their relative position in order to allow the deﬁni-

tion of a more complex grammar.

The sets of terminal, and non-terminal symbols, and

the start symbol are implicitly deﬁned by the productions

themselves.

PECCS2013-InternationalConferenceonPervasiveandEmbeddedComputingandCommunicationSystems

4 CASE STUDY

The proposed method is part of a system aiming for

timely and ubiquitous observations of an ofﬁce en-

vironment, namely a department building, in order

to fulﬁll constraints deriving both from the speciﬁc

user preferences and from considerations on the over-

all energy consumption.

The system will reason on high-level concepts as

“air quality”, “lighting conditions”, “room occupancy

level”, each one referring to a physical measurement

captured by a physical layer. Since the system must be

able to learn the user preferences, ad-hoc sensors for

capturing the interaction between users and actuators

are needed similarly to what is described in (Morana

et al., 2012). The plan of one ofﬁce, giving an exam-

ple of the adopted solutions, is showed in Figure 3.

The devices labelled as A and B are used as a ﬁrst-

level, rough, control of the user’s presence.

The sensing infrastructure is realized by means of

a WSAN, whose nodes (Fig. 3-E) are able to mea-

sure temperature, relative humidity, ambient light ex-

posure and noise level. Sensor nodes can be located

close to several points of interest, e.g., door, window,

user’s desk; moreover, nodes equipped with outdoor

sensors can also be installed on the building facade,

close to the ofﬁce windows, in order to monitor out-

door temperature, relative humidity, and light expo-

sure. Actuators are able to change the state of the

environment by acting on some measures of interest.

The air-conditioning system (Fig. 3-D), the curtain

and rolling shutter controllers (Fig. 3-F, G), and the

lighting regulator (Fig. 3-I) address this task by mod-

ifying the ofﬁce temperature and lighting conditions.

Both temperature and light management are funda-

mental for the energy efﬁciency of the ofﬁce, but in

order to achieve better results a more accurate anal-

ysis of the power consumption is required. For this

reason, we used energy monitoring units for each de-

vice (e.g, air-conditioner, lights, PCs) and an energy

meter (Fig. 3-C) for the overall monitoring of each of-

ﬁce. The users’ interaction with actuators is captured

via the Kinect sensor (Fig. 3-H) that is also respon-

sible for detecting the presence and count the number

of people on the inside of the ofﬁce. An additional

contribution for detecting user’s presence is given by

video sensors integrated with wireless sensor nodes

(Fig. 3-J), that can be used to perceive high-level fea-

tures such as who is in the ofﬁce.

The monitoring infrastructure is based on the IRIS

Mote produced by Crossbow, equipped with a number

of sensors (i.e., temperature, humidity, light intensity,

noise level, CO2). The IRIS is a 2.4 GHz Mote mod-

ule designed speciﬁcally for deeply embedded sensor

Figure 3: Monitored ofﬁce.

networks. Other ad-hoc sensors and actuators (e.g.,

curtain reader and controller) can be connected with

the WSAN by means of standard protocols (e.g., Zig-

Bee, EIA RS-485).

The Kinect is connected to a miniature fanless PC

(i.e., ﬁt-PC2i) with Intel Atom Z530 1.6GHz CPU

and Linux Mint OS, that guarantees real-time pro-

cessing of the observed scene with minimum levels

of obtrusiveness and power consumptions (i.e., 6W).

Several tests have been performed in order to sep-

arately evaluate both the fuzzy gesture recognizer and

the interpreter. The former has been tested under

varying lighting conditions and poses showing an ac-

ceptable level of robustness. This is mainly due to

the primary role of the depth information in detect-

ing the hand mask, while compensating for the lower

quality of the RGB data. Results showed that about

70% of the gestures (i.e., masks of the hands) are cor-

rectly classiﬁed when the user acts in a range of 1.5 to

3.5 meters from the Kinect. Greater distances make

performances worse due to the physical limits of the

infrared sensor.

The interpreter functionalities have been prelim-

inarily veriﬁed by means of a synthetic generator of

gestures allowing for the validation of both the alpha-

bet and the grammar we chose. The overall system

has been tested by conducting a set of experiments

involving 8 different individuals. Each person was

positioned in front of Kinect at a distance within the

sensor range and was asked to interact with the de-

vice by performing a random sequence of 20 gestures

chosen from a predeﬁned set of 10 commands (i.e.,

turn on the light, turn off the light, turn on HVAC,

turn off HVAC, set the temperature, set the airﬂow

angle, open the door, lock the door, open the curtains,

close the curtains) and 10 queries (i.e., get the instan-

taneous energy consumption, get the monthly energy

consumption, get the list of active appliances, get the

temperature, get the humidity level, get the position

of the sensors, get the position of the actuators, get

the state of the sensors, get the state of the actuators,

get the state of the system).

ImprovingUserExperienceviaMotionSensorsinanAmbientIntelligenceScenario

The proposed system was able to correctly clas-

sify the input gestures in 83.75% of the cases, cor-

responding to 134 positives out of 160 inputs. Such

a result shows that compared to the standalone fuzzy

recognizer, representing the set of possible commands

and queries as a language increases the performance

of the system.

5 CONCLUSIONS

In this work we presented a system for the manage-

ment of an ofﬁce environments by means of an unob-

trusive sensing device, i.e., the Kinect. Such a sensor

is equipped with a number of input/output devicesthat

make it possible to sense the user and its interaction

with the surrounding environment. We considered a

scenario where the whole environment is permeated

with small pervasive sensor devices, for this reason

the Kinect is coherently connected to a miniature fan-

less computer with reduced computation capabilities.

The control of the actuators of the AmI system

(e.g., air-conditioning, curtain and rolling shutter) is

performed by the Kinect by recognizing some simple

gestures (i.e., open/closed hands) opportunely struc-

tured by means of a grammar.

Once the hand of the user has been detected, some

local processing is done using RGB-D data and the

obtained hand region is described according to its

roundness. Such a descriptor is veriﬁed by means of

a fuzzy procedure that predicts the state of the hand

with a certain level of accuracy. Each state (i.e., open

or closed) represents a symbol of a grammar that de-

ﬁnes the corresponding commands for the actuators.

The construction of a real prototype of the moni-

toring and controlling system allowed for exhaustive

testing of the proposed method. Experimental results

showed that the system is able to perform efﬁciently

on a miniature computer while maintaining a high

level of accuracy both in terms of image analysis and

gesture recognition.

Although the effectiveness of the system has been

evaluated considering only two different gestures, the

addition of more symbols is straightforward. As fu-

ture work we may conceivably consider some easily

recognizable gestures involving the use of both hands

and their relative position in order to allow the deﬁni-

tion of a more complex grammar.

ACKNOWLEDGEMENTS

This work is supported by the SMARTBUILDINGS

project, funded by POR FESR SICILIA 2007-2013.

REFERENCES

Aho, A., Lam, M., Sethi, R., and Ullman, J. (2007). Compil-

ers: principles, techniques, and tools, volume 1009.

Pearson/Addison Wesley.

Alcal´a, R., Casillas, J., Cord´on, O., and Herrera, F.

(2003). Linguistic modeling with weighted double-

consequent fuzzy rules based on cooperative co-

evolutionary learning. Integr. Comput.-Aided Eng.,

10(4):343–355.

Borenstein, G. (2012). Making Things See: 3D Vision With

Kinect, Processing, Arduino, and MakerBot. Make:

Books. O’Reilly Media, Incorporated.

Cho, J.-S. and Park, D.-J. (2000). Novel fuzzy logic control

based on weighting of partially inconsistent rules us-

ing neural network. Journal of Intelligent Fuzzy Sys-

tems, 8(2):99–110.

De Paola, A., Cascia, M., Lo Re, G., Morana, M., and

Ortolani, M. (2012a). User detection through multi-

sensor fusion in an ami scenario. In Information Fu-

sion (FUSION), 2012 15th International Conference

on, pages 2502 –2509.

De Paola, A., Gaglio, S., Lo Re, G., and Ortolani, M.

(2012b). Sensor9k: A testbed for designing and exper-

imenting with WSN-based ambient intelligence appli-

cations. Pervasive and Mobile Computing. Elsevier,

8(3):448–466.

Gulshan, V., Lempitsky, V., and Zisserman, A. (2011). Hu-

manising grabcut: Learning to segment humans us-

ing the kinect. In Computer Vision Workshops (ICCV

Workshops), 2011 IEEE International Conference on,

pages 1127 –1133.

Kean, S., Hall, J., and Perry, P. (2011). Meet the Kinect:

An Introduction to Programming Natural User Inter-

faces. Apress, Berkely, CA, USA, 1st edition.

Lai, K., Bo, L., Ren, X., and Fox, D. (2011). Sparse dis-

tance learning for object recognition combining rgb

and depth information. In Robotics and Automa-

tion (ICRA), 2011 IEEE International Conference on,

pages 4007 –4013.

Morana, M., De Paola, A., Lo Re, G., and Ortolani, M.

(2012). An Intelligent System for Energy Efﬁciency

in a Complex of Buildings. In Proc. of the 2nd IFIP

Conference on Sustainable Internet and ICT for Sus-

tainability.

Raheja, J., Chaudhary, A., and Singal, K. (2011). Track-

ing of ﬁngertips and centers of palm using kinect. In

Computational Intelligence, Modelling and Simula-

tion (CIMSiM), 2011 Third International Conference

on, pages 248 –252.

Xia, L., Chen, C.-C., and Aggarwal, J. (2011). Hu-

man detection using depth information by kinect. In

Computer Vision and Pattern Recognition Workshops

(CVPRW), 2011 IEEE Computer Society Conference

on, pages 15 –22.

PECCS2013-InternationalConferenceonPervasiveandEmbeddedComputingandCommunicationSystems