Reinforcement Learning for Multi-purpose Schedules

Kristof Van Moffaert, Yann-Micha

el De Hauwere, Peter Vrancx and Ann Now

Department of Computer Science, Vrije Universiteit Brussel, 1050 Elsene, Belgium

Keywords:

Machine Learning, Reinforcement Learning, Optimization, Multi-objective, Energy.

Abstract:

In this paper, we present a learning technique for determining schedules for general devices that focus on

a combination of two objectives. These objectives are user-convenience and gains in energy savings. The

proposed learning algorithm is based on Fitted-Q Iteration (FQI) and analyzes the usage and the users of a

particular device to decide upon the appropriate proﬁle of start-up and shutdown times of that equipment. The

algorithm is experimentally evaluated on real-life data to discover that close-to-optimal control policies can

be learned on a short timespan of a only few iterations. Our results show that the algorithm is capable of

proposing intelligent schedules depending on which objective the user placed more or less emphasis on.

1 INTRODUCTION

Automatic control policies received a great deal of at-

tention in recent years. Combining such control poli-

cies with a low resource consumption and minimal

costs is a goal that researchers from various disci-

plines are attempting to achieve. Traditional manu-

ally designed control strategies lack predictive capa-

bilities to ensure a certain quality of service in sys-

tems that are characterized by diverse usage patterns

and user preferences. As a result, such systems do

not provide effective solutions for achieving the de-

sired resource efﬁciency. Moreover, such traditional

approaches typically also result in a signiﬁcant risk of

temporary discomfort as part of the learning phase or

due to ill-conﬁgured systems.

In this paper we describe an approach that aims

to automatically conﬁgure product systems to user

demand patterns and their preferences. This means

tailoring the performance of devices to the speciﬁc

circumstances imposed on them by their everyday

users. By taking into account patterns in user behav-

ior and expectations, the system usage optimization

is twofold. On the one hand side, the quality of ser-

vice provided by the system to the end user, and on

the other hand the resources needed to keep the sys-

tem running. Such tailoring can be inﬂuenced by time

dependent usage patterns as well as personal or group

determined performance preferences.

Consider for instance an espresso or a coffee ma-

chine. An espresso machine has different operational

modes: on (making the beverage), idle (temporarily

heat water) and off. By default, the machine is idle.

Every couple of minutes, the machine will re-heat it’s

water supply, to always be in a state of readiness when

a user wants coffee. After ofﬁce hours, the machine

should be turned off manually, to bring down power

consumption even further. Bringing the coffee ma-

chine from off to idle again in the morning mode re-

quires a warming up phase, which implies that the

machine is not immediately usable. On a typical day,

the beverage machine used in an ofﬁce environment

will be turned on in the morning and remain on during

the day, being used only sporadically. During long pe-

riods of time, the machine will be idling. Consistently

turning it off after usage is a hindrance because the

machine will need to warm up each time it is switched

on again. Finding a correct control policy which op-

timizes energy consumption, without sacriﬁcing hu-

man comfort will be the scope of the experiments de-

scribed later on.

We propose a batch Reinforcement Learning (RL)

approach that outputs a control policy based on his-

toric data of usage and user preferences. This ap-

proach avoids the overhead and discomfort typically

associated with a learning phase in reinforcement

learning while still having the beneﬁt of being adap-

tive to changing patterns and preferences. In Section

2, we elaborate on related concepts that allow auto-

matic extracting of user patterns. Furthermore, we

present the problem setting in Section 3 and the corre-

sponding experiments in Section 4. These results are

discussed in the subsequent Section 5 and to conclude

the paper, we form conclusions obtained in Section 6.

203

Van Moffaert K., De Hauwere Y., Vrancx P. and Nowé A..

Reinforcement Learning for Multi-purpose Schedules.

DOI: 10.5220/0004187202030209

In Proceedings of the 5th International Conference on Agents and Artiﬁcial Intelligence (ICAART-2013), pages 203-209

ISBN: 978-989-8565-39-6

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

2 BACKGROUND AND

PRELIMINARIES

In this section, we focus on related work of techniques

concerning automatic retrieval of user patterns and

proﬁles.

2.1 MDPs and Reinforcement Learning

A Markov Decision Process (MDP) can be described

as follows. Let S = {s

, . . . , s

} be the state space of a

ﬁnite Markov chain {x

}

l≥0

and A = {a

, . . . , a

} the

action set available to the agent. Each combination of

starting state s

, action choice a

∈ A

and next state

has an associated transition probability T (s

|, s

, a

)

and immediate reward R(s

, a

). The goal is to learn

a policy π, which maps each state to an action so that

the the expected discounted reward J

is maximized:

≡ E

∞

∑

t=0

R(s(t), π(s(t)))

(1)

where γ ∈ [0, 1) is the discount factor and expectations

are taken over stochastic rewards and transitions. This

goal can also be expressed using Q-values which ex-

plicitly store the expected discounted reward for every

state-action pair:

∗

(s, a) = R(s, a) + γ

∑

T (s

|s, a)max

Q(s

, a

) (2)

So in order to ﬁnd the optimal policy, one can learn

this Q-function and then use greedy action selection

over these values in every state. Watkins described

an algorithm to iteratively approximate Q

∗

. In the Q-

learning algorithm (Watkins, 1989) a large table con-

sisting of state-action pairs is stored. Each entry con-

tains the value for

Q(s, a) which is the learner’s cur-

rent hypothesis about the actual value of Q(s, a). The

Q-values are updated accordingly to following update

rule:

Q(s, a) ← (1 − α

)

Q(s, a) +α

[r + γmax

Q(s

, a

)]

(3)

where α

is the learning rate at time step t and r is the

reward received for performing action a in state s.

Provided that all state-action pairs are visited in-

ﬁnitely often and a suitable evolution for the learning

rate is chosen, the estimates,

Q, will converge to the

optimal values Q

∗

(Tsitsiklis, 1994).

2.1.1 Fitted-Q Iteration

Fitted Q-iteration (FQI) is a model-free, batch-mode

reinforcement learning algorithm that learns an ap-

proximation of the optimal Q-function (Busoniu et al.,

2010). The algorithm requires a set of input MDP

transition samples (s, a, s

, r), where s is the transition

start state, a is the selected action and s

, r are the state

and immediate reward resulting from the transition,

respectively. Given these samples, ﬁtted Q-iteration

trains a number of successive approximations to the

optimal Q-function in an off-line fashion. The com-

plete algorithm is listed in Algorithm 1. Each itera-

tion of the algorithm consists of a single application

of the standard Q-learning update from Equation 3 for

each input sample, followed by the execution of a su-

pervised learning method in order to train the next

Q-function approximation. In the literature, the ﬁt-

ted Q-iteration framework is most commonly used

with tree-based regression methods or with multi-

layer perceptrons, resulting in algorithms known as

Tree-based Fitted Q-iteration (Ernst et al., 2005) and

Neural Fitted-Q iteration (Riedmiller, 2005). The

FQI algorithm is particularly suited for problems with

large input spaces and large amounts of data, but

where direct experimentation with the system is difﬁ-

cult or costly.

Algorithm 1: Fitted Q-iteration.

Q(s, a) ← 0 ∀s, a  Initialize approximations

repeat

T,I ←

for all samples i do  Build training set

I ← I ∪(s

, a

)  Input values

T ← T ∪ r

+ max

Q(s

, a)  Target output value

end for

Q ← Regress(I,T)  Train supervised learning

method

until Termination

return

Q  Return ﬁnal Q-values

3 PROBLEM SETTING

In our experiments, we recorded the presences of the

employees in a small ﬁrm together with the usage of a

particular small-ofﬁce device, in this case an espresso

maker. As in most companies, there is nobody partic-

ularly designated for turning unnecessary equipment

off at unnecessary moments in time and as every-

body is eager to have their beverage ready when they

please, the general policy of the espresso maker con-

sists of a 24/7 operational time.

3.1 Presence and Usage Probabilities

In the experiments we conducted below, we extracted

the presence and usage of six individual users of the

coffee maker. The presence probabilities for each of

the six users are collected using software tools that

ICAART2013-InternationalConferenceonAgentsandArtificialIntelligence

204

analyze network activity in the ﬁrm. As most em-

ployee actually turn their computer off during ab-

sence, this technique was adequate to monitor the

presences without requiring direct, manual interaction

of the employees with an speciﬁc authentication sys-

tem, such as a system with badge recognition. The

probability distribution, extracted from this data was

collected for a period of one month and is depicted

in Figure 1. From these distributions, we notice that

users 2 and 3 regularly leave their computer on for

the entire day. This behavior introduces noise into

the system, but should not cause too much additional

problems for this technique to come up with an ap-

propriate schedule for the espresso maker. At around

11h to 12h, most people tend to leave for lunch and

thus less activity is spotted on the network. In the af-

ternoon, at around 17h, on average occasions, most of

the employees tend to leave their ofﬁce and shut down

their computer. Occasionally, some employees seem

to be working late.

The other type of information needed is the usage

of the particular device. The usage of the espresso

maker is measured by the number of cups being drank

at the ofﬁce. Similar to the manner presence was be-

ing monitored, we opted for a measuring technique

that did not require manual input from the user. To

obtain usage information, we relied on an appliance

monitoring device

that records the power consump-

tion of the coffee maker every six seconds. By analyz-

ing this data, the timestamps at which a user actually

requested coffee could be retrieved. It is important to

notice that no information is collected on who is ac-

tually requesting coffee, i.e. only the time-dependent

information is recorded. This data is collected for the

same period of time as the presence information and is

presented in Figure 2. Given the fact that the espresso

maker can only be operated when employees are ac-

tually present at the ofﬁce, there is a peak in morning,

from 8h to 10h, when most of the beverages is being

consumed. In this period of time, around 1.3 to 1.4

cups of coffee are being requested, which is in fact

quite minimal. While in the afternoon, the usage is

diminished with a small peak at 14h and starting from

18h, there were no recordings of people consuming

any beverages.

For each of the six users, a series of working

days are generated using the presence distributions as

a probability distribution together with noise added

from a Normal distribution, while for the usage dis-

tribution a Poisson (Haight, 1967) distribution is em-

ployed to generate different usages for different days.

The Poisson distribution is especially tailored for ex-

We conducted our experiments with the EnviR appli-

ance monitor

pressing the probability of a given number of events

occurring in a ﬁxed interval of time or space if these

events occur with a known average rate and indepen-

dently of the time since the last event. The combina-

tion of the two graphs allow us to generate a variety

days with simulated presences and usages up to the

level of 10 minutes, i.e. for each simulated day 144

individual data points are generated.

0 2 4 6 8 10 12 14 16 18 20 22

0.2

0.4

0.6

0.8

Hour

Probability

User 1

User 2

User 3

User 4

User 5

User 6

Figure 1: The probability distribution for each of the six

individual users on their presence.

0 2 4 6 8 10 12 14 16 18 20 22

0.2

0.4

0.6

0.8

1.2

1.4

Hour

Usage

Figure 2: The distribution of the company’s drinking behav-

ior based on historical data.

3.2 A General Device Model

Another important part of our experimental setting is

the model used to represent the device, being con-

trolled. This model should be both general and spe-

ciﬁc enough to capture all aspects of any household

device. A Markov Decision Process, as introduced

in Section 2.1, is speciﬁcally tailored for representing

the behavior of a particular household device. In total,

two possible actions and three states are presented in

an MDP that would cover most, if not all household

equipment. The three states or modes of the MDP

are ’on’, ’off ’ or ’booting’, where the latter represents

the time needed before the actual operational mode is

reached. The action space A of our MDP is limited

to two distinct, deterministic actions, i.e. the agent

can either decide to press a switch or relay that alters

the mode of the machine or it can decide do leave the

mode of the machine unchanged and do nothing. The

former action is a simpliﬁcation to two separate ac-

tions ’turn on’ and ’turn off ’.

An aspect of the MDP that we did not cover yet

ReinforcementLearningforMulti-purposeSchedules

205

is the immediate reward R

(s, s

) received after tran-

sitioning to state s

from state s by action a. These

rewards are a combination of two objectives, i.e.. an

energy consumption penalty and a reward given by

the user. The latter is a predeﬁned constant for dif-

ferent situations that can occur. For instance, when

the machine is turned off but at the same time a user

wanted coffee, then, the current policy does not meet

that speciﬁc user’s proﬁle and the policy is manually

overruled. Although, for a general audience this is

not necessary a bad policy when the algorithm has de-

ducted that in fact the probability of somebody want-

ing a beverage was very low and from an energy-

consumption point of view, it was not interesting to

have the device turned on. In such a case, the system

is provided with a negative feedback signal indicating

the user’s inconvenience. On the other hand, when

the device is turned on at the same time that a user

requested a beverage, then the policy actually suits

the current user and the system anticipated well on

the expected usage. In those cases, positive reward is

provided to the system.

The former reward signal is a measure indicating

quality of a certain action a in terms of power con-

sumption. These rewards are device-dependent and

allow the learning algorithm on top to learn over time

whether leaving the device in idle mode is energy re-

ducing enough for the current state s of S or if a shut-

down is needed. By specifying a certain cost for cold-

starting the device, in according to the real-life cost,

the algorithm could also learn to power the device

on x minutes before a timeslot where a lot of con-

sumption is expected. In general, the learning algo-

rithm will have to deduct which future timeslots are

expected to have a positive difference between the

consumption reward signal and the user satisfaction

feedback signal. For the moment, these two reward

signals are combined by scalarization.

To conclude, our MDP is graphically repre-

sented in Figure 3 and is mathematically for-

malized as follows: M= < S, A, P, R >, where

S = {On, Off, Booting} and A = {Do nothing,

press switch}. The transitions between the differ-

ent states are deterministic, resulting in a probability

function P that is shown in Figure 3. The reward func-

tion R is device-speciﬁc and we will elaborate this

function in the sections below.

4 EXPERIMENTS

At each of data points, representing a point in time,

the FQI algorithm, described in Section 2.1.1, will de-

cide which action to take from the action space given

On Off

Booting

Press switch

Boot for x minutes

Do nothing

Figure 3: A general model for almost every household de-

vice.

the current hour, interval of 10 minutes and presence

set, with 24, 6 and 2

possible values, respectively.

These ﬁgures results in a large state space of 9,216

possible combinations. In our setting, the FQI algo-

rithm was ﬁrst trained with data of one single sim-

ulated day and the control policy was tested for one

new day after every training step, whereafter this test

sample was also added to the list of training samples

to increase the training set’s size. Thus, an on-line

learning setting was created. In our experiments, we

opted for the Tree-Based FQI algorithm with a classi-

ﬁcation and regression tree or CARTand we averaged

our results over 10 individual trials.

For the reward signals in our MDP M, we mim-

icked the properties of a real-life espresso maker into

our simulation framework. Using the same appliance

monitoring equipment, we have tried to capture the

real-life power consumption of the device under dif-

ferent circumstances. After measuring the power con-

sumption of the machine for a few weeks, we came to

the following conclusions:

• We noticed that, for our industrial coffee maker,

the start-up time was very fast. In just over one

minute, the device heated the water up to the boil-

ing temperature and the beverage could be served.

The power consumption of actually making coffee

is around 940 Watts per minute.

• When the machine was running in idle mode, the

device is only using around 2 Watts most of the

time. However, every ten minutes, the coffee

maker re-heated its water automatically. On av-

erage, this results in an energy consumption of 5

Watts per minute in idle mode.

• The device does not consume any power when

turned off.

The reward signals to identify the user’s satisfaction

or inconvenience, when the device was turned on and

off, respectively, can be tuned to obtain schedules for

ICAART2013-InternationalConferenceonAgentsandArtificialIntelligence

206

the coffee maker that focus on either or both.

4.1 A User-oriented Schedule

In a ﬁrst experiment, we deﬁned a large positive and

negative value for the user satisfaction reward (+0.5)

and the user inconvenience cost (−0.4), respectively.

With these rewards in place, we ran the FQI algorithm

for 50 simulation days. Although the learning curve

in Figure 4 is still ﬂuctuating signiﬁcantly at the ﬁnal

learning days, we see that the good performance is

being reached from day 20. This observation is con-

ﬁrmed in Figure 5, where we plotted the number of

manual overrides that occurred on each day. Initially,

the number of manual overrides per day is around

10, where in ﬁnal iterations, around 2 overrides are

needed. In the initial learning phases, the algorithm

tries out a series of different start-up and shutdown

times for the coffee maker. Upon observing the state

of the machine and the feedback from the users, it

tries to reﬁne its schedule by improving and adjusting

the schedule to their needs.

0 10 20 30 40 50

−5

Day

Reward

Collected reward

Figure 4: The learning curve for learning a schedule that

focusses on satisfying the convenience of the users.

0 10 20 30 40 50

Day

Overrides

Manual overrides

Figure 5: The number of human interventions needed that

involve a manual start-up of the coffee maker diminishes as

learning proceeds.

Currently a setting is created that focuses heavily

on the keeping an accepted level of user-friendliness

compared to energy-efﬁciency, the system proposes

the schedule of Figure 6. Although some people tend

to be present before 8h, the algorithm decides to have

the coffee maker turned on at 8h to be prepared for

the high peak in usage (Figure 2). Although not that

many beverages are being consumed in the afternoon,

a lot of employees are still present at the ofﬁce, which

increases the chance of somebody using the machine.

This makes the suggested schedule result in a very

user-friendly schedule taking the gains of keeping an

accepted level of user-friendliness over the potential

economical beneﬁts of turning the device off for a

small period of time.

0 2 4 6 8 10 12 14 16 18 20 22

Off

Hour

State

Device’s state

Figure 6: The user-friendly policy decides to have the coffee

maker turned on at 8h and turned off at 16h.

4.2 An Energy-oriented Schedule

Instead of focusing on the users of the actual system,

one could prefer to ﬁnd a policy that focusses more

on keeping the energy consumption down. A radical

schedule that takes into consideration only the power

consumption side of the story could be have the de-

vice turned off all the time and let the users them-

selves have the task of starting-up the system accord-

ing to their needs. However, such a schedule would be

of little to no value in any real-life situation. There-

fore, we conduct the same experiment as in Section

4.1, but decrease the user’s inﬂuence on the learned

policy. The new rewards are 0.35 and −0.35 to deﬁne

user convenience and inconvenience, respectively.

0 10 20 30 40 50

−6

−4

−2

Day

Rewad

Collected reward

Figure 7: The learning curve for an energy-oriented sched-

ule.

After around the same number of training days as

in the previous experiment, a stable performance is

obtained (Figure 7). The number of manual overrides

(Figure 8) stays acceptably low because the ﬁnal pol-

icy in Figure 9 focuses on leaving the device turned on

at the most critical time of the day, i.e. the morning

when at the same time most people tend to be present

and most of the beverages are being consumed.

ReinforcementLearningforMulti-purposeSchedules

207

0 10 20 30 40 50

Day

Overrides

Manual overrides

Figure 8: The number of manual overrides decrease as

learning proceeds. Although this metric is currently not be-

ing focussed on too much, the number of manual overrides

stays low.

0 2 4 6 8 10 12 14 16 18 20 22

Off

Hour

State

Device state

Figure 9: The ﬁnal policy obtained turns the device on from

8h to 10h50, while leaving it off during the less occupied

afternoon.

4.3 Energy Consumption

The three schedules, i.e. the original always-on, the

user-oriented and the energy-oriented policy, can also

be compared in terms of economical gains. Given

the actual cost of 0.22 eper Kilowatt hour (kWh) and

an average consumption of the espresso machine in

stand-by mode of around 300 Watts per hour, the re-

sults are listed in Table 1. From these ﬁgures, we

deduct an annual saving of around 66.6% and 88.2%

for the user-oriented and energy-oriented proﬁles, re-

spectively, compared to the initial setting of always

leaving the device on.

5 DISCUSSION AND RELATED

WORK

In the ﬁrst experiment we showed how the FQI al-

gorithm quite easily managed to generalize from the

large state-space. The schedule proposed in that par-

ticular setting might seem trivial, but applying such a

schedule would already be a signiﬁcant money saver

in most ofﬁces. On the other hand, the second ex-

periment, where we placed less emphasis on the ob-

jective indicating the convenience of the user, the ob-

tained schedule is much more interesting. Although

most people tend to be present from 8h to 17h, it has

analyzed the combination of the presence and the us-

age information on the long run to conclude that the

timespan from 8h to little before 11h is the most crit-

ical one. As the most critical timeslot of the general

working day is covered, also the number of manual

overrides remains acceptably low, i.e. only two man-

ual interventions are needed during the entire day.

Thus, when the device is being used at later times that

day, the user is still free to manual overrule the sched-

ule, but the algorithm will not suggest such an action

itself.

The economical savings one can accomplish by

implementing these schedules are compared to the

company’s initial schedule which was to leave the de-

vice always on as nobody took responsibility, are of

course signiﬁcant. A potential cost saving of 385.7 e

and 510.3 e for the user-oriented and energy-oriented

proﬁle, respectively, could be obtained if an automatic

control device applied one of these proposed sched-

ules. Besides the economical cost, there is also the

wear and tear of device itself that should be taken into

consideration. It is obvious that an always-on proﬁle

is not beneﬁcial for the lifetime and durability of the

device and neither is a proﬁle that rapidly switches

between operational modes.

Previous research has applied the Fitted-Q algo-

rithm mainly in single-objective optimal control prob-

lems (e.g. (Busoniu et al., 2010; Riedmiller, 2005;

Ernst et al., 2005)). More recently, (Castelletti et al.,

2012) also introduced a multi-objective FQI version,

which is capable of approximating the Pareto frontier

in learning problems with multiple objectives, and ap-

plied this algorithm to learn operation policies for wa-

ter reservoir management. None of these works, how-

ever, consider the problem of user interactions and

taking into account end-user preferences. To the best

of our knowledge this paper presents the ﬁrst appli-

cation of FQI in a setting which includes both a cost

function and direct user feedback.

Several authors have considered other reinforce-

ment learning algorithms in problem settings related

to those presented in this paper. In (Dalamagkidis

et al., 2007), an on-line temporal difference RL con-

troller is developed to control a building heating sys-

tem. The controller uses a reinforcement signal which

is the weighted combination of 3 objectives: energy

consumption, user comfort and air quality. On-line

RL algorithms have also been applied to the problem

of energy conservation in wireless sensor networks,

often in combination with other objectives such as

satisfying certain routing criteria (see e.g. (Liu and

Elhanany, 2006; Mihaylov et al., 2010)). Finally,

(Khalili et al., 2009) apply Q-learning to learn user

preferences in a smart home application setting. Their

ICAART2013-InternationalConferenceonAgentsandArtificialIntelligence

208

Table 1: The economical properties of the three schedules.

Always-on User-oriented Energy-oriented

Operational hours per day 24h 8h 2h50

Cost per month (e) 48.22 16.08 5.68

Cost per year (e) 578.56 192.86 68.25

system is able to adapt to (time-varying) user prefer-

ences regarding ambient light and music settings, but

does not take into other criteria such as account en-

ergy consumption.

6 CONCLUSIONS

In this paper, we have presented our results on ap-

plying Reinforcement Learning (RL) techniques on

real-life data to come-up with appropriate start-up and

shutdown decisions for multi-criteria environments.

These two criteria are user convenience and energy.

We have seen that the FQI algorithm integrates very

well into such a multi-objective environment and by

specifying emphasis on each of the different objec-

tives, one can obtain schedules for everyone’s needs.

The aspect of this work that requires the most oppor-

tunities for future research is the manner how both

reward signals can be combined in a more intelligent

way by multi-objective techniques. As in many of to-

day’s attempts in the RL research landscape, combin-

ing multiple reward signals is limited to scalarization

techniques and no aspects of Pareto dominance rela-

tionships are incorporated. Our next step is to incor-

porate RL with multi-objective techniques and apply

these techniques in a similar real-life environment we

have presented here.

To conclude, we have shown how one can fairly

easily come up with an application of RL using histor-

ical data and cheap monitoring devices. In the near fu-

ture, one of our intentions is to shift these simulations

out of the virtual world and to design a real-world sys-

tem using automated control devices that apply these

schedules in the same ofﬁce where the measurements

took place.

ACKNOWLEDGEMENTS

This research is supported by IWT-SBO project PER-

PETUAL. (grant nr. 110041).

REFERENCES

Busoniu, L., Babuska, R., De Schutter, B., and Ernst, D.

(2010). Reinforcement Learning and Dynamic Pro-

gramming using Function Approximators, volume 39

of Automation and Control Engineering Series. CRC

Press.

Castelletti, A., Pianosi, F., and Restelli, M. (2012). Tree-

based ﬁtted q-iteration for multi-objective markov de-

cision problems. In Proceedings International Joint

Conference on Neural Networks (IJCNN 2012).

Dalamagkidis, K., Kolokotsa, D., Kalaitzakis, K., and

Stavrakakis, G. (2007). Reinforcement learning for

energy conservation and comfort in buildings. Build-

ing and Environment, 42(7):2686 – 2698.

Ernst, D., Geurts, P., and Wehenkel, L. (2005). Tree-based

batch mode reinforcement learning. Journal of Ma-

chine Learning Research, 6:503–556.

Haight, F. (1967). Handbook of the Poisson distribution.

Publications in operations research. Wiley.

Khalili, A., Wu, C., and Aghajan, H. (2009). Autonomous

learning of users preference of music and light ser-

vices in smart home applications. In Proceedings

Behavior Monitoring and Interpretation Workshop at

German AI Conf.

Liu, Z. and Elhanany, I. (2006). A reinforcement learning

based mac protocol for wireless sensor networks. Int.

J. Sen. Netw., 1(3/4):117–124.

Mihaylov, M., Tuyls, K., and Now

e, A. (2010). Decen-

tralized learning in wireless sensor networks. Lecture

Notes in Computer Science, 4865:60–73.

Riedmiller, M. (2005). Neural ﬁtted q iteration – ﬁrst ex-

periences with a data efﬁcient neural reinforcement

learning method. In In 16th European Conference on

Machine Learning, pages 317–328. Springer.

Tsitsiklis, J. (1994). Asynchronous stochastic approxima-

tion and q-learning. Journal of Machine Learning,

16(3):185–202.

Watkins, C. (1989). Learning from Delayed Rewards. PhD

thesis, University of Cambridge.

ReinforcementLearningforMulti-purposeSchedules

209