is the immediate reward R
a
(s, s
0
) received after tran-
sitioning to state s
0
from state s by action a. These
rewards are a combination of two objectives, i.e.. an
energy consumption penalty and a reward given by
the user. The latter is a predefined constant for dif-
ferent situations that can occur. For instance, when
the machine is turned off but at the same time a user
wanted coffee, then, the current policy does not meet
that specific user’s profile and the policy is manually
overruled. Although, for a general audience this is
not necessary a bad policy when the algorithm has de-
ducted that in fact the probability of somebody want-
ing a beverage was very low and from an energy-
consumption point of view, it was not interesting to
have the device turned on. In such a case, the system
is provided with a negative feedback signal indicating
the user’s inconvenience. On the other hand, when
the device is turned on at the same time that a user
requested a beverage, then the policy actually suits
the current user and the system anticipated well on
the expected usage. In those cases, positive reward is
provided to the system.
The former reward signal is a measure indicating
quality of a certain action a in terms of power con-
sumption. These rewards are device-dependent and
allow the learning algorithm on top to learn over time
whether leaving the device in idle mode is energy re-
ducing enough for the current state s of S or if a shut-
down is needed. By specifying a certain cost for cold-
starting the device, in according to the real-life cost,
the algorithm could also learn to power the device
on x minutes before a timeslot where a lot of con-
sumption is expected. In general, the learning algo-
rithm will have to deduct which future timeslots are
expected to have a positive difference between the
consumption reward signal and the user satisfaction
feedback signal. For the moment, these two reward
signals are combined by scalarization.
To conclude, our MDP is graphically repre-
sented in Figure 3 and is mathematically for-
malized as follows: M= < S, A, P, R >, where
S = {On, Off, Booting} and A = {Do nothing,
press switch}. The transitions between the differ-
ent states are deterministic, resulting in a probability
function P that is shown in Figure 3. The reward func-
tion R is device-specific and we will elaborate this
function in the sections below.
4 EXPERIMENTS
At each of data points, representing a point in time,
the FQI algorithm, described in Section 2.1.1, will de-
cide which action to take from the action space given
On Off
Booting
Press switch
Press switch
Boot for x minutes
Do nothing
Do nothing
Figure 3: A general model for almost every household de-
vice.
the current hour, interval of 10 minutes and presence
set, with 24, 6 and 2
6
possible values, respectively.
These figures results in a large state space of 9,216
possible combinations. In our setting, the FQI algo-
rithm was first trained with data of one single sim-
ulated day and the control policy was tested for one
new day after every training step, whereafter this test
sample was also added to the list of training samples
to increase the training set’s size. Thus, an on-line
learning setting was created. In our experiments, we
opted for the Tree-Based FQI algorithm with a classi-
fication and regression tree or CARTand we averaged
our results over 10 individual trials.
For the reward signals in our MDP M, we mim-
icked the properties of a real-life espresso maker into
our simulation framework. Using the same appliance
monitoring equipment, we have tried to capture the
real-life power consumption of the device under dif-
ferent circumstances. After measuring the power con-
sumption of the machine for a few weeks, we came to
the following conclusions:
• We noticed that, for our industrial coffee maker,
the start-up time was very fast. In just over one
minute, the device heated the water up to the boil-
ing temperature and the beverage could be served.
The power consumption of actually making coffee
is around 940 Watts per minute.
• When the machine was running in idle mode, the
device is only using around 2 Watts most of the
time. However, every ten minutes, the coffee
maker re-heated its water automatically. On av-
erage, this results in an energy consumption of 5
Watts per minute in idle mode.
• The device does not consume any power when
turned off.
The reward signals to identify the user’s satisfaction
or inconvenience, when the device was turned on and
off, respectively, can be tuned to obtain schedules for
ICAART2013-InternationalConferenceonAgentsandArtificialIntelligence
206