tation, respectively. Section 5 show the experimenta-
tion of the proporsals over a set of benchmark func-
tions and a real energy consumption dataset. Finally,
conclusions and future research are discussed in sec-
tion 6.
2 FUNDAMENTALS OF
SYMBOLIC REGRESSION
Regression analysis (M., 2007) is one of the basic
tools of scientific research. It is used to fit a func-
tional model that represents a relationship between
independent and dependent variables. Traditionally,
this kind of problems has been solved with algebraic
methods, where the researcher provides a hypothesis
about a functional model with a set of parameters, and
the goal is to optimize these parameters for the studied
dataset. Equation 1 shows the parametric model of re-
gression analysis, where ¯x = (x
1
,x
2
,...,x
n
) stands for
the set of independent variables of the data, f is the
functional model hypothesis, ¯w = (w
1
,w
2
,...,w
m
) are
the parameters of the model, and ¯y = (y
1
,y
2
,...,y
l
)
and the dependent variables of the problem. In these
cases, since ¯x and ¯y are the problem data, and f is a
function established as model hypothesis, the regres-
sion problem is solved by finding the best values for
the parameters ¯w, which are not known in advance.
As an example, the most simple case in regression
analysis is the well known linear regression, whose
functional model can be written as y = f(< w
1
,w
2
>
,x) = w
1
∗ x+ w
2
.
(1)y = f( ¯w, ¯x)
A limitation of classical regression arises when
the data properties are unknown in advance, it is dif-
ficult to find a pattern that explains the dataset, and
therefore it is hard to establish a suitable hypothesis
for the function f. This problem is even less tractable
in the multivariate case, where graphical analyses lack
of enough expresiveness to show relations between
multidimensional data.
To solve these limitations, the use of symbolic re-
gression attempts to generalize the traditional prob-
lem of regression analysis by assuming that f is un-
known, and developing techniques targeted at finding
a suitable model
˜
f and parameters ¯w that minimizes
an error expression such as || ¯y −
˜
f( ¯w, ¯x)||. In sym-
bolic regression it is assumed that ¯y and ¯x and the only
components known in advance. Therefore, the goal of
symbolic regression is to find an algebraic expression
that models the behaviour of the dependent data as
function of independent data. Techniques like genetic
programming (Langdon, 1998) have been developed
to solve this problem.
Symbolic regression techniques have been applied
traditionally in a wide variety of real applications. Be-
sides of being used to solve mathematical optimiza-
tion problems, they have been of practical application
in decision making in economics, chemical processes
optimization, etc. For instance, in (Duffy and Engle-
Warnick, 2002) it is described how can be used sym-
bolic regression to uncover simple data generating
function that have the flavor of strategic rules in eco-
nomic decisions. In the work (McKay et al., 1997),
symbolic regression has been used to model chemical
procesess systems, to solve problems about vacuumm
distillation column and a chemical reactor system. On
the other hand, in (Schmidt and Lipson, 2010) they
explore the use of symbolic regression to perform un-
supervised learning by searching for implicit relation-
ships, specifically they present a successful method
based on implicit derivated. In (Davidson et al., 2001)
the authors use symbolic regression in two real-world
problems, approximating the Colebrook-White equa-
tion and rainfall-runoff modelling.
In this work, we test the feasibility of the use of
symbolic regression to model energy consumption in
a set of public buildings, under the hypothesis that the
resulting models obtained from the symbolic regres-
sion approach will be useful for high-level decision
making processes regarding energy efficiency. The
following subsection describes in depth the basis of
our approach, which is based in genetic programming.
2.1 Introduction to Genetic
Programming
Genetic programming (Langdon, 1998) can be seen
as a supervised learning method based on biological
evolution. Genetic programmingfundamentalsare in-
spired in genetic algorithms, and it has been used in
previous works to solve optimization problems like
symbolic regression (Alonso et al., 2009), digital sig-
nal processing (Alcazar and Sharman, 1996), solving
differential equations (Tsoulos and Lagaris, 2006),
tasks of evolving robotic behaviours (Lazarus and Hu,
2001), grammatical inference (Lankhorst, 1994), au-
tomatic program generation (Koza, 1994), etc. If we
focus in the problem of symbolic regression, the goal
of genetic programming to evolve a set of algebraic
expressions encoded as chromosomes, according to
Darwinian evolution principles of genetic algorithms
and, as fitness measure, the minimization of an er-
ror function that explains the behaviour of dependent
variables regardingthe independentvariables in a spe-
cific dataset.