consider the crop characteristics and soil factors. The
FAO Penman-Monteith method (Allen et al., 1998)
is recommended as the sole method for determining
ET
0
. However, their use is complex and requires that
all climate variables be present. In this way, several
procedures have been developed for estimating miss-
ing climatic parameters.
Climatic data can also be analyzed by data min-
ing techniques. Data mining refers to the applica-
tion of techniques and algorithms for recognizing pat-
terns and models about the data being able to gen-
erate knowledge. There exist several papers which
analyze meteorological and climatic data, such as
(Xavier et al., 2016), (Hendrawan and Murase, 2011),
(Rahimikhoob, 2014) and (Sawalkarand Dixit, 2015).
We aim to analyze these data as well by using data
mining. In this paper, we aim to answer the follow-
ing research question: Is it possible to estimate refer-
ence evapotranspiration without loss of accuracy re-
gardless of the availability of all variables? To solve
this problem, we use a dataset with historical series,
generated by a weather station in the UFC Quixad´a,
Cear´a, Brazil. The prediction models were created
by using the data mining technique M5’ proposed on
(Wang and Witten, 1996). M5’ created more than one
function to calculate the reference evapotranspiration,
and it specifically refers to the crops present in the
environments where the climatic data were collected.
Another example of such techniques to generate pre-
diction models is Regression, which learns a function
that maps a data item to a real-valued prediction vari-
able (Fayyad et al., 1996). In this work, we apply
linear regression models as well to estimate the refer-
ence evapotranspiration based on climatic data.
However, the data collected from weather stations
can be inaccurate or missing due to several reasons
such as sensor failure, calibration problems, wireless
transmission loss or environmental noise. Moreover,
we can also highlight the existence of missing val-
ues due to the problems with data storage or datalog-
ger power failures. In order to overcome these prob-
lems, feature selection techniques can help to handle
the fluctuating, inaccuracy or imprecision of the sen-
sor readings in a proper way and avoid that a wrong
decision would be made.
We may notice that related papers proposed mod-
els usually applied to calculate the reference evap-
otranspiration and the models are not composed of
all attributes of climatic data. Moreover, as climatic
data are multidimensional, by reducing the number
of attributes so that irrelevant, redundant or non-
significant data might be removed from results (Liu
and Yu, 2005), we can save computation time in the
analysis of these data as well.
Data pre-processing is a significant step in the
knowledge discovery process since quality decisions
must be based on quality data. The Feature Selection
is one of the data reduction techniques, which the goal
is to find a minimum set of attributes such that the re-
sulting probability distribution of the data classes is as
close as possible to the original distribution obtained
using all attributes (Karegowda et al., 2010).
In this way, we propose to apply feature selec-
tion before generating the prediction model for ref-
erence evapotranspiration (ET
0
). In this paper, the
model to predict the reference evapotranspiration was
generated by using the attributes selected. Our so-
lution generated two models, one by applying M5’
algorithm and another one by applying linear regres-
sion. We have performed many experiments in or-
der to compare the models generated with the origi-
nal data (without feature selection) and with feature
selection in order to discover which one results in a
more accurate model.
The remaining of this paper is structured as fol-
lows. Section 2 reports our solution and Section
3 presents the performed experiments. Section 4
presents the related works. Finally, Section 5, we
draw conclusions and propose future works.
2 METHODOLOGY
We aim at discovering a model to predict the ET
0
value for the collected data of a weather station. To
achieve this goal, we used an adaptation of the KDD
(standing for Knowledge Discovery in Databases)
process described in (Fayyad et al., 1996) to drive our
methodology. The next subsections correspond to the
steps of the process and how they were executed.
2.1 Data Collection
The first stage consists to collect climatic data gen-
erated by the weather station. They are related to the
climatic conditions monitored by the station in the pe-
riod from 16th of June to 19th of October of 2016 at
the city of Quixad´a, Cear´a, Brazil. The data collec-
tion were performed through a serial connection with
the data logger of the station provided by software
PC200W (pc2, 2016). The dataset was stored in CSV
files.
The original dataset contains 3191 numeric type
tuples, no missing values and it is composed of the
attributes described in Table 1.