used to model the dataset and state the assumption
along with their justification.
The problem of regression is to estimate the value
of a dependent variable (known as response variable)
based on values of one or more independent variables
(known as feature variables). We model the tuple as
{X, y} where X is an ordered set of variables (prop-
erties) like {x
1
, x
2
, ..., x
n
} and y is the variable to be
predicted. Here x
i
are variables.
Formally, the problem has the following inputs:
• An ordered set of feature variables X ie {x
1
, x
2
,
..., x
n
}
• A set of tuples called the training dataset, D, =
{(X
1
, y
1
), (X
2
, y
2
), .. ., (X
m
, y
m
)}.
The output is an estimated value of y for the given X.
Mathematically, it can be represented as
y = f(X,D, parameters), (1)
where parameters are the arguments which the func-
tion f() takes. These are generally set by user and are
learned by trial and error method.
We assume that the dependent variable is only
dependent on the independent variables and nothing
else. This is the sole assumption we make. If this as-
sumption is not satisfied, then there is no chance of
obtaining accurate estimates even with the best pos-
sible regression algorithms available. In the coming
section we discuss work done related to this topic.
3 RELATED WORK
Before presenting our algorithms, we would like to
throw light on related work done in the recent past
The most common statistical regression approach
is linear regression (Seber and Lee, 1999) which as-
sumes the entire data to follow a linear relationship
between the response variable and the feature vari-
ables. Linear Regression does not perform well if the
relationship between the variables is not linear.
One of the widely used regression algorithm is
KNN i.e. K-Nearest-Neighbors. In this algorithm dis-
tance of the test tuple is calculated from every train-
ing tuple, and output is given based on the k near-
est tuples. It is resistant to outliers. There is no
pre-processing; every calculation is done in run-time.
Hence it has high computation complexity and per-
forms poorly as far as efficiency is concerned. More-
over, it requires learning of a parameter; asking the
user to have some understanding of the domain.
Another class of regression algorithms is SVM
(Smola and Scholkopf, 1998), Support Vector Ma-
chines. Support Vector Machines are very specific
class of algorithms, characterized by usage of kernels,
absence of local minima, sparseness of the solution
and capacity control obtained by acting on the margin,
or on number of support vectors. SVM try to linearly
separate the dataset and use this technique for predic-
tion. SVM suffer several disadvantages like choice of
kernel, discrete data, multi-class classifiers, selection
of kernel function parameters, high algorithmic com-
plexity, extensive memory requirements etc.
Another popular class of regression algorithms is
Decision trees. These algorithms build up a tree,
which is later used for decision making. They gener-
ally don’t require any data cleaning. Since the prob-
lem of constructing an optimal decision tree is NP-
Complete, heuristic algorithms are used which take
locally optimal decisions. One of the major problems
with Decision trees is that the tree can become too big
and complex.
Neural networks (Haykin, 2009) are another class
of data mining approaches that have been applied for
regression. However, neural networks are complex
closed box models and hence an in-depth analysis
of the results obtained is not possible. Data min-
ing applications typically demand an open-box model
where the prediction can be explained to the user,
since it is to be used for his decision support.
One of the latest works in the field of regres-
sion is PAGER (Desai et al., 2010). Being derived
from nearest neighbor methods, PAGER is simple and
also outliers-resilient. It assigns weight to non-feature
variables based on how much they influence the value
of the response variable. But as it is derived from
KNN, it also suffers from poor run-time performance.
Our work shares resemblance with segmented or
piecewise regression (H.P.Ritzema, 1994). However
upon analysis, the techniques are entirely different. In
segmented regression the independent variables are
partitioned into segments. In our method, the re-
sponse variable is partitioned into two groups to fa-
cilitate a binary search based methodology.
Our work seems to share a resemblance with Bi-
nary logistic regression (Hilbe and Joseph, 2009).
However the technique is again entirely different. In
Binary logistic regression the response variable is as-
sumed to follow a binomial logit model and the pa-
rameters of this model are learnt from training data.
4 BINGR
In this section we present the BINGR Algorithm, fol-
lowed by experimental results in the next section. The
algorithm is straightforward and follows the Divide
and Conquer kind-of policy.
BINGR:BinarySearchbasedGaussianRegression
259