(SVM) is a new data mining paradigm applied for re-
gression. However these techniques involve complex
abstract mathematics thus resulting in techniques that
are more difficult to implement, maintain, embed and
modify as the situation demands. Neural networks
are another class of data mining approaches that
have been used for regression and dimensionality
reduction as in Self Organizing Maps (Haykin,
Simon). However, neural networks are complex and
hence an in depth analysis of the results obtained is
not possible. Ensemble based learning (L. Breiman,
1996) (R.E. Schapire, 1999) is a new approach
to regression. A major problem associated with
ensemble based learning is to determine the relative
importance of each individual learner.
Decision Trees. One of the first data min-
ing approaches to regression were regression
trees (L. Breiman, J. Friedman, R. Olshen, and
C. Stone, 1984), a variation of decision trees where
the predicted output values are stored at the leaf node.
These nodes are finite and hence the predicted output
is limited to a finite set of values which is in contrast
with the problem of predicting a continuous variable
as required in regression.
k-Nearest Neighbour. Another class of data
mining approaches that have been used for regression
are nearest neighbour techniques (E. Fix and J. L. H.
Jr., 1951)(P. J. Rousseeuw and A. M. Leroy, 1987).
These methods estimate the response value as a
weighted sum of the responses of all neighbours,
where the weight is inversely proportional to the
distance from the input tuple. These algorithms are
known to be simple and reasonably outlier resistant
butt they have relatively low accuracy because of
the problem of determining the correct number of
neighbours and the fact that they assume that all
dimensions contribute equally.
From the data mining perspective, we have re-
cently developed a nearest neighbour based algo-
rithm, PAGER (Desai A., Singh H. 2010) which en-
hances the power of nearest neighbour predictors. In
addtition to this PAGER also eliminates the prob-
lems associated with nearest neighbour methods like
choice of number of neighbours and difference in im-
portance of dimensions. However, PAGER suffers
from the problem of high time complexity O(nlogn),
bias due to the use of only the two closest neighbours
and decrease in performance due to noisy neighbours.
In addition PAGER assigns equal weight to all neigh-
bours which is not typical of the true setting as closer
neighbours tend to be more important than far away
ones. In this paper we use the framework of PAGER
and present a new algorithm SEAR which not only
retains the desirable properties of PAGER, but also
eliminates its major drawbacks mentioned above.
3 SEAR ALGORITHM
In this section we present the SEAR (Scalable,
Efficient, Accurate and Robust) regression algorithm.
SEAR is uses the nearest neighbour paradigm which
makes it simple and outlier-resilient. In addition to
these, SEAR is also scalable and efficient. These
desirable features make SEAR a very attractive
alternative to existing approaaches.
In the remainder of this section, we use the nota-
tion shown in Table 1. First in Section 3.1 we present
an algorithm that has a low space and time complex-
ity for computing k nearest neighbours followed by
Section 3.2 which removes over-dependence on the
two closest neighbours for constructing a line. Finally
in Section 3.3, we describe our technique for noisy
neighbour elimination.
Table 1: Notation.
k The number of closest neighbours used for
prediction.
D The Training Data
d Number of dimensions in D
n Number of training tuples in D
A
i
Denotes a feature
X The feature vectors space. X =
(A
1
,...,A
d
).
T A tuple in X-space. T = (t
1
,...,t
d
)
T
RID
Tuple in D with record id RID
v
i,RID
Value of attribute A
i
of T
RID
N
L,i
Value of attribute A
i
for the L
th
closest
neighbour of of T
y The response variable.
3.1 Approximate kNN
SEAR rectifies the problem of high-dimensionality
and high response time by providing a variation that
reduces the space complexity by a factor of d and the
time complexity by a factor of n. This is achieved by
a heuristic which approximately computes k nearest
neighbours in about log(n) time for k << n. For this
we generate a list L
i
for all attributes A
i
where each
row is a two tuple consisting of tuple id and attribute
value. The list is sorted on the basis of attribute val-
ues. For any tuple T and corresponding to attribute A
i
,
we search the list for T
i
using binary search (Knuth
SEAR - Scalable, Efficient, Accurate, Robust kNN-based Regression
393