
the classication of 2 million 10-dimensional points
in 15 seconds on a Pentium-4 (2.4 GHz, 256 Mb
RAM, Linux). In order to deal with very large (at
least one billion points) data sets, the incremental
version is extended from the computation of E
T
E
and d=E
T
De.
3 INCREMENTAL PSVM
3.1 Row-incremental PSVM
(Fung and Mangasarian,2002) have proposed to split
the training data set E into blocks of lines E
i
, D
i
and
compute E
T
E and d=E
T
De from these blocks : E
T
E =
∑ E
i
T
E
i
and d = ∑ d
i
= ∑ E
i
T
D
i
e. For each step, we
only need to load the (blocksize)x(n+1) matrix E
i
and the (blocksize)x1 vector D
i
e for computing E
T
E
and d=E
T
De. We only need to store in memory
(n+1)x(n+1) and (n+1)x1 matrices although the
order of the data set is one billion data points. The
authors have performed the linear classification of
one billion data points in 10-dimensional input space
into two classes in less than 2 hours and 26 minutes
on a Pentium II.
3.2 Column-incremental PSVM
The algorithm described in the previous section can
handle data sets with a very large number of
datapoints and small number of attributes. But some
applications (like bioinformatics or text mining)
require data sets with a very large number of
attributes and few training data. To adapt the
algorithm to this problem, we have applied the
Sherman-Morrison-Woodbury formula to the linear
equation system (4) to obtain:
[w
1
w
2
.. w
n
b]
T
= (I/ν + E
T
E)
-1
E
T
De
= νE
T
[De – (I/ν + EE
T
)
-1
EE
T
De] (5)
where E = [A -e]
The solution of (5) depends on the inversion of the
(m)x(m) matrix (I/ν + EE
T
) instead of the
(n+1)x(n+1) matrix (I/ν + E
T
E) in (4). The cost of
storage and computation depends on the number of
training data. This formulation can handle data sets
with very large number of attributes and few training
data. We have imitated the row-incremental
algorithm for constructing the column-incremental
algorithm able to deal with very large number of
dimensions. The data are split in blocks of columns
E
i
and then we perform the incremental computation
of EE
T
= ∑ E
i
E
i
T
. For each step, we only need to
load the (m)x(blocksize) matrix E
i
for computing
EE
T
. Between two incremental steps, we need to
store in memory the (m)x(m) matrix EE
T
although
the order of the dimensional input space is very
high. With these two formulations of the linear
incremental PSVM, we are able to deal with very
large data sets (large either in training data or
number of attributes, but not yet both
simultaneously). We have used them to classify bio-
medical data sets with interesting results in terms of
learning time and classification accuracy (Do &
Poulet, 2003). The parallel and distributed version of
the two incremental PSVM algorithms can be found
in (Poulet & Do, 2003) and (Poulet, 2003).
4 BOOSTING OF PSVM
For mining massive data sets with simultaneously
large number (at least 10
4
) of datapoints and
attributes, there are at least two problems to solve:
the learning time and the memory requirement. The
PSVM algorithm and its incremental versions need
to store and invert a matrix with size mxm (or nxn).
To scale PSVM to large data sets, we have applied
the boosting approach to the PSVM algorithm. We
briefly explain the mechanism of boosting of PSVM,
more details about boosting can be found in (Freund
& Chapire, 1999). Boosting technique is a general
method for improving the accuracy of any given
learning algorithm. The boosting algorithm calls
repeatedly a given weak or base learning algorithm k
times so that each boosting step concentrates mostly
on the errors produced by the previous step. For
achieving this goal, we need to maintain a
distribution weights over the training examples.
Initially, all weights are set equally and at each
boosting step the weights of incorrectly classified
examples are increased so that the weak learner is
forced to focus on the hard examples in the training
set. The final hypothesis is a weighted majority vote
of k weak hypotheses. Alternately, we consider the
PSVM algorithm as a weak algorithm thus at each
boosting step we can sample a subset of the training
set according to the distribution weights over the
training examples. Note that PSVM only classifies
the subset (less than the original training set). The
subset size is in opposite proportion to the number of
boosting steps. Row-incremental or Column-
incremental PSVM can be adapted to solve large
sizes of subset with high performances concerning
ICEIS 2004 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS
38