2 TEXT CATEGORIZATION
Text categorization can be defined as an assignment
of natural language texts to one or more predefined
categories, based on their content. Automatic text
categorization can play an important role in informa-
tion management tasks, such as text retrieval, rout-
ing and filtering. To accomplish automatic text cate-
gorization (Baeza-Yates and Ribeiro-Neto, 1999), the
set of documents, typically strings of characters, has
firstly to be converted to an acceptable representation
that the learning machine can handle, and features are
usually reduced and/or extracted. Afterwards a data
mining phase takes place, as represented in Figure 1.
More thoroughly, the task of text categorization can
Document
representation
Pre-processing
Spacereduction&
Featureextraction
Text
documents
Relevant
documents
Parsing:
stemming
stop-wordremoval
Dictionarybuilding
Cleaning
Scaling
Building
trainandtestsets
Training
buildingamodel
Testing
Evaluating
Validating
Learning
Enhancingthe
easeoffeature
extraction
Featuresare
summarizedin
numericalvector
forms,suitable
fordatamining
Asaresult
documentsare
classifiedinto
predefined
classes
Figure 1: Automatic text Categorization.
be divided into several sub-tasks: A) pre-processing,
B) parsing by applying stemming and removing stop
words (Silva and Ribeiro, 2003), C) dictionary build-
ing with the terms and their document frequency, D)
cleaning less frequent words, E) scaling, F) building
the train and test sets, G) training, H) testing, I) appli-
cation of ensemble strategies where partial classifiers
are joined together to gain from synergies between
them and J) evaluation of classifiers.
For the experiments the Reuters-215768 col-
lection of articles (R21578) was used, which is
publicly available at http://kdd.ics.uci.edu/databases
/reuters21578/reuters21578.html. It is a financial
corpus with news articles averaging 200 words
each. R21578 collection has about 12000 articles,
classified into 118 possible categories. We use only
10 categories (earn, acq, money-fx, grain, crude,
trade, interest, ship, wheat, corn), which cover 75%
of the items and constitute an accepted benchmark.
R21578 is a very heterogeneous corpus, since the
number of articles assigned to each category is very
varying. There are articles not assigned to any of the
categories and articles assigned to more than 10 cat-
egories. On the other hand there are categories with
only one assigned article and others with thousands
of assigned articles.
3 KERNEL-BASED LEARNING
MACHINES AND ENSEMBLE
STRATEGIES
Support Vector Machines (SVM) and Relevance Vec-
tor Machines (RVM) that show the state-of-art results
in several problems are used in conjunction with the
ensemble learning techniques for the purpose of text
classification.
Support Vector Machines. were introduced by
Vapnik (Vapnik, 1998) based on the Structural Risk
Minimization principle, as an alternative to the tradi-
tional Empirical Risk Minimization principle. Given
N input-outputsamples, (x
i
, t
i
), i = 1, . . . , N, a general
two-class or binary classification problem is to find a
classifier with the decision function y(x), such that
t
i
= y(x
i
), where t
i
∈ {−1, +1} is the class label for
the input vector x
i
. From the multiple hyper-planes
that can separate the training data without error, a lin-
ear SVM chooses the one with the largest margin. The
margin is the distance from the hyperplane to the clos-
est training examples, called support vectors. The set
of support vectors also includes the training examples
inside the margin and the misclassified ones.
SVM Ensemble. We explored different parameters
for SVM learning (Joachims, 2007), resulting in four
different learning machines: (i) linear default kernel,
(ii) RBF kernel, (iii) linear kernel with trade-off be-
tween training error and margin set to 100, and (iv)
linear kernel with the cost-factor, by which errors in
positive examples outweight errors in negative exam-
ples, is set to 2.
Relevance Vector Machines. (RVM), proposed by
Tipping (Tipping, 2001), are probabilistic non-linear
models that use Bayesian theory to obtain sparse so-
lutions for regression and classification. The RVM
have an identical functional form to the Support Vec-
tor Machines, but provide probabilistic classification.
The number of relevance vectors does not grow lin-
early with the size of training set and the models are
usually much sparser, resulting in faster performance
on test data at a comparable generalization error. The
overall training complexity is O(N
3
), implying long
lasting learning phase in case of huge sample sizes.
RVM Ensemble. The size and the number of the
training sets used in RVM ensemble modeling de-
pend on the available computational power, but more
training examples usually results in more diversity
and better performance achieved. In our case, seven
DISTRIBUTED ENSEMBLE LEARNING IN TEXT CLASSIFICATION
421