Supervised Learning requires a manual classifica-
tion of a group of texts into a predefined set of cate-
gories. This result is then used to train and build an
automatic classifier able to categorize any text into the
predefined set of categories.
According to (Huang, 2001), there are two key
factors for a successful supervised learning. One is
the feature extraction, which should accurately rep-
resent the contents of text in a compact and efficient
manner, and the other is the classifier design, which
should take the maximum advantage of the proper-
ties inherent to the texts, to achieve the best possi-
ble results. Huang studied several algorithms for both
factors, and concluded that the LSA algorithm is the
most appropriated for the feature extraction, and the
SVM for the classifier.
(Debole and Sebastiani, 2003) and (Ishii et al.,
2006) both agree with Huang in using the LSA for
feature extraction, but they introduced some changes
to the feature extraction process. While the first au-
thors included a number of “supervised variants” of
TFIDF weighting, the second authors complemented
the LSA by introducing the concept of data grouping.
Although supervised learning can obtain good re-
sults, they require a large number of texts (literature
values vary between 500 and 1400) and a manual clas-
sification to train the final classifier.
Unsupervised Learning tries to overcome the dis-
advantages of the supervised approaches by replacing
the manual classification of a high number of texts
with an automatic classification (often called boot-
strapping). By doing so we are able to greatly reduce
the costs and the need for human intervention.
Unfortunately the automatic classification of texts
used by the unsupervised learning can cause vari-
ous misclassification, introducing noise in the training
of the classifier and affecting its final performance,
which traditionally is worst than in the supervised
learning.
Since most unsupervised approaches requires a
list of representative keywords for each category,
some authors tried to improve the bootstrapping qual-
ity by developing algorithms to help in the selection
of the best keywords for each category. (Liu et al.,
2004) used a clustering algorithm to identify the most
important words for each cluster of texts. Then the
user could inspect the ranked list and select a small
set of representative keywords for each category.
(Barak et al., 2009) went a step forward by com-
pletely automating the process. Their approach at-
tempts to automatically extract possible keywords us-
ing only the category name as a starting point. The
authors introduced also a novel scheme that mod-
els both lexical references, based on certain relations
present in WordNet and Wikipedia, and contextual
references, using the model of the LSA. From the re-
sulting model they extract the necessary keywords.
(Gliozzo et al., 2005) tried to minimize the num-
ber of misclassifications of the bootstrapping by first
preprocessing the text to remove all the words that are
not nouns, verbs, adjectives and adverbs. The result-
ing set of words is then represented using LSA. An
algorithm based on unsupervised estimation of Gaus-
sian Mixtures is then applied to differentiate between
relevant and non-relevant information using statistics
from unclassified texts. According to the authors a
SVM classifier trained with the results from this boot-
strapping algorithm achieved results comparable to a
supervised solution.
In summary, although supervised learning ap-
proaches present the best results, they require some-
one (an expert person) to manually classify a large
number of texts, which is an arduous and monotonous
task, with an enormous cost associated. On the other
hand, the unsupervised learning avoids the manual
classification by including a bootstrapping technique,
but requires specific knowledge about the algorithms
in use (e.g. LSA) and of the domain problem. In-
deed, when we use an approach that reduce the di-
mension of the features extracted from the text, like
for instance the LSA algorithm the selection of the
dimension is very important, since its value affects
the final results. From the analysis of the several ex-
isting proposals based on the LSA algorithm, we did
not find a clear explanation on how to choose the best
dimension for the LSA. In most cases its value is cho-
sen after several iterations and taking into account the
specific context of the current problem.
To overcome this, to minimize the human inter-
vention, and to offer good results, we propose in this
paper a solution to automatically estimate the “opti-
mum” dimension for the LSA algorithm, taking into
account only the number of texts.
3 LSA DIMENSION ESTIMATION
The solution that we developed for the bootstrapping
step starts by reducing the size of the vocabulary by
removing useless words that only introduce noise in
the categorization. We remove words contained into
a stopwords list and the least frequent words (words
that appeared less than three times)(Joachims, 1997).
By removing the least frequent words we are able to
eliminate typos and reduce the noise of the vocabu-
lary. Additionally, by reducing the size of the vocab-
ulary we will reduce the complexity of the problem
and the computational cost of all the following algo-
KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval
310