Authors:
Jorge Fernandes
;
Andreia Artífice
and
Manuel J. Fonseca
Affiliation:
INESC-ID/ IST/ Technical University of Lisbon, Portugal
Keyword(s):
LSA, LSA dimension, Unsupervised text classification, Bootstrapping.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Clustering and Classification Methods
;
Computational Intelligence
;
Evolutionary Computing
;
Information Extraction
;
Knowledge Discovery and Information Retrieval
;
Knowledge-Based Systems
;
Machine Learning
;
Soft Computing
;
Symbolic Systems
Abstract:
Nowadays the size of collections of information achieved considerable sizes, making the finding and exploration of a particular subject hard to achieve. One way to solve this problem is through text classification, where a theme or category is assigned to a text based on the analysis of its content. However, existing approaches to text classification require some effort and a high level of knowledge on this subject by the users, making them inaccessible to the common user. Another problem of current approaches is that they are optimized for a specific problem and can not easily be adapted to another context. In particular, unsupervised methods based on the LSA algorithm require users to define the dimension to use in the algorithm. In this paper we describe an approach to make the use of text classification more accessible to common users, by providing a formula to estimate the dimension of the LSA based on the number of texts used during the bootstrapping process. Experimental resul
ts show that our formula for estimation of the LSA dimension allows us
to create unsupervised solutions able to achieve results similar to supervised approaches.
(More)