Authors:
Fabio Clarizia
;
Francesco Colace
;
Massimo De Santo
;
Luca Greco
and
Paolo Napoletano
Affiliation:
University of Salerno, Italy
Keyword(s):
Text classification, Term extraction, Probabilistic topic model.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Clustering and Classification Methods
;
Computational Intelligence
;
Evolutionary Computing
;
Information Extraction
;
Knowledge Discovery and Information Retrieval
;
Knowledge-Based Systems
;
Machine Learning
;
Mining Text and Semi-Structured Data
;
Soft Computing
;
Symbolic Systems
Abstract:
Text classification methods have been evaluated on supervised classification tasks of large datasets showing
high accuracy. Nevertheless, due to the fact that these classifiers, to obtain a good performance on a test set,
need to learn from many examples, some difficulties may be found when they are employed in real contexts.
In fact, most users of a practical system do not want to carry out labeling tasks for a long time only to obtain a
better level of accuracy. They obviously prefer algorithms that have high accuracy, but do not require a large
amount of manual labeling tasks.
In this paper we propose a new supervised method for single-label text classification, based on a mixed Graph
of Terms, that is capable of achieving a good performance, in term of accuracy, when the size of the training
set is 1% of the original. The mixed Graph of Terms can be automatically extracted from a set of documents
following a kind of term clustering technique weighted by the probabilistic topic mo
del. The method has been
tested on the top 10 classes of the ModApte split from the Reuters-21578 dataset and learnt on 1% of the
original training set. Results have confirmed the discriminative property of the graph and have confirmed that
the proposed method is comparable with existing methods learnt on the whole training set.
(More)