2 BACKGROUND
In this work, the underlying scenario is text catego-
rization, where source items are textual documents
(e.g., webpages, online news, scientific papers, and
e-books).
According to Luhn (1958), in a document a rela-
tively small number of terms are meaningful for a ML
or IR task. Non-informative terms that frequently oc-
cur in a document are called stopwords. Such terms
are mainly pronouns, articles, prepositions, conjunc-
tions, frequent verbs forms, etc. (Silva and Ribeiro,
2003). In principle, stopwords are expected to occur
in every document. The work of Francis and Kucera
(1983) show that the ten most frequent terms in the
English language typically occur between 20 and 30
percent of the total number of terms in a document
collection. Furthermore, Hart (1994) assesses that
over 50% of all terms in an English document belongs
to a set of about 135 common terms in the Brown cor-
pus (Kucera and Francis, 1967).
Stopwords are expected not only to have a very
low discriminant value, but often they could intro-
duce noise for an IR task (Rijsbergen, 1979). For
these reasons, a stoplist is usually built with terms
that should be filtered in the document representation
process, since they actually reduce retrieval effective-
ness. Traditionally, stoplists are supposed to have in-
cluded only the most frequently occurring terms in
a specific language. Several systems have been de-
veloped for suggest stoplists in an automatic manner.
SMART (Salton, 1971) has been the first system that
automatically built a stoplist, containing 571 English
terms. Fox (1989) initially proposed only 421 terms,
and then derived a stoplist from the Brown corpus
(Francis and Kucera, 1983). This set was typically
adopted as standard stoplist in many subsequent re-
search works and systems (Fox, 1992).
Nonetheless, the use of fixed stoplists across dif-
ferent document collections could negatively affect
the performance of a system. In English, for example,
a text classifier might encounter problems with terms
such as “language c”, “vitamin a”, “IT engineer”, or
“US citizen” where the forms “c”, “a”, “it”, or “us”
are usually removed (Dolamic and Savoy, 2010; Lo
et al., 2005). In other words, we deem that each doc-
ument collection is unique, making useful to devise
methods and algorithms able to automatically build
different stoplist for each collection, with the goal of
maximizing the performance of a ML or IR system.
Several metrics are used to weight terms for iden-
tifying a stoplist in a document collection. The most
common metric is the TF-IDF (Salton and McGill,
1984), in which the weight is given as a product of
two parts: the term frequency (TF), i.e., the frequency
of a term in a document; and the inverse documentfre-
quency (IDF), i.e., the inverse of the number of docu-
ments in the collection in which the term occurs. The
use of TF-IDF makes possible to rank terms, filtering
whose that frequently appear in a document collec-
tion (Silva and Ribeiro, 2003). A further approach to
find stopwords is the use of entropy as discriminant
measure (Sinka and Corne, 2003b). Entropy, here,
is correlated with the frequency variance of a given
term over multiple documents, meaning that terms
with very high frequencyin some documents, but very
low frequency in others, will have higher entropy than
terms with similar frequency in all documents of the
collection. The list of terms is ordered by ascending
entropy to reveal terms that have a greater probability
of being noisy. Further works define automated stop-
words extraction techniques by focusing on statistical
approaches (Hao and Hao, 2008; Wilbur and Sirotkin,
1992).
As the most acknowledgedapproaches do not give
a value to the discriminant power of a term, we use
novel metrics able to measure it, with the goal of
identifying stopwords for a document collection. The
adopted metrics are the discriminant and the char-
acteristic capability defined in a previous work (Ar-
mano, 2014). The former is expected to raise in ac-
cordance with the ability to distinguish a given cat-
egory against others. The latter is expected to grow
according to how the term is frequent and common
over all categories. In our work, terms having a low
discriminant value and high characteristic value are
considered stopwords.
3 THE ADOPTED METRICS
In this paper, we adopt two novel metrics able to pro-
vide relevant information to researchers in several IR
and ML tasks (Armano, 2014). The proposal consists
in two unbiased metrics, i.e., independent from the
imbalance between positive (P) and negative(N) sam-
ples. For a binary classifier, the former means that the
item belongs to the considered category C, whereas
the latter means that the item belongs to the alternate
category
¯
C (i.e., the set of remaining categories). The
metrics rely on the classical four basic components
of a confusion matrix, i.e., true positives (TP), true
negatives (TN), false positives (FP), and false nega-
tives (FN). The most acknowledged metrics, e.g., pre-
cision, recall (or sensitivity), and accuracy are calcu-
lated in terms of such entries. Similarly, our metrics
relies on some of the classical metrics (see Table 1 for
ICAART2015-InternationalConferenceonAgentsandArtificialIntelligence
354