3 SIMULATION
As was mentioned in the introduction, the aim of this
work is to study empirically which of the similar-
ity measures presented in Section 2.2 can be used
to compare experience-based evaluation sets repre-
sented by IFSs. Hence, in this section we describe
both the learning and the evaluation processes that
were used to obtain the IFSs that represent the sim-
ulated experience-based evaluation sets.
3.1 Learning Process
In this part we describe the data, scenarios and algo-
rithm that were employed to simulate how a human
editor categorizes newswire stories.
3.1.1 Learning Data
We made use of the Reuters Corpora Volume I
(RCV1) (Rose et al., 2002), which is a collection of
manually categorized newswire stories provided by
Reuters, Ltd. Specifically, we made use of the cor-
rected version RCV1.v2, which is available (and fully
described) in (Lewis et al., 2004). This collection
has 804414 newswire stories, each assigned to one or
more (sub) categories within three main categories:
Topics, Regions and Industries.
We made use of the 23149 newswire stories in the
training file lyrl2004 tokens train.dat to learn how
to categorize newswire stories into one or more of
the following categories from Topics: ECAT, E11,
E12, GSCI, GSPO, GTOUR, GVIO, CCAT, C12, C13,
GCAT, G15, GDEF, GDIP, GDIS, GENT, GENV,
GFAS, GHEA and GJOB. The interested reader is re-
ferred to (Lewis et al., 2004) for a full description of
these categories.
3.1.2 Learning Scenarios
We established the following scenarios to learn how
to categorize newswire stories into each of the chosen
categories:
- R0: All the stories in the training data preserve the
assignation of the training category in its original
state.
- R20, R40, R60, R80, R100: The assignation of the
training category is opposite to its original state in
the 20%, 40%, 60%, 80% and 100% of the sto-
ries in the training data respectively. The assigna-
tion of the training category in the remainder of
the stories is preserved. The selection of the sto-
ries that do not preserve the original state is made
through a simple random sampling.
For instance, consider the story with code 2286,
which was assigned to the category ECAT. In the sce-
nario R20, if the training category is ECAT and the
story is selected to change its category, the story will
be considered as a nonmember of ECAT.
3.1.3 Learning Algorithm
We made use of an algorithm based on support vec-
tor machines, or SVM for short (Vapnik, 1995; Vap-
nik and Vapnik, 1998), which have been successfully
used in statistical learning theory. Specifically, we
made use of the application of SVMs for the text cat-
egorization problem proposed in (Joachims, 1998),
which has demonstrated superior results to deal with
such a problem (Lewis et al., 2004).
In the context of the text categorization problem,
the words in a newswire story are the features that
determine whether the story belongs or not to a cate-
gory. This follows an intuition in which, according to
his/her experience, a person focuses on the words in a
document to decide whether it fits or not into a given
category.
To use the SVM algorithm, each story must be
modeled as a vector whose components are the words
in the story. A story might contain words such as
‘the’, ‘of’ or ‘at’ that have a negligible impact on
the categorization decision, or words such as ‘learn-
ing’, ‘learned’ or ‘learn’ that have a common stem.
To simplify the vector representation, such words are
usually filtered out and stemmed by using different
algorithms. Hence, for the sake of reproducibility of
the simulation, we made use of the stories in the train-
ing file lyrl2004 tokens train.dat (Lewis et al., 2004),
which already have reduced and stemmed words. For
example, the story with code 2320 has the following
words: tuesday, stock, york, seat, seat, nys, level, mil-
lion, million, million, sold, sold, current, off, exchang,
exchang, exchang, bid, prev, sale, mln.
Since the impact of the words on the categoriza-
tion decision could be different, a weight should be
assigned to each word. Thus, to compute the (ini-
tial) weight of a word in a story (or document), as it
was suggested in (Lewis et al., 2004), we applied the
equation
weight( f ,x) = (1 + lnn( f ,x))ln(|X
0
|/n( f ,X
0
)),
(11)
which is a kind of tf-idf weighting given in (Buckley
et al., 1994) where X
0
is the training collection (i.e.,
the collection of stories in lyrl2004 tokens train.dat),
x ∈ X
0
is a story, f is a word in x, n( f ,x) is the num-
ber of occurrences of f in x, n( f ,X
0
) is the num-
ber of stories in X
0
that contain f , and |X
0
| is the
number of stories in X
0
(i.e., |X
0
| = 23149). For ex-
FCTA 2015 - 7th International Conference on Fuzzy Computation Theory and Applications
60