that the number H of concepts can vary from a mini-
mum of 5 to a maximum of 20, and considering that
it is an integer number we conclude that the number
of possible values for H is 15. We have empirically
chosen these values for H considering that we wish to
have at most |T
0
p
| ≈ 300
3
.
ψ
i j
and ρ
is
are probabilities, and so real value, we
have that ν ∈ [0, 1] and each µ
i
∈ [0, 1]. It means that if
we use a step of 1% to explore the entire set [0, 1], then
we have 100 possible values for ν and 100 for each µ
i
,
which makes 100×100×H ×15 possible values of Λ,
that is 750, 000 for H = 5 and 3, 000, 000 for H = 20.
To limit such a space we can reduce the numbers of
parameters, for instance we can consider µ
i
= µ, ∀i ∈
[1, ·· · , H] and so obtaining 150, 000, independently of
H, possible values of Λ.
Searching for the best solution is still not easy and
it does not provide an accurate solution, because of
the big number of possible values and due to the lin-
ear exploration strategy of the set [0, 1] we are em-
ploying. In fact, by analysing how the values of ψ
i j
and ρ
is
are distributed along the set [0, 1], we note that
they are not uniformly distributed. It means that many
values of ψ
i j
and ρ
is
are likely closer than 1% with the
consequence that if the thresholds ν and µ are chosen
thanks to that linear exploration then many values will
be treated in the same way. To solve this problem one
can think to reduce the step from 1% to 0.1% and so
obtaining more accuracy in the exploration of the set
[0, 1]. The problem in this case is that the space of so-
lution can grow exponentially, and so this way is not
feasible. Another way to reduce the space can be the
application of a clustering methods, like the K-means
algorithm, to all ψ
i j
and ρ
is
values (Bishop, 2006). In
this way we can have a space of possible values ex-
tracted by a no-uniform procedure directly adapted to
the real numbers and not to the set which the numbers
belong to. Following this approach and choosing for
instance 10 classes of values for ν and µ, we obtain
that the space of possible Λ is 10 × 10 × 15, that is
1, 500. As a consequence, the optimum solution can
be exactly obtained after the exploration of the en-
tire space of solutions. This reduction allows us to
compute a mGT from a repository composed of few
documents in a reasonable time, for instance for 10
documents it takes about 30 seconds with a Mac OS
X based computer and a 2.66 GHz Intel Core i7 CPU
and a 8GB RAM. Otherwise we need an algorithm
based on a random search procedure in big solution
spaces, for instance Evolutionary Algorithms would
be suitable for this purpose, which can be very slow.
3
This number is usually employed in the case the Sup-
port Vector Machine, which have demonstrated to be one of
the best.
4 INDUCTIVE CONSTRUCTION
OF THE CLASSIFIER
The inductive construction of a ranking classifier for
category c
i
∈ C usually consists in the definition of
a function CSV
i
: D → [0, 1] that, given a document
d
j
, returns a categorization status value (CSV
i
(d
j
))
for it, i.e. a number between 0 and 1 that, repre-
sents the evidence for the fact that d
j
∈ c
i
, or in other
words it is a measure of vector closeness in |T
0
p
|-
dimensional space. Following this criterion each doc-
uments is then ranked according to its CSV
i
value, and
so the system works as a document-ranking text clas-
sifier, namely a “soft” decision based classifier. As
we have discussed in previous sections we need a bi-
nary classifier, also known as “hard” classifier, that is
capable of assign to each document a value T or F to
measure the vector closeness. A way turn a soft clas-
sifier in a hard one is to define a threshold γ
i
such that
CSV
i
(d
j
) ≥ γ
i
is interpreted as T while CSV
i
(d
j
) ≤ γ
i
is interpreted as F. We have adopted an experimen-
tal method, that is the CSV thresholding (Sebastiani,
2002), which consists in testing different values for
γ
i
on a sub set of the training set (the validation set)
and choosing the value which maximizes effective-
ness. Different γ
i
’s have been chosen for the different
c
i
’s.
Table 1: mGT for the topic Corn. (see Fig. 2).
Conceptual Level
Concept i Concept j Relation Factor (ψ
i j
)
corn us 4,0
··· ··· ···
Word Level
Concept i Word s Relation Factor (ρ
is
)
corn south 2.0
corn us 1.96
corn export 1.69
corn africa 1.0
··· ··· ···
us south 1.17
us taiwan 1.0
··· ··· ···
5 EVALUATION
We have considered a classic text classification prob-
lem performed on the Reuters-21578 repository. This
is a collection of 21,578 newswire articles, originally
collected and labeled by Carnegie Group, Inc. and
Reuters, Ltd.. The articles are assigned classes from
a set of 118 topic categories. A document may be
KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval
550