3.1 The TF-IDF Method
The idea of automatic text retrieval systems based
on the identification of text content and associated
identifiers is dated in the 50s, but it was Gerard
Salton in the late 70s and the 80s who laid the
foundations of the existing relation between these
identifiers and the texts they represent (Salton &
Buckley, 1988). Salton suggested that every
document D could be represented by term vectors tk
and a set of weights wdk, which represent the weight
of the term tk in document D, that is to say, its
importance in the document.
A TW system should improve efficiency in terms
of two main factors, recall and precision. Recall
bears in mind the fact that the most relevant objects
for the user must be retrieved. Precision takes into
account that strange objects must be rejected. (Ruiz
& Srinivasan, 1998). Recall may be defined as the
number of retrieved relevant objects divided by the
total number of objects. On the other hand, precision
is the number of retrieved relevant objects divided
by the total number of retrieved objects. Recall
improves if high-frequency terms are used, as such
terms will make it possible to retrieve many objects,
including the relevant ones. Precision improves if
low-frequency terms are used, as specific terms will
isolate the relevant objects from the non-relevant
ones. In practice, compromise solutions are used,
using terms which are frequent enough to reach a
reasonable level of recall without producing a too
low precision.
Therefore, terms that are mentioned often in
individual objects, seem to be useful to improve
recall. This suggests the utilization of a factor named
Term Frequency (TF). Term Frequency (TF) is the
frequency of occurrence of a term. On the other side,
another factor should favor the terms concentrated in
a few documents of the collection. The inverse
frequency of document (IDF) varies inversely with
the number of objects (n) to which the term is
assigned in an N-object collection. A typical IDF
factor is log (N/n). (Salton & Buckley, 1988). A
usual formula to describe the weight of a term j in
document i is:
w
i
= tf
i
x idf
. (1)
This formula has been modified and improved by
many authors to achieve better results in IR and IE
(Lee et al., 1997; Liu et al., 2001; Zhao & Karypis,
2002; Lertnattee & Theeramunkong, 2002; Xu et al.,
2003).
3.2 The FL based Method
The TF-IDF method works reasonably well, but it
has the disadvantage of not considering two key
aspects for us:
The degree of identification of the object if only
the considered index term is used. This parameter
has a strong influence on the final value of a term
weight if the degree of identification is high. The
more a keyword identifies an object, the higher
value for the corresponding term weight.
Nevertheless, this parameter creates two
disadvantages in terms of practical aspects when it
comes to carrying out a term weight automated and
systematic assignment. On the one hand, the degree
of identification is not deductible from any
characteristic of a keyword, so it must be specified
by the System Administrator. On the second hand,
the same keyword may have a different relationship
with every object.
The second parameter is related to join terms. In
the index term ‘term weighting’, this expression
would constitute a join term. Every single term in a
join term has a lower value than it would have if it
did not belong to it. However, if we combine all the
single terms in a join term, term weight must be
higher. A join term may really determine an object
whereas the appearance of only one of its single
terms may refer to another object.
The consideration of these two parameters
together with classical TF and IDF determines the
weight of an index term for every subset in every
level. The FL based method gives a solution to all
the problems and also gives two main advantages.
The solution to both problems is to create a table
with all the keywords and their corresponding
weights for every object. This table will be created
in the phase of keyword extraction from standard
questions. Imprecision practically does not affect the
working method due to the fact that both term
weighting and information extraction are based on
fuzzy logic, what minimizes possible variations of
the assigned weights. The way of extracting
information also helps to successfully overcome this
imprecision. In addition, the FL based method also
gives important advantages: on the one hand, term
weighting is automated; on the other hand, the level
of required expertise for an operator is lower. This
operator would not need to know anything about the
FL engine functioning, but only how many times
does a term appear in any subset and the answer to
these questions: a) Does a keyword undoubtedly
ICEIS 2009 - International Conference on Enterprise Information Systems
132