3 CONCURRENT WORDS AND
ATTACHING WEIGHT
3.1 Concurrent Words
Concurrent words (C words) are two short unit FA
words connected by particles (e.g. the, in, and)
which are used to associate fields. The importance of
C words can be expressed by ranking the weight of
the short unit FA words. The importance of C words
relates especially to appearance frequency and to
association fields of the short unit FA words. The
frequency of short unit words shows field rank, and
number of overlapping fields shows the degree of
ambiguity of the short unit words.
In this paper, it is assumed that no rank 1 short
unit FA words are C words because rank 1 FA words
refer to specific fields and it is not necessary to
converge association fields.
3.1.1 Attaching Weight
Generally, to extract a word which characterizes a
file, a weight function TF x IDF attaches to the
words (TF is a high frequency of the appearance
characteristic words and IDF is inverse document
Frequency). However, not every word with high
frequency characterizes a file. For example, particles
(the, to, etc) appear often in a file, but the particles
are not characteristic words. On the other hand,
some characteristic words have relatively low
frequency, so IDF attaches high weight to those
characteristic words and considers weight in many
fields. IDF value is given by log N/df(t), where total
number of files is N and the number of files which
include word t is df(t). TF x IDF is given by:
W(d,t) = TF (d,t) x IDF (t)**************(1)
where TF is the normalized frequency value of a
word t in a file d.
This research applies TF x IDF to consider
the normalized frequency of a word
α
in one field A.
So, the weight of a short unit word
α
can be defined:
N
Weight
A
(
α
) = Freq
A
(
α
) x log ( ) (2)
Category _ num (
α
)
where Freq is the normalized frequency of word
α
in field A, N is total number of fields and
Category_num is number of fields containing
α
.
In the same way, the weight of word
β
in Field
A can be calculated:
N
Weight
A
(
β
) = Freq
A
(
β
) x log ( )
Category _ num (
β
)
Consider a C word
α
+
β
is in a field A, the weight
of the C words is:
Weight
A
(
α
+
β
) = Weight
A
(
α
) + Weight
A
(
β
) =
(3)
N N
Freq
A
(
α
) x log ( ) + Freq
A
(
β
) x log( )
Category _ num (
α
) Category _ num (
β
)
The following cases are examples of weight
according to degree of importance of C words:
Case (1): C words with high frequency are
confirmed to be improper for use as CFA words.
In field <Soccer>
Foreigner Freq. = 52 Category_num (foreigner)= 35
athlete Freq. = 535 Category _ num (athlete) = 57
foreigner and athlete Freq. = 52
Cross_Category_num = 26
foreigner and athlete (frequency rank)= 13
foreigner and athlete (weighting according to
degree of importance rank) = 408
W
new
(foreigner and athlete) =
52 x log (133/35) + 535 x log (133/57)
52 x = 58.27 26
In field <Soccer>, the concurrent relation of
“foreigner” and “athlete” has frequency of 52. If C
words are ranked according to frequency, provide
relatively high rank of 13 in field <Soccer>. So, C
words might appear to be important by considering
only frequency, but the concurrent relation of
“foreigner” and “athlete” is not characteristic words
in field <Soccer>; “foreigner” and “athlete” appear
in all sub- fields of field <SPORTS>.
Ranking “foreigner” and “athlete” by weighting
according to degree of importance provides a
relatively low rank of 408. So, C words “foreigner”
and “athlete” are not CFA words in field <Soccer>.
4 EVALUATION RESULTS
4.1 Field Systems and Test Data
To verify the efficiency of the new method
described in this paper, about 38,000 articles from a
data set of 20 Newsgroups from CNN Web Site
(1995-2001) were selected. There were various
topics related to sports, computers, politics,
economics, etc. The accumulating method is to
search titles of articles by using keywords exists in
field tree system.
4.2 Method Evaluation
Precision and Recall are evaluated to show how well
weighting according to degree of importance
NEW METHOD USING DECLINABLE WORDS AND CONCURRENT WORDS TO CREATE A LARGE NUMBER
OF FA WORDS
529