77
78
79
80
81
82
83
84
85
86
0 5 10 15 20 25 30 35 40 45 50
Precision
K
"EMD"
"VSM"
Figure 1: Precision rate on K=1, 10, 20, 30, 40, 50.
Table 2: The number of correct documents and error docu-
ments for VSM and EMD.
VSM
correct error
EMD correct 2,359 217
error 94 338
test documents were labeled with several categories.
Hence, when an estimated category ˆc is included in
categories labeled in a test document as correct cat-
egories, we consider that that the text categorization
algorithm can label the document correctly.
To evaluate each method based on previous idea
we used precision rate for the test documents.
precision=
the number of correct labeled documents
the number of all test documents
(17)
The precision rate depends on a value of K in
Equation (15) and Equation (16). Figure 1 shows
precision rates for cosine similarity (VSM) and our
proposing method (EMD) on K = 1, 10, 20, 30, 40, 50.
Our proposing method, EMD was superior to a con-
ventional method, VSM on every K-values. The pre-
cision rate on K = 10 was the maximum value for
VSM and EMD. The precision rate of VSM is 81.6%
and the one of EMD is 85.6%. Hence, the difference
of the precision rates is about 4.0%.
To discuss text categorization ability Table 2
shows the number of error documents and correct
documents for VSM and EMD. The error documents
for EMD was smaller than the one for VSM because
of improvement of the precision rate.
In figure 1 we confirmed that it could improve
the precision rate to regard the dependency of in-
dexing words. In table 2 we think that we can im-
prove our method. Our method makes a word related
to too many words or contextually unrelated words.
Hence, our method could not label the documents
which VSM could label correctly. To make a word
related to appropriate words increases the similarity
between documents which the VSM can not evaluate.
Hence, the number of error documents in VSM de-
creases. On the other hand, to make a word related to
too many words boosts the similarity between docu-
ments which are unrelated. This cause the number of
error documents, which can not exist in the VSM, to
increase. We need to discuss how to define the dis-
tance between the indexing words beside the condi-
tional probability P(T
i
|T
j
).
5 CONCLUSION
We proposed a text categorization method using Earth
Mover’s Distance as a similarity measure. We re-
alized to compute similarity between documents re-
garding the dependency of words using the Earth
Mover’s Distance. The distance between the words
is defined with the conditional probability that one
word occur with the other word in the same sentence.
We confirm that the proposing method is superior to
a conventional method using cosine similarity with
Reuters-21578 text categorization test collection.
We will discuss how to define the distance be-
tween words beside the conditional probability and
improve our proposing method.
REFERENCES
Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., and
Miller, K. J. (1990). Introduction to wordnet: An on-
line lexical database. International Journal of Lexi-
cography, 3(4):235–312.
Mitchell, T. M. (1997). Machine Learning. McGraw Hill,
New York, US.
Porter, M. (1980). An algorithm for suffix stripping. Pro-
gram, 14(3):130–137.
Rubner, Y., Tomasi, C., and Guibas, L. (2000). The earth
mover’s distance as a metric for image retrieval. Inter-
national Journal of Computer Vision, 40(2):99–121.
Salton, G., Wong, A., and Yang, C. S. (1975). A vector
space model for automatic indexing. Communications
of the ACM, 18(11):613–620.
Sebastiani, F. (2002). Machine learning in automated text
categorization. ACM Computing Surveys, 34(1):1–47.
Wan, X. and Peng, Y. (2005). The earth mover’s distance as
a semantic measure for document similarity. In the
14th ACM International Conference on Information
and Knowledge Management, pages 301–302. ACM
Press.
Yang, Y. and Chute, C. G. (1994). An example-
based mapping method for text categorization and re-
trieval. ACM Transactions on Information Systems,
12(3):252–277.
TEXT CATEGORIZATION USING EARTH MOVER'S DISTANCE AS SIMILARITY MEASURE
635