TEXT CATEGORIZATION USING EARTH MOVER’S DISTANCE AS SIMILARITY MEASURE

Hidekazu Yanagimoto, Sigeru Omatu

Abstract

We propose a text categorization system using Earth Mover’s Distance (EMD) as similarity measure between documents. Many text categorization systems adopt the Vector Space Model and use cosine similarity as similarity measure between documents. There is an assumption that each of words included in documents is uncorrelated because of an orthogonal vector space. However, the assumption is not desirable when a document includes a lot of synonyms and polysemic words. The EMD does not demand the assumption because it is computed as a solution of a transportation problem. To compute the EMD in consideration of dependency among words, we define the distance between words, which needs to compute the EMD, using a co-occurrence frequency between the words. We evaluate the proposing method with ModApte split of Reuters-21578 text categorization test collection and confirm that the proposing method improves a precision rate for text categorization.

References

  1. Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K. J. (1990). Introduction to wordnet: An online lexical database. International Journal of Lexicography, 3(4):235-312.
  2. Mitchell, T. M. (1997). Machine Learning. McGraw Hill, New York, US.
  3. Porter, M. (1980). An algorithm for suffix stripping. Program, 14(3):130-137.
  4. Rubner, Y., Tomasi, C., and Guibas, L. (2000). The earth mover's distance as a metric for image retrieval. International Journal of Computer Vision, 40(2):99-121.
  5. Salton, G., Wong, A., and Yang, C. S. (1975). A Vector Space Model for Automatic Indexing. Communications of the ACM, 18(11):613-620.
  6. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1-47.
  7. Wan, X. and Peng, Y. (2005). The earth mover's distance as a semantic measure for document similarity. In the 14th ACM International Conference on Information and Knowledge Management, pages 301-302. ACM Press.
  8. Yang, Y. and Chute, C. G. (1994). An examplebased mapping method for text categorization and retrieval. ACM Transactions on Information Systems, 12(3):252-277.
Download


Paper Citation


in Harvard Style

Yanagimoto H. and Omatu S. (2007). TEXT CATEGORIZATION USING EARTH MOVER’S DISTANCE AS SIMILARITY MEASURE . In Proceedings of the Ninth International Conference on Enterprise Information Systems - Volume 3: ICEIS, ISBN 978-972-8865-90-0, pages 632-635. DOI: 10.5220/0002406606320635


in Bibtex Style

@conference{iceis07,
author={Hidekazu Yanagimoto and Sigeru Omatu},
title={TEXT CATEGORIZATION USING EARTH MOVER’S DISTANCE AS SIMILARITY MEASURE},
booktitle={Proceedings of the Ninth International Conference on Enterprise Information Systems - Volume 3: ICEIS,},
year={2007},
pages={632-635},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002406606320635},
isbn={978-972-8865-90-0},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Ninth International Conference on Enterprise Information Systems - Volume 3: ICEIS,
TI - TEXT CATEGORIZATION USING EARTH MOVER’S DISTANCE AS SIMILARITY MEASURE
SN - 978-972-8865-90-0
AU - Yanagimoto H.
AU - Omatu S.
PY - 2007
SP - 632
EP - 635
DO - 10.5220/0002406606320635