DOCUMENT RELATION ANALYSIS BASED ON COMPRESSIBILITY VECTOR

Nuo Zhang, Daisuke Matsuzaki, Toshinori Watanabe, Hisashi Koga

2009

Abstract

Nowadays, there are a great deal of e-documents can be easily accessed. It will be beneficial if a method can evaluate documents and abstract significant content. Similarity analysis and topic extraction are widely used as document relation analysis techniques. Most of the methods are based on dictionary-base morphological analysis. They cannot meet the requirement when the Internet grows fast and new terms appear but dictionary cannot be automatically updated fast enough. In this study, we propose a novel document relation analysis (topic extraction) method based on a compressibility vector. Our proposal does not require morphological analysis, and it can automatically evaluate input documents. We will examine the proposal with using model document and Reuters-21578 dataset, for relation analysis and topic extraction. The effectiveness of the proposed method will be shown in simulations.

References

  1. J. Ziv, A. L. (Sept. 1978). Compression of individual sequence via variable-rate coding. IEEE Trans.Inf.Theory, IT-24(5):530-536.
  2. Ltd., R. (Mar 2008). Reuters-21578 text categorization test collection. Reuters-21578 dataset from http://www.daviddlewis.com/resources/testcollections/ reuters21578/.
  3. Porter, M. (Mar 2008). The porter stemming algorithm. http://tartarus.org/ martin/PorterStemmer/.
  4. Rijsbergen, V. (Mar 2008). stop word list. http://ftp.dcs.glasgow.ac.uk/idom/irresources/linguistic-utils/stop-words.
  5. Saito, K. (Vol. 3, No. 3, pp. 15-18, 2005). Multiple topic detection by parametric mixture models (pmm) automatic web page categorization for browsing. In NTT Technical Review.
  6. Timothy C. Bell, John G. Cleary, I. H. W. (1990). Text compression. Prentice-Hall.
  7. Toshinori Watanabe, K. S. and Sugihara, H. (May. 2002). A new pattern representation scheme using data compression. IEEE TransPAMI, 24(5):579-590.
Download


Paper Citation


in Harvard Style

Zhang N., Matsuzaki D., Watanabe T. and Koga H. (2009). DOCUMENT RELATION ANALYSIS BASED ON COMPRESSIBILITY VECTOR . In Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 1: ICAART, ISBN 978-989-8111-66-1, pages 255-260. DOI: 10.5220/0001660202550260


in Bibtex Style

@conference{icaart09,
author={Nuo Zhang and Daisuke Matsuzaki and Toshinori Watanabe and Hisashi Koga},
title={DOCUMENT RELATION ANALYSIS BASED ON COMPRESSIBILITY VECTOR},
booktitle={Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,},
year={2009},
pages={255-260},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001660202550260},
isbn={978-989-8111-66-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,
TI - DOCUMENT RELATION ANALYSIS BASED ON COMPRESSIBILITY VECTOR
SN - 978-989-8111-66-1
AU - Zhang N.
AU - Matsuzaki D.
AU - Watanabe T.
AU - Koga H.
PY - 2009
SP - 255
EP - 260
DO - 10.5220/0001660202550260