DOCUMENT RELATION ANALYSIS BASED ON COMPRESSIBILITY VECTOR
Nuo Zhang, Daisuke Matsuzaki, Toshinori Watanabe, Hisashi Koga
2009
Abstract
Nowadays, there are a great deal of e-documents can be easily accessed. It will be beneficial if a method can evaluate documents and abstract significant content. Similarity analysis and topic extraction are widely used as document relation analysis techniques. Most of the methods are based on dictionary-base morphological analysis. They cannot meet the requirement when the Internet grows fast and new terms appear but dictionary cannot be automatically updated fast enough. In this study, we propose a novel document relation analysis (topic extraction) method based on a compressibility vector. Our proposal does not require morphological analysis, and it can automatically evaluate input documents. We will examine the proposal with using model document and Reuters-21578 dataset, for relation analysis and topic extraction. The effectiveness of the proposed method will be shown in simulations.
References
- J. Ziv, A. L. (Sept. 1978). Compression of individual sequence via variable-rate coding. IEEE Trans.Inf.Theory, IT-24(5):530-536.
- Ltd., R. (Mar 2008). Reuters-21578 text categorization test collection. Reuters-21578 dataset from http://www.daviddlewis.com/resources/testcollections/ reuters21578/.
- Porter, M. (Mar 2008). The porter stemming algorithm. http://tartarus.org/ martin/PorterStemmer/.
- Rijsbergen, V. (Mar 2008). stop word list. http://ftp.dcs.glasgow.ac.uk/idom/irresources/linguistic-utils/stop-words.
- Saito, K. (Vol. 3, No. 3, pp. 15-18, 2005). Multiple topic detection by parametric mixture models (pmm) automatic web page categorization for browsing. In NTT Technical Review.
- Timothy C. Bell, John G. Cleary, I. H. W. (1990). Text compression. Prentice-Hall.
- Toshinori Watanabe, K. S. and Sugihara, H. (May. 2002). A new pattern representation scheme using data compression. IEEE TransPAMI, 24(5):579-590.
Paper Citation
in Harvard Style
Zhang N., Matsuzaki D., Watanabe T. and Koga H. (2009). DOCUMENT RELATION ANALYSIS BASED ON COMPRESSIBILITY VECTOR . In Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 1: ICAART, ISBN 978-989-8111-66-1, pages 255-260. DOI: 10.5220/0001660202550260
in Bibtex Style
@conference{icaart09,
author={Nuo Zhang and Daisuke Matsuzaki and Toshinori Watanabe and Hisashi Koga},
title={DOCUMENT RELATION ANALYSIS BASED ON COMPRESSIBILITY VECTOR},
booktitle={Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,},
year={2009},
pages={255-260},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001660202550260},
isbn={978-989-8111-66-1},
}
in EndNote Style
TY - CONF
JO - Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,
TI - DOCUMENT RELATION ANALYSIS BASED ON COMPRESSIBILITY VECTOR
SN - 978-989-8111-66-1
AU - Zhang N.
AU - Matsuzaki D.
AU - Watanabe T.
AU - Koga H.
PY - 2009
SP - 255
EP - 260
DO - 10.5220/0001660202550260