DOCUMENTS REPRESENTATION BASED ON INDEPENDENT COMPRESSIBILITY FEATURE SPACE

Nuo Zhang, Toshinori Watanabe

Abstract

There are two well-known feature representation methods, bag-of-words and N-gram models, which have been widely used in natural language processing, text mining, and web document analysis. A novel Pattern Representation scheme using Data Compression (PRDC) has been proposed for data representation. The PRDC not only can process data of linguistic text, but also can process the other multimedia data effectively. Although PRDC provides better performance than the traditional methods in some situation, it still suffers the problem of dictionary selection and construction of feature space. In this study, we propose a method for PRDC to construct an independent compressibility space, and compare the proposed method to the two other representation methods and PRDC. The performance will be compared in terms of clustering ability. Experiment results will show that the proposed method can provide better performance than that of PRDC and the other two methods.

References

  1. Cavnar, W. B. (1994). Using an n-gram-based document representation with a vector processing retrieval model. TREC-3: text retrieval conference, pages 269- 277.
  2. Gao, M.-T. and Wang, Z.-O. (2007). A new algorithm for text clustering based on projection pursuit. Sixth International Conference on Machine Learning Cybernetics, 6:3401-3405.
  3. Jean-Francois Cardoso, J.-f. C. C. and SOULOUMIAC, A. (1996). Jacobi angles for simultaneous diagonalization. SIAM J. Mat. Anal. Appl, 17:161-164.
  4. Lewis, D. D. (1998). Naive (bayes) at forty: The independence assumption in information retrieval. Proceedings of ECML-98, 10th European Conference on Machine Learning, 1398:4-15.
  5. Liu Ming-ji, W. X.-f. and Yi-mei, R. (2002). Feature acquiring algorithm on the web text. Mini-Micro Systems, 23(6):683-686.
  6. Martin-Merino, M. and Roman, J. (2006). A new semisupervised dimension reduction technique for textual data analysis. Intelligent Data Engineering and Automated Learning - IDEAL 2006. 7th International Conference. Proceedings, 4224:654-662.
  7. P. C. Barman, N. I. and Lee, S.-Y. (2006). Non-negative matrix factorization based text mining: feature extraction and classification. Neural Information Processing. 13th International Conference, ICONIP 2006. Proceedings, Part II, 4233:703-712.
  8. Richard O. Duda, P. E. H. and Stork, D. G. (2001). Pattern classification (2nd edition). John Wiley and Sons.
  9. Shafiei, M. Singer Wang Zhang, R. M. et al. (2007). Document representation and dimension reduction for text clustering. 2007 IEEE 23rd International Conference on Data Engineering Workshop, pages 770-779.
  10. Toshinori Watanabe, K. S. and Sugihara, H. (May. 2002). A new pattern representation scheme using data compression. IEEE TransPAMI, 24(5):579-590.
  11. Yin Zhonghang, Wang Yongcheng, C. W. and Qian, D. (2002). A comparative study on two techniques of reducing the dimension of text feature space. Journal of Systems Engineering and Electronics, 13(1):87-92.
  12. Zhao, Y. and Karypis, G. (2002). Criterion functions for document clustering: Experiments and analysis. Technical Report TR, Department of Computer Science, University of Minnesota,Minneapolis, MN.
  13. Ziv, J. and Lempel, A. (1978). Compression of individual sequence via variable-rate coding. Information Theory, IEEE Transactions on, 24(5):530-536.
Download


Paper Citation


in Harvard Style

Zhang N. and Watanabe T. (2010). DOCUMENTS REPRESENTATION BASED ON INDEPENDENT COMPRESSIBILITY FEATURE SPACE . In Proceedings of the 2nd International Conference on Agents and Artificial Intelligence - Volume 1: ICAART, ISBN 978-989-674-021-4, pages 217-222. DOI: 10.5220/0002704402170222


in Bibtex Style

@conference{icaart10,
author={Nuo Zhang and Toshinori Watanabe},
title={DOCUMENTS REPRESENTATION BASED ON INDEPENDENT COMPRESSIBILITY FEATURE SPACE},
booktitle={Proceedings of the 2nd International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,},
year={2010},
pages={217-222},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002704402170222},
isbn={978-989-674-021-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 2nd International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,
TI - DOCUMENTS REPRESENTATION BASED ON INDEPENDENT COMPRESSIBILITY FEATURE SPACE
SN - 978-989-674-021-4
AU - Zhang N.
AU - Watanabe T.
PY - 2010
SP - 217
EP - 222
DO - 10.5220/0002704402170222