Dimension Reduction with Coevolutionary Genetic Algorithm for Text Classification

Tatiana Gasanova, Roman Sergienko, Eugene Semenkin, Wolfgang Minker

Abstract

Text classification of large-size corpora is time-consuming for implementation of classification algorithms. For this reason, it is important to reduce dimension of text classification problems. We propose a method for dimension reduction based on hierarchical agglomerative clustering of terms and cluster weight optimization using cooperative coevolutionary genetic algorithm. The method was applied on 5 different corpora using several classification methods with different text preprocessing. The method reduces dimension of text classification problem significantly. Classification efficiency increases or decreases non-significantly after clustering with optimization of cluster weights.

References

  1. Ando, R. K. and Zhang, T. (2005). A high-performance semi-supervised learning method for text chunking. In Proceedings of the 43rd annual meeting on association for computational linguistics, pages 1-9. Association for Computational Linguistics.
  2. Bengio, Y., Schwenk, H., Senécal, J.-S., Morin, F., and Gauvain, J.-L. (2006). Neural probabilistic language models. In Innovations in Machine Learning, pages 137-186. Springer.
  3. Collobert, R. and Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160-167. ACM.
  4. DEFT07 (2007). Proceedings of the 3rd DEFT Workshop. AFIA, Grenoble, France.
  5. DEFT08 (2008). Proceedings of the 4th DEFT Workshop. TALN, Avignon, France.
  6. Gasanova, T., Sergienko, R., Semenkin, E., Minker, W., and Zhukov, E. (2013). A semi-supervised approach for natural language call routing. Proceedings of the SIGDIAL 2013 Conference, pages 344-348.
  7. Huang, F. and Yates, A. (2009). Distributional representations for handling sparsity in supervised sequencelabeling. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1, pages 495-503. Association for Computational Linguistics.
  8. Ishibuchi, H., Nakashima, T., and Murata, T. (1999). Performance evaluation of fuzzy classifier systems for multidimensional pattern classification problems. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 29(5):601-618.
  9. Koo, T., Carreras, X., and Collins, M. (2008). Simple semisupervised dependency parsing. ACL, pages 595-603.
  10. Miller, S., Guinness, J., and Zamanian, A. (2004). Name tagging with word clusters and discriminative training. In HLT-NAACL, volume 4, pages 337-342.
  11. Mnih, A. and Hinton, G. (2007). Three new graphical models for statistical language modelling. In Proceedings of the 24th international conference on Machine learning, pages 641-648. ACM.
  12. Potter, M. and Jong, K. D. (2000). Cooperative coevolution: an architecture for evolving coadapted subcomponents. Trans. Evolutionary Computation, 8:129.
  13. Ratinov, L. and Roth, D. (2009). Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, pages 147-155. Association for Computational Linguistics.
  14. Rocchio, J. (1971). Relevance feedback in information retrieval, in the smart retrival system-experiments in automatic document processing. Prentice-Hall, pages 313-323.
  15. Salton, G. and Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, pages 513-523.
  16. Schwenk, H. and Gauvain, J.-L. (2002). Connectionist language modeling for large vocabulary continuous speech recognition. In Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on, volume 1, pages I-765. IEEE.
  17. Shafait, F., Reif, M., Kofler, C., and Breuel, T. M. (2010). Pattern recognition engineering. In RapidMiner Community Meeting and Conference, volume 9.
  18. Soucy, P. and Mineau, G. (2005). Beyond tf-idf weighting for text categorization in the vector space model. Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI 2005), pages 1130- 1135.
  19. Suzuki, J. and Isozaki, H. (2008). Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In ACL, pages 665-673. Citeseer.
  20. Suzuki, J., Isozaki, H., Carreras, X., and Collins, M. (2009). An empirical study of semi-supervised structured conditional models for dependency parsing. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2, pages 551-560. Association for Computational Linguistics.
  21. Turian, J., Ratinov, L., and Bengio, Y. (2010). Word representations: a simple and general method for semisupervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 384-394. Association for Computational Linguistics.
  22. Ward, J. (1963). Hierarchical grouping to optimize an objective function. Journal of the American statistical association, 58(301):236-244.
Download


Paper Citation


in Harvard Style

Gasanova T., Sergienko R., Semenkin E. and Minker W. (2014). Dimension Reduction with Coevolutionary Genetic Algorithm for Text Classification . In Proceedings of the 11th International Conference on Informatics in Control, Automation and Robotics - Volume 1: ICINCO, ISBN 978-989-758-039-0, pages 215-222. DOI: 10.5220/0005020702150222


in Bibtex Style

@conference{icinco14,
author={Tatiana Gasanova and Roman Sergienko and Eugene Semenkin and Wolfgang Minker},
title={Dimension Reduction with Coevolutionary Genetic Algorithm for Text Classification},
booktitle={Proceedings of the 11th International Conference on Informatics in Control, Automation and Robotics - Volume 1: ICINCO,},
year={2014},
pages={215-222},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005020702150222},
isbn={978-989-758-039-0},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 11th International Conference on Informatics in Control, Automation and Robotics - Volume 1: ICINCO,
TI - Dimension Reduction with Coevolutionary Genetic Algorithm for Text Classification
SN - 978-989-758-039-0
AU - Gasanova T.
AU - Sergienko R.
AU - Semenkin E.
AU - Minker W.
PY - 2014
SP - 215
EP - 222
DO - 10.5220/0005020702150222