Arabic Text Classification using Bag-of-Concepts Representation

Alaa Alahmadi, Arash Joorabchi, Abdulhussain E. Mahdi

Abstract

With the exponential growth of Arabic text in digital form, the need for efficient organization, navigation and browsing of large amounts of documents in Arabic has increased. Text Classification (TC) is one of the important subfields of data mining. The Bag-of-Words (BOW) representation model, which is the traditional way to represent text for TC, only takes into account the frequency of term occurrence within a document. Therefore, it ignores important semantic relationships between terms and treats synonymous words independently. In order to address this problem, this paper describes the application of a Bag-of-Concepts (BOC) text representation model for Arabic text. The proposed model is based on utilizing the Arabic Wikipedia as a knowledge base for concept detection. The BOC model is used to generate a Vector Space Model, which in turn is fed into a classifier to categorize a collection of Arabic text documents. Two different machine-learning based classifiers have been deployed to evaluate the effectiveness of the proposed model in comparison to the traditional BOW model. The results of our experiment show that the proposed BOC model achieves an improved performance with respect to BOW in terms of classification accuracy.

References

  1. Al-Harbi, S., Almuhareb, A., Al-Thubaity, A., Khorsheed, M. and Al-Rajeh, A. 2008. Automatic Arabic text classification.
  2. Alahmadi, A., Joorabchi, A. and Mahdi, A. E. A new text representation scheme combining Bag-of-Words and Bag-of-Concepts approaches for automatic text classification. GCC Conference and Exhibition (GCC), 2013 7th IEEE, 2013. IEEE, 108-113.
  3. Alahmadi, A., Joorabchi, A. and Mahdi, A. E. 2014. Combining Bag-of-Words and Bag-of-Concepts Representations for Arabic Text Classification. IET Irish Signals & Systems Conference 2014.
  4. Alsaleem, S. 2011. Automated Arabic Text Categorization Using SVM and NB. Int. Arab J. e-Technol., 2, 124- 128.
  5. Black, W., Elkateb, S. and Vossen, P. Introducing the Arabic wordnet project. In Proceedings of the third International WordNet Conference (GWC-06, 2006. Citeseer.
  6. Breiman, L. 2001. Random forests. Machine learning, 45, 5-32.
  7. Elberrichi, Z. and Abidi, K. 2012. Arabic Text Categorization: a Comparative Study of Different Representation Modes. International Arab Journal of Information Technology (IAJIT), 9.
  8. Gabrilovich, E. and Markovitch, S. Feature generation for text categorization using world knowledge. IJCAI, 2005. 1048-1053.
  9. Gabrilovich, E. and Markovitch, S. Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. AAAI, 2006. 1301-1306.
  10. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. and Witten, I. H. 2009. The Weka data mining software: an update. ACM SIGKDD explorations newsletter, 11, 10-18.
  11. Hotho, A., Staab, S. and Stumme, G. 2003. Wordnet improves Text Document Clustering.
  12. ISO. (2009). ISO-704: Terminology work-Principles and methods (3rd ed.).
  13. Kanaan, G., Alshalabi, R., Ghwanmeh, S. and Alma'adeed, H. 2009. A comparison of text classification techniques applied to Arabic text. Journal of the American society for information science and technology, 60, 1836-1844.
  14. Mesleh, A. M. D. A. 2007. Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System. Journal of Computer Science, 3.
  15. Milne, D. and Witten, I. H. 2013. An open-source toolkit for mining Wikipedia. Artificial Intelligence, 194, 222-239.
  16. Mitchell, T. 1996. Machine Learning. McCraw Hill.
  17. Salton, G. and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Information processing & management, 24, 513-523.
  18. Versteegh, K. and Versteegh, C. 1997. The Arabic Language, Columbia University Press.
Download


Paper Citation


in Harvard Style

Alahmadi A., Joorabchi A. and E. Mahdi A. (2014). Arabic Text Classification using Bag-of-Concepts Representation . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014) ISBN 978-989-758-048-2, pages 374-380. DOI: 10.5220/0005138103740380


in Bibtex Style

@conference{kdir14,
author={Alaa Alahmadi and Arash Joorabchi and Abdulhussain E. Mahdi},
title={Arabic Text Classification using Bag-of-Concepts Representation},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)},
year={2014},
pages={374-380},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005138103740380},
isbn={978-989-758-048-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)
TI - Arabic Text Classification using Bag-of-Concepts Representation
SN - 978-989-758-048-2
AU - Alahmadi A.
AU - Joorabchi A.
AU - E. Mahdi A.
PY - 2014
SP - 374
EP - 380
DO - 10.5220/0005138103740380