INITIAL EXPERIMENTS WITH EXTRACTION OF STOPWORDS IN HEBREW

Yaakov HaCohen-Kerner, Shmuel Yishai Blitz

2010

Abstract

Stopwords are regarded as meaningless in terms of information retrieval. Various stopword lists have been constructed for English and a few other languages. However, to the best of our knowledge, no stopword list has been constructed for Hebrew. In this ongoing work, we present an implementation of three baseline methods that attempt to extract stopwords for a data set containing Israeli daily news. Two of the methods are state-of-the-art methods previously applied to other languages and the third method is proposed by the authors. Comparison of the behavior of these three methods to the behavior of the Zipf's law shows that Zipf’s succeeds to describe the distribution of the top occurring words according to these methods.

References

  1. Choueka, Y., Conley, E. S., Dagan, I. 2000. A comprehensive bilingual word alignment system: application to disparate languages - Hebrew, English. in Veronis J. (Ed.), Parallel Text Processing, Kluwer Academic Publishers, 69-96
  2. Fox, C., 1990. A Stop List for General Text. ACM-SIGIR Forum, 24, 19-35.
  3. Fox, C., 1992. Lexical analysis and stoplists. In Information Retrieval - Data Structures & Algorithms, 102-130, Prentice-Hall.
  4. Francis, W., 1982. Frequency Analysis of English Usage: Lexicon and Grammar. Houghton Mifflin.
  5. Frakes, W., Baeza-Yates, R. 1992. Information retrieval: Data Structures and Algorithms. Englewood Cliffs, NJ: Prentice Hall.
  6. Lazarinis, F., 2007. Engineering and utilizing a stopword list in Greek Web retrieval. JASIST, 58(11): 1645- 1652.
  7. Lo, R. T.-W., He, B., Ounis, I., 2005. Automatically Building a Stopword List for an Information Retrieval System. Journal Of Digital Information Management: Special Issue On The 5th Dutch-belgian Information Retrieval Workshop (dir'05), 31(3), 3-8.
  8. Makrehchi, M., Kamel M. S., 2008. Automatic Extraction of Domain-Specific Stopwords from Labeled Documents. In proc. of ECIR-08, 222-233.
  9. Raghavan, V. V., Wong S. K. M., 1986. A critical analysis of vector space model for information retrieval, Journal of the American Society for Information Science, 37(5), 279-287.
  10. Robertson S. E., Sparck-Jones. K., 1976. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3): 129-146.
  11. Salton, G., McGill, M. J., 1983. Introduction to Modern Information Retrieval. New York: McGraw-Hill.
  12. Salton, G., Buckley, C., 1988. Term-Weighting Approaches in Automatic Text Retrieval. Information Processing and Management, 24, 513-523.
  13. Savoy, J. A., 1999. Stemming Procedure and Stopword List for General French Corpora. Journal of the American Society for Information Science, 50(10), 944-952.
  14. Sinka, M. P., Corne, D. W., 2002. A large benchmark dataset for web document clustering, in Soft Computing Systems: Design, Management and Applications, Volume 87 of Frontiers in Artificial Intelligence and Applications, 881-890.
  15. Sinka, M. P., Corne, D. W., 2003. Evolving better stoplists for document clustering and web intelligence. Design and application of hybrid intelligent systems, 1015- 1023.
  16. Van Rijsbergen. C. J., 1979. Information Retrieval, 2nd edition. Dept. of Computer Science, University of Glasgow.
  17. Yang, Y. M., 1995. Noise Reduction in a Statistical Approach to Text Categorization, In Proc. of SIGIR95, 18th ACM Int. Conference on Research and Development in Information Retrieval. 256-263.
  18. Zipf, G. K., 1949. Human Behaviours and the Principle of Least Effort, Addison-Wesley, Cambridge, MA.
  19. Zou, F., Wang, P. F. L., Deng, X., Han, S., 2006a. Evaluation of Stop Word Lists in Chinese Language, In Proc. of the 5th Int. Conference on Language Resources and Evaluation (LREC 2006), 2504-2207.
  20. Zou, F., Wang, P. F. L., Deng, X., Han, S., 2006b. Automatic Identification of Chinese Stop Words, Research on Computing Science, 18, 151-162.
Download


Paper Citation


in Harvard Style

HaCohen-Kerner Y. and Yishai Blitz S. (2010). INITIAL EXPERIMENTS WITH EXTRACTION OF STOPWORDS IN HEBREW . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010) ISBN 978-989-8425-28-7, pages 449-453. DOI: 10.5220/0003093104490453


in Bibtex Style

@conference{kdir10,
author={Yaakov HaCohen-Kerner and Shmuel Yishai Blitz},
title={INITIAL EXPERIMENTS WITH EXTRACTION OF STOPWORDS IN HEBREW },
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)},
year={2010},
pages={449-453},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003093104490453},
isbn={978-989-8425-28-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)
TI - INITIAL EXPERIMENTS WITH EXTRACTION OF STOPWORDS IN HEBREW
SN - 978-989-8425-28-7
AU - HaCohen-Kerner Y.
AU - Yishai Blitz S.
PY - 2010
SP - 449
EP - 453
DO - 10.5220/0003093104490453