Stopwords Identification by Means of Characteristic and Discriminant Analysis

Giuliano Armano, Francesca Fanni, Alessandro Giuliani

Abstract

Stopwords are meaningless, non-significant terms that frequently occur in a document. They should be removed, like a noise. Traditionally, two different approaches of building a stoplist have been used: the former considers the most frequent terms looking at a language (e.g., english stoplist), the other includes the most occurring terms in a document collection. In several tasks, e.g., text classification and clustering, documents are typically grouped into categories. We propose a novel approach aimed at automatically identifying specific stopwords for each category. The proposal relies on two unbiased metrics that allow to analyze the informative content of each term; one measures the discriminant capability and the latter measures the characteristic capability. For each term, the former is expected to be high in accordance with the ability to distinguish a category against others, whereas the latter is expected to be high according to how the term is frequent and common over all categories. A preliminary study and experiments have been performed, pointing out our insight. Results confirm that, for each domain, the metrics easily identify specific stoplist wich include classical and category-dependent stopwords.

References

  1. Armano, G. (2014). A direct measure of discriminant and characteristic capability for classifier building and assessment. Technical report, DIEE, Department of Electrical and Electronic Engineering, University of Cagliari, Cagliari, Italy. DIEE Technical Report Series.
  2. Dolamic, L. and Savoy, J. (2010). When stopword lists make the difference. J. Am. Soc. Inf. Sci. Technol., 61(1):200-203.
  3. Fox, C. (1989). A stop list for general text. SIGIR Forum, 24(1-2):19-21.
  4. Fox, C. (1992). Information retrieval. chapter Lexical Analysis and Stoplists, pages 102-130. Prentice-Hall, Inc., Upper Saddle River, NJ, USA.
  5. Francis, W. N. and Kucera, H. (1983). Frequency Analysis of English Usage: Lexicon and Grammar. Houghton Mifflin.
  6. Hao, L. and Hao, L. (2008). Automatic identification of stop words in chinese text classification. In CSSE (1)7808, pages 718-722.
  7. Hart, G. W. (1994). To decode short cryptograms. Commun. ACM, 37(9):102-108.
  8. Kucera, H. and Francis, W. N. (1967). Computational analysis of present-day American English. Brown University Press, Providence, RI.
  9. Lo, R. T.-W., He, B., and Ounis, I. (2005). Automatically building a stopword list for an information retrieval system. JDIM, 3(1):3-8.
  10. Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM J. Res. Dev., 2(2):159-165.
  11. Rijsbergen, C. J. V. (1979). Information Retrieval. Butterworth-Heinemann, Newton, MA, USA, 2nd edition.
  12. Salton, G. (1971). The SMART Retrieval SystemExperiments in Automatic Document Processing. Prentice-Hall, Inc., Upper Saddle River, NJ, USA.
  13. Salton, G. and McGill, M. (1984). Introduction to Modern Information Retrieval. McGraw-Hill Book Company.
  14. Silva, C. and Ribeiro, B. (2003). The importance of stop word removal on recall values in text categorization. In International Joint Conference on Neural Networks, 2003, volume 3, pages 1661-1666+.
  15. Sinka, M. P. and Corne, D. (2003a). Towards modernised and web-specific stoplists for web document analysis. In Web Intelligence, pages 396-404. IEEE Computer Society.
  16. Sinka, M. P. and Corne, D. W. (2003b). Design and application of hybrid intelligent systems. chapter Evolving Better Stoplists for Document Clustering and Web Intelligence, pages 1015-1023. IOS Press, Amsterdam, The Netherlands, The Netherlands.
  17. Wilbur, W. J. and Sirotkin, K. (1992). The automatic identification of stop words. J. Inf. Sci., 18(1):45-55.
  18. Zipf, G. K. (1935). The Psychobiology of Language. Houghton-Mifflin, New York, NY, USA.
Download


Paper Citation


in Harvard Style

Armano G., Fanni F. and Giuliani A. (2015). Stopwords Identification by Means of Characteristic and Discriminant Analysis . In Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, ISBN 978-989-758-074-1, pages 353-360. DOI: 10.5220/0005194303530360


in Bibtex Style

@conference{icaart15,
author={Giuliano Armano and Francesca Fanni and Alessandro Giuliani},
title={Stopwords Identification by Means of Characteristic and Discriminant Analysis},
booktitle={Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,},
year={2015},
pages={353-360},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005194303530360},
isbn={978-989-758-074-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,
TI - Stopwords Identification by Means of Characteristic and Discriminant Analysis
SN - 978-989-758-074-1
AU - Armano G.
AU - Fanni F.
AU - Giuliani A.
PY - 2015
SP - 353
EP - 360
DO - 10.5220/0005194303530360