Stopwords Identification by Means of Characteristic and Discriminant Analysis
Giuliano Armano, Francesca Fanni, Alessandro Giuliani
2015
Abstract
Stopwords are meaningless, non-significant terms that frequently occur in a document. They should be removed, like a noise. Traditionally, two different approaches of building a stoplist have been used: the former considers the most frequent terms looking at a language (e.g., english stoplist), the other includes the most occurring terms in a document collection. In several tasks, e.g., text classification and clustering, documents are typically grouped into categories. We propose a novel approach aimed at automatically identifying specific stopwords for each category. The proposal relies on two unbiased metrics that allow to analyze the informative content of each term; one measures the discriminant capability and the latter measures the characteristic capability. For each term, the former is expected to be high in accordance with the ability to distinguish a category against others, whereas the latter is expected to be high according to how the term is frequent and common over all categories. A preliminary study and experiments have been performed, pointing out our insight. Results confirm that, for each domain, the metrics easily identify specific stoplist wich include classical and category-dependent stopwords.
References
- Armano, G. (2014). A direct measure of discriminant and characteristic capability for classifier building and assessment. Technical report, DIEE, Department of Electrical and Electronic Engineering, University of Cagliari, Cagliari, Italy. DIEE Technical Report Series.
- Dolamic, L. and Savoy, J. (2010). When stopword lists make the difference. J. Am. Soc. Inf. Sci. Technol., 61(1):200-203.
- Fox, C. (1989). A stop list for general text. SIGIR Forum, 24(1-2):19-21.
- Fox, C. (1992). Information retrieval. chapter Lexical Analysis and Stoplists, pages 102-130. Prentice-Hall, Inc., Upper Saddle River, NJ, USA.
- Francis, W. N. and Kucera, H. (1983). Frequency Analysis of English Usage: Lexicon and Grammar. Houghton Mifflin.
- Hao, L. and Hao, L. (2008). Automatic identification of stop words in chinese text classification. In CSSE (1)7808, pages 718-722.
- Hart, G. W. (1994). To decode short cryptograms. Commun. ACM, 37(9):102-108.
- Kucera, H. and Francis, W. N. (1967). Computational analysis of present-day American English. Brown University Press, Providence, RI.
- Lo, R. T.-W., He, B., and Ounis, I. (2005). Automatically building a stopword list for an information retrieval system. JDIM, 3(1):3-8.
- Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM J. Res. Dev., 2(2):159-165.
- Rijsbergen, C. J. V. (1979). Information Retrieval. Butterworth-Heinemann, Newton, MA, USA, 2nd edition.
- Salton, G. (1971). The SMART Retrieval SystemExperiments in Automatic Document Processing. Prentice-Hall, Inc., Upper Saddle River, NJ, USA.
- Salton, G. and McGill, M. (1984). Introduction to Modern Information Retrieval. McGraw-Hill Book Company.
- Silva, C. and Ribeiro, B. (2003). The importance of stop word removal on recall values in text categorization. In International Joint Conference on Neural Networks, 2003, volume 3, pages 1661-1666+.
- Sinka, M. P. and Corne, D. (2003a). Towards modernised and web-specific stoplists for web document analysis. In Web Intelligence, pages 396-404. IEEE Computer Society.
- Sinka, M. P. and Corne, D. W. (2003b). Design and application of hybrid intelligent systems. chapter Evolving Better Stoplists for Document Clustering and Web Intelligence, pages 1015-1023. IOS Press, Amsterdam, The Netherlands, The Netherlands.
- Wilbur, W. J. and Sirotkin, K. (1992). The automatic identification of stop words. J. Inf. Sci., 18(1):45-55.
- Zipf, G. K. (1935). The Psychobiology of Language. Houghton-Mifflin, New York, NY, USA.
Paper Citation
in Harvard Style
Armano G., Fanni F. and Giuliani A. (2015). Stopwords Identification by Means of Characteristic and Discriminant Analysis . In Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, ISBN 978-989-758-074-1, pages 353-360. DOI: 10.5220/0005194303530360
in Bibtex Style
@conference{icaart15,
author={Giuliano Armano and Francesca Fanni and Alessandro Giuliani},
title={Stopwords Identification by Means of Characteristic and Discriminant Analysis},
booktitle={Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,},
year={2015},
pages={353-360},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005194303530360},
isbn={978-989-758-074-1},
}
in EndNote Style
TY - CONF
JO - Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,
TI - Stopwords Identification by Means of Characteristic and Discriminant Analysis
SN - 978-989-758-074-1
AU - Armano G.
AU - Fanni F.
AU - Giuliani A.
PY - 2015
SP - 353
EP - 360
DO - 10.5220/0005194303530360