An N-gram Based Distributional Test for Authorship Identification

Kostas Fragos, Christos Skourlas

Abstract

In this paper, a novel method for the authorship identification problem is presented. Based on character level text segmentation we study the disputed text’s N-grams distributions within the authors’ text collections. The distribution that behaves most abnormally is identified using the Kolmogorov - Smirnov test and the corresponding Author is selected as the correct one. Our method is evaluated using the test sets of the 2004 ALLC/ACH Ad-hoc Authorship Attribution Competition and its performance is comparable with the best performances of the participants in the competition. The main advantage of our method is that it is a simple, not parametric way for authorship attribution without the necessity of building authors’ profiles from training data. Moreover, the method is language independent and does not require segmentation for languages such as Chinese or Thai. There is also no need for any text pre-processing or higher level processing, avoiding thus the use of taggers, parsers, feature selection strategies, or the use of other language dependent NLP tools.

References

  1. Derich, J., et al.: Authorship attribution with support vector machines. Applied Intelligence, 19(1-2): (2003) 109-123.
  2. Goel, L. A.: Cumulative sum control charts. In S.Kotz and N. Johnson, editors, Encyclopedia of Statistics, volume 2, Wiley (1982) 233-241.
  3. Neal, D. K.: Goodness of Fit Tests for Normality, Mathematica Educ. Res. 5, 23-30. Massey, F. J. Jr., 1951. The Kolmogorov-Smirnov test of goodness of fit, Journal of the American Statistical Association, Vol. 46 (1996).
  4. Lowe D. and Mattews R.: Shakespeare vs. Fletcher: A stylometric analysis by radial basis functions, Computers and the Humanities, 29: (1995) 449-461.
  5. McCallum and Nigam K.: A comparison of event models for naive Bayes text classification. In AAA-98 Workshop on Learning for Text Categorization (1998).
  6. Juola, P.: Ad-hoc authorship attribution competition. In Proc. of Joint International Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities, Goteborg, Sweden ALLC/ACH (2004).
  7. K. Luyckx and W. Daelemans: Shallow Text Analysis and Machine Learning for Authorship Attribution. In: Proceedings of the Fifteenth Meeting of Computational Linguistics in the Netherlands (2005).
  8. Scott S. and Matwin S.: Feature engineering for text classification. In Proceedings ICML99, Florida (1992).
  9. Aizawa A.: Linguistic techniques to improve the performance of automatic text categorization. In Proceedings 6th NLP Pac. Rim Symp. NLPRS-01 (2001).
  10. Thisted B. and Efron R.: Did Shakespeare write a newly discovered poem? Biometrika, 74:445-55 (1987).
  11. Tweedie J. F., Singh S. and Holmes I. D.: Neural network applications in stylometry: The Federalist papers. Computers and the Humanities, 30:1-10 (1996).
  12. Valenza R. J.: Are the Thisted-Efron authorship tests valid? Computers and the Humanities, 25:27-46 (1991).
  13. Keselj V.: Perl package Text N-grams. http://www.cs.dal.ca/vlado/srcperl/Ngrams or http://search.cpan.org/author/VLADO/Text-Ngrams-0.03/Ngrams.pm, (2003).
  14. Fuchun P., Schuurmans D., Keselj V. and Wang, S.: Automated authorship attribution with character level language models. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, Budapest, Hungary, April 12-17 EACL (2003).
  15. Bell, T. Cleary J. and Witten I.: Text Compression. Prentice Hall. (1990).
  16. Mahoui M., Witten I., Bray Z. and Teahan W.: Text mining: A new frontier for lossless compression. In Proceedings of the IEEE Data Compression Conference DCC (1999).
  17. Cavnar W. and Trenkle, J.: N-gram-based text categorization. In Proceedings SDAIR-94 (1994).
Download


Paper Citation


in Harvard Style

Fragos K. and Skourlas C. (2006). An N-gram Based Distributional Test for Authorship Identification . In Proceedings of the 3rd International Workshop on Natural Language Understanding and Cognitive Science - Volume 1: NLUCS, (ICEIS 2006) ISBN 978-972-8865-50-4, pages 139-148. DOI: 10.5220/0002474701390148


in Bibtex Style

@conference{nlucs06,
author={Kostas Fragos and Christos Skourlas},
title={An N-gram Based Distributional Test for Authorship Identification},
booktitle={Proceedings of the 3rd International Workshop on Natural Language Understanding and Cognitive Science - Volume 1: NLUCS, (ICEIS 2006)},
year={2006},
pages={139-148},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002474701390148},
isbn={978-972-8865-50-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 3rd International Workshop on Natural Language Understanding and Cognitive Science - Volume 1: NLUCS, (ICEIS 2006)
TI - An N-gram Based Distributional Test for Authorship Identification
SN - 978-972-8865-50-4
AU - Fragos K.
AU - Skourlas C.
PY - 2006
SP - 139
EP - 148
DO - 10.5220/0002474701390148