Authorship Attribution using Variable Length Part-of-Speech Patterns
Yao Jean Marc Pokou, Philippe Fournier-Viger, Chadia Moghrabi
2016
Abstract
Identifying the author of a book or document is an interesting research topic having numerous real-life applications. A number of algorithms have been proposed for the automatic authorship attribution of texts. However, it remains an important challenge to find distinct and quantifiable features for accurately identifying or narrowing the range of likely authors of a text. In this paper we propose a novel approach for authorship attribution, which relies on the discovery of variable-length sequential patterns of parts of speech to build signatures representing each author’s writing style. An experimental evaluation using 10 authors and 30 books, consisting of 2,615,856 words, from Project Gutenberg was carried. Results show that the proposed approach can accurately classify texts most of the time using a very small number of variable-length patterns. The proposed approach is also shown to perform better using variable-length patterns than with fixed-length patterns (bigrams or trigrams).
References
- Agrawal, R. and Srikant, R. (1995). Mining sequential patterns. In Proc. of the Eleventh Intern. Conf. on Data Engineering, 1995., pages 3-14. IEEE.
- Argamon-Engelson, S., Koppel, M., and Avneri, G. (1998). Style-based text categorization: What newspaper am i reading. In Proc. of the AAAI Workshop on Text Categorization, pages 1-4.
- Baayen, H., van Halteren, H., and Tweedie, F. (1996). Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, 11(3):121-132.
- Boukhaled, M. A. and Ganascia, J.-G. (2015). Using function words for authorship attribution: Bag-of-words vs. sequential rules. Natural Language Processing and Cognitive Science: Proc. 2014, page 115.
- Clark, J. H. and Hannon, C. J. (2007). A classifier system for author recognition using synonym-based features. In MICAI 2007: Advances in Artificial Intelligence, pages 839-849. Springer.
- Fournier-Viger, P., Gomariz, A., Gueniche, T., , E., and Thomas, R. (2013). Tks: Efficient mining of top-k sequential patterns. In Advanced Data Mining and Applications, pages 109-120. Springer.
- Gamon, M. (2004). Linguistic correlates of style: authorship classification with deep linguistic analysis features. In Proc. of the 20th international conference on Computational Linguistics, page 611. ACL.
- García-Hern ández, R. A., Martínez-Trinidad, J. F., and Carrasco-Ochoa, J. A. (2010). Finding maximal sequential patterns in text document collections and single documents. Informatica, 34(1).
- Han, J., Kamber, M., and Pei, J. (2011). Data mining: concepts and techniques. Elsevier.
- Howe, D. C. (2009). Rita: creativity support for computational literature. In Proc. of the 7th ACM conference on Creativity and cognition, pages 205-210. ACM.
- Koppel, M. and Schler, J. (2003). Exploiting stylistic idiosyncrasies for authorship attribution. In Proc. of IJCAI'03 Workshop on Computational Approaches to Style Analysis and Synthesis, volume 69, page 72.
- Koppel, M., Schler, J., and Argamon, S. (2013). Authorship attribution: What's easy and what's hard? Available at SSRN 2274891.
- Koppel, M., Schler, J., and Bonchek-Dokow, E. (2007). Measuring differentiability: Unmasking pseudonymous authors. Journal of Machine Learning Research, 8(2):1261-1276.
- Litvinova, T., Seredin, P., and Litvinova, O. (2015). Using part-of-speech sequences frequencies in a text to predict author personality: a corpus study. Indian Journal of Science and Technology, 8(S9):93-97.
- McDonald, A. W., Afroz, S., Caliskan, A., Stolerman, A., and Greenstadt, R. (2012). Use fewer instances of the letter i: Toward writing style anonymization. In Privacy Enhancing Technologies, pages 299-318.
- Mendenhall, T. C. (1887). The characteristic curves of composition. Science, pages 237-249.
- Mosteller, F. and Wallace, D. (1964). Inference and disputed authorship: The federalist.
- Mwamikazi, E., Fournier-Viger, P., Moghrabi, C., and Baudouin, R. (2014). A dynamic questionnaire to further reduce questions in learning style assessment. In Artificial Intelligence Applications and Innovations, pages 224-235. Springer.
- Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., and Chanona-Hernández, L. (2014). Syntactic ngrams as machine learning features for natural language processing. Expert Systems with Applications, 41(3):853-860.
- Stamatatos, E., Fakotakis, N., and Kokkinakis, G. (2000). Automatic text categorization in terms of genre and author. Computational linguistics, 26(4):471-495.
- Stamatatos, E., Fakotakis, N., and Kokkinakis, G. (2001). Computer-based authorship attribution without lexical measures. Computers and the Humanities, 35(2):193- 214.
- Toutanova, K., Klein, D., Manning, C. D., and Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proc. of the 2003 Conf. of the North American Chapter of the ACL on Human Language Technology-Vol. 1, pages 173-180. ACL.
- Yule, G. U. (1939). On sentence-length as a statistical characteristic of style in prose: With application to two cases of disputed authorship. Biometrika, 30(3- 4):363-390.
Paper Citation
in Harvard Style
Pokou Y., Fournier-Viger P. and Moghrabi C. (2016). Authorship Attribution using Variable Length Part-of-Speech Patterns . In Proceedings of the 8th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, ISBN 978-989-758-172-4, pages 354-361. DOI: 10.5220/0005710103540361
in Bibtex Style
@conference{icaart16,
author={Yao Jean Marc Pokou and Philippe Fournier-Viger and Chadia Moghrabi},
title={Authorship Attribution using Variable Length Part-of-Speech Patterns},
booktitle={Proceedings of the 8th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,},
year={2016},
pages={354-361},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005710103540361},
isbn={978-989-758-172-4},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 8th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,
TI - Authorship Attribution using Variable Length Part-of-Speech Patterns
SN - 978-989-758-172-4
AU - Pokou Y.
AU - Fournier-Viger P.
AU - Moghrabi C.
PY - 2016
SP - 354
EP - 361
DO - 10.5220/0005710103540361