Efficient Social Network Multilingual Classification using Character, POS n-grams and Dynamic Normalization

Carlos-Emiliano González-Gallardo, Juan-Manuel Torres-Moreno, Azucena Montes Rendón, Gerardo Sierra

2016

Abstract

In this paper we describe a dynamic normalization process applied to social network multilingual documents (Facebook and Twitter) to improve the performance of the Author profiling task for short texts. After the normalization process, n-grams of characters and n-grams of POS tags are obtained to extract all the possible stylistic information encoded in the documents (emoticons, character flooding, capital letters, references to other users, hyperlinks, hashtags, etc.). Experiments with SVM showed up to 90% of performance.

References

  1. Argamon, S., Koppel, M., Fine, J., and Shimoni, A. R. (2003). Gender, genre, and writing style in formal written texts. Text, 23(3):321-346.
  2. Argamon, S., Koppel, M., Pennebaker, J. W., and Schler, J. (2009). Automatically profiling the author of an anonymous text. Communications of the ACM, 52(2):119-123.
  3. Carmona, M. Í. Í., L ópez-Monroy, A. P., Montes-yG ómez, M., Pineda, L. V., and Escalante, H. J. (2015).
  4. Inaoe's participation at pan'15: Author profiling task.
  5. In Working Notes of CLEF 2015, Toulouse, France.
  6. Chang, P. F., Choi, Y. H., Bazarova, N. N., and Löckenhoff, C. E. (2015). Age differences in online social networking: Extending socioemotional selectivity theory to social network sites. Journal of Broadcasting & Electronic Media, 59(2):221-239.
  7. Doyle, J. and Kes?elj, V. (2005). Automatic categorization of author gender via n-gram analysis. In 6th Symposium on Natural Language Processing, SNLP.
  8. Giannakopoulos, G., Karkaletsis, V., and Vouros, G. (2008). Testing the use of n-gram graphs in summarization sub-tasks. In Text Analysis Conference (TAC).
  9. Gimpel, K., Schneider, N., O'Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, J., and Smith, N. A. (2011). Part-of-speech tagging for twitter: Annotation, features, and experiments. In 49th ACL HLT: Short Papers v2, pages 42- 47, Stroudsburg. ACL.
  10. González-Gallardo, C.-E., Torres-Moreno, J.-M., Montes Rend ón, A., and Sierra, G. (2016). Perfilado de autor multilingüe en redes sociales a partir de n-gramas de caracteres y de etiquetas gramaticales. Linguamática, 8(1):21-29.
  11. Grivas, A., Krithara, A., and Giannakopoulos, G. (2015). Author profiling using stylometric and structural feature groupings. In Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, September 8-11, 2015.
  12. Koehn, P. (2010). Statistical Machine Translation. Cambridge University Press, NY, USA, 1st edition.
  13. Koppel, M., Argamon, S., and Shimoni, A. R. (2002). Automatically categorizing written texts by author gender. Literary and Linguistic Computing, 17(4):401-412.
  14. Lopez-Monroy, A. P., Gomez, M. M.-y., Escalante, H. J., Villasenor-Pineda, L., and Villatoro-Tello, E. (2013). Inaoe's participation at pan'13: Author profiling task. In CLEF 2013 Evaluation Labs and Workshop.
  15. Manning, C. D. and Sch ütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press, Cambridge.
  16. Oberreuter, G. and Velásquez, J. D. (2013). Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style. Expert Systems with Applications, 40(9):3756-3763.
  17. Padró, L. and Stanilovsky, E. (2012). Freeling 3.0: Towards wider multilinguality. In Language Resources and Evaluation Conference (LREC 2012), Istanbul, Turkey. ELRA.
  18. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, Ó. (2011). Scikit-learn: Machine Learning in Python. Machine Learning Research, 12:2825-2830.
  19. Peersman, C., Daelemans, W., and Van Vaerenbergh, L. (2011). Predicting age and gender in online social networks. In 3rd Int. Workshop on Search and mining user-generated contents, pages 37-44. ACM.
  20. Rangel, F., Rosso, P., Potthast, M., Stein, B., and Daelemans, W. (2005). Overview of the 3rd Author Profiling Task at PAN 2015. In CLEF 2015 Labs and Workshops, Notebook Papers. Cappellato L. and Ferro N. and Gareth J. and San Juan E. (Eds).: CEUR-WS.org.
  21. Rangel, F., Rosso, P., Potthast, M., Stein, B., and Daelemans, W. (2015). Overview of the 3rd author profiling task at pan 2015. In CLEF.
  22. Stamatatos, E. (2006). Ensemble-based Author Identification Using Character N-grams. In 3rd Int. Workshop on Text-based Information Retrieval, pages 41-46.
  23. Stamatatos, E. (2009). A Survey of Modern Authorship Attribution Methods. American Society for information Science and Technology, 60(3):538-556.
  24. Stamatatos, E., Potthast, M., Rangel, F., Rosso, P., and Stein, B. (2015). Overview of the pan/clef 2015 evaluation lab. In Experimental IR Meets Multilinguality, Multimodality, and Interaction, pages 518-538.
  25. Torres-Moreno, J.-M. (2014). Automatic Text Summarization. Wiley-Sons, London.
  26. Vapnik, V. N. (1998). Statistical Learning Theory. WileyInterscience, New York.
  27. Wiemer-Hastings, P., Wiemer-Hastings, K., and Graesser, A. (2004). Latent semantic analysis. In 16th Int. joint conference on Artificial intelligence, pages 1-14.
Download


Paper Citation


in Harvard Style

González-Gallardo C., Torres-Moreno J., Montes Rendón A. and Sierra G. (2016). Efficient Social Network Multilingual Classification using Character, POS n-grams and Dynamic Normalization . In Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016) ISBN 978-989-758-203-5, pages 307-314. DOI: 10.5220/0006052803070314


in Bibtex Style

@conference{kdir16,
author={Carlos-Emiliano González-Gallardo and Juan-Manuel Torres-Moreno and Azucena Montes Rendón and Gerardo Sierra},
title={Efficient Social Network Multilingual Classification using Character, POS n-grams and Dynamic Normalization},
booktitle={Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016)},
year={2016},
pages={307-314},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006052803070314},
isbn={978-989-758-203-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016)
TI - Efficient Social Network Multilingual Classification using Character, POS n-grams and Dynamic Normalization
SN - 978-989-758-203-5
AU - González-Gallardo C.
AU - Torres-Moreno J.
AU - Montes Rendón A.
AU - Sierra G.
PY - 2016
SP - 307
EP - 314
DO - 10.5220/0006052803070314