The GENIE Project - A Semantic Pipeline for Automatic Document Categorisation

Angel L. Garrido, Maria G. Buey, Sandra Escudero, Alvaro Peiro, Sergio Ilarri, Eduardo Mena

Abstract

Automatic text categorisation systems is a type of software that every day it is receiving more interest, due not only to its use in documentaries environments but also to its possible application to tag properly documents on the Web. Many options have been proposed to face this subject using statistical approaches, natural language processing tools, ontologies and lexical databases. Nevertheless, there have been no too many empirical evaluations comparing the influence of the different tools used to solve these problems, particularly in a multilingual environment. In this paper we propose a multi-language rule-based pipeline system for automatic document categorisation and we compare empirically the results of applying techniques that rely on statistics and supervised learning with the results of applying the same techniques but with the support of smarter tools based on language semantics and ontologies, using for this purpose several corpora of documents. GENIE is being applied to real environments, which shows the potential of the proposal.

References

  1. Aguado de Cea, G., Puch, J., and Ramos, J. (2008). Tagging Spanish texts: The problem of 'se'. In Sixth International Conference on Language Resources and Evaluation (LREC'08), pages 2321-2324.
  2. Amitay, E., Har'El, N., Sivan, R., and Soffer, A. (2004). Web-a-where: geotagging web content. In 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 273-280. ACM.
  3. Atkinson, M. and Van der Goot, E. (2009). Near real time information mining in multilingual news. In Proceedings of the 18th International Conference on World Wide Web, WWW 7809, pages 1153-1154. ACM.
  4. Bikakis, N., Giannopoulos, G., Dalamagas, T., and Sellis, T. (2010). Integrating keywords and semantics on document annotation and search. In On the Move to Meaningful Internet Systems (OTM 2010), pages 921-938. Springer.
  5. Bloehdorn, S. and Hotho, A. (2006). Boosting for text classification with semantic features. In Advances in Web mining and Web Usage Analysis, pages 149-166. Springer.
  6. Bruno, M., Canfora, G., Di Penta, M., and Scognamiglio, R. (2005). An approach to support web service classification and annotation. In 2005 IEEE International Conference on e-Technology, e-Commerce and e-Service (EEE'05), pages 138-143. IEEE.
  7. Carrasco, R. and Gelbukh, A. (2003). Evaluation of TnT Tagger for Spanish. In Proceedings of ENC, Fourth Mexican International Conference on Computer Science, pages 18-25. IEEE.
  8. Carreras, X., Chao, I., Padró, L., and Padró, M. (2004). FreeLing: An open-source suite of language analyzers. In Fourth International Conference on Language Resources and Evaluation (LREC'04), pages 239- 242. European Language Resources Association.
  9. Chau, R. and Yeh, C.-H. (2004). Filtering multilingual web content using fuzzy logic and self-organizing maps. Neural Computing and Applications, 13(2):140-148.
  10. Elberrichi, Z., Rahmoun, A., and Bentaallah, M. A. (2008). Using WordNet for text categorization. The International Arab Journal of Information Technology (IAJIT), 5(1):16-24.
  11. Garrido, A. L., Buey, M. G., Escudero, S., Ilarri, S., Mena, E., and Silveira, S. B. (2013a). TM-gen: A topic map generator from text documents. In 25th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2013), Washington DC (USA), pages 735-740. IEEE Computer Society.
  12. Garrido, A. L., Gomez, O., Ilarri, S., and Mena, E. (2011). NASS: News Annotation Semantic System. In 23rd IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2011), Boca Raton, Florida (USA), pages 904-905. IEEE Computer Society.
  13. Garrido, A. L., Gomez, O., Ilarri, S., and Mena, E. (2012). An experience developing a semantic annotation system in a media group. In Proceedings of the 17th International Conference on Applications of Natural Language Processing and Information Systems, pages 333-338. Springer.
  14. Gilchrist, A. (2003). Thesauri, taxonomies and ontologies - an etymological note. Journal of Documentation, 59(1):7-18.
  15. Goodchild, M. F. and Hill, L. (2008). Introduction to digital gazetteer research. International Journal of Geographical Information Science, 22(10):1039-1044.
  16. Gruber, T. R. et al. (1993). A translation approach to portable ontology specifications. Knowledge Acquisition, 5(2):199-220.
  17. Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., and Weischedel, R. (2006). OntoNotes: The 90% solution. In Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pages 57-60. Association for Computational Linguistics.
  18. Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In Tenth European Conference on Machine Learning (ECML'98), pages 137-142. Springer.
  19. Joachims, T. (2004). SVM http://svmlight.joachims.org/.
  20. Light Version: 6.01.
  21. Lee, S. O. K. and Chun, A. H. W. (2007). Automatic tag recommendation for the web 2.0 blogosphere using collaborative tagging and hybrid and semantic structures. Sixth Conference on WSEAS International Conference on Applied Computer Science (ACOS'07), World Scientific and Engineering Academy and Society (WSEAS), 7:88-93.
  22. Leopold, E. and Kindermann, J. (2002). Text categorization with support vector machines. How to represent texts in input space? Machine Learning, 46:423-444.
  23. Li, H., Srihari, R. K., Niu, C., and Li, W. (2002). Location normalization for information extraction. In Proceedings of the 19th international conference on Computational linguistics-Volume 1, pages 1-7. Association for Computational Linguistics.
  24. Manning, C. D., Raghavan, P., and Schütze, H. (2008). Introduction to Information Retrieval, volume 1. Cambridge University Press.
  25. Maynard, D., Peters, W., and Li, Y. (2006). Metrics for evaluation of ontology-based information extraction. In Workshop on Evaluation of Ontologies for the Web (EON) at the International World Wide Web Conference (WWW'06).
  26. McGuinness, D. L., Van Harmelen, F., et al. (2004). OWL web ontology language overview. W3C recommendation 10 February 2004.
  27. Miller, G. A. (1995). WordNet: a lexical database for english. Communications of ACM, 38(11):39-41.
  28. Mishra, R. B. and Kumar, S. (2011). Semantic web reasoners and languages. Artificial Intelligence Review, 35(4):339-368.
  29. Navigli, R. (2009). Word sense disambiguation: A survey. ACM Computing Surveys, 41(2):10:1-10:69.
  30. Quercini, G., Samet, H., Sankaranarayanan, J., and Lieberman, M. D. (2010). Determining the spatial reader scopes of news sources using local lexicons. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 43-52. ACM.
  31. Rauch, E., Bukatin, M., and Baker, K. (2003). A confidence-based framework for disambiguating geographic terms. In HLT-NAACL 2003 Workshop on Analysis of Geographic References, vol. 1, pages 50- 54. Association for Computational Linguistics.
  32. Resnik, P. (1999). Disambiguating noun groupings with respect to WordNet senses. In Natural Language Processing Using Very Large Corpora, pages 77-98. Springer.
  33. Salton, G. and Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information processing and management, 24(5):513-523.
  34. Scharkow, M. (2013). Thematic content analysis using supervised machine learning: An empirical evaluation using German online news. Quality and Quantity, 47(2):761-773.
  35. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1-47.
  36. Sekine, S. and Ranchhod, E. (2009). Named Entities: Recognition, Classification and Use. John Benjamins.
  37. Shen, D., Sun, J.-T., Yang, Q., and Chen, Z. (2006). A comparison of implicit and explicit links for web page classification. In Proceedings of the 15th International Conference on World Wide Web, WWW 7806, pages 643-650. ACM.
  38. Silveira, S. B. and Branco, A. (2012). Extracting multidocument summaries with a double clustering approach. In Natural Language Processing and Information Systems, pages 70-81. Springer.
  39. Siolas, G. and d'Alché Buc, F. (2000). Support vector machines based on a semantic kernel for text categorization. In IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN 2000), volume 5, pages 205-209. IEEE.
  40. Smeaton, A. F. (1999). Using NLP or NLP Resources for Information Retrieval Tasks. Natural Language Information Retrieval. Kluwer Academic Publishers.
  41. Trillo, R., Gracia, J., Espinoza, M., and Mena, E. (2007). Discovering the semantics of user keywords. Journal of Universal Computer Science, 13(12):1908-1935.
  42. Vossen, P. (1998). EuroWordNet: A multilingual database with lexical semantic networks. Kluwer Academic Boston.
  43. Wilbur, W. J. and Sirotkin, K. (1992). The automatic identification of stop words. Journal of Information Science, 18(1):45-55.
Download


Paper Citation


in Harvard Style

Garrido A., Buey M., Escudero S., Peiro A., Ilarri S. and Mena E. (2014). The GENIE Project - A Semantic Pipeline for Automatic Document Categorisation . In Proceedings of the 10th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST, ISBN 978-989-758-024-6, pages 161-171. DOI: 10.5220/0004750601610171


in Bibtex Style

@conference{webist14,
author={Angel L. Garrido and Maria G. Buey and Sandra Escudero and Alvaro Peiro and Sergio Ilarri and Eduardo Mena},
title={The GENIE Project - A Semantic Pipeline for Automatic Document Categorisation},
booktitle={Proceedings of the 10th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,},
year={2014},
pages={161-171},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004750601610171},
isbn={978-989-758-024-6},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 10th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,
TI - The GENIE Project - A Semantic Pipeline for Automatic Document Categorisation
SN - 978-989-758-024-6
AU - Garrido A.
AU - Buey M.
AU - Escudero S.
AU - Peiro A.
AU - Ilarri S.
AU - Mena E.
PY - 2014
SP - 161
EP - 171
DO - 10.5220/0004750601610171