ONTOLOGY-DRIVEN CONCEPTUAL DOCUMENT CLASSIFICATION
Gordana Pavlović-Lažetić, Jelena Graovac
2010
Abstract
Document classification based on the lexical-semantic network, wordnet, is presented. Two types of document classification in Serbian have been experimented with – classification based on chosen concepts from Serbian WordNet (SWN) and proper names-based classification. Conceptual document classification criteria are constructed from hierarchies rooted in a set of chosen concepts (first case) or in hierarchies rooted in some of the proper names' hypernyms (second case). A classificator of the first type is trained and then tested on an indexed and already classified Ebart corpus of Serbian newspapers (476917 articles). Precision, recall and F-measure show that this type of classification is promising although incomplete due mainly to SWN incompleteness. In the context of proper names-based classification, a proper names ontology based on the SWN is presented in the paper. A distance based similarity measure is defined, based on Euclidean and Manhattan distances. Classification of a subset of Contemporary Serbian Language Corpus is presented.
References
- EAGLES (1996). Preliminary Recommendations on Text Typology, EAGLES Document EAG-TCWG-TTYP/P. Expert Advisory Group on Language Engineering Standards, European Commission.
- Ebart (2010). Aktuelna arhiva. Medijska dokumentacija Ebart, http://www.arhiv.rs.
- Fellbaum, C. (1998). WordNet: An electronic lexical database. The MIT press.
- HLTG (2010). Resursi srpskog jezika. Human Language Technologies Group, http://korpus.matf.bg.ac.rs, Faculty of Mathematics, University of Belgrade.
- Krstev, C., Pavlovic-Laz?etic, G., Vitas, D., and Obradovic, I. (2004). Using textual and lexical resources in developing serbian wordnet. In Romanian J. Sci. Tech. Inform. (Special Issue on Balkanet), 7(1-2), pages 147- 161. Romanian Academy.
- LCC (2009). Library of Congress Classification Outline. http://www.loc.gov/catdir/cpso/lcco/, U.S. government.
- Miller, G. (1995). Wordnet: A lexical database. In Comm. ACM 38(11) 39-41. ACM - Association for Computing Machinery.
- Reuters (2010). Site Archive. Thomson Reuters Corporate, http://in.reuters.com/resources/archive/in/index.html.
- Rodriguez, M., Gomez-Hidalgo, J., and Diaz-Agudo, B. (1996). Using wordnet to complement training information in text categorization. In Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access, Bulgaria.
- Rosso, P., Molina, A., Pla, F., Jiménez, D., and Vidal, V. (2004). Text categorization and information retrieval usingwordnet senses. In CICLing 2004, Lecture Notes in Computer Science, 2945., pages 596- 600. Springer- Verlag.
- S. and Matwin, S. (1998). Text classifcation using wordnet hypernyms. In Usage of WordNet in Natural Language Processing Systems1st International Wordnet Conference.
- Tan, P., Steinbach, M., and Kumar, V. (2006). Introduction to Data Mining. Addison-Wesley.
- Tomas?evic, J. and Pavlovic-Laz?etic, G. (2008). Productivity of concepts in serbian wordnet. In Proceedings of the Sixth Language Technologies Conference: proceedings of the 11th International Multiconference Information Society - IS 2008, 86-91, pages 86-91.
- Tufis, D., Cristea, D., and Stamou, S. (2004). Balkanet: Aims, methods, results and perspectives. a general overview. In Romanian J. Sci. Tech. Inform. (Special Issue on Balkanet), 7(1-2), . 9-43, pages 9-43. Romanian Academy.
- Vitas, D., Pavlovic-Laz?etic, G., Krstev, C., Popovic, L., and Obradovic, I. (2003). Processing serbianwritten texts: An overview of resources and basic tools. In Proceedings of the International Workshop on Balkan Language Resources and Tools, Thessaloniki, pages 97-104.
Paper Citation
in EndNote Style
TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)
TI - ONTOLOGY-DRIVEN CONCEPTUAL DOCUMENT CLASSIFICATION
SN - 978-989-8425-28-7
AU - Pavlović-Lažetić G.
AU - Graovac J.
PY - 2010
SP - 383
EP - 386
DO - 10.5220/0003063903830386
in Harvard Style
Pavlović-Lažetić G. and Graovac J. (2010). ONTOLOGY-DRIVEN CONCEPTUAL DOCUMENT CLASSIFICATION . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010) ISBN 978-989-8425-28-7, pages 383-386. DOI: 10.5220/0003063903830386
in Bibtex Style
@conference{kdir10,
author={Gordana Pavlović-Lažetić and Jelena Graovac},
title={ONTOLOGY-DRIVEN CONCEPTUAL DOCUMENT CLASSIFICATION},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2010)},
year={2010},
pages={383-386},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003063903830386},
isbn={978-989-8425-28-7},
}