ing to the class ”drˇzava”, top most 10 belong to daily
and weekly newspapers (”Politika” and ”Nin”).
5 CONCLUSIONS
Document classification may benefit from both com-
mon and proper names. Proper names are especially
suitable when applied for subclassification of docu-
ments already classified into common classes such
as news, articles, science, politics, finance, etc. Two
types of classification based on lexical-semantic net-
work wordnet are presented. Classification based on
wordnet basic concepts (commonwords) proved quite
successful in classifying a large newspaper archive
into several classes. A distance-based classification
driven by an ontology of proper names is then pre-
sented, where conceptual hierarchies of proper names
follow the structure of the Serbian WordNet. Results
of classification applied to an untagged part of the cor-
pus of contemporary Serbian, of around 23 million
words, are presented.
Our future plans involve improvement of the pro-
posed classification framework and its implementa-
tion – enlargement of SWN and lookup at EWN, as
well as development of new classification methods
based on character and word patterns in texts. We
also plan to define new distance measures and to per-
form a multi-way comparison of different classifica-
tion methods applied to different types of corpora.
ACKNOWLEDGEMENTS
The work presented has been financially supported by
the Ministry of Science and Technological Develop-
ment of the Republic of Serbia, Project No.148921A.
REFERENCES
EAGLES (1996). Preliminary Recommendations on Text
Typology, EAGLES Document EAG-TCWG-TTYP/P.
Expert Advisory Group on Language Engineering
Standards, European Commission.
Ebart (2010). Aktuelna arhiva. Medijska dokumentacija
Ebart, http://www.arhiv.rs.
Fellbaum, C. (1998). Wordnet: An Electronic Lexical
Database. The MIT Press.
HLTG (2010). Resursi srpskog jezika. Human Language
Technologies Group, http://korpus.matf.bg.ac.rs, Fac-
ulty of Mathematics, University of Belgrade.
Krstev, C., Pavlovi´c-Laˇzeti´c, G., Vitas, D., and Obradovi´c,
I. (2004). Using textual and lexical resources in devel-
oping serbian wordnet. In Romanian J. Sci. Tech. In-
form. (Special Issue on Balkanet), 7(1-2), pages 147–
161. Romanian Academy.
LCC (2009). Library of Congress Classification Out-
line. http://www.loc.gov/catdir/cpso/lcco/, U.S. gov-
ernment.
Miller, G. (1995). Wordnet: A lexical database. In Comm.
ACM 38(11) 39–41. ACM – Association for Comput-
ing Machinery.
Reuters (2010). Site Archive. Thomson Reuters Corporate,
http://in.reuters.com/resources/archive/in/index.html.
Rodriguez, M., Gomez-Hidalgo, J., and Diaz-Agudo, B.
(1996). Using wordnet to complement training infor-
mation in text categorization. In Proceedings of the
AAAI Spring Symposium on Machine Learning in In-
formation Access, Bulgaria.
Rosso, P., Molina, A., Pla, F., Jim´enez, D., and Vidal,
V. (2004). Text categorization and information re-
trieval usingwordnet senses. In CICLing 2004, Lec-
ture Notes in Computer Science, 2945., pages 596–
600. Springer- Verlag.
Scott, S. and Matwin, S. (1998). Text classif-
cation using wordnet hypernyms. In Usage
of WordNet in Natural Language Processing
Systems1st International Wordnet Conference.
http://www.ceid.upatras.gr/Balkanet/files/balkanet-
elsnet-ko-accept.pdf.
Tan, P., Steinbach, M., and Kumar, V. (2006). Introduction
to Data Mining. Addison-Wesley.
Tomaˇsevi´c, J. and Pavlovi´c-Laˇzeti´c, G. (2008). Productiv-
ity of concepts in serbian wordnet. In Proceedings
of the Sixth Language Technologies Conference: pro-
ceedings of the 11th International Multiconference In-
formation Society - IS 2008, 86–91, pages 86–91.
Tufis, D., Cristea, D., and Stamou, S. (2004). Balkanet:
Aims, methods, results and perspectives. a general
overview. In Romanian J. Sci. Tech. Inform. (Special
Issue on Balkanet), 7(1-2), . 9–43, pages 9–43. Roma-
nian Academy.
Vitas, D., Pavlovi´c-Laˇzeti´c, G., Krstev, C., Popovi´c, L.,
and Obradovi´c, I. (2003). Processing serbianwritten
texts: An overview of resources and basic tools. In
Proceedings of the International Workshop on Balkan
Language Resources and Tools, Thessaloniki, pages
97–104.
Vossen, P. (1998). EuroWordnet: A Multilingual Database
with Lexical Semantic Networks. Kluwer Academic
Publishers, Dordrecht.
KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval
386