ing to the class ”drˇzava”, top most 10 belong to daily
and weekly newspapers (”Politika” and ”Nin”).
Document classification may benefit from both com-
mon and proper names. Proper names are especially
suitable when applied for subclassification of docu-
ments already classified into common classes such
as news, articles, science, politics, finance, etc. Two
types of classification based on lexical-semantic net-
work wordnet are presented. Classification based on
wordnet basic concepts (commonwords) proved quite
successful in classifying a large newspaper archive
into several classes. A distance-based classification
driven by an ontology of proper names is then pre-
sented, where conceptual hierarchies of proper names
follow the structure of the Serbian WordNet. Results
of classification applied to an untagged part of the cor-
pus of contemporary Serbian, of around 23 million
words, are presented.
Our future plans involve improvement of the pro-
posed classification framework and its implementa-
tion – enlargement of SWN and lookup at EWN, as
well as development of new classification methods
based on character and word patterns in texts. We
also plan to define new distance measures and to per-
form a multi-way comparison of different classifica-
tion methods applied to different types of corpora.
The work presented has been financially supported by
the Ministry of Science and Technological Develop-
ment of the Republic of Serbia, Project No.148921A.
