ONTOLOGY-DRIVEN CONCEPTUAL DOCUMENT

CLASSIFICATION

Gordana Pavlovi´c-Laˇzeti´c and Jelena Graovac

Faculty of Mathematics, University of Belgrade, Studentski trg 16, Belgrade, Serbia

Keywords:

Document classiﬁcation, Wordnet, SWN, Ontology, Proper name.

Abstract:

Document classiﬁcation based on the lexical-semantic network, wordnet, is presented. Two types of docu-

ment classiﬁcation in Serbian have been experimented with – classiﬁcation based on chosen concepts from

Serbian WordNet (SWN) and proper names-based classiﬁcation. Conceptual document classiﬁcation criteria

are constructed from hierarchies rooted in a set of chosen concepts (ﬁrst case) or in hierarchies rooted in some

of the proper names’ hypernyms (second case). A classiﬁcator of the ﬁrst type is trained and then tested on

an indexed and already classiﬁed Ebart corpus of Serbian newspapers (476917 articles). Precision, recall and

F-measure show that this type of classiﬁcation is promising although incomplete due mainly to SWN incom-

pleteness. In the context of proper names-based classiﬁcation, a proper names ontology based on the SWN

is presented in the paper. A distance based similarity measure is deﬁned, based on Euclidean and Manhattan

distances. Classiﬁcation of a subset of Contemporary Serbian Language Corpus is presented.

1 INTRODUCTION

Different document processing tasks such as docu-

ment retrieval, internet web pages search, information

extraction an the like, may highly beneﬁt from con-

ceptual document classiﬁcation. For example, search

for the term Moon will be more efﬁcient if conducted

in a class correspondingto celestial bodies, than in the

whole set of documents, thus increasing the precision

of the search. On the other side, in order to search for

celestial bodies, it is reasonable to expand the concep-

tual hierarchy and to search for Moon, thus increasing

the recall.

There are different pre-speciﬁed classiﬁcation

schemes, e.g., EAGLES Guidelines (EAGLES,

1996), Library of Congress Classiﬁcation (LCC,

2009), Reuters news archive (Reuters, 2010). Ebart

(Ebart, 2010) is the largest digital media documenta-

tion in Serbia, with more than 1.200.000 completely

indexed texts of ﬁfteen complete daily and weekly

press, since 2003. It is classiﬁed in a usual way into

internal politics, foreign politics, society, economics,

chronicle and crime, culture and entertainment, sport,

media, etc.

This paper proposes a document classiﬁcation in

Serbian based on wordnet and distance based algo-

rithms. Wordnet is a semantic lexical resource ad-

dressing conceptual hierarchies issue (Miller, 1995)

and is under development for Serbian (SWN - Ser-

bian WordNet, (Krstev et al., 2004)). Wordnet is a de

facto standard for semantic networks, and it has been

used so far in document classiﬁcation as a source of

synonyms mainly (Scott and Matwin, 1998), (Rosso

et al., 2004), (Rodriguez et al., 1996).

As the ﬁrst experiment, classiﬁcation has been ap-

plied to a part of Ebart corpus (articles from the Ser-

bian daily newspaper ”Politika”). Each class is de-

ﬁned so that a set of concepts from the wordnet, as

well as all of their hyponyms, have been assigned to

it. An article is assigned to a class for which it con-

tains the largest number of concepts (with repetitions)

assigned to that class. The results have been validated

using statistical measures of precision, recall and their

combination - F-measure.

As the second experiment, the classiﬁcation based

on ontology of proper names for a predeﬁned set of

classes has been applied to a subset of the corpus of

contemporary Serbian language developed at the Fac-

ulty of mathematics,University of Belgrade. Distance

based similarity measures are deﬁned that assign doc-

uments to classes on the lowest distance, taking into

account ratio of the number of occurrences of proper

names from the corresponding conceptual hierarchy

in the document and in the whole collection.

383

Pavlovi

c-Lažeti

c G. and Graovac J..

ONTOLOGY-DRIVEN CONCEPTUAL DOCUMENT CLASSIFICATION.

DOI: 10.5220/0003063903830386

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2010), pages 383-386

ISBN: 978-989-8425-28-7

 2010 SCITEPRESS (Science and Technology Publications, Lda.)

2 LEXICAL RESOURCES FOR

CONCEPTUAL DOCUMENT

CLASSIFICATION IN SERBIAN

Lexical resources for Serbian have been developed

within the Human Language Technologies Group at

the Faculty of Mathematics, University of Belgrade

(HLTG, 2010). Except for Serbian language corpus

and multilingual parallel corpora, especially impor-

tant are the system of electronic morphological dic-

tionaries of Serbian and the lexical-semantic network,

Serbian WordNet, SWN (Vitas et al., 2003).

The Serbian WordNet has been initiated in the

scope of BalkaNet (BWN), the Balkan wordnet

project (2002-2004) aimed at producing a multilin-

gual database with wordnets for ﬁve Balkan lan-

guages (Greek, Turkish, Bulgarian, Romanian and

Serbian) as well as Czech (Tuﬁs et al., 2004) . BWN

is based on the model of the EuroWordNet (EWN), a

multilingual database with wordnets for Dutch, Ital-

ian, Spanish, German, French, Czech and Estonian

(Vossen, 1998). The structure of these wordnets is

basically the same as the structure of the Prince-

ton WordNet (PWN) for English (Fellbaum, 1998).

Wordnet consists of synsets - the sets of synonymous

words representing a concept, with basic semantic re-

lations between them forming a semantic network.

At the moment, SWN consists of around 15000

synsets. There are 9 noun hierarchies (top ontologies)

rooted at generic concepts of ”abstraction” ”act”,

”event”, ”entity”, ”phenomenon”, ”group”, ”psycho-

logical feature”, ”state” and ”possession”. Hierar-

chies have different depths (up to 12 levels). Some-

where in the middle of these hierarchies concepts are

found which are neither too general, nor too speciﬁc.

Those concepts are called ”basic concepts” and they

are used for classiﬁcation.

Figure 1 represents conceptual hierarchy having

the proper name ”Zeus” at its leaf, as well as all the

other proper names that are hyponyms of the Zeus’s

hypernym ”Greek deity” (other Greek deities). If

”Greek deity” is used as the basic concept describing

a document class, than all the proper names – names

of Greek deities, will be part of the class description.

3 DOCUMENT CLASSIFICATION

BASED ON CHOSEN

CONCEPTS

The ﬁrst experiment has been performed on a subset

of Ebart corpus consisting of articles in the follow-

ing columns: sport, economics, politics, culture and

entertainment, chronicle and crime, published from

2003 to 2006. There are 476917 such articles. The

classiﬁcation process proceeds as follows:

1. the chosen columns (article types) are assigned to

predeﬁned set of classes (we experimented with 3,

4 and 5 classes);

2. key words for each column and each class are

identiﬁed as the most frequent words in a set of ar-

ticles from the given column / class (training set);

3. SWN concepts containing the chosen key words,

along with all of their hyponyms, are assigned to

the corresponding classes;

4. class assignment functions are deﬁned for an arti-

cle (from the test set) in different ways, the sim-

plest being the maximum number of occurrences

of literals from the hierarchy rooted in the con-

cepts assigned to the class, maybe ﬁltered by do-

mains.

The chosen columns have been assigned the fol-

lowing SWN concepts and key words (translated in

English):

SPORT

event → social event → contest, competition

group → social group → team

event ...→result, victory, triumph, defeat

act → activity → recreation → sport

and similarly for the classes ECONOMICS, POL-

ITICS, CULTURE AND ENTERTAINMENT and

CHRONICLE AND CRIME.

Figure 1: Proper names that are in the same conceptual hi-

erarchy (Greek deity) as the given proper name (Zeus).

3.1 Results

The table 1 illustrates the results of the experiment

performed - measures of precision, recall and F-

measure for classiﬁcation into three classes (similarly

into four and ﬁve classes).

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

384

Table 1: Results of 3-classiﬁcation.

class precision recall F-measure

sport 97.30% 87.21% 91.98%

politics 91.27% 76.20% 83.06%

economics 59.19% 93.57% 72.51%

It may be noticed that the source of texts is not in

favor of this approach, since daily press is character-

ized by concise and factual expression. The approach

is expected to produce even better results on rich vo-

cabulary corpora.

Instead of chosen concepts, document classiﬁca-

tion may also be based on the most productive con-

cepts (Tomaˇsevi´c and Pavlovi´c-Laˇzeti´c, 2008) in each

of the wordnet top ontology hierarchies.

4 PROPER NAME

ONTOLOGY-BASED

CLASSIFICATION

Since document retrieval is usually performed on

proper names, document classiﬁcation may be based

on an ontology of proper names. Proper names for

document classiﬁcation in Serbian may be extracted

from the SWN.

Classiﬁcation criteria may be constructed from

the hierarchy rooted in the most productive concept

which is in hypernym relation with the proper name,

the so-called proper name’s ontology term. For ex-

ample, for the proper name ”Zemlja” (Eng. Earth),

third-level hypernym, ”nebesko telo” (Eng. celestial

body), may be used as a proper name ontology term,

and the hierarchy, rooted at ”nebesko telo”, would

constitute the basis for deﬁnition of the corresponding

class. Out of all the SWN synsets, about 10% contain

proper names. Ontology terms are found at different

levels of proper name hypernymy (usually the third or

fourth).

Departing from proper names in the present

SWN noun hierarchies, we have come up with

the proper names ontology schema consisting of

23 proper name ontology terms, e.g., entity/person,

entity/inhabitant, entity/city, entity/celestial body,

group/dinasty, group/organization, etc.

4.1 Classiﬁcation Method

Classiﬁcation of documents based on proper names

ontology, is performed using distance-based ap-

proach. There are as many classes as there are proper

name ontology hierarchies considered, at the moment

it is around 20. Each class C

∈ C is deﬁned by a tu-

ple t

=(C

, C

, . . . , C

), whereC

represents a proper

name from the corresponding hierarchy. The tuple t

is a representative of the class C

. Given a database

D = {D

, D

, . . . , D

} of documents, each document

∈ D is assigned to a class C

such that dist(D

, C

)

≤ dist(D

, C

) for each C

∈ C, C

6= C

Distance may be deﬁned by different formulas

for similarity (dissimilarity) between database docu-

ments or between a database document and a class. It

also depends on how numeric characteristics of sim-

ilarity on a single attribute (proper name) is deﬁned.

We considered (and experimented with) variations of

two of document distance formulas, Euclidean and

Manhattan distance measures (Tan et al., 2006).

4.2 Results

The classiﬁcation based on ontology of proper names

has been applied to a subset of the corpus of contem-

porary Serbian language developed at the Faculty of

mathematics, University of Belgrade, and accessible

on the web at http://korpus.matf.bg.ac.rs. The collec-

tion experimented with consists of 234 non-tagged

texts, total size of around 23 million words, and

encompassing texts from the newspapers and jour-

nals (”Politika”, ”Journal of the Serbian Orthodox

Church”, ”Viva”, ”Svet”, ”NIN”, ”Magazin”, ”Ilus-

trovana politika”), ﬁction, textbooks, chapters from

literature pieces, etc.

For experimental purposes, we chose ﬁve concep-

tual hierarchiesfrom the SWN rooted at proper names

ontology terms ”drˇzava” (Eng. country), ”dinas-

tija” (Eng. dynasty), ”nebesko telo” (Eng. celestial

body), ”manastir” (Eng. monastery) and ”hriˇs´canski

praznik” (Eng. Christian holyday).

Since the SWN itself is in early development

phase, the corresponding conceptual hierarchies con-

tain limited number of proper names, between 3 (for

”dinastija”) and 16 (for ”drˇzava”). The classiﬁcation

is performed following the procedure for crisp classi-

ﬁcation described in the previous section. There were

195 documents containing at least one occurrence of

at least one proper name from at least one of the cho-

sen conceptual hierarchies. Although similarity mea-

sures were somewhat different, classiﬁcation obtained

by both Euclidean and Manhattan distance formulas

were identical.

Assignment of documents to classes is quite suc-

cessful. For example, among the 59 documents clas-

siﬁed as belonging to the class ”nebesko telo”, the top

most 10 belong to geography textbooks, science ﬁc-

tion books and the newspaper texts dealing with the

subject. Among the documents classiﬁed as belong-

ONTOLOGY-DRIVEN CONCEPTUAL DOCUMENT CLASSIFICATION

385

ing to the class ”drˇzava”, top most 10 belong to daily

and weekly newspapers (”Politika” and ”Nin”).

5 CONCLUSIONS

Document classiﬁcation may beneﬁt from both com-

mon and proper names. Proper names are especially

suitable when applied for subclassiﬁcation of docu-

ments already classiﬁed into common classes such

as news, articles, science, politics, ﬁnance, etc. Two

types of classiﬁcation based on lexical-semantic net-

work wordnet are presented. Classiﬁcation based on

wordnet basic concepts (commonwords) proved quite

successful in classifying a large newspaper archive

into several classes. A distance-based classiﬁcation

driven by an ontology of proper names is then pre-

sented, where conceptual hierarchies of proper names

follow the structure of the Serbian WordNet. Results

of classiﬁcation applied to an untagged part of the cor-

pus of contemporary Serbian, of around 23 million

words, are presented.

Our future plans involve improvement of the pro-

posed classiﬁcation framework and its implementa-

tion – enlargement of SWN and lookup at EWN, as

well as development of new classiﬁcation methods

based on character and word patterns in texts. We

also plan to deﬁne new distance measures and to per-

form a multi-way comparison of different classiﬁca-

tion methods applied to different types of corpora.

ACKNOWLEDGEMENTS

The work presented has been ﬁnancially supported by

the Ministry of Science and Technological Develop-

ment of the Republic of Serbia, Project No.148921A.

REFERENCES

EAGLES (1996). Preliminary Recommendations on Text

Typology, EAGLES Document EAG-TCWG-TTYP/P.

Expert Advisory Group on Language Engineering

Standards, European Commission.

Ebart (2010). Aktuelna arhiva. Medijska dokumentacija

Ebart, http://www.arhiv.rs.

Fellbaum, C. (1998). Wordnet: An Electronic Lexical

Database. The MIT Press.

HLTG (2010). Resursi srpskog jezika. Human Language

Technologies Group, http://korpus.matf.bg.ac.rs, Fac-

ulty of Mathematics, University of Belgrade.

Krstev, C., Pavlovi´c-Laˇzeti´c, G., Vitas, D., and Obradovi´c,

I. (2004). Using textual and lexical resources in devel-

oping serbian wordnet. In Romanian J. Sci. Tech. In-

form. (Special Issue on Balkanet), 7(1-2), pages 147–

161. Romanian Academy.

LCC (2009). Library of Congress Classiﬁcation Out-

line. http://www.loc.gov/catdir/cpso/lcco/, U.S. gov-

ernment.

Miller, G. (1995). Wordnet: A lexical database. In Comm.

ACM 38(11) 39–41. ACM – Association for Comput-

ing Machinery.

Reuters (2010). Site Archive. Thomson Reuters Corporate,

http://in.reuters.com/resources/archive/in/index.html.

Rodriguez, M., Gomez-Hidalgo, J., and Diaz-Agudo, B.

(1996). Using wordnet to complement training infor-

mation in text categorization. In Proceedings of the

AAAI Spring Symposium on Machine Learning in In-

formation Access, Bulgaria.

Rosso, P., Molina, A., Pla, F., Jim´enez, D., and Vidal,

V. (2004). Text categorization and information re-

trieval usingwordnet senses. In CICLing 2004, Lec-

ture Notes in Computer Science, 2945., pages 596–

600. Springer- Verlag.

Scott, S. and Matwin, S. (1998). Text classif-

cation using wordnet hypernyms. In Usage

of WordNet in Natural Language Processing

Systems1st International Wordnet Conference.

http://www.ceid.upatras.gr/Balkanet/ﬁles/balkanet-

elsnet-ko-accept.pdf.

Tan, P., Steinbach, M., and Kumar, V. (2006). Introduction

to Data Mining. Addison-Wesley.

Tomaˇsevi´c, J. and Pavlovi´c-Laˇzeti´c, G. (2008). Productiv-

ity of concepts in serbian wordnet. In Proceedings

of the Sixth Language Technologies Conference: pro-

ceedings of the 11th International Multiconference In-

formation Society - IS 2008, 86–91, pages 86–91.

Tuﬁs, D., Cristea, D., and Stamou, S. (2004). Balkanet:

Aims, methods, results and perspectives. a general

overview. In Romanian J. Sci. Tech. Inform. (Special

Issue on Balkanet), 7(1-2), . 9–43, pages 9–43. Roma-

nian Academy.

Vitas, D., Pavlovi´c-Laˇzeti´c, G., Krstev, C., Popovi´c, L.,

and Obradovi´c, I. (2003). Processing serbianwritten

texts: An overview of resources and basic tools. In

Proceedings of the International Workshop on Balkan

Language Resources and Tools, Thessaloniki, pages

97–104.

Vossen, P. (1998). EuroWordnet: A Multilingual Database

with Lexical Semantic Networks. Kluwer Academic

Publishers, Dordrecht.

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

386