Lexicon based Algorithm for Domain Ontology Merging and Alignment

Tomasz Boi

nski and Henryk Krawczyk

Faculty of Electronics, Telecommunications and Informatics, Gda

nsk University of Technology,

11/12 Gabriela Narutowicza Street, 80-233 Gda

nsk-Wrzeszcz, Poland

Keywords:

Ontology Matching, Algorithm.

Abstract:

More and more systems contain some kind of knowledge describing their ﬁeld of operation. Such knowledge

in many cases is stored as an ontology. A need arises for ability to quickly match those ontologies to enable

interoperability of such systems. The paper presents a lexicon based algorithm for merging and aligning of

OWL ontologies. The proposed similarity levels are being presented and the proposed algorithm is being

described. Results of test showing the algorithm quality are presented.

1 INTRODUCTION

Merging and aligning of domain ontologies is a com-

plex process. A set of predeﬁned procedures is

needed for proper integration (Fig. 1).

Figure 1: Procedures ensuring ontology integration.

Quality of ontology integration is inﬂuenced by

three factors:

1. Quality and integrity of input ontologies – a key

factor for proper ontology integration. Omit-

ting typical mistakes (Goczyła, 2011) or bas-

ing ontology construction on well known guide-

lines (G

omez-P

erez et al., 2002) allows creation

of consistent and ﬂexible ontologies. This way

integration introduces some new knowledge in the

output ontology.

2. Usage of precise and well formed vocabu-

lary, especially proper selection of core con-

cepts (Boi

nski, 2012) determines possibility of

performing integration seamlessly.

3. Methodology used during input ontology devel-

opment – usage of one of the well known and

accepted methodologies like Methontology (Fer-

nandez et al., 1997), NeOn (NeOn Project, 2010)

or methodology described by Noy and McGui-

ness (Noy et al., 2001) improves quality of the

ontology (Boi

nski, 2012). All those methodolo-

gies take into account the need of future integra-

tion, which combined with usage of tools like

Protg (Stanford University School of Medicine,

2010; Gennari et al., 2003; Noy et al., 2000) or

OCS (Boi

nski, 2012), allows creation of consis-

tent and formally correct ontologies.

With above criteria satisﬁed creation of lexicon

based algorithm for ontology integration becomes

possible.

In the following chapters of this paper proposed

similarity measures and the algorithm itself will be

presented. Some results showing it’s correctness will

be presented.

2 LEVELS OF SIMILARITY

BETWEEN ONTOLOGY

ELEMENTS

Integration of knowledge represented by two ontolo-

gies can be based on ability to compare elements,

i.e. classes, individuals, relations and properties, of

those ontologies. A need for measures of semantic

and syntactic similarity between concepts thus arises.

Those measures will lay fundamentals for the algo-

rithm described in the next chapter.

Measures proposed in this paper are derived di-

rectly from pragmatic approach to ontologies rep-

resented by both Hovy (Hovy, 1998) and Euzenat

321

Boi

nski T. and Krawczyk H..

Lexicon based Algorithm for Domain Ontology Merging and Alignment.

DOI: 10.5220/0004092903210326

In Proceedings of the International Conference on Knowledge Engineering and Ontology Development (KEOD-2012), pages 321-326

ISBN: 978-989-8565-30-3

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

and Volchev (Euzenat and Valtchev, 2004). Pro-

posed solution extends them with possibilities intro-

duced by modern lexicons like WordNet (Fellbaum,

1998). Four mutually complementary and supple-

mentary levels of similarity are proposed:

1. Lexical similarity P

lex

of classes K

(i = 1, 2) and

individuals B

(i = 1, 2)

The basic measure of similarity between concepts.

Classes and individuals are being looked up in

WordNet dictionary. Their similarity and mu-

tual relation is being derived from WordNet struc-

ture. Basing on this measure one can determine

whether concepts are identical, disjoint or logi-

cally encapsulating one another, meaning one can

determine class hierarchy and membership of in-

dividuals. WordNet dictionary allows both direct

string matching and synonyms look-up thanks to

organisation of lemmas into so called synsets.

2. Semantic similarity P

sem

of classes K

(i = 1, 2)

and individuals B

(i = 1, 2)

When lexical similarity is not possible to calcu-

late, the proposed algorithm uses semantic simi-

larity as a secondary measure for similarity calcu-

lations. This level of similarity takes values from

range < 0;1 > and allows to determine whether

concepts are similar, disjoint or logically encap-

sulating one another. As a base for this measure

WordNet based Lin algorithm (Lin, 1998; Lin,

1993) is being used.

Lin algorithm requires lemmas describing the

concepts to be present in WordNet dictionary

which not always is true. In that case Levenshtein

edit distance (Levenshtein, 1966) is being used as

secondary way of calculating similarity. In that

case only similarity and disjointness of concepts

can be calculated.

To improve performance some strings, containing

no information like pronouns and words shorter

than three letters, are being eliminated from the

lemma of the concept. WordNet contains only 305

noun entries of length two and less (0,26% of all

nouns contained in the lexicon), moreover most

of them are abbreviations and numbers in roman

system.

The border value the similarity that separates

similar and disjoint concepts was based on re-

sults obtained from comparing human judgement

(sim

) with values obtained from Lin algorithm

(sim

Lin

) (Boi

nski, 2012).

Pairs of high sim

Lin

(car - automobile, gem -

jewel, journey - voyage, boy - lad, coast - shore,

asylum - madhouse, magician - wizard, midday -

noon and furnace - stove) were all found very sim-

ilar by humans. Similarity sim

of pairs tool - im-

plement and brother - monk, were lower than ex-

pected when compared to sim

Lin

, however humans

closely connected with English language (North-

ern Ireland resident, English teacher etc.) marked

those pairs as highly similar.

The most problematic were pairs bird - cock and

bird - crane. In both cases second lemma is log-

ically encapsulated by ﬁrst lemma. According to

the Lin algorithm both pairs have relatively high

sim

Lin

, 0.74 and 0.72 respectively. It is unfortu-

nately not possible to determine which concepts

encapsulates which basing only on that value.

Most of the human testers decided however that

in case when only a similarity value is available

such concepts can be safely merged if no other

piece of information is available.

Remaining concept pairs were judged different by

both humans and the Lin algorithm. Basing on

that test it was decided, that P

sem

< 0, 7 means

different concepts and P

sem

≥ 0, 7 means similar

concepts. This observation is consistent with re-

sults obtained by other research groups, i.e. de-

velopers of Falcon-AO (Jian et al., 2005).

3. Similarity of comments P

kom

attached to classes

(i = 1, 2) and individuals B

(i = 1, 2)

Third level of similarity between ontology ele-

ments if similarity between comments attached to

those elements. Comments are being treated as

parts of bipartite graph, where words are mapped

to nodes of the graph and similarity between them

are mapped to edges between those nodes. Sim-

ilarity between the comments is than reduced to

problem of maximal assignments between two

graphs which can be solved using Hungarian

Method (Kuhn, 1955). Similarities between nodes

are being calculated using aforementioned meth-

ods.

Finally, the value of P

kom

is calculated as ratio

of achieved maximal assignment to number of

elements in longer of the two comments (Equa-

tion 1). This ensures that the value of P

kom

will

always be within the range < 0, 1 >.

kom

∑

max(|L|, |R|)

(1)

4. Structural similarity P

str

of classes K

(i = 1, 2)

and individuals B

(i = 1, 2)

This level of similarity takes into account struc-

tural placements of given concept in regard of its

nearest neighbours. Both the type and direction

KEOD2012-InternationalConferenceonKnowledgeEngineeringandOntologyDevelopment

322

of the relation are being taken into account. This

similarity can be described by Equation 2:

str

∑

min(r

)

∑

max(r

)

(2)

where:

– number of occurrences of relation i (where

i = {subsumption, membership, equality, disjoint-

ness, union, intersection}) in which given concept

takes part.

This level of similarity is used where no other pos-

sibility can be applied as it provides the lowest

amount of information.

The basic similarity measure in the lexical simi-

larity is being proposed. With constant development

of many lexicons this approach seems to be more

and more justiﬁed. In current 3.0 version of Word-

Net 155287 different nouns in English language with

206941 word–meaning pairs can be found (Princeton

University, 2010). It is highly probable that most of

the words used in concepts description will be found

within that lexicon. This way connections between

concepts can be derived from WordNet and thus en-

riching the output ontology. Further levels of simi-

larity allows mapping concepts into one another even

when they are not found in the used lexicon, making

the proposed algorithm usable in general scenarios.

3 THE PROPOSED ALGORITHM

The proposed algorithm is based on lexical and se-

mantic analysis of integrated ontologies and operates

on ontologies stored in OWL language. Furthermore

it was observed that in most ontologies concept names

are usually represented by nouns and most of the in-

formation is either explicitly represented by classes

and relations between them or can be easily derived

and transformed into such representation. It was also

assumed that input ontologies are representing the

same domain. Strictness of such assumption depends

on used lexicon and usually improves the quality of

output ontology (Boi

nski, 2012).

The WordNet dictionary was proposed as a knowl-

edge base allowing adding and extending relations in

output ontology (Boi

nski, 2012). The proposed al-

gorithm takes two ontologies as input and produces

one ontology as its output. The output ontology can

be either a merge (in terms of uniﬁcation of URI’s in

OWL language) or an alignment (in terms of import-

ing source ontologies and including only mapping be-

tween their elements).

The proposed algorithm goes as follows:

1: program OntologyMerger {

2: function combineSubTrees(

node_A, node_B, node_C) {

3: for all child_A of node_A do

4: for all child_B of node_B do

5: result = compare(child_A, child_B);

6: if result == EQUAL then

7: node = combine(child_A, child_B);

8: add node as child of node_C;

9: combineSubTrees(

child_A, child_B, node);

10: else if result == DISJOINT then

11: add child_A as child of node_C;

12: add child_B as child of node_C;

13: else if result ==

A_ENCAPSULATES_B then

14: add child_A as child of node_C;

15: place child_B in subtree of child_A;

16: else if result ==

B_ENCAPSULATES_A then

17: add child_B as child of node_C;

18: place child_A in subtree of child_B;

19: end if

20: end for

21: end for

22: }

23: function placeInSubTree(node, root) {

24: for all child of root do

25: result = compare(node, child);

26: if result == EQUAL then

27: newNode = combine(node, child);

28: add newNode as child of root;

29: combineSubTrees(

node, child, newNode);

30: else if result == DISJOINT then

31: add child with subtree

as child of root;

32: add node with subtree

as child of root;

33: else if result ==

NODE_ENCAPSULATES_CHILD then

34: add node as child of root;

35: placeInSubTree(child, node);

36: else if result ==

CHILD_ENCAPSULATES_NODE then

37: add child as child of root;

38: placeInSubTree(node, child);

39: end if

40: end for

41: }

42: input Ontology_A;

43: input Ontology_B;

44: output Ontology_C;

45: combineSubTrees(

root_of_A, root_of_B, root_of_C);

46: return Ontology_C;

47: }

The main job of the algorithm is performed by two

functions:

LexiconbasedAlgorithmforDomainOntologyMergingandAlignment

323

• combineSubTrees – which merges subtrees of

node A and node B and adds the result node C

as its subtree (lines 2-22),

• placeInSubTree – which ﬁnds the best placement

of node in subtree of root recursively looking for

proper placement in terms of logical encapsula-

tion (lines 23-42).

The basic concept behind both of those functions

is the same, thus only the ﬁrst one will be described

in detail. The algorithm starts with reading two on-

tologies A and B (lines 42 and 43). Ontology C is the

output of the algorithm and is returned when it ﬁn-

ishes (line 46). In OWL ontologies there always is

a common class owl:Thing that is the root of the on-

tology. The algorithm starts its work from this node of

every ontology thus combining subtree of ontology A

owl:Thing with subtree of ontology B owl:Thing. The

result is being added to owl:Thing of ontology C (line

45) by the ﬁrst function which takes three parameters:

• analyzed node of ontology A,

• analyzed node of ontology B,

• node of ontology C which should be the root for

combination of subtrees of the rest of the parame-

ters.

Note that it is not required for the ontology to be a

tree. The ontology however is analysed at the level of

a concept and its direct descendants and this fragment

can be considered as such.

The algorithm starts with comparing every ele-

ment of ontology A with every other element of on-

tology B at the same level of detail (loops starting at

lines 3 and 4). The action performed depends on the

result of the comparison:

1. if the concepts in both ontologies are determined

to be equal (according to measures presented in

chapter 2) they are merged together (or connected

with equivalent property) (line 7), added to the

output ontology (line 8). Finally their subtrees are

being combined together into one tree attached to

this new node (line 9).

2. if the concepts are determined to be different they

are added with their subtrees to the output ontol-

ogy (lines 10-13). No further analysis of those

subtrees is being performed.

3. if meaning of node from ontology A is more gen-

eral than meaning of node from ontology B (lines

13-16) than node from ontology A is added to the

ontology (line 13) and the node from ontology B

is placed within the subtree of node from ontology

A (line 15). Lookup of proper place for this node

is done via function placeInSubTree.

4. if meaning of node from ontology B is more gen-

eral than meaning of node from ontology A (lines

(linie 16-19), the operations are analogous to

those in point 3.

The compare(node A, node B) function (lines 5

and 25) is used to determine whether concepts are

similar, disjoint or logically encapsulating. It utilises

similarity levels introduced in chapter 2. First it

checks lexical and than semantic similarity. Further-

more, wherever possible, sibling and parent-children

relations between compared concepts are determined.

Finally, if two previous means did not prove concept

similarity, comments and structural similarity is uti-

lized according to Equation 3.

= w

str

+ w

kom

(3)

where:

str

- structural similarity derived from type and num-

ber of relations in which analysed concepts take

part (Equation 2),

kom

- semantic similarity of comments attached to

analysed concepts(Equation 1),

- weights of aforementioned similarities, as a re-

sult of tests, their values where determined to be as

follows: w

= 0, 3; w

= 0, 7.

Similar as in other cases it was assumed that P

≥

0, 7 means the concepts are identical and P

< 0, 7

means the concepts are different.

The second function (placeInSubTree) is similar

in the way of performing it’s tasks. Calculation of

similarity between elements is done in the same way

and the main loop is similar. The difference is that in-

stead of combining two subtrees it locates best logical

placement of one node (ﬁrst parameter) in the subtree

of other (second parameter).

4 RESULTS

The proposed algorithm was tested using selected on-

tologies developed by Ontology Alignment Evalua-

tion Initiative as input for EON Ontology Alignment

Contest (Euzenat, 2004) and specially developed se-

curity ontology (Boi

nski, 2012).

4.1 OAEI Ontologies

The tests were performed using reference ontol-

ogy (describing Bibtex structure), its modiﬁca-

tions and one unrelated ontology. All ontolo-

gies used in the tests are publicly available at

http://oaei.ontologymatching.org/2006/benchmarks/.

The results obtained by the algorithm were than com-

pared with those provided by the contests authors.

KEOD2012-InternationalConferenceonKnowledgeEngineeringandOntologyDevelopment

324

Table 1: Results of comparing concepts InCollection and Chapter with their modiﬁed equivalents dcsqdcsqd and dzqndbzq.

Lemma A Lemma B max(P

lex

, P

sem

) P

kom

str

Result

InCollection dcsqdcsqd 0,10 1,00 1,00 1,00 EQUAL

InCollection dzqndbzq 0,00 0,75 0,67 0,73 EQUAL

Chapter dcsqdcsqd 0,11 0,75 0,67 0,73 EQUAL

Chapter dzqndbzq 0,00 1,00 1,00 1,00 EQUAL

All tests yielded positive results. The following

scenarios were considered:

• merging with identical ontology – all classes were

connected with ﬁnal similarity equal 1.0. One ad-

ditional connection was introduced (between Ad-

dress and Reference) because of the domain over-

lap, as source ontologies provide no additional

info stating disjointness of those two concepts,

• merging with completely different ontology –

Bibtex ontology was combined with ontology de-

scribing food and wines, all concepts were cor-

rectly marked as different,

• merging with similar ontologies stored using

more general dialects of OWL language – all el-

ements of source and target ontologies were de-

scribed as identical with similarity equal to 1.0.

Similarly as in the ﬁrst test one additional connec-

tion was introduced (between Address and Refer-

ence),

• merging with identical ontology with removed la-

bels (comment and structural similarity only) –

the ontology was merged with identical but with

labels replaced by random, meaningless strings.

The algorithm based its work solely on comments

and structure of both ontologies. The algorithm

calculated most of the connections right. The only

problem was with connecting concepts InCollec-

tion and Chapter with their respective matches

in modiﬁed ontology, as they are located within

the same structure nad have similar comments

(,,A part of a book having its own title.” and

,,A chapter (or section or whatever) of a book hav-

ing its own title.” respectively). Thus matches to

concepts dcsqdcsqd and dzqndbzq could not be

guessed correctly and the algorithm marked all

four concepts as similar. Details of the results of

those calculations are presented by Table 1.

In all cases the algorithm produced satisfactory re-

sults proving that’s it’s useful for small, domain ori-

ented ontologies.

4.2 Security Ontology

The security ontology

was created both manually

Available at http://ocs.kask.eti.pg.gda.pl and (as OWL)

at http://kask.eti.pg.gda.pl

and using the proposed algorithm. It consists of three

modules:

• Risk Core Concepts module – created from in-

tegration of ontologies based on ENISA dictio-

nary (ENISA, 2006; Enisa, 2010) (43 classes

and 28 properties), NIST dictionary (Guttman and

Roback, 1995) (70 classes and 23 properties) and

chapter of Sommerville Book ,,Software Engi-

neering” (Sommerville, 2006) (40 classes and 22

properties). After integration the ontology con-

sists of 122 classes and 63 properties.

• Basic Security Concepts module – based on Avi-

ienis taxonomy (Avizienis et al., 2004) (269

classes and 91 properties).

• Safety and Security Requirements module – based

on Firesmith taxonomy (Firesmith, 2005a; Fire-

smith, 2005b) (195 classes and 56 properties).

The ontologies included in the Risk Core Con-

cepts module and later all three modules were merged

using the proposed algorithm and the obtained ontolo-

gies were compared with results of manual integra-

tion.

In the ﬁrst step Risk Core Concepts module was

created where the algorithm performed 1170 com-

parisons between 153 classes. Of those comparisons

only 13 was incorrect (1,11%). Some of them re-

sulted from errors in source ontologies which were

corrected. Later on the three modules were integrated

into the Security ontology. The algorithm performed

1956 comparisons of which only 31 was incorrect

(1,59%). The resulting ontology was very similar to

the one obtained manually. Plus some of the errors

in source ontologies were discovered during the auto-

mated integration and could be corrected.

5 CONCLUSIONS

In all test cases the proposed algorithm proved to

be useful. The tests show its usability both in case

of small and large domain ontologies with efﬁciency

around 98%. In all cases however the algorithm was

dependent on quality of input ontologies and external

lexicons used during concept mapping. With the fur-

ther development of such lexicons quality of obtained

LexiconbasedAlgorithmforDomainOntologyMergingandAlignment

325

results can be enhanced and ontologies from wider

ﬁeld of domains can be merged.

The algorithm can be easily implemented as

a lightweight library and used in any kind of appli-

cation managing OWL or RDF ontologies. Such us-

age can further improve interoperability between sys-

tems in heterogeneous environment by enabling them

to understand messages sent to each other and map

them to local knowledge bases represented as OWL

ontologies.

ACKNOWLEDGEMENTS

This work was funded by Nation Science Centre un-

der the grant N N516 476440

REFERENCES

Avizienis, A., Laprie, J., Randell, B., and Landwehr, C.

(2004). Basic concepts and taxonomy of dependable

and secure computing. Dependable and Secure Com-

puting, IEEE Transactions on, 1(1):11–33.

Boi

nski, T. (2012). Procedures for merging and alignment

of domain ontologies, PhD thesis [in Polish]. Faculty

of Electronics, Telecommunications and Informatics,

Gdansk University of Technology.

ENISA (2006). Risk management: implementation princi-

ples and inventories for risk management/risk assess-

ment methods and tools. Technical report.

Enisa (2010). Enisa: a European Union Agency - Glos-

sary of Risk Management. [Online; przegldano 01-

12-2010].

Euzenat, J. (2004). Eon ontology alignment contest. [On-

line; przegldano 27-01-2012].

Euzenat, J. and Valtchev, P. (2004). Similarity-based ontol-

ogy alignment in OWL-lite. In ECAI 2004: 16th Eu-

ropean Conference on Artiﬁcial Intelligence, August

22-27, 2004, Valencia, Spain: including Prestigious

Applicants [sic] of Intelligent Systems (PAIS 2004):

proceedings, page 333. Ios Pr Inc.

Fellbaum, C. (1998). WordNet: An electronic lexical

database. The MIT press.

Fernandez, M., Gomez-Perez, A., and Juristo, N. (1997).

Methontology: from ontological art towards ontologi-

cal engineering. In Proceedings of the AAAI97 Spring

Symposium Series on Ontological Engineering, pages

33–40.

Firesmith, D. (2005a). A Taxonomy of safety-related re-

quirements. In International Workshop on High As-

surance Systems (RHAS’05).

Firesmith, D. (2005b). A taxonomy of security-related re-

quirements. In International Workshop on High As-

surance Systems (RHAS’05). Citeseer.

Gennari, J., Musen, M., Fergerson, R., Grosso, W.,

Crub

ezy, M., Eriksson, H., Noy, N., and Tu, S.

(2003). The evolution of Prot

e: an environment for

knowledge-based systems development. International

Journal of Human-Computer Studies, 58(1):89–123.

Goczyła, K. (2011). Ontologie w systemach informaty-

cznych. Akademicka Oﬁcyna Wydawnicza EXIT,

Warszawa.

omez-P

erez, A., Corcho, O., and Fernandez-Lopez, M.

(2002). Ontological Engineering. Springer-Verlag,

London, Berlin.

Guttman, B. and Roback, E. (1995). An introduction to

computer security: the NIST handbook. DIANE Pub-

lishing.

Hovy, E. (1998). Combining and standardizing large-scale,

practical ontologies for machine translation and other

uses. In Proceedings of the 1st International Confer-

ence on Language Resources and Evaluation (LREC),

pages 535–542. Citeseer.

Jian, N., Hu, W., Cheng, G., and Qu, Y. (2005). Falcon-

AO: Aligning ontologies with falcon. In Integrating

Ontologies Workshop Proceedings, page 85. Citeseer.

Kuhn, H. (1955). The Hungarian method for the assign-

ment problem. Naval research logistics quarterly,

2(1-2):83–97.

Levenshtein, V. (1966). Binary codes capable of correcting

deletions, insertions, and reversals. In Soviet Physics

Doklady, volume 10, pages 707–710.

Lin, D. (1993). Principle-based parsing without overgen-

eration. In Proceedings of the 31st annual meeting

on Association for Computational Linguistics, pages

112–120. Association for Computational Linguistics.

Lin, D. (1998). An information-theoretic deﬁnition of sim-

ilarity. In Proceedings of the 15th International Con-

ference on Machine Learning, volume 1, pages 296–

304. Citeseer.

NeOn Project (2010). NeOn Book. http://www.neon-

project.org/nw/NeOn Book. [Online; viewed 09-10-

2010].

Noy, N., Fergerson, R., and Musen, M. (2000). The knowl-

edge model of Protege-2000: Combining interoper-

ability and ﬂexibility. Knowledge Engineering and

Knowledge Management Methods, Models, and Tools,

pages 69–82.

Noy, N., McGuinness, D., et al. (2001). Ontology develop-

ment 101: A guide to creating your ﬁrst ontology.

Princeton University (2010). WordNet Database Statistics.

http:// wordnet.princeton.edu/ wordnet/ man/ wnstats.

7WN. html. [Online; viewed 22-11-2010].

Sommerville, I. (2006). Software Engineering. 8th. Harlow,

UK: Addison-Wesley.

Stanford University School of Medicine (2010). Stan-

ford Center for Biomedical Informatics Research.

http://protege.stanford.edu. [Online; viewed 09-10-

2010].

KEOD2012-InternationalConferenceonKnowledgeEngineeringandOntologyDevelopment

326