A Comparative Assessment of Ontology Weighting Methods in

Semantic Similarity Search

Antonio De Nicola

, Anna Formica

, Michele Missikoff

, Elaheh Pourabbas

and Francesco Taglino

Italian National Agency for New Technologies, Energy and Sustainable Economic Development (ENEA),

Casaccia Research Centre, Via Anguillarese 301, I-00123, Rome, Italy

Istituto di Analisi dei Sistemi ed Informatica (IASI) “Antonio Ruberti”, National Research Council,

Via dei Taurini 19, I-00185, Rome, Italy

Keywords:

Weighted Reference Ontology, Semantic Similarity, Information Content, Probabilistic Approach.

Abstract:

Semantic search is the new frontier for the search engines of the last generation. Advanced semantic search

methods are exploring the use of weighted ontologies, i.e., domain ontologies where concepts are associated

with weights, inversely related to their selective power. In this paper, we present and assess four different

ontology weighting methods, organized according to two groups: intensional methods, based on the sole

ontology structure, and extensional methods, where also the content of the search space is considered. The

comparative assessment is carried out by embedding the different methods within the semantic search engine

SemSim, based on weighted ontologies, and then by running four retrieval tests over a search space we have

previously proposed in the literature. In order to reach a broad audience of readers, the key concepts of this

paper have been presented by using a simple taxonomy, and the already experimented dataset.

1 INTRODUCTION

Search engines represent today the killer application

of the Web and can be found in every and all pos-

sible Web applications. For instance, if you need to

ﬁnd a place on Google Maps, or you are looking for

a friend on Facebook, or you want to discover the last

song of your preferred singer on YouTube or Spo-

tify, you always go through a search facility. Since

the ﬁrst appearance of general purpose search engines

on the Web, such as Yahoo! and AltaVista in the

Nineties, followed a few years later by Google and,

almost a decade afterwards, by Bing (just to name the

popular ones), their technology has been constantly

evolving. Such an evolution brought continuous en-

hancements of search strategies, algorithms, and, last

but not least, indexes, directories, vocabularies, and

other supporting metadata. Among metadata, seman-

tic annotation has emerged as an important enrich-

ment of digital resources, necessary to support the

evolution of search engines towards semantic similar-

ity search. A semantic annotation consists of a set

of concepts, taken from an ontology, that character-

ize a resource. In (Formica et al., 2008), (Formica

et al., 2013), (Formica et al., 2016), the authors ad-

dressed the semantic annotation and retrieval in accor-

dance to a probabilistic approach, based on a Vector

Space Model proposed in the context of text mining

and retrieval, where text documents are represented

by feature vectors. In our case, we deal with any kind

of digital resources (not only text documents), and

the features that characterize a resource correspond

to concepts in a reference ontology. Therefore we re-

fer to such a vector of features as an Ontology Feature

Vector (OFV). The adoption of ontologies is the base

of semantic search, representing a marked evolution

from the traditional keyword based retrieval methods.

In an ontology based search engine, the matchmaking

process can take place between a user request vector

and the annotation vectors associated with the digital

resources in the search space. A signiﬁcant enhance-

ment of semantic search consists in the use of prob-

abilistic similarity reasoning methods. Within these

approaches, concept similarity is computed consider-

ing the contextual knowledge represented by the on-

tology, with its (topo)logical structure (essentially, the

ISA hierarchy). This approach requires each concept

in the ontology be associated with a weight related to

the level of speciﬁcity of the concept in the resource

space. The introduction of concept weights yields a

new breed of weighted ontologies, see for instance

(Abioui et al., 2018), (S

anchez et al., 2011). The

506

De Nicola, A., Formica, A., Missikoff, M., Pourabbas, E. and Taglino, F.

A Comparative Assessment of Ontology Weighting Methods in Semantic Similarity Search.

DOI: 10.5220/0007342805060513

In Proceedings of the 11th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2019), pages 506-513

ISBN: 978-989-758-350-6

majority of them share the idea that the weight of a

concept corresponds to the probability that selecting

at random a resource, it is characterized by a set of

features including one representing such a concept,

or one of its descendants in the ontology. Then, the

higher the weight of a concept the lower its speciﬁcity.

For instance, the concept student has a smaller weight

than person since the former is more speciﬁc than the

latter. Therefore, in formulating a query, the lower

the weights of the concepts, the higher their selective

power, and a more focused answer set is returned.

The performance of a semantic search engine de-

pends on the semantic matchmaking method and the

approach used to weigh the reference ontology. In this

paper, we focus on the analysis of four different ap-

proaches for weighting the concepts of an ontology,

and we carry out an experiment in order to asses the

analyzed ontology weighting methods.

The presented methods are divided according to

two groups (S

anchez et al., 2011): (i) extensional

methods (also known as distributional methods),

where the concept weights are derived by taking into

account both the topology of the ISA hierarchy and

the content of the resource space, also referred to as

dataset, (ii) intensional methods (also known as in-

trinsic methods), where the concept weights are de-

rived on the basis of the sole topology of the ISA hi-

erarchy.

In this paper, we selected the semantic similar-

ity method SemSim (Formica et al., 2013) in order to

evaluate the assessment of the four methods. In the

mentioned paper, the authors illustrate that SemSim

outperforms the most representative similarity meth-

ods proposed in the literature, i.e., Dice, Cosine, Jac-

card, and Weighted Sum. The SemSim method re-

quires: i) a dataset consisting of a set of resources

annotated according to a given ontology, and ii) a

method for associating weights with the concepts of

the ontology. Then, SemSim has been conceived to

compute the semantic similarity between a given user

request and any annotated resource in the dataset.

With respect to this work, in the mentioned paper

we considered only two weighting methods, i.e., the

frequency and the probabilistic approaches. In this

paper, they correspond to the Annotation Frequency

Method and the Top Down Topology Method, respec-

tively. Note that, in order to be coherent with the re-

sults given in (Formica et al., 2013), in this paper we

keep the same experimental setting, in particular, the

reference ontology and the dataset presented in the

mentioned work.

The next section gives a brief overview about on-

tology weighting. Section 3 provides the basic no-

tions concerning weighted ontologies and ontology

based feature vectors and proposes a probabilistic

model for weighted ontologies. Section 4 describes

in detail the four methods. Section 5 illustrates the

assessment of the methods and, ﬁnally, Section 6 con-

cludes.

2 RELATED WORK

According to the extensional methods, also referred

to as distributional (S

anchez et al., 2011), the infor-

mation content of a concept is in general estimated

from the frequency distribution of terms in text cor-

pora. Hence, this type is based on the extensional

semantics of the concept itself as its probability can

be derived on the basis of the number of occurrences

of the concept in the text corpora. This approach was

used in (Jiang and Conrath, 1997), (Resnik, 1995),

and (Lin, 1998) to assess semantic similarity between

concepts. Other proposals include the inverse docu-

ment frequency (IDF) method, and the method based

on the combination of term frequency (TF) and the

IDF (Manning et al., 2008). In our work, we de-

rived the concept frequency method and the annota-

tion frequency method, respectively, from those used

in (Resnik, 1995) and the IDF.

According to the intensional methods, also re-

ferred to as intrinsic (S

anchez et al., 2011), informa-

tion content is computed starting from the conceptual

relations existing between concepts and, in particular,

from the taxonomic structure of concepts. With this

regard, one of the most relevant methods is presented

in (Seco et al., 2004). This is based on the number

of concepts’ hyponyms and the maximum number of

concepts in the taxonomy. In (Meng et al., 2012), the

authors present a method derived from (Seco et al.,

2004) but they also consider the degree of generality

of concepts and, hence, their depth in the taxonomy.

In (S

anchez et al., 2011), the authors claim that the

taxonomical leaves are enough to describe and dif-

ferentiate two concepts because ad-hoc abstractions

(e.g., abstract entities) rarely appear in a universe of

discourse, but have an impact on the size of the hy-

ponym tree. In (Hayuhardhika et al., 2013), the au-

thors propose to use the density factor to estimate con-

cept weights on the basis of the sum of inward and

outward connections with other concepts against the

total number of connections in the ontology. Finally,

just to mention one more example, (Abioui et al.,

2018) takes into account both the taxonomic structure

and other semantic relationships to compute weights

of concepts.

In this work, ﬁrst of all we focus on a tree-shaped

taxonomy organized as an ISA hierarchy and, within

A Comparative Assessment of Ontology Weighting Methods in Semantic Similarity Search

507

the above mentioned classiﬁcation, we investigate

two extensional and two intensional methods. In par-

ticular, with regard to the extensional methods, we

address semantic annotations of resources rather than

text corpora.

3 A WEIGHTED ONTOLOGY AS

A PROBABILISTIC MODEL

In line with (Formica et al., 2013), (Formica et al.,

2016), an ontology Ont is a taxonomy deﬁned by the

pair:

Ont =< C ,ISA >

where C = {K

} is a set of concepts and ISA is the

set of pairs of concepts in C that are in subsumption

(subs) relation:

ISA = {(K

) ∈ C × C |subs(K

)},

where subs(K

) means that K

is a child of K

in the

taxonomy. In this work, we assume that the hierarchy

is a tree. A Weighted Reference Ontology (W RO) is

then deﬁned as follows:

W RO =< Ont,w >

where w, the concept weighting function, is a proba-

bility distribution deﬁned on C , such that given K ∈ C ,

w(K) is a decimal number in the interval [0.. .1].

The W RO is then used to annotate each resource

in the Universe of Digital Resources (U DR) by means

of an OFV. An OFV is a vector that gathers a set of

concepts of the ontology Ont, aimed at capturing the

semantic content of the corresponding resource. The

same also holds for a user request, and is represented

as follows:

o f v = (K

,...,K

), where K

∈ C ,i = 1,...,n

A normalized OFV is an OFV where if a concept ap-

pears, none of its ancestors appears. Note that, when

an OFV is used to represent a user request, it is re-

ferred to as semantic Request Vector (RV) whereas,

if it used to represent a resource, it is referred to as

semantic Annotation Vector (AV). They are denoted,

respectively, as follows:

rv = (R

,. .. ,R

), av = (A

,. .. ,A

where {R

,. .. ,R

} ∪ {A

,. .. ,A

} ⊆ C . We assume

that also AVs and RVs are normalized OFVs.

In the following, consider an ontology Ont =<

C ,ISA > and a dataset deﬁned as a set of annotated

resources, where different resources can also have

the same annotations. For each K

∈ C , let X

be a

boolean variable, where 1 ≤ i ≤ q and q = |C |. Ac-

cording to the semantics of the ISA relationship, we

Figure 1: The simple taxonomy.

Figure 2: The Reference Ontology.

assume that the set of variables associated with the

concepts of the ontology are dependent. Each annota-

tion av = (A

,. .. ,A

) in the dataset can also be rep-

resented as:

= 1,...,X

= 1] (1)

Analogously, any OFV can also be represented ac-

cording to the above notation.

Table 1: Simple dataset.

Resource Annotation Vector

= (A,B)

= (C)

= (B)

= (C,D)

In order to better illustrate this point, let us consider

the very simple taxonomy shown in Figure 1. Accord-

ing to this taxonomy, we have the following boolean

variables: X

, X

, corresponding to the

concepts T , A, B, C, D, respectively. For example,

ICAART 2019 - 11th International Conference on Agents and Artiﬁcial Intelligence

508

the variables X

and X

are dependent because C is a

child of A. Therefore X

= 1 implies X

= 1, accord-

ing to the semantics of the ISA hierarchy. Further-

more, with regard to the dataset, we assume the UDR

is composed by the four resources r

, r

, and r

annotated as shown in Table 1. According to the no-

tation given in (1), for instance av

= (A,B) can also

be represented as [X

= 1,X

= 1].

In the literature, there are several deﬁnitions about

the notion of probability (Papoulis, 1965). In this

paper, we focus on the axiomatic and classical ap-

proaches. With respect to the axiomatic approach for

which a dataset is not required, in the classical ap-

proach a dataset has to be deﬁned in order to identify

the bag of all possible outcomes, here indicated as S .

An outcome corresponds to an OFV . For instance,

the outcome corresponding to the o f v = (X

) is:

= 1,X

= 1] and we assume: X

= 0 for h 6= i, j,

1 ≤ h ≤ q, q = |C |.

Note that, the same dataset can determine different

bags of all possible outcomes. It may vary from a bag

of concepts to a bag of annotations, according to the

methods we consider in the next sections.

An event corresponds to a bag of outcomes (a

subset of S ) a probability is associated with. Accord-

ing to our approach, an event is a valued subset of

the q boolean variables enclosed in angular brackets.

In particular, the event deﬁned by the single variable

= 1 is deﬁned as follows:

< X

= 1 >



,. .. ,X

] ∈ S |H

,. .. ,H

∈ C ,

∃X

= 1,1 ≤ j ≤ q, H

∈ K



where:

• K

= {K} ∪ desc(K),

and desc(K) is the set of the descendants of the

concept K in Ont

• double curly brackets denote a bag.

Finally, the probability of an event is given as follows:

p(< X

= 1 >

) =

| < X

= 1 >

|S |

(2)

We assume that, given a bag of possible outcomes S ,

the probability p

associated with a concept K in the

taxonomy is deﬁned as the probability of the corre-

sponding event < X

= 1 >

, i.e.:

p(K) = p(< X

= 1 >

) (3)

4 WEIGHTING METHODS

In this section, we illustrate four methods for com-

puting the probability of concepts (weights) in a

tree-shaped taxonomy, by adopting the probabilistic

framework described in the previous section. In order

to better illustrate these methods, we use a running ex-

ample based on the ontology shown in Figure 1 and,

in the case of the methods based on the classical ap-

proach, we refer to the dataset shown in Table 1. For

this reason, for each classical method, we introduce

outcomes and events.

4.1 Extensional Methods

Concept Frequency Method (CF). The CF method

is based on the standard approach for computing the

relative frequency of a concept from a taxonomy in a

corpus of documents (Resnik, 1995).

According to this approach, given a concept K, its

relative frequency is the number of occurrences of K

divided by the number of occurrences of all concepts

in the set of all annotation vectors (AVs). In formal

terms, we have:

p(K) =

n(K

)

(4)

where n(K

) is the total number of occurrences of the

concepts in K

(K and its descendants in the taxon-

omy, as deﬁned previously), and N is the number of

occurrences of all the concepts in the AVs.

Therefore, the bag of all possible outcomes S is

formed by all the occurrences of the concepts in the

AV s deﬁned in the dataset, and an event < X

= 1 >

corresponds to the occurrences of the concept K and

its descendants in S .

Let us consider the running example, deﬁned ac-

cording to Figure 1 and Table 1. In this case, the set

S is deﬁned as follows:

S = {[X

= 1],[X

= 1],

= 1],[X

= 1]}.

For instance, consider the event < X

= 1 >

. We

have:

< X

= 1 >

= {[X

= 1],[X

= 1],

= 1],[X

= 1]}

As a result, according to Eq. (2), we have:

p(A) = p(< X

= 1 >

) = 4/6 = 2/3.

Similarly, in the other cases:

p(T ) = p(< X

= 1 >

) = 1

p(B) = p(< X

= 1 >

) = 1/3

p(C) = p(< X

= 1 >

) = 1/3

p(D) = p(< X

= 1 >

) = 1/6.

Annotation Frequency Method (AF). The AF

method is also referred to as frequency in (Formica

et al., 2013). In the AF method, given a concept K, its

relative frequency is the number of annotation vectors

A Comparative Assessment of Ontology Weighting Methods in Semantic Similarity Search

509

containing K, or a descendant of it, divided by the to-

tal number of annotation vectors. Therefore we have:

p(K) =

|AV

|AV |

(5)

where AV is the set of all the annotation vectors in the

dataset, and AV

is the subset of AV containing the

concept K or a descendant of it.

The bag of all possible outcomes S is represented

by the bag of the outcomes corresponding to the AVs

in the UDR, and an event < X

= 1 >

corresponds

to the occurrences of the AVs containing a concept in

Consider the running example:

S = {[X

= 1,X

= 1],[X

= 1],

= 1,X

= 1]}.

For instance, in the case of the event < X

= 1 >

have:

< X

= 1 >

= {[X

= 1,X

= 1],

= 1],[X

= 1,X

= 1]}

and:

p(A) = p(< X

= 1 >

) = 3/4.

Similarly, in the other cases, we have:

p(T ) = p(< X

= 1 >) = 1

p(B) = p(< X

= 1 >) = 1/2

p(C) = p(< X

= 1 >) = 1/2

p(D) = p(< X

= 1 >) = 1/4.

4.2 Intensional Methods

With respect to the previous methods, the intensional,

or topology-based, methods illustrated in this sec-

tion follow an axiomatic approach, and therefore do

not require a dataset and a set of possible outcomes S .

Top-Down Topology-based Method (TD). The TD

method has been introduced in (Formica et al.,

2008), and successively extensively experimented in

(Formica et al., 2013) (where it has been referred to

as probabilistic). Here, we brieﬂy recall it for reader’s

convenience. In order to compute the probabilities

of concepts in the reference ontology, this method

adopts a uniform probabilistic distribution along the

ISA hierarchy following a top-down approach. In par-

ticular, the root of the hierarchy has the probability

equal to 1, and the probability of a concept K of the

ontology is computed as follows:

p(K) =

p(parent(K))

|children(parent(K))|

(6)

In our running example, according to this approach,

the probabilities of the concepts in Figure 1 are

deﬁned as follows:

p(T ) = 1, p(A) = 1/2, p(B) = 1/2

p(C) = 1/4, p(D) = 1/4.

Intrinsic Information Content Method (IIC). The

IIC method is based on an axiomatic approach, which

has been conceived in order to compute the informa-

tion content of concepts (Seco et al., 2004). The au-

thors deﬁne the information content of a concept in

a taxonomy as a function of its descendants. In par-

ticular, they claim that the more descendants a con-

cept has the less information it expresses. Therefore,

concepts that are leaves are the most speciﬁc in the

taxonomy, and their information is maximal.

Formally, they deﬁne the intrinsic information

content (iic) of a concept K as follows:

iic(K) = 1 −

log(|desc(K)| + 1)

log(|C |)

(7)

where the desc(K) is the set of the descendants of the

concept K, and C is the set of the concepts in Ont.

Note that the denominator assures that the iic values

are in [0,. ..,1]. The above formulation guarantees

that the information content decreases monotonically.

Moreover, the root node of the taxonomy yields an

information content value equal to 0.

For instance, consider the taxonomy shown in Fig-

ure 1. The information contents of the concepts are:

ic(T ) = 0, ic(A) = 1 −

log(2+1)

log(5)

= 0.32

ic(B) = 1, ic(C) = 1, ic(D) = 1.

5 ASSESSMENT OF METHODS

In this section, in order to carry out an assessment of

the four methods illustrated in the previous section,

we ﬁrst recall the SemSim method.

5.1 Semsim

The SemSim method has been conceived to search for

the resources in the resource space that best match the

RV, by contrasting it with the various AV, associated

with the searchable digital resources (Formica et al.,

2013). This is achieved by applying the semsim func-

tion, which has been deﬁned to compute the semantic

similarity between OFV. In SemSim, the probabilities

of concepts are used to derive the information content

(ifc) of the concepts that, according to (Lin, 1998),

represents the basis for computing the concept simi-

larity. In particular, according to the information the-

ory, the ifc of a concept K, is deﬁned as:

ifc(K) = −log(w(K))

ICAART 2019 - 11th International Conference on Agents and Artiﬁcial Intelligence

510

Table 2: Annotation Vectors (dataset).

= (InternationalHotel, FrenchMeal, Cinema, Flight)

= (Pension, VegetarianMeal, ArtGallery, ShoppingCenter)

= (CountryResort, MediterraneanMeal, Bus)

= (CozyAccommodation, VegetarianMeal, Museum, Train)

= (InternationalHotel, ThaiMeal, IndianMeal, Concert, Bus)

= (SeasideCottage, LightMeal, ArcheologicalSite, Flight, ShoppingCenter)

= (RegularAccommodation, RegularMeal, Salon, Flight)

= (InternationalHotel, VegetarianMeal, Ship)

= (FarmHouse, MediterraneanMeal, CarRental)

= (RegularAccommodation, EthnicMeal, Museum)

= (RegularAccommodation, LightMeal, Cinema, Bazaar)

= (SeasideCottage, VegetarianMeal, Shopping)

= (Campsite, IndianMeal, Museum, RockConcert)

= (RegularAccommodation, RegularMeal, Museum, Bazaar)

= (InternationalHotel, PictureGallery, Flight)

= (Pension, LightMeal, ArcheologicalSite, CarRental, Flight)

= (AlternativeAccommodation, LightMeal, RockConcert, Bus)

= (CozyAccommodation, VegetarianMeal, Exhibition, ArcheologicalSite, Train)

= (CountryResort, VegetarianMeal, Concert, Bus)

= (Campsite, MediterraneanMeal, ArcheologicalSite, Attraction, CarRental)

= (AlternativeAccommodation, LightMeal, Concert, Bus)

= (FarmHouse, LightMeal, RockConcert, Train)

Table 3: Request Vectors.

= (Campsite, EthnicMeal, RockConcert, Bus)

= (InternationalHotel, InternationalMeal, ArtGallery,

Flight)

= (Pension, MediterraneanMeal, Cinema,

ShoppingCenter)

= (CountryResort, LightMeal, ArcheologicalSite,

Museum, Train)

The semsim function is based on the notion of similar-

ity between concepts (features), referred to as consim.

Given two concepts K

, K

, it is deﬁned as follows:

consim(K

) =

2 × IC(lub(K

))

IC(K

) + IC(K

)

where the lub represents the least abstract concept of

the ontology that subsumes both K

and K

. Given an

instance of RV and an instance of AV , say rv and av re-

spectively, the semsim function computes the consim

for each pair of concepts belonging to the set formed

by the Cartesian product of the rv, and av.

However, we focus on the pairs that exhibit high

afﬁnity. In particular, we adopt the exclusive match

philosophy, where the elements of each pair of con-

cepts do not participate in any other pair. The method

aims to identify the set of pairs of concepts of the rv

and av that maximizes the sum of the consim similar-

ity values. In particular, given rv = {R

,..., R

} and

av = {A

,..., A

} as deﬁned in Section 3, let S be the

Cartesian Product of rv and av, i.e., S = rv × av, then,

P (rv,av) is deﬁned as follows:

P (rv,av) = {P ⊂ S | ∀ (R

, A

), (R

, A

) ∈ P, R

, A

6= A

, |P| = min{n, m}}.

Table 4: Results of SemSim about rv

Extensional Intensional

HJ CF AF TD IIC

0.10 0.49 0.16 0.54 0.47

0.10 0.30 0.03 0.34 0.29

0.25 0.45 0.26 0.50 0.45

0.18 0.47 0.08 0.49 0.44

0.51 0.64 0.54 0.64 0.59

0.14 0.39 0.07 0.40 0.36

0.16 0.47 0.08 0.51 0.48

0.10 0.33 0.04 0.37 0.34

0.10 0.41 0.19 0.46 0.42

0.21 0.48 0.28 0.49 0.45

0.15 0.42 0.11 0.45 0.38

0.10 0.21 0.01 0.25 0.20

0.89 0.72 0.73 0.71 0.69

0.10 0.33 0.03 0.38 0.33

0.10 0.33 0.07 0.33 0.31

0.10 0.39 0.07 0.39 0.36

0.93 0.85 0.69 0.87 0.84

0.26 0.45 0.17 0.46 0.42

0.50 0.68 0.45 0.73 0.66

0.34 0.51 0.28 0.51 0.50

0.77 0.82 0.63 0.85 0.80

0.46 0.70 0.44 0.72 0.70

Corr 1.00 0.92 0.96 0.90 0.92

Therefore, semsim(rv,av) is given below:

semsim(rv,av) =

max

P∈P (rv,av)



∑

)∈P

consim(R

)



max{n,m}

A Comparative Assessment of Ontology Weighting Methods in Semantic Similarity Search

511

Table 5: Results of SemSim about rv

Extensional Intensional

HJ CF AF TD IIC

0.72 0.59 0.52 0.80 0.76

0.21 0.41 0.35 0.55 0.50

0.16 0.24 0.05 0.35 0.31

0.10 0.34 0.07 0.49 0.42

0.10 0.39 0.26 0.47 0.46

0.20 0.36 0.22 0.49 0.43

0.71 0.67 0.60 0.90 0.86

0.10 0.36 0.28 0.49 0.47

0.10 0.23 0.05 0.35 0.31

0.40 0.30 0.18 0.46 0.39

0.10 0.29 0.18 0.44 0.39

0.10 0.10 0.00 0.23 0.18

0.10 0.19 0.02 0.34 0.27

0.44 0.30 0.18 0.55 0.49

0.86 0.69 0.66 0.70 0.68

0.25 0.40 0.29 0.54 0.48

0.10 0.35 0.07 0.48 0.43

0.10 0.28 0.06 0.39 0.35

0.10 0.34 0.08 0.46 0.42

0.10 0.28 0.06 0.41 0.35

0.10 0.36 0.08 0.48 0.45

0.10 0.32 0.07 0.46 0.42

Corr 1.00 0.81 0.87 0.83 0.82

5.2 Validation

In order to analyze the four methods illustrated in

the previous sections, we refer to the experiment pre-

sented in (Formica et al., 2013). In that experiment,

the taxonomy shown in Figure 2 has been considered,

and four request vectors, namely rv

, i = 1, ...4, which

are recalled in Table 3. In the same experiment, 22 an-

notated resources have been deﬁned, which are repre-

sented by their annotation vectors av

, av

, .. ., av

as recalled in Table 2. In our approach they repre-

sent the dataset. In the experiment, the SemSim val-

ues were computed against the 22 annotation vectors,

and the correlation index (Corr) against human judg-

ment (HJ) scores was calculated. The HJ scores were

computed by asking to a group of 21 people to eval-

uate the similarity among each request vector and the

annotation vectors deﬁned in Table 2. In the same

work, the authors demonstrated that the Annotation

Frequency Method (AF) (referred to as frequency in

the mentioned paper) outperforms some of the most

representative similarity methods deﬁned in the liter-

ature (i.e., Dice, Jaccard, Cosine, and Weighted Sum).

In our work, for each request vector, we apply Sem-

Sim by using the four weighting methods illustrated

above. In Tables 4, 5, 6, 7 the results about rv

, rv

are shown. In particular, we observe that the

AF method still achieves a higher correlation with HJ

with respect to all the other considered methods, i.e.,

Table 6: Results of SemSim about rv

Extensional Intensional

HJ CF AF TD IIC

0.10 0.50 0.35 0.55 0.50

0.62 0.73 0.58 0.80 0.76

0.29 0.34 0.25 0.36 0.34

0.10 0.36 0.08 0.44 0.37

0.10 0.34 0.18 0.38 0.32

0.31 0.49 0.28 0.56 0.51

0.10 0.38 0.15 0.45 0.40

0.10 0.30 0.15 0.38 0.34

0.12 0.34 0.25 0.36 0.34

0.18 0.39 0.15 0.45 0.39

0.78 0.79 0.61 0.85 0.83

0.38 0.45 0.25 0.52 0.48

0.10 0.35 0.11 0.39 0.31

0.42 0.58 0.31 0.63 0.56

0.10 0.24 0.11 0.28 0.24

0.31 0.42 0.28 0.47 0.43

0.10 0.44 0.18 0.51 0.45

0.18 0.35 0.16 0.43 0.37

0.10 0.41 0.18 0.49 0.42

0.22 0.38 0.23 0.40 0.38

0.10 0.45 0.20 0.53 0.47

0.10 0.42 0.18 0.50 0.43

Corr 1.00 0.85 0.88 0.81 0.85

Concept Frequency (CF), Top-Down Topology-based

(T D), Intrinsic Information Content (IIC). Table 8

summarizes the results about the four request vectors.

First of all note that, in most cases, the extensional

methods outperform the intensional ones. This con-

ﬁrms the intuition that semantic methods work bet-

ter if a dataset representing the application domain is

considered. In the case of the intensional methods, the

IIC achieves higher correlations with respect to the

TD method. In order to better clarify, let us consider

two sibling concepts A and B in the taxonomy, where

A is a leaf and the B has some descendants. Accord-

ing to the TD method A and B have the same weights,

whereas according to the IIC method their weights

are different because the descendants contribute to

the weights of the concept B. Furthermore, the IIC

method outperforms the other intensional method be-

cause it also considers the total number of concepts

in the ontology. Concerning the extensional methods,

as mentioned above, the AF method outperforms the

other one (and all the others).

6 CONCLUSION

In this paper, we presented a comparative assessment

of the performances of four different methods for on-

tology weighting. The results of this work reveal

that, in general, the extensional methods outperform

ICAART 2019 - 11th International Conference on Agents and Artiﬁcial Intelligence

512

Table 7: Results of SemSim about rv

Extensional Intensional

HJ CF AF TD IIC

0.10 0.36 0.06 0.39 0.33

0.10 0.31 0.11 0.36 0.31

0.45 0.44 0.30 0.48 0.47

0.88 0.72 0.63 0.75 0.73

0.10 0.38 0.07 0.38 0.34

0.50 0.65 0.55 0.66 0.65

0.10 0.39 0.07 0.43 0.38

0.10 0.32 0.12 0.37 0.34

0.10 0.31 0.10 0.37 0.34

0.14 0.41 0.21 0.42 0.40

0.14 0.38 0.22 0.40 0.36

0.16 0.30 0.20 0.34 0.31

0.18 0.46 0.23 0.48 0.43

0.20 0.40 0.21 0.42 0.39

0.10 0.26 0.05 0.29 0.25

0.31 0.58 0.44 0.59 0.58

0.10 0.48 0.26 0.49 0.46

0.84 0.83 0.66 0.86 0.82

0.32 0.56 0.35 0.57 0.55

0.36 0.63 0.34 0.71 0.65

0.21 0.49 0.26 0.50 0.47

0.29 0.56 0.42 0.58 0.54

Corr 1.00 0.87 0.91 0.88 0.90

Table 8: Summary of correlations.

Extensional Intensional

CF AF TD IIC

0.92 0.96 0.90 0.92

0.81 0.87 0.83 0.82

0.85 0.88 0.81 0.85

0.87 0.91 0.88 0.90

Mean 0.86 0.91 0.86 0.87

the intensional ones. Furthermore, among the ex-

tensional methods, the AF method exhibits the best

correlation with human judgment. However, there

are cases where the extensional methods may require

more elaboration, e.g., when the resource space is

highly dynamic, and then it is more appropriate to rely

on intensional methods.

REFERENCES

Abioui, H., Idarrou, A., Bouzit, A., and Mammass, D.

(2018). Towards a novel and generic approach for

owl ontology weighting. Procedia Computer Science,

127:426 – 435.

Formica, A., Missikoff, M., Pourabbas, E., and Taglino, F.

(2008). Weighted ontology for semantic search. In

Proc. of the OTM 2008 Confederated International

Conferences, CoopIS, DOA, GADA, IS, and ODBASE

2008. Part II on On the Move to Meaningful Internet

Systems, OTM ’08, pages 1289–1303, Berlin, Heidel-

berg. Springer-Verlag.

Formica, A., Missikoff, M., Pourabbas, E., and Taglino, F.

(2013). Semantic search for matching user requests

with proﬁled enterprises. Comput. Ind., 64(3):191–

202.

Formica, A., Missikoff, M., Pourabbas, E., and Taglino, F.

(2016). A bayesian approach for weighted ontolo-

gies and semantic search. In Proc. of the 8th Int.

Joint Conf. on Knowledge Discovery, Knowledge En-

gineering and Knowledge Management (IC3K 2016)

- KEOD, Porto - Portugal, November 9 - 11, 2016.,

pages 171–178.

Hayuhardhika, W., Purta, N., Sugiyanto, R., S., and Sidiq,

M. (2013). Weighted ontology and weighted tree

similarity algorithm for diagnosing diabetes mellitus.

In 2013 International Conference on Computer, Con-

trol, Informatics and Its Applications (IC3INA), pages

267–272.

Jiang, J. and Conrath, D. (1997). Semantic similarity based

on corpus statistics and lexical taxonomy. In Proc.

of the Int’l. Conf. on Research in Computational Lin-

guistics, pages 19–33.

Lin, D. (1998). An information-theoretic deﬁnition of sim-

ilarity. In Proceedings of the 15th International Con-

ference on Machine Learning, ICML ’98, pages 296–

304, San Francisco, CA, USA. Morgan Kaufmann

Publishers Inc.

Manning, C. D., Raghavan, P., and Schutze, H. (2008). In-

troduction to Information Retrieval. Cambridge Uni-

versity Press, New York, NY, USA.

Meng, L., Gu, J., and Zhou, Z. (2012). A new model of

information content based on concepts topology for

measuring semantic similarity in wordnet 1. Inter-

national Journal of Grid and Distributed Computing,

5(3):81–94.

Papoulis, A. (1965). Probability, Random Variables, and

Stochastic Processes. McGraw Hill, New York, NY,

USA.

Resnik, P. (1995). Using information content to evaluate se-

mantic similarity in a taxonomy. In Proceedings of the

14th International Joint Conference on Artiﬁcial In-

telligence - Volume 1, IJCAI’95, pages 448–453, San

Francisco, CA, USA. Morgan Kaufmann Publishers

Inc.

anchez, D., Batet, M., and Isern, D. (2011). Ontology-

based information content computation. Know.-Based

Syst., 24(2):297–303.

Seco, N., Veale, T., and Hayes, J. (2004). An intrinsic infor-

mation content metric for semantic similarity in word-

net. In Proceedings of the 16th European Conference

on Artiﬁcial Intelligence, ECAI’04, pages 1089–1090,

Amsterdam, The Netherlands, The Netherlands. IOS

Press.

A Comparative Assessment of Ontology Weighting Methods in Semantic Similarity Search

513