the information content of the concepts in the RO.
The information content of a concept c is computed
according to the well-known expression: -log(p(c)),
where p(c) is the probability that a document deals
with the concept c. Here, a crucial problem is how to
obtain the value of p(c). The large majority of
proposals (see the next section on related work) use
the probabilities derived from WordNet frequencies
(WordNet, 2010). However, as shown in the related
work section such measures are not very accurate
and, often are not available for all possible concepts.
In a previous work (Formica et al., 2008), we
adopted a probabilistic approach proposing an
alternative to the measures provided by WordNet. In
this work, we address a frequency approach: since
we operate within a cluster of enterprises, and
therefore in a closed UDR, we have a ―controlled‖
situation where it is possible to replace the estimate
of a probability with the factual measure of the
relative frequency of the concepts in the UDR. The
relative frequency of a concept is obtained from the
number of resources containing the concept over the
total number of digital resources in the UDR. In
particular, in this paper we present an experimental
result showing that the frequency approach has a
higher correlation with human judgment than the
probabilistic approach introduced in (Formica et al.,
2008), and some representative methods defined in
the literature.
The SemSim method is articulated according to
two phases: a preparatory and an execution phase.
The preparatory phase is necessary to set up the
semantic infrastructure by: (i) developing a RO, (ii)
providing a semantic annotation to each document in
the UDR, (iii) analyzing the documents in the UDR
to determine the relative frequency of the concepts
in the RO. Such a phase is time consuming and
costly, but it takes place only once at the constitution
of the cluster of enterprises, and then there are only
periodical updates. The execution phase, performed
on-the-fly at request time, is articulated according to
the following steps: (a) the semantic annotation of
the user request; (b) the matchmaking between the
semantic annotation of the user request and the
semantic annotation of each document in the UDR,
yielding a semantic similarity measure; (c) the
ranking of the documents by descending similarity
degrees.
The rest of the paper is structured as follows. In
the next section the related work is given. In Section
3, some basic notions used in SemSim are recalled.
In Section 4, the probabilistic approach is recalled,
and the frequency approach is introduced, and the
weighted reference ontology of our running example
is presented. In Section 5, the SemSim method for
evaluating semantic similarity is given. In Section 6,
an assessment of the Semsim method is presented.
Finally, Section 7 concludes the paper.
2 RELATED WORK
In the vast literature available (see for instance,
(Alani and Brewster, 2005), (Euzenat and Shvaiko,
2007), (Madhavan and Halevy, 2003), (Maguitman,
et al., 2005)), we will restrict our focus on the
proposals tightly related to our approach. We wish to
emphasize that the focus of our work is both on the
assignment of weights to the concepts of a reference
ontology, and the method to compute the similarity
between concept vectors. The following subsections
concern these two aspects.
2.1 The Weight Assignment
In the large majority of papers proposed in the
literature (Euzenat and Shvaiko, 2007), (Maguitman,
et al., 2005), assignment of weights to the concepts
of a reference ontology (or a taxonomy) is
performed by using WordNet (WordNet, 2010), see
for instance (Kim and Candan, 2006), (Li et al.,
2003), and also (Resnik, 1995), (Lin, 1998) which
inspired our method. WordNet (a lexical ontology
for the English language) provides, for a given
concept (noun), the natural language definition,
hypernyms, hyponyms, synonyms, etc, and also a
measure of the frequency of the concept. The latter
is obtained by using noun frequencies from the
Brown Corpus of American English (Francis and
Kucera, 1979). Then, the SemCor project (Fellbaum
et al., 1997) made a step forward by linking
subsections of Brown Corpus to senses in the
WordNet lexicon (with a total of 88,312 observed
nouns). We did not adopt the WordNet frequencies
for two reasons. Firstly, we deal with specialised
domains (e.g., systems engineering, tourism, etc.),
requiring specialised domain ontologies. WordNet is
a generic lexical ontology (i.e., not focused on a
specific domain) that contains only simple terms. In
fact, multi-word terms are not reported (e.g., terms
such as ―seaside cottage‖ or ―farm house‖ are not
defined in WordNet). Secondly, there are concepts
in WordNet for which the frequency is not given
(e.g., ―accommodation‖) or is irrelevant, as in the
case of ―meal‖ (the frequency is 20).
Concerning weight assignment, in (Fang et al.,
2005) the proposal makes a joint use of an ontology
and a typical Natural Language Processing method,
based on term frequency and inverse document
KEOD 2010 - International Conference on Knowledge Engineering and Ontology Development
184