clouds. We then compute the semantic relatedness be-
tween the two concept clouds and use it to determine
the relatedness between the initial word pair.
The rest of the paper is organized as follows: Sec-
tion 2 discusses related work. Section 3 gives a de-
tailed description of the our approach for calculating
semantic relatedness and its underlying algorithms.
Section 4 contains the experimental results and an
evaluation of our method by comparison with other
similar methods. We conclude the paper in Section 5.
2 RELATED WORK
The problem of determining the semantic relatedness
between two words has been an area of interest to
researchers from several areas for long time. Some
very preliminary approaches (Rada et al., 1989) cal-
culated the similarity between two words on the ba-
sis of the number of edges in the term hierarchy cre-
ated by indexing of articles. Similar edge-counting
based methods were also applied on existing knowl-
edge repositories such as Roget’s Thesaurus (Jarmasz
and Szpakowicz, 2003) or WordNet (Hirst and St-
Onge, 1998) to compute the semantic relatedness.
To improve the preliminary approaches to calcu-
lating the semantic relatedness between words, more
sophisticated methods have been proposed. Instead
on simply relying on the number of connecting edges,
Leacock and Chodorow (1998) have proposed to take
the depth of the term hierarchy into consideration.
Others groups have proposed to use the description of
words present in dictionaries (Lesk, 1986) and tech-
niques such as LSA (Deerwester et al., 1990) to com-
pute semantic relatedness. However, due to the very
limited size of WordNet as a knowledge base and the
absence of well known named entities (e.g., Harry
Potter) in WordNet, researchers have started to look
for more comprehensive knowledge bases.
The advent of Wikipedia in 2001 has fulfilled
the need for a more comprehensive knowledge base.
Many techniques that use Wikipedia to compute se-
mantic relatedness have been developed in the recent
years. Among others, Strube and Ponzetto (2005)
have used Wikipedia to determine semantic related-
ness. Their results outperform those obtained us-
ing WordNet, hence showing the effectiveness of
Wikipedia in determining the similarity between two
words. Gabrilovich and Markovitch (2007) have de-
veloped a technique, called Explicit Semantic Anal-
ysis (ESA), to represent the meaning of words in
a high dimensional space of concepts derived from
Wikipedia. Experimental results show that ESA out-
performs the method given by (Strube and Ponzetto,
2005). Chernov et al. (2006) have suggested to
make use of the links between categories present on
Wikipedia to extract semantic information. Milne and
Witten (2008) have proposed the use of links between
articles of Wikipedia rather than its categories to de-
termine semantic relatedness between words. Zesch
et al. (2008) have proposed to use Wiktionary, a
comprehensive wiki-based dictionary and thesaurus
for computation of semantic relatedness. Although
Wikipedia has proven to be a better knowledge base
than WordNet, many terms (e.g., 1980 movies) are
still unavailable on Wikipedia. This has motivated
the use of the whole web as the knowledge base for
calculating semantic relatedness.
Bollegala et al. (2007) have proposed to use page
counts and text snippets extracted from result pages
of web searches to measure semantic relatedness be-
tween words. They achieve a high correlation mea-
sure of 0.83 on the Charles-Miller benchmark dataset.
Sahami and Heilman (2006) have used a similar mea-
sure. Cilibrasi et al. (2007) have proposed to compute
the semantic relatedness using the normalized google
distance (NGD), in which they used Google
T M
to de-
termine how closely related two words are on the ba-
sis of their frequency of occurring together in web
documents. Chen et al. (2006) have proposed to ex-
ploit the text snippets returned by a Web search engine
as an important measure in computing the semantic
similarity between two words.
The approach in (Salahli, 2009) is the closest to
our approach, as it uses the related terms of two words
to determine the semantic relatedness between the
words. However, the major drawback of the approach
proposed in (Salahli, 2009) is that the related terms
are manually selected. As opposed to that, our ap-
proach automatically retrieves the most relevant terms
to a given word query. Furthermore, Salahli compares
the related terms to the original query. In our ap-
proach, we compute the semantic relatedness between
two words using the semantic similarity between their
generated concept clouds. To the best of our knowl-
edge, such an approach has not been proposed yet.
3 PROPOSED APPROACH
The steps of out proposed approach are shown in Fig-
ure 1. We use a two-phase procedure to compute the
semantic relatedness between two words. The first
phase involves the use of a Concept Extractor (Kulka-
rni and Caragea, 2009) to identify concepts related to
the given pair of words and to generate their concept
clouds. In the second phase, we use web-based coef-
ficients (Cosine, Jaccard, Dice, Overlap) to compute
KDIR 2009 - International Conference on Knowledge Discovery and Information Retrieval
184