(Wormuth and Becker, 2004). Thus, it helps user to
structure an interest domain (Ganter et al., 1997,
Wille, 2009). It models the world of data through the
use of objects and attributes (Cole and Eklund,
1999). Ganter et al.(1999) applied the concept lattice
from the formal concept analysis. This approach has
an advantage that users can refine their query by
searching well-structured graphs. These graphs,
known as formal concept lattice, are composed of a
set of documents and a set of terms. Effectively, it
reduces the task of setting bound restrictions for
managing the number of documents to be retrieved
required (Tam, 2004).
2.2 Related Works of Similarity
Measure between Two Words
Traditionally, a number of approaches to find
synonym have been published. The methodology to
automatically discover synonym from large corpora
have been popular topic in a variety of language
processing (Sánchez and Moreno, 2005, Senellart
and Blondel, 2008, Blondel and Senellart, 2011, Van
der Plas and Tiedemann, 2006). There are two kinds
of approaches to identify synonyms.
The first kind of approaches uses a general
dictionary (Wu and Zhou, 2003). In the area of
synonym extraction, it is common to use lexical
information in dictionary (Veronis and Ide, 1990). In
dictionary-based case, a similarity is decided on
definition of each word in a dictionary. This kind of
approaches is conducted through learning algorithm
based on information in the dictionary (Lu et al.,
2010, Vickrey et al., 2010). Wu and Zhou (2003)
proposed a method of synonym identification by
using bilingual dictionary and corpus. The bilingual
approach works on as follows: Firstly, the bilingual
dictionary is used to translate the target word.
Secondly, the authors used two bilingual corpora
that mean precisely the same. And then, they
calculated the probability of the coincidence degree.
The result of the bilingual method is remarkable in
comparison with the monolingual cases. Another
research builds a graph of lexical information from a
dictionary. The method to compute similarity for
each word is limited to nearby words of graph. This
similarity measurement was evaluated on a set of
related terms (Ho and Fairon, 2004).
The second kind of approaches to identity
synonym considers context of the target word and
computes a similarity of lexical distributions from
corpus (Lin, 1998). In the case of distributional
approaches, a similarity is decided on context. Thus,
it is important to compute how much similar words
are in a corpus. The approach of distributional
similarity for synonym identification is used in order
to find related words (Curran and Moens, 2002).
There has been many works to measure similarity of
words, such as distributional similarity (Lin et al.,
2003). Landauer and Dumais (1997) proposed a
similarity measurement to solve TOEFL tests of
synonym by using latent semantic analysis
(Landauer and Dumais, 1997). Lin (1998) proposed
several methodologies to identify the most probable
candidate among similar words by using a few
distance measures. Turney (2001) presented PMI
and IR method which is calculated by data from the
web. He evaluated this measure on the TOEFL test
in which the system has to select the most probable
candidate of the synonym among 4 words. Lin et al.
(2003) proposed two ways of finding synonym
among distributional related words. The first way is
looking over the overlap in translated texts of
semantically similar words in multiple bilingual
dictionaries. The second is to look through designed
patterns so as to filter out antonyms.
There are a lot of researches for measuring
similarity to identify the synonym. However, the use
of dictionary has been applied to a specific task or
domain(Turney, 2001). Hence, these existing
researches are hard to be applied in the changeable
web. And, the context-based similarity method deals
with unstructured web documents and it takes much
time to analysis since it needs to pre-treatment such
as morphological analysis. Therefore, this paper
proposes a methodology to automatically measure
the semantically similar relation between two words
by using keyword-based structured data from web.
3 METHOD TO MEASURE
SIMILARITY
In this section, we demonstrate the method to
measure semantic similarity between two distinct
words. This paper defined the ‘query’ as the target
word that we would like to compute the semantic
similarity. A pair of queries is defined as
,
which is the set of two different words
and
.
The overall procedure to estimate semantic
similarity between two queries of Q is composed of
three phases as shown in the Figure 1; preprocessing,
analysis and calculation phase. In the preprocessing
phase, base data for the analysis are collected and
refined on each query. Let us assume that the query
pair is Q=(contamination, pollution). The set of web
WEBIST2014-InternationalConferenceonWebInformationSystemsandTechnologies
314