
(Wormuth and Becker, 2004). Thus, it helps user to 
structure an interest domain (Ganter et al., 1997, 
Wille, 2009). It models the world of data through the 
use of objects and attributes (Cole and Eklund, 
1999). Ganter et al.(1999) applied the concept lattice 
from the formal concept analysis. This approach has 
an advantage that users can refine their query by 
searching well-structured graphs. These graphs, 
known as formal concept lattice, are composed of a 
set of documents and a set of terms. Effectively, it 
reduces the task of setting bound restrictions for 
managing the number of documents to be retrieved 
required  (Tam, 2004). 
2.2  Related Works of Similarity 
Measure between Two Words  
Traditionally, a number of approaches to find 
synonym have been published. The methodology to 
automatically discover synonym from large corpora 
have been popular topic in a variety of language 
processing (Sánchez and Moreno, 2005, Senellart 
and Blondel, 2008, Blondel and Senellart, 2011, Van 
der Plas and Tiedemann, 2006). There are two kinds 
of approaches to identify synonyms. 
The first kind of approaches uses a general 
dictionary (Wu and Zhou, 2003). In the area of 
synonym extraction, it is common to use lexical 
information in dictionary (Veronis and Ide, 1990). In 
dictionary-based case, a similarity is decided on 
definition of each word in a dictionary. This kind of 
approaches is conducted through learning algorithm 
based on information in the dictionary (Lu et al., 
2010, Vickrey et al., 2010). Wu and Zhou (2003) 
proposed a method of synonym identification by 
using bilingual dictionary and corpus. The bilingual 
approach works on as follows: Firstly, the bilingual 
dictionary is used to translate the target word. 
Secondly, the authors used two bilingual corpora 
that mean precisely the same. And then, they 
calculated the probability of the coincidence degree. 
The result of the bilingual method is remarkable in 
comparison with the monolingual cases. Another 
research builds a graph of lexical information from a 
dictionary. The method to compute similarity for 
each word is limited to nearby words of graph. This 
similarity measurement was evaluated on a set of 
related terms (Ho and Fairon, 2004). 
The second kind of approaches to identity 
synonym considers context of the target word and 
computes a similarity of lexical distributions from 
corpus (Lin, 1998). In the case of distributional 
approaches, a similarity is decided on context. Thus, 
it is important to compute how much similar words 
are in a corpus. The approach of distributional 
similarity for synonym identification is used in order 
to find related words (Curran and Moens, 2002). 
There has been many works to measure similarity of 
words, such as distributional similarity (Lin et al., 
2003). Landauer and Dumais (1997) proposed a 
similarity measurement to solve TOEFL tests of 
synonym by using latent semantic analysis 
(Landauer and Dumais, 1997). Lin (1998) proposed 
several methodologies to identify the most probable 
candidate among similar words by using a few 
distance measures. Turney (2001) presented PMI 
and IR method which is calculated by data from the 
web. He evaluated this measure on the TOEFL test 
in which the system has to select the most probable 
candidate of the synonym among 4 words. Lin et al. 
(2003) proposed two ways of finding synonym 
among distributional related words. The first way is 
looking over the overlap in translated texts of 
semantically similar words in multiple bilingual 
dictionaries. The second is to look through designed 
patterns so as to filter out antonyms. 
There are a lot of researches for measuring 
similarity to identify the synonym. However, the use 
of dictionary has been applied to a specific task or 
domain(Turney, 2001). Hence, these existing 
researches are hard to be applied in the changeable 
web. And, the context-based similarity method deals 
with unstructured web documents and it takes much 
time to analysis since it needs to pre-treatment such 
as morphological analysis. Therefore, this paper 
proposes a methodology to automatically measure 
the semantically similar relation between two words 
by using keyword-based structured data from web.  
3 METHOD TO MEASURE 
SIMILARITY 
In this section, we demonstrate the method to 
measure semantic similarity between two distinct 
words. This paper defined the ‘query’ as the target 
word that we would like to compute the semantic 
similarity. A pair of queries is defined as 
,
 which is the set of two different words 
 
and 
. 
The overall procedure to estimate semantic 
similarity between two queries of Q is composed of 
three phases as shown in the Figure 1; preprocessing, 
analysis and calculation phase. In the preprocessing 
phase, base data for the analysis are collected and 
refined on each query. Let us assume that the query 
pair is Q=(contamination, pollution). The set of web 
WEBIST2014-InternationalConferenceonWebInformationSystemsandTechnologies
314