and Witten, 2008) the anchor texts of links and link
structure itself is used to find. They use link counts
weighted by the probability of the link occurring on
the page (inspired by tf-idf) as a vector representation
of the article while calculating the cosine similarity
on the vectors for the similarity measure. This may
look very similar to our approach, but we use tf-idf
on the terms not the links as well as a directly com-
putable measure for the link structure, so we can cal-
culate our measure online with only two requests to
the Wikipedia API. Thus, we combine a link-based
measure with the second category of text based mea-
sures.
Text based measures take an example corpus of
documents that are known to relate to the two terms
and then calculate the semantic distance between the
two document sets, thereby splitting the problem of
relatedness between terms into two problems: choos-
ing a suitable data set and calculating the semantic
distance between the documents. There are large
numbers of semantic distances to choose from: Lee
distance, tf-idf cosine similarity (Ramos, 2003), Jaro-
Winkler distance, and Approximate string matching
(Navarro, 2001), just to name a few. In (Islam and
Inkpen, 2008), the Semantic Text Similarity (STS)
has been developed as a variety of the Longest Com-
mon Subsequence (LCS) algorithm and a combina-
tion of other methods. It is optimized on very short
texts, such as single sentences and phrases. This
method was evaluated by using definitions from a dic-
tionary.
The Explicit Semantic Analysis ESA (Gabrilovich
and Markovitch, 2007) uses Wikipedia, just like our
approach, and calculates a complete matrix of term to
concept relatedness, which can be further refined by
introducing human judgments. Unlike our approach,
however, it requires the processing of the whole of
Wikipedia in a non-linear process, which is very ex-
pensive and has not been replicated on the scale since.
There are other approaches to mix both link and
text analysis, such as (Nakayama et al., 2008) which
extracts explicit relationships such as Apple is Fruit,
Computer is Machine, Colorado is U.S. state. The
goal of this paper, however, is not to use Wikipedia to
find relationships which conform to established stan-
dards and semantics, but quite the opposite, to pro-
duce explanatory text suited for unusual relationships.
3 RELATIONSHIP EXTRACTION
3.1 Architecture
The RelationWik Extractor was built as a web infor-
mation system. From a user’s point of view, it’s func-
tion it quite simple. The articles for which a relation-
ship is sought and a few parameters are input over a
web site and the system will show the results as both
a score and snippets illustrating the connection from
both sides.
The Wikipedia articles are then downloaded di-
rectly via the Wikipedia API. The text is then scanned
for additional information such as links, templates,
etc. and stripped of its Wikipedia syntax. Both text
and meta-information is stored in a database cache.
The results of the algorithms are visualized with PHP
and the Google Chart API.
3.2 Calculating Relatedness
For the actual calculation of the relationship, two dif-
ferently approaches are used. One is based on the link
structure and the other on the textual closeness of the
texts. A third approach is a mixture of both.
The first algorithm measures the connectedness of
the terms, by studying inlinks and outlinks. When
looking at connections that go over several hops, it
becomes clear that the connection can be quite thin.
For example, Banana and Berlin are connected by an
enzyme that occurs in the Banana and was identified
in Berlin. This gets worse when looking at connec-
tions with even more connectors in between. There-
fore, we decided to ignore all connections involving
more than one intermediary. The connections with
one intermediary that made the most sense occurred
in the scenario where both articles link to the same
article. This occurs, for example, when both of the
given articles A and B are connected to a category or
another larger super-concept by linking to it. Also,
in terms of computation time, it is the fastest possible
link analysis since outlinks are the easiest to extract
from.
Following that argumentation, we only look at ar-
ticles that either have intersecting outcoming links or
have a link from A to B or from B to A. Any other
pairs are given a relatedness of zero. Connected arti-
cles A and B receive a base relatedness b of 0.5 and
are given a boost of 0.1 for additional connections
(e.g. 0.7 when A and B link to each other and link
to at least one third article). This base value is further
modified by the number of backlinks of the linked-to
article in relation to the links from the other article.
WEBIST2012-8thInternationalConferenceonWebInformationSystemsandTechnologies
626