cover various forms and inflections of concept
representational words. To clarify, inflections of the
word molecule, like 'molecules' and 'molecular' are
anchor texts in Wikipedia articles connecting to
'Molecule' article. Further, the similar meaning of
structurally different phrases is also captured with
anchor text representation (for example, anchor texts
"possessed by the devil", "Demonic possession" and
"Control over human form" link to same article and
thus have similar meanings), not to mention the
inclusion of acronyms (anchor text 'WWW' links to
the article 'World Wide Web').
Wikipedia concept vocabulary is stored in a
dictionary based data-structure along with the target
articles it represents. Article Titles and Redirect
Titles implicitly have one to one relationship with
concepts but same Anchor Text can have more than
one target concepts. For resolving anchor to multi-
concept relation, we employ a simple measure to
establish one target article per anchor text in the
vocabulary set. For all multi-concept anchors, an
article that is most number of times linked by an
anchor text in all Wikipedia inter-article links is
ascertained as its final target concept. It may seem a
trivial way to resolve the target conflicts for anchor
text, but this approach saves us significant
computational expense which otherwise would have
occurred with a more adaptive but complex
approach like TF-IDF.
3.3 Mining Concept Thesaurus from
Wikipedia Links
We condense the knowledge of Wikipedia article
content and links into a Concept Thesaurus (CT),
which is not just a set of synonyms, but a set
capturing all logical relations between concepts
(relations like ‘water’ to ‘ocean’). Analyzing article
text, shows us that the text contains many such
logically related concept phrases. Many of these
phrases are turned into hyperlinks that connect to
their own articles. We capitalize on these links and
narrow down our focus to mutually linked articles,
which we conceive as related concepts, as the
relation is validated by two way link created by
human intelligence. We also pay attention to linked
concepts which share a common domain, in other
words belong to common categories.
3.3.1 Cross Link Analysis
First step involves examining a concept and all the
inter-article links from the article explaining a
concept (let's call this the article under examination
'A'; we would use 'A' for representing the concept as
well). Among the articles linked from 'A', the ones
which have a link back to 'A' represent mutually
related concepts.
3.3.2 Link Co-category Analysis
In the next step, we look into one-way links from
'A'. This step also involves the category pages linked
to 'A' (categories of 'A'). An article which has a link
from 'A' and belongs to one of the categories of 'A',
represents a related concept. An important point to
note here, a small but significant number of
Wikipedia categories are related to article status and
do not indicate a concept domain (categories like
‘stub article’, ‘articles to be deleted’ etc.). We have
taken care to implement a filter that weeds out
analysis of such categories.
3.3.3 Relation Specific (RS) Score
After determining related concepts, we resolve how
closely related two articles are, based on the study of
overall Wikipedia link structure as well as the text
analysis of articles related to 'A'. All candidate
concepts related to 'A' are given the RS score. If an
article 'AR' is related to 'A' (hence the concepts
represented by A and AR), then
RS ('A', 'AR') = count (title ('A'), text ('AR')) / inLink ('AR')
Where, RS ( ) is the RS score of relation between
‘A’ and ‘AR’, count (str, tx) is number of times
name of string ‘str’ appeared in text ‘tx’, and
‘inLink’ (article) is the number of backward links to
the article. Ding et al. (Ding, 2005) explain a
backward link as a simple relation between articles
‘A1’ and ‘A2’, if ‘A2’ has a hyperlink targeting
‘A1’.
Articles having a relatively higher number of
other articles linking to them thus, represent more
generic concepts; and will be given a low RS score
from above formula. A complete analysis of
Wikipedia article link structure, category network
and article text using the above three steps yields a
CT of related concepts with respective RS scores.
Algorithm 1 implements the CT extraction
process. Lines 7 to 14 extract cross linked articles
for every concept. The link co-category analysis is
carried out by lines 15 to 20. Lines 21 to 29
determine the RS scores for all related concepts. The
output of this algorithm is the final CT.
QUERY PROCESSING FOR ENTERPRISE SEARCH WITH WIKIPEDIA LINK STRUCTURE
245