reflect some of the graph properties (network den-
sity, node depth, link strength, etc.) (Jiang and
Conrath, 1997).
3. Hybrid: a combination of the former two (Jiang
and Conrath, 1997) (Leacock and Chodorow,
1998).
Regarding the ambiguity, we have a couple of
works that tried to solve this problem of human lan-
guages by using Wikipedia. Word sense ambiguity
exists in all natural languages across the world. One
of the first approaches uses Wikipedia to compare lex-
ical context around the ambiguous concept to the can-
didates of desambiguation at Wikipedia (Bunescu and
Pasca, 2006).
Some authors explored the possibility of using
Wikipedia labels, definition on the disambiguation
pages and Wordnet definitions combined to learn the
real true meaning of the sentences (Mihalcea, 2007).
Lexical databases, such as WordNet, have been
explored as knowledge bases to measure the semantic
similarity between words or expressions. However,
WordNet provides generic definitions and a somewhat
rigid categorization that does not reflect the intuitive
semantic meaning that a human might assign to a con-
cept.
Other works in this particular field aim to combine
the traditional approaches with the Wikipedia infor-
mations as an auxiliary source, to improve the results
(Ratinov et al., 2011). One of most common problems
with this kind of approaches rely on the time that it is
needed to perform the calculation. With that in mind,
the tests were reduced to a limited set of Wikipedia
information.
Based on that information we believe that a great
progress on disambiguation problem using Wikipedia
as base is still achievable.
3 PROBLEM
Currently Wikipedia is mainly used has a tool to ex-
tract semantic knowledge, having currently over 4
million articles and a well structured category net-
work, which allows us to extract the necessary infor-
mation to disambiguate an ambiguous term.
In our particular case, we have a generic entity,
this entity contains a list of features that describe her.
We want to find a Wikipedia article that represents
the semantic content of each feature. The problem is
that some topics lead us to disambiguation page
2
, a
non-article page which lists the various meanings the
2
http://en.wikipedia.org/wiki/Wikipedia:Disambiguation
ambiguous term and links to the articles which cover
them.
The challenge is to find the most appropriate arti-
cle from the list provided by the disambiguation page
with an acceptable time and efficiency.
To disambiguate an article it is first necessary a
context, the context will consist of all non-ambiguous
articles in the list of features, this context will be used
to calculate the proximity between him and every dis-
ambiguation article, the article closest context should
be the most suitable article.
In short, our problem is to find Wikipedia articles
that semantically represent the features and disam-
biguate the ambiguous articles quickly and efficiently.
4 PROPOSED DISAMBIGUATION
METHOD
Search for Articles
Considering the problem described in the previous
section, it is first necessary to find an article that se-
mantically represent each feature. There are two basic
ways to find articles from a feature:
1. Find an Wikipedia article directly from the feature
literally comparing the text of the feature with the
title of the article.
2. Decomposing the feature, in order to obtain
simpler sub-features and use them to find the
Wikipedia articles, this technique can lead to se-
mantic deviations, so it should be avoided or care-
fully treated.
The solution we found was to develop a set of
methods that can meet efficiency about 60% or higher
of the article for the features.
The developed methods are:
• Direct Search: Find an Wikipedia article, liter-
ally comparing the text of the feature, singular and
plural, with the title of the article. This method is
reused by the other methods.
• Regex
3
: Find regular expressions in the text of
the feature, and treats it according to the type of
regular expressions found. Much of the Regex
are developed to decompose the features contain-
ing in it’s text the word ”and” or punctuation
mark’s like ”comma” or ”colon”; These elements
are very common and easily decomposed because
they generate predictable structures.
3
http://en.wikipedia.org/wiki/Regular expression
KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
550