Human readers would quickly notice that the
sentence is a definition of a politician, de-
spite the missing concept and words. This
is made possible due to the existence of two
dependency paths
ROOT
→
is
→
politician
, and
politician
→
leader
→
of
. The former acts as a
context indicator indicating the type of definiendum
being described, whereas the latter yields content that
is very likely to be found across descriptions of this
particular context indicator (politician). A key differ-
ence from the vast majority of TREC systems is that
the inference is drawn by using contextual informa-
tion conveyed by several descriptions of politicians,
instead of using additional sources that provide infor-
mation about a particular definiendum (e. g., “Gordon
Brown”).
In our approach, context indicators and their
corresponding dependency paths are learnt from
abstracts provided by Wikipedia. Specifically,
contextual n-gram language models are constructed
on top of these contextual dependency paths in
order to recognise sentences conveying definitions.
Unlike other QA systems (Hildebrandt et al., 2004),
definition patterns are applied at the surface level
(Soubbotin, 2001) and key named entities are
identified using named-entity recognizers (NER)
1
.
Preprocessed sentences are then parsed by using a
lexicalised dependency parser
2
, in which obtained
lexical trees are used for building a treebank of
lexicalised definition sentences. As an example, the
following trees extracted from the treebank represent
two highly-frequent definition sentences:
(1)
Concept was born in Entity, Entity.
(2)
Concept is a tributary of the Entity in Entity.
The treebank contains trees for 1, 900,642 different
sentences in which each entity is replaced with a
placeholder. This placeholder allows us to reduce the
sparseness of the data and to obtain more reliable fre-
quency counts. For the same reason, we did not con-
sider different categories of entities and capitalised
adjectives were mapped to the same placeholder.
From the sentences in the treebank, our method
identifies potential Context Indicators. These involve
words that signal what is being defined or what type
of descriptive information is being expressed. Con-
text indicators are recognised by walking through the
dependency tree starting from the root node. Since
only sentences matching definition patterns are taken
1
http://nlp.stanford.edu/software/CRF-NER.shtml
2
http://nlp.stanford.edu/software/lex-parser.shtml
into account, there are some regularities that are help-
ful to find the respective context indicator. Occa-
sionally, the root node itself is a context indicator.
However, whenever the root node is a word contained
in the surface patterns (e.g. is, was and are), the
method walks down the hierarchy. In the case that
the root has several children, the first child (differ-
ent from the concept) is interpreted as the context
indicator. Note that the method must sometimes go
down one more level in the tree depending of the ex-
pression holding the relationship between nodes (i.e.,
“part/kind/sort/type/class/first of”). Furthermore, the
used lexical parser outputs trees that meet the projec-
tion constrain, hence the order of the sentence is pre-
served. Overall, 45, 698 different context indicators
were obtained during parsing. Table 1 shows the most
frequent indicators acquired with our method, where
P(c
s
) is the probability of finding a sentence triggered
by the context indicator c
s
within the treebank.
Table 1: Some Interesting Context Indicators.
Indicator P(c
s
) ∗10
4
Indicator P(c
s
) ∗10
4
born 1,5034 company 1,32814
album 1,46046 game 1,31932
member 1,45059 organization 1,31836
player 1,38362 band 1,31794
film 1,37389 song 1,3162
town 1,37243 author 1,31601
school 1,35213 term 1,31402
village 1,35021 series 1,31388
station 1,34465 politician 1,30075
son 1,33464 group 1,29767
Next, candidate sentences are grouped according
to the obtained context indicators. Consequently,
highly-frequent directed dependency paths within a
particular context are hypothesised to significantly
characterise the meaning when describing an in-
stance of the corresponding context indicator. This is
strongly based on the extended distributional hypoth-
esis (Lin and Pantel, 2001) which states that if two
paths tend to occur in similar contexts, their meanings
tend to be similar. In addition, the relationship be-
tween two entities in a sentence is almost exclusively
concentratedin the shortest path between the two enti-
ties of the undirected version of the dependency graph
(Bunescu and Mooney, 2005). Hence, one entity can
be interpreted as the definiendum, and the other can
be any entity within the sentence. Therefore, paths
linking a particular type of definiendum with a class
of entity relevant to its type will be highly frequent
in the context (e. g., politician → leader → of →
ENTITY).
For each context, all directed paths containing two
to five nodes are extracted. Longer paths are not taken
into consideration as they are likely to indicate weaker
USING DEPENDENCY PATHS FOR ANSWERING DEFINITION QUESTIONS ON THE WEB
645