SURFACE LANGUAGE MODELS FOR DISCOVERING

TEMPORALLY ANCHORED DEFINITIONS ON THE WEB

Producing Chronologies as Answers to Deﬁnition Questions

Alejandro Figueroa

Deutsches Forschungszentrum f

ur K

unstliche Intelligenz - DFKI, Stuhlsatzenhausweg 3, D - 66123, Saarbr

ucken, Germany

Keywords:

Web question answering, Deﬁnition questions, N-gram language models, Temporally anchored deﬁnitions.

Abstract:

This work presents a data-driven deﬁnition question answering (QA) system that outputs a set of temporally

anchored deﬁnitions as answers. This system builds surface language models on top of a corpus automatically

acquired from Wikipedia abstracts, and ranks answer candidates in agreement with these models afterwards.

Additionally, this study deals at greater length with the impact of several surface features in the ranking of

temporally anchored answers.

1 INTRODUCTION

Nowadays, the systematic growth and diversiﬁcation

of information, published daily on the web, poses

continuing and strong challenges. One of these ma-

jor challenges is assisting users in ﬁnding relevant an-

swers to natural language queries such as deﬁnition

questions (e.g., “Who is Flavius Josephus?”).

In practical terms, open-domain QA is situated at

the frontier of Natural Language Processing (NLP)

and modern information retrieval, being an appealing

alternative to the retrieval of full-length documents.

Strictly speaking, users of QA systems specify their

information needs in the form of natural-language

questions. This means they eliminate any artiﬁcial

constraints or features that are sometimes imposed by

a particular input syntax (e.g., boolean operators).

More often than not, QA systems take advantage

of the fact that answers to speciﬁc questions are fre-

quently concentrated in small fragments of text doc-

uments. This advantage helps QA systems to return

brief answer strings extracted from the text collection.

In a sense, it is up to the QA system to analyse the

content of full-length documents and identify these

small and relevant text fragments.

Essentially, deﬁnition questions are a particular

category of fact–seeking queries about some topic or

concept (a.k.a. deﬁniendum). This type of query has

become especially interesting in recent years, because

of the increasing number of submissions by users to

web search engines (Rose and Levinson, 2004).

More speciﬁcally, deﬁnition QA systems aim

usually at ﬁnding a set of relevant and/or factual

pieces of information (a.k.a. nuggets) about the

deﬁniendum. Nuggets are comprised of distinct

kinds of pieces of information including relations

with people or locations, biographical events and

special attributes. Some illustrative nuggets about the

deﬁniendum “Flavius Josephus” are as follows:

also known as Yosef Ben Matityahu

Jewish

historian and apologist

born AD 37

recorded the destruction of Jerusalem in AD 70

wrote the Jewish War in AD 75

wrote Antiquities of the Jews in AD 94

In 71 became Roman citizen

Since it is necessary to provide enough context to

ensure readability, deﬁnition QA systems output the

corresponding set of sentences. The complexity of

this task, however, is causing deﬁnition QA systems

to split the problem into subtasks that independently

address distinct nugget types. This work deals with

one of those classes: biographical events.

Biographical events, like “born AD 37”, are

part of answers to deﬁnition questions encompass-

ing chronologies of the most remarkable events or

achievements related to the deﬁniendum.

269

Figueroa A.

SURFACE LANGUAGE MODELS FOR DISCOVERING TEMPORALLY ANCHORED DEFINITIONS ON THE WEB - Producing Chronologies as Answers to Deﬁnition Questions.

DOI: 10.5220/0002769802690275

In Proceedings of the 6th International Conference on Web Information Systems and Technology (WEBIST 2010), page

ISBN: 978-989-674-025-2

2 RELATED WORK

For starters, (Alonso et al., 2009) introduced the no-

tion of the time-centred snippet as a useful way of rep-

resenting documents for exploratory search and doc-

ument retrieval. The core idea is proﬁting from sen-

tences that carry relevant units of time (chronons) for

building document surrogates.

Speciﬁcally, (Alonso et al., 2009) noticed that

chronons can be incorporated into web-pages as meta-

data or in the form of temporal expressions. The vital

aspect of these chronons is their relevance for presen-

tation and for highlighting the importance of a doc-

ument given a query. More precisely, they are a key

factor in the construction of more descriptive snippets

that include essential temporal information. In or-

der to detect chronons, (Alonso et al., 2009) analysed

documents for detecting temporal anchors by means

of time-based linguistic tools.

As for selected sentences, (Alonso et al., 2009)

only took into consideration sentences containing ex-

plicit temporal expressions. The length of these se-

lected sentences was bound. For the purpose of rank-

ing sentences, (Alonso et al., 2009) made allowances

for the position of the temporal expression within the

sentence, the number and length of the sentence, and

features regarding the particular chronon: appearance

order and its frequency in the document and within

the sentence. Since (Alonso et al., 2009) applied

this ranking function to a web-corpus, the features

they utilised were chieﬂy on the surface level. Sen-

tences are thus ranked, and the top are sorted and pre-

sented as a temporal snippet. An interesting ﬁnding

of (Alonso et al., 2009) is the fact that users were

concerned about the lack of time-sensitive informa-

tion, that is they are keen on seeing time-sensitive in-

formation within search results. In particular, users

found temporally anchored snippets as surrogates of

documents very useful and the presentation of sorted

temporal information interesting.

Contrarily, (Pas¸ca, 2008) utilised temporally an-

chored text snippets to answer deﬁnition questions.

The difference between both strategies lies in the fact

that temporally anchored answers to deﬁnition ques-

tions must be biographical, and the chronon must be

closely related to the deﬁniendum, whereas tempo-

rally anchored sentences representing a document can

be more diverse in nature.

Essentially, (Pas¸ca, 2008) also focused on tech-

niques that lack deep linguistic processing for discov-

ering temporally anchored answers. Frequently, this

type of answer must be extracted from several docu-

ments, not only because of completeness, but also as

a means to increase the redundancy. In this way def-

inition QA systems boost the probability of detecting

a larger set of reliable and diverse answers that are

temporally anchored, and build richer chronologies

afterwards. Therefore, deﬁnition QA systems require

efﬁcient strategies that can quickly process massive

collections of documents. In particular, (Pas¸ca, 2008)

processed one billion documents corresponding to the

2003 Web snapshot of Google. To be more precise,

they solely used HTML tags removal, sentence detec-

tion and part-of-speech (POS) tags.

In addition, (Pas¸ca, 2008) took advantage of a re-

stricted set of regular expressions to detect dates: iso-

lated year (four-digit numbers, e.g., 1977); or simple

decade (e.g., 1970s); or month name and year (e.g.,

January 1534); or month name, day number and year

(e.g., August, 1945). In order to increase the accu-

racy of their date matching strategy, potential dates

are discarded if they are immediately followed by a

noun or noun modiﬁer, or immediately preceded by a

noun. Further, four lexico-syntactic surface patterns

were used for selecting answer candidates:

: <Date [,|-|(|nil] [when] Snipp et [,|-|)|.]>

: <[StartSent] [In|On] Date [,|-|(|nil] Snippet[, |-|)|.]>

: <[StartSent] Snippet [in|on] Date [EndSent]>

: <[Verb] [OptionalAdverb] [in|on] Date>

As a means to avoid overmatching sentences

formed by complex linguistic phenomena, they en-

forced nuggets on containing a verb and on carrying

no pronoun. (Pas¸ca, 2008) additionally ensured that

both P

and P

match the start of the sentence, and

that the nugget in P

contains a noun phrase. Since

the aim is building a method with limited linguistic

knowledge, this noun phrase was approximated by the

occurrence of a noun, adjective or determiner.

Also, (Pas¸ca, 2008) biased their ranking strategy

in favour of: (a) snippets contained in a higher num-

ber of documents, and (b) snippets that carry fewer

non-stop terms. By the same token, they preferred

snippets that matched query words as a term to scat-

ter query matches. Lastly, (Pas¸ca, 2008) ranked dates

in accordance with the relevance of the snippets sup-

porting that date, and in each date, snippets are also

ranked relatively to one another.

3 CORPUS ACQUISITION

Contrary to (Pas¸ca, 2008; Alonso et al., 2009), our

approach is data-driven. In short, it aims essentially

at learning regularities from training sentences (pos-

itive examples) that are deemed to convey tempo-

rally anchored information about deﬁniendum. More

precisely, these positive examples are acquired from

WEBIST 2010 - 6th International Conference on Web Information Systems and Technologies

270

abstracts provided by the January 2008 snapshot of

Wikipedia.

This acquisition process is motivated by the fol-

lowing observations: (a) sentences within abstracts

are more reliable than those in the body of the article,

and (b) sentences in abstracts are very likely to yield

strongly related descriptions about the topic of the ar-

ticle (deﬁniendum). Consequently, sentences carrying

dates across Wikipedia abstracts are likely to yield

temporal anchored deﬁnitions about their respective

topic.

Speciﬁcally, Wikipedia abstracts are extracted by

means of the structure supplied by the articles. With

abstracts, we understand the ﬁrst section provided by

the document, which typically is a succinct summary

of the key aspects, more important achievements and

events of the corresponding topic. Also, it is worth

noting that only articles fulﬁlling the next criterion

were taken into consideration:

1. Regarding deﬁniendums that their orthographical

composition consists solely of numbers, letters

and hyphens and periods.

2. Deﬁniendums not corresponding to purpose-built

pages such as lists (e.g., “List of economists”) and

categories (e.g. “Category of magmas”).

Subsequently, selected abstracts are preprocessed

as follows. Firstly, sentences are identiﬁed by

means of JavaRap

. Secondly, sentences carrying

dates are recognised by means of SupperTagger

and those not containing dates formed by at least

one number (e.g., 1930 and March 1945) were

eliminated. Thirdly, co-references are resolved. For

this purpose, the replacements presented by (Keselj

and Cox, 2004; Abou-Assaleh et al., 2005) were

utilised. Fourthly, all instances of the deﬁniendum

(title of the page) are replaced with a placeholder

(CONCEPT), this way the learnt models are prevented

from overﬁtting any strong dependance between

some lexical properties of the deﬁniendums and their

respective deﬁnitions across the training set. Also,

partial matches were substituted as long as this partial

match did not consider a stop-word only. Fifthly,

duplicate sentences are detected by simple string

matching, and all sentences that do not contain the

placeholder CONCEPT were ﬁltered out. Posteriorly,

all content in parentheses was moved to the end of

the sentence as a mean of preventing a distortion

in the posterior learning process. Overall, a set of

2,351,708 sentences were extracted corresponding

to 942,242 deﬁniendums from wide-ranging topics:

http://wing.comp.nus.edu.sg/∼qiu/NLPTools/

http://sourceforge.net/projects/supersensetag/

2,190,630 were used for training and 161,078 for de-

velopment belonging to 879,089 and 63,153 distinct

deﬁniendums, respectively. Some training examples

include:

CONCEPT[Ocosingo, Chiapas] was given city

status on 31 July 1979 .

From 1976 to 2001 , the CONCEPT[Le Studio] was

used by many Canadian and international artists to

record hit albums .

The CONCEPT[Le Travail Movement] lasted only

7 months , from Sept 1936 to April 1937 .

CONCEPT[Timelike Inﬁnity] is a 1992 science

ﬁction book by Stephen Baxter .

As a means of assessing the quality of the ac-

quired sets, one hundred randomly selected train-

ing sentences were manually annotated. From these

sentences, 17% were misleading examples. Basi-

cally, these were originated by numbers mislabelled

as dates, sentence boundaries wrongly detected and

wrong inferences drawn by the replacement and ex-

traction heuristics. This acquisition process, nonethe-

less, provided signiﬁcant accuracy, especially consid-

ering that it does not require manual annotations.

In juxtaposition, the testing set was derived

from sentences taken from full web-documents,

since the ultimate goal is deﬁnition QA systems

operating on the web. These test sentences were

obtained by submitting to a search engine (MSN

) deﬁniendums that did not supply train-

ing/development sentences. A maximum of 50 hits

were requested of the search engine per deﬁnien-

dum, from which only documents corresponding

to web snippets containing the exact match of the

deﬁniendum were downloaded and processed. To

be more precise, document processing consisted

in removing HTML tags, and splitting them into

sentences by means of openNLP

. Subsequently, only

sentences carrying the exact match of the deﬁniendum

and a number were chosen, and every instance of

the deﬁniendum is replaced with the placeholder

CONCEPT. As a result, 5,773 testing sentences were

obtained belonging to 1,008 distinct deﬁniendums

from wide-ranging topics.

(+)On October 31 , 1948 , the CON-

CEPT[American Vecturist Association] was

formed in New York City out of interest sparked

from Mr. Moore ’s newsletter .

(-)Posted by: CONCEPT[Chuck Moore] — April

19, 2007 8:35 AM

http://www.live.com

http://opennlp.sourceforge.net

SURFACE LANGUAGE MODELS FOR DISCOVERING TEMPORALLY ANCHORED DEFINITIONS ON THE WEB

- Producing Chronologies as Answers to Definition Questions

271

(-)Starting in 1997 , the Union began to work

with non-student instructional staﬀ to join CON-

CEPT[CUPE 3902].

(+)Beginning in July of 2001 , Dr. CONCEPT[Yuri

Pichugin] joined the CI team as Director of Re-

search .

As shown in the illustrative examples, these 5,773

testing sentences were manually labelled as positive

or negative. As an outcome, this manual annotation

provided 1,149 positive and 4,624 negative samples

respectively. Intentionally, we opted out of forcing

the test set to be balanced, but rather we chose to keep

the distributions as they were found on the web, this

way experimental scenarios are kept as realistic as

possible. Certainly, temporally anchored deﬁnitions

are much fewer in number than their counterparts.

To sum up, reliable sets of positive training and

development sentences were automatically acquired

from Wikipedia abstracts. In each sentence, the oc-

currence of dates was validated by means of linguistic

processing. Conversely, the testing set was obtained

from the web, and numbers are interpreted as the in-

dicator of potential dates. This key difference is due

chieﬂy to the fact that models were trained off-line,

and linguistic tools were utilised for increasing the re-

liability of the positive training set, and consequently,

of the models. On the other hand, testing sets are as-

sessed in real time. For this reason, dates are discrim-

inated from numbers on the grounds of the similar-

ity between their respective contexts and the contexts

learnt by the models.

A ﬁnal remark is due to on-line resources that

yield temporally anchored descriptions. For instance:

www.theday2day.com, www.thisdaythatyear.com,

www.worldofquotes.com. We also acquired 519,240

positive examples from eight different on-line re-

sources. We realised, however, that there are two

aspects that make them less attractive: (1) the kinds

of descriptions they cover is narrow, mainly births

and deaths of people; and (2) the deﬁniendum must

be manually identiﬁed. For these two reasons, these

sentences were not taken into account in this work.

4 N-GRAM LANGUAGE MODELS

Since our corpus acquisition procedure produces two

sets (development and training) consisting solely of

positive examples, only strategies that can learn from

one-class are considered. In language models, a test

sentence S = w

. . . w

is scored in concert with the

probability P(S) of its sequence of words. This prob-

ability is usually decomposed into the multiplication

of the likelihood of smaller sequences of n-words

(Goodman, 2001; Figueroa and Atkinson, 2009):

P(S) ≈

∏

i=1

P(w

i−n+1

. . . w

i−1

)

Where l denotes the length of the test sen-

tence S. Typically, the length of the sentence frag-

ments n ranges between one and ﬁve. Accordingly,

P(w

i−n+1

. . . w

i−1

) is the probability of seeing the

word w

after the fragment w

i−n+1

. . . w

i−1

. These

probabilities are normally approximated by means of

the Maximum Likelihood Estimate:

P(w

i−n+1

. . . w

i−1

) ≈

count(w

i−n+1

. . . w

)

count(w

i−n+1

. . . w

i−1

)

In order to deal with unknown words, when rank-

ing test sentences, a dummy token was arbitrarily

added to the model. This token was associated with a

frequency value of one, then, unigrams probabilities

were deﬁned by the next formula:

P(w

) ≈

count(w

)

∑

∀w

count(w

)

Subsequently, the obtained n-gram language

model is smoothed by interpolating with shorter n-

grams (Goodman, 2001) as follows:

inter

i−n+1

. . . w

i−1

) = λ

P(w

i−n+1

. . . w

i−1

(1 − λ

inter

i−n+2

. . . w

i−1

)

Lastly, the rank of a test sentence S is calculated as

Rank(S) = log(P(S)) as a means of preventing arith-

metically underﬂowing when dealing with long sen-

tences. As for the smoothing parameters, their opti-

mal values were estimated by means of the Expecta-

tion Maximization (EM) algorithm. In essence, this

algorithm is run for each interpolation level, that is λ

is computed interpolating pentagrams and tetragrams,

while λ

tetragrams and trigrams, and so on. Accord-

ingly, the ﬁrst row in table 1 underlines all parameters

values for the proposed models. For the sake of clar-

ity, the remaining two rows will be discussed later.

Table 1: Lambda Estimates.

Language Models 0.91 0.41 0.005 0.0001

Language Models+ 0.80 0.26 0.014 0.0001

Google N-grams

Language Models +CD 0.91 0.43 0.078 0.0001

WEBIST 2010 - 6th International Conference on Web Information Systems and Technologies

272

5 EXPERIMENTS AND RESULTS

Our language models were tried on the testing set ac-

quired in section 3. A baseline was implemented that

assigns a random score to each sentence, and we ad-

ditionally studied the impact of some of the features

of (Pas¸ca, 2008; Alonso et al., 2009) in the ranking.

For the sake of simplicity, from now on, the presented

system is called ChronosDefQA.

As to evaluation metrics, we made use of precision

at k. This measurement is the ratio of sentences that

are actual deﬁnitions between the ﬁrst k positions of

the ranking (per deﬁniendum). In our experiments, we

made allowances solely for the top ﬁve positions as

they showed to be enough to draw a clear distinction

between different approaches.

Table 2: Results achieved by the baseline and different con-

ﬁgurations of ChronosDefQA (average precision at k).

1 2 3 4 5

Baseline

random 0.154 0.123 0.085 0.080 0.098

ranking

(1)=ChronosDefQA

unigrams 0.166 0.127 0.079 0.061 0.073

bigrams 0.179 0.143 0.088 0.079 0.090

trigrams 0.181 0.138 0.092 0.095 0.107

tetragrams 0.184 0.136 0.092 0.079 0.088

pentagrams 0.181 0.143 0.092 0.086 0.097

ChronosDefQA+Google N-grams (all)

unigrams 0.124 0.110 0.070 0.056 0.067

bigrams 0.123 0.111 0.069 0.075 0.092

trigrams 0.129 0.109 0.072 0.067 0.081

tetragrams 0.131 0.115 0.081 0.068 0.083

pentagrams 0.137 0.118 0.072 0.071 0.086

ChronosDefQA+Google N-grams (ngram)

unigrams 0.150 0.121 0.080 0.081 0.096

bigrams 0.148 0.122 0.071 0.077 0.093

trigrams 0.151 0.123 0.073 0.067 0.081

tetragrams 0.146 0.122 0.071 0.075 0.091

pentagrams

0.149 0.125 0.060 0.084 0.101

Table 2 stresses the achievements obtained in

terms of average precision at k by the baseline and dif-

ferent conﬁgurations of ChronosDefQA. These con-

ﬁgurations varied the level of the n-gram language

models (n = 1 . . . 5). In light of the results, it can be

concluded:

1. In general, ChronosDefQA outperforms our base-

line in the top-three ranked positions. This suggests

that our corpus in conjunction with n-gram language

models enhance the performance.

2. More interestingly, trigram language models pro-

duced better results consistently, though higher level

models were also taken into consideration in this eval-

uation. This outcome is also corroborated later by

other variations of ChronosDefQA (see tables 3 and

4).

3. The reason to prefer language models for rank-

ing answer candidates is having only positive sam-

ples. Thus, ChronosDefQA knows n-gram distribu-

tions in temporally anchored deﬁnitions, while at the

same time, it lacks of data about distributions of n-

grams in general text (sentences) containing numbers.

As a means of inferring this information, avoiding

the manual annotation of a set of sentences of about

the same size than the positive set, negative (general

text) language models were deduced from Google N-

grams.

First, the respective frequencies of the n-grams in

the positive models were taken from Google N-grams.

Only these n-grams were considered in order to pre-

vent inducing a bias in the negative models. Second,

language models are built on top of these frequencies.

Third, as to interpolation parameters, since we do not

have annotated negative sentences, the positive devel-

opment set was utilised for their tuning. Accordingly,

the second row in table 1 shows the obtained values.

Two distinct alternatives were attempted. The

ﬁrst one, signalled by “(all)” in table 2, indepen-

dently ranks answer candidates with both language

models, and divide both scores afterwards. The

second, indicated by “(ngram)” in table 2, divides

each P

inter

i−n+1

. . . w

i−1

), and outputs the corre-

sponding rank(S). In general, both approaches were

detrimental to performance, suggesting that Google

N-grams did not provide a good approximation of the

negative set.

4. Besides, (Alonso et al., 2009) utilised the length

of the sentence as an attribute in their ranking strat-

egy. Equally, results (2) in table 3 highlights the

results achieved by ChronosDefQA, when the score

function rank(S) is divided by the number of tokens

in S. This resulted in an enhancement with respect to

(1) in models that account for features of higher or-

der than unigrams. Speciﬁcally, the precision of the

top ranked sentence in the trigram model increased

from 0.181 to 0.194 (7.18%), while the ranking in the

case of the unigram model was inclined to worsen.

5. A key aspect of rank(S) is that it assigns higher

scores to sentences in agreement with their likelihood

of being temporally anchored deﬁnitions. Since the

ranking is forced to output ﬁve sentences, in some

deﬁniendums, some misleading sentences can still be

incorporated into the output, despite their low scores.

The reason for this is two-fold: (a) few or no reli-

able answers were distinguished in the test set, and (b)

there was no genuine answer for a particular deﬁnien-

dum. For this reason, we experimentally check

SURFACE LANGUAGE MODELS FOR DISCOVERING TEMPORALLY ANCHORED DEFINITIONS ON THE WEB

- Producing Chronologies as Answers to Definition Questions

273

Table 3: Results achieved by distinct conﬁgurations of

ChronosDefQA (average precision at k).

1 2 3 4 5

(2)=(1)+Length

unigrams 0.156 0.108 0.073 0.076 0.091

bigrams 0.194 0.150 0.110 0.109 0.122

trigrams 0.194 0.151 0.102 0.117 0.128

tetragrams 0.188 0.151 0.104 0.094 0.103

pentagrams 0.191 0.141 0.097 0.099 0.111

(3)=(2)+Thresholds

unigrams 0.156 0.108 0.073 0.076 0.091

bigrams 0.203 0.158 0.117 0.117 0.131

trigrams 0.206 0.161 0.110 0.127 0.139

tetragrams 0.197 0.158 0.110 0.102 0.112

pentagrams 0.199 0.147 0.103 0.106 0.118

(4)=(3)+ Number Substitution

unigrams 0.156 0.116 0.073 0.074 0.088

bigrams

0.232 0.195 0.143 0.197 0.211

trigrams 0.230 0.206 0.250 0.266 0.275

tetragrams 0.247 0.196 0.180 0.205 0.221

pentagrams 0.249 0.229 0.129 0.137 0.161

(5)=(3)+Number Substitution & Redundancy

unigrams 0.182 0.140 0.087 0.094 0.106

bigrams 0.328 0.250 0.229 0.127 0.243

trigrams 0.350 0.294 0.210 0.25 0.273

tetragrams 0.307 0.253 0.184 0.260 0.286

pentagrams 0.290 0.226 0.156 0.202 0.229

for a set of reliable thresholds to tackle this issue.

Accordingly, these thresholds are: unigrams(-13.0),

bigrams(-4.0) and others(-2.5). The top-ﬁve answers

were cut-off at any point where an answer ﬁnished

with a score smaller than the respective threshold.

Results (3) (table 3) emphasises the new out-

comes, showing that trigram models perform the best.

More precisely, this model improved the performance

of its homologous in (2) as follows: top-one (6.18%),

top-two (6.6%), top-three (7.8%), top-four (8.5%)

and top-ﬁve (8.6%). Since the ranking is more likely

to be trimmed at lower positions, the impact of this

attribute is inclined to be stronger as long as greater

values of k are considered. This outcome also reaf-

ﬁrms our ﬁrst conclusion, because greater improve-

ments are achieved when cutting at lower positions,

that is when discarding candidates with low scores.

6. In order to boost the similarities between the

models and test sentences, numbers across train-

ing and development sentences were replaced with a

placeholder (CD). Consequently, new language mod-

els were trained, and accordingly, the new interpola-

tion parameters are shown in the third row of table 1.

Subsequently, test sentences are assessed by (3), but

making use of these new language models along with

substituting numbers in test sentences with CD.

Results (4) indicate the relevance of this substitu-

tion as it considerably improves the performance for

almost all models in relation to (3). For instance,

the trigram model ﬁnished with better precision at all

levels, when contrasted to its similar in (3): top-one

(11.65%), top-two (27.95%), top-three, top-four and

ﬁve (90+%). These outcomes reafﬁrm the positive

contribution of our language models to the ranking.

In this conﬁguration, the pentagram model outper-

formed the trigram model at k = 1 and 2, revealing the

importance of fuller syntactic structures. Since num-

bers were replaced by a placeholder, the probability of

matching some few more complete and reliable struc-

tures increased, bringing about improvements at all

levels with respect to (3), being these enhancements

greater than the trigram models for the ﬁrst and sec-

ond ranked answers.

7. Incidentally, (Pas¸ca, 2008) clustered answers ac-

cording to dates, under the assumption that dates sup-

ported by more answer candidates are more likely

to be the most reliable and relevant chronons for a

particular deﬁniendum. Following this observation,

ChronosDefQA builds an histogram of the numbers

appearing across answer candidates, and removes all

the numbers with a frequency of one. It additionally

discards dates where at least one of their instances

does not occur proceeded by a, the, in and on. Con-

trary to (Pas¸ca, 2008), ChronosDefQA accounts for

any sort of number. The higher ranked instance for

each date is moved to the top of the rank by adding

the score of the highest ranked answer to its score.

Broadly speaking, results (5) reveal a marked im-

provement with respect to (4) for all conﬁgurations,

and for almost all levels of precision, demonstrating

the relevance of the local contextual information.

8. Previous conﬁgurations, excluding those utilising

Google N-grams, bias the ranking in favour of answer

candidates containing positive evidence. Following

the observation of (Pas¸ca, 2008), conﬁguration (6)

in table 4 underlines the outcomes achieved when re-

moving sentences containing pronouns. Results re-

veal an improvement in relation to (5).

9. Following the same line, ChronosDefQA eliminates

answer candidates that contain expressions including

“in”/“on”/“the”/“at” (the) “CONCEPT”. Like expung-

ing pronouns, the outcome (7) obtained by this re-

moval bettered the results achieved by (6).

10. One aspect that makes the language models pre-

sented in this work less attractive is their assump-

tion about characterising an answer candidate S as

sequences of n words. Many times insertions and/or

deletions of words can make a genuine answer look

spurious or, in the best case, less reliable. As a means

of investigating the impact of insertions and deletions

of words, we learnt sets of ﬁve and four ordered words

WEBIST 2010 - 6th International Conference on Web Information Systems and Technologies

274

Table 4: Results obtained by extra conﬁgurations of

ChronosDefQA (average precision at k).

1 2 3 4 5

(6)=(5)+Pronouns

unigrams 0.190 0.149 0.100 0.092 0.104

bigrams 0.341 0.255 0.231 0.25 0.267

trigrams 0.360 0.298 0.222 0.259 0.279

tetragrams 0.315 0.257 0.194 0.270 0.298

pentagrams 0.299 0.232 0.161 0.197 0.223

(7)=(6)+PP

unigrams 0.194 0.156 0.109 0.089 0.100

bigrams 0.353 0.242 0.250 0.258 0.275

trigrams 0.376 0.302 0.220 0.270 0.288

tetragrams 0.325 0.269 0.198 0.230 0.257

pentagrams 0.308 0.236 0.172 0.174 0.193

(8)=(7)+Four-Words Orderings

unigrams 0.211 0.174 0.131 0.108 0.118

bigrams

0.381 0.365 0.318 0.290 0.304

trigrams 0.385 0.368 0.359 0.308 0.326

tetragrams 0.345 0.309 0.269 0.315 0.341

pentagrams 0.318 0.283 0.215 0.243 0.265

(9)=(7)+Five-Words Orderings

unigrams 0.205 0.149 0.125 0.083 0.094

bigrams 0.393 0.296 0.224 0.341 0.364

trigrams 0.415 0.330 0.222 0.400 0.424

tetragrams 0.358 0.276 0.200 0.260 0.290

pentagrams 0.328 0.249 0.180 0.195 0.217

that co-occur in windows of ten tokens across training

sentences, similarly to (Figueroa, 2008). These or-

dered words were constrained to start with the place-

holder CONCEPT and end with the placeholder CD.

Some illustrative highly frequent tuples are:

<CONCEPT,born,in,CD>; <CONCEPT,served,as,CD>;

<CONCEPT,album,released,in,CD>;

Answer candidates matching these tuples are re-

ranked analogously to conﬁguration (5). Generally,

tuples consisting of four words caused greater en-

hancement at k = 3, 4, 5 for all models, whereas tu-

ples formed by ﬁve words at k = 1 and 2. Nonethe-

less, both brought about an improvement with respect

to the previous conﬁguration, signalling that the con-

text between the deﬁniendum and the potential date

along with its ﬂexible word ordering is a conspicuous

attribute of temporally anchored deﬁnitions.

6 CONCLUSIONS AND FUTURE

WORK

This work presented an automatic acquisition method

of temporally anchored deﬁnitions from Wikipedia.

These acquired deﬁnitions are then used for language

models that rank temporally anchored answers to def-

inition questions. This work also studied the effects

of different attributes that have a signiﬁcant impact in

ranking answer candidates.

As future work, we envision the enrichment of the

acquisition process with more linguistic knowledge;

this way more reliable training and development sets

can be obtained. By the same token, contextual word

orderings and windows lengths could be learnt from

the respective dependency trees, and employed at the

surface level to rank test sentences afterwards.

ACKNOWLEDGEMENTS

The material used in this work is available

at: http://www.inf.udec.cl/∼atkinson/biographical-

EventsSets.tar.gz. Thanks to John Atkinson for aiding

in publishing these data-sets.

This work was partially supported by a research

grant from the German Federal Ministry of Edu-

cation, Science, Research and Technology (BMBF)

to the DFKI project TAKE (01IW08003) and the

EC-funded project QALL-ME - FP6 IST-033860

(http://qallme.fbk.eu).

REFERENCES

Abou-Assaleh, T., Cercone, N., Doyle, J., Keselj, V., and

Whidden, C. (2005). DalTREC 2005 QA System Jel-

lyﬁsh: Mark-and-Match Approach to Question An-

swering. In Proceedings of TREC 2005. NIST.

Alonso, O., BaezaYates, R., and Gertz, M. (2009). Ef-

fectiveness of temporal snippets. In Workshop on

Web Search Result Summarization and Presentation

(WWW 2009).

Figueroa, A. (2008). Mining Wikipedia for Discovering

Multilingual Deﬁnitions on the Web. In 4th Inter-

national Conference on Semantics, Knowledge and

Grid, pages 125–132.

Figueroa, A. and Atkinson, J. (2009). Using Dependency

Paths For Answering Deﬁnition Questions on The

Web. In 5th International Conference on Web Infor-

mation Systems and Technologies, pages 643–650.

Goodman, J. T. (2001). A bit of progress in language mod-

eling. Computer Speech and Language, 15:403–434.

Keselj, V. and Cox, A. (2004). DalTREC 2004: Question

Answering Using Regular Expression Rewriting. In

Proceedings of TREC 2004. NIST.

Pas¸ca, M. (2008). Answering Deﬁnition Questions via

Temporally-Anchored Text Snippets. In Proceedings

of IJCNLP.

Rose, D. E. and Levinson, D. (2004). Understanding user

goals in web search. In WWW ’04: Proceedings of

the 13th international conference on World Wide Web,

pages 13–19.

SURFACE LANGUAGE MODELS FOR DISCOVERING TEMPORALLY ANCHORED DEFINITIONS ON THE WEB

- Producing Chronologies as Answers to Definition Questions

275