UESTION ANSWERING AS A KNOWLEDGE DISCOVERY

TECHNIQUE

Yllias Chali

University of Lethbridge, 4401 University Drive

Lethbridge, Alberta, Canada, T1K 3M4

Keywords:

Knowledge and Information Extraction, Question Answering.

Abstract:

The size of the publicly indexable world-wide-web has provably surpassed several billions of documents and as

yet growth shows no sign of leveling off. Search engines are therefore increasingly challenged when trying to

maintain current indices using exhaustive crawling. Focused Retrieval provides more direct access to relevant

information. In this paper we investigate the various aspects of discovering knowledge about entities such as

people, places, groups, and about complex events.

1 INTRODUCTION

The size of the publicly indexable world-wide-web

has provably surpassed several billions of documents

and as yet growth shows no sign of leveling off. Dy-

namic content on the web is also growing as time-

sensitive materials, such as news, ﬁnancial data, en-

tertainment and schedules become widely dissemi-

nated via the web. Search engines are therefore in-

creasingly challenged when trying to maintain cur-

rent indices using exhaustive crawling. Even using

state of the art systems such as AltaVista’s Scooter,

which reportedly crawls ten million pages per day, an

exhaustive crawl of the web can take weeks. This

vast raise in the amount of online text available and

the demand for access to different types of informa-

tion have, however, led to a renewed interest in a

broad range of Information Retrieval (IR) related ar-

eas that go beyond simple document retrieval, such as

focused retrieval, topic detection and tracking, sum-

marization, multimedia retrieval (e.g., image, video

and music), software engineering, chemical and bio-

logical informatics, text structuring, text mining, and

genomics (Voorhees, 2003a; Voorhees, 2003b). Fo-

cused Retrieval (FR) is relatively a new area of re-

search which deals with retrieving speciﬁc informa-

tion (i.e. passage or answer to a question or XML ele-

ment) to the query rather than state of the art informa-

tion retrieval systems (search engines), which retrieve

documents (Harabagiu et al., 2003; Moldovan et al.,

1999; Roth et al., 2002; Moldovan et al., 2002). This

means that the focused retrieval systems will possibly

be the next generation of search engines. What is left

to be done to allow the focused retrieval systems to be

the next generation of search engines? The answer is

higher accuracy and efﬁcient extraction. In this paper,

we investigate various aspects of the focused retrieval

applications such as question answering, passage re-

trieval and element retrieval. We are proposing in this

paper techniques to extract useful information about

entities such as people, places and groups, and about

complex events. To achieve this goal, we need to de-

velop mechanisms to answer questions about these

entities and complex events. For instance, consider-

ing “Abraham Lincoln”, we can extract the informa-

tion that are answers to the following questions:

Who is Abraham Lincoln?

When was he president?

When did he die?

How did he die?

Who shot him?

etc.

This paper is organized as follows. Section 2

presents the pre-processing techniques consisting of

the normalization of the questions. Then, we describe

our system for question answering in all its details.

Finally, we present an evaluation of the system and

conclude by some future works.

339

Chali Y. (2008).

QUESTION ANSWERING AS A KNOWLEDGE DISCOVERY TECHNIQUE.

In Proceedings of the Fourth International Conference on Web Information Systems and Technologies, pages 339-342

DOI: 10.5220/0001531003390342

 SciTePress

2 QUESTION NORMALIZATION

The questions are not only about entities but could

be about complex events such as “the visit of Prince

Charles and Camilla to California”. We call the theme

of the questions in general the

target

. The ques-

tions are grouped by target being the overall theme of

the questions. The targets are mainly people, places,

groups, and events. For instance, we could have a tar-

get like “Space Shuttles,” and we will have all possi-

ble questions about this target or more speciﬁc ques-

tions about more topics like “Spaceship Columbia”.

Our goal is to collect and discover several useful in-

formation about a specif target. To accomplish this

goal, we need to answer all the possible questions

about the target. For instance, if we consider Canada

general election event as a target, we will have the

scenario including the following questions:

Target:

Canada general election

Questions:

When was the last Canada general election?

Why was the election called?

Which political parties participated

in the election?

Who was leading the liberal party?

Who was leading the conservative party?

What was the election results?

How many seats did each party get

in the parliament house?

What are the changes with the new

government?

How long is the Canadian mandate?

etc.

The question normalization module takes the in-

formation given by the target and the questions, and

changes the questions to incorporate that information.

This means that questions can refer to the target of the

questions, or even to other questions. Our system re-

solves these references so that it can answer the ques-

tions one at a time.

These types of references were also investigated

by (Schone et al., 2004). We classify these references

in three ways:

• Reference to the target by a pronoun

• Reference to the target by an entity

• Implied Reference

Our system resolves the pronouns of the question

ﬁrst, then proceeds to resolve the other entities.

2.1 Pronoun Resolution

Our system assumes that pronouns are referring to the

target of the question, unless the target already ap-

pears in the question. It considers two types of pro-

nouns; personal and possessive.

Personal pronouns just need a direct replacement

with the target. An example of this is the ques-

tion, “Where was he born?”, with the target, “Wal-

ter Mosley”, which will be changed to “Where was

Walter Mosley born?” The personal pronouns we are

considering are; it, he, she, they, him and her.

The possessive pronouns will involve more than a

direct replacement. An “’s” will be added after the

target, once the target replaces the pronoun. An ex-

ample of this is the question, “Who is her coach?”, for

the target, “Jennifer Capriati”, which will be changed

to “Who is Jennifer Capriati’s coach?”

2.2 Entity Resolution

These are references to the target, or a past question,

in the form of what type of entity it is. An example

of this is the entity, “the cult”, in the question, “Who

was the leader of the cult?”. This entity is referring to

the target, “Heaven’s Gate”. These entities start with

“the” or “this”, and they could refer to three things;

the target, an answer from a previous question, or the

answer to the question.

First our system checks for a pattern correlation

between the entity and the target. For instance the

question, “When was the agreement made?”, has the

entity “the agreement” which corresponds to the tar-

get “Good Friday Agreement”, so the entity will be

replaced by the target.

Each time an entity represents an answer, the en-

tity is saved along with its corresponding answer.

Then, if that entity appears again, it will be replaced

with the answer, if it has not already been replaced

by the target. For the question, “What are titles of

the group’s releases?”, the entity that needs resolu-

tion is “group”, which does not correspond to the tar-

get “Fred Durst”. It does, however, correspond to

the focus of a previous question, “What is the name

of Durst’s group?” Therefore, the entity “group”

from the question, “What are titles of the group’s re-

leases?”, will be replaced by the answer of the previ-

ous question.

2.3 Implied References

Implied references are when the target is implied,

but not explicitly stated. An example of this is the

question, “Who was President of the United States

WEBIST 2008 - International Conference on Web Information Systems and Technologies

340

at the time?”, for the target, “Teapot Dome scandal”.

The question would ideally be reformed to “Who was

President of the United States at the time of the Teapot

Dome scandal?”, but our system does not reform the

question in such a way. Our system will include the

target in the query for these questions without reform-

ing the question itself.

Some questions are more difﬁcult than this and

need to be treated differently. The question, “How

many are there now?”, is looking for a count of an

entity which is the target, which is “Kibbutz” in this

case. Therefore, if our system can not determine the

entity that is to be counted, it will consider the target

as the entity.

3 OUR SYSTEM

Once the questions are normalized, each question is

answered individually by our question answering sys-

tem. Figure 1 outlines the general architecture of our

system. In the subsequent, we detail each of its com-

ponents.

Question Documents

Question Classifier

Document Retriever

Document Tagger

Answer Extractor

Answer Ranker

Answer

Figure 1: System Overview.

3.1 Question Classiﬁcation

Questions are classiﬁed by ﬁrst separating them into

one of the following categories; who, when, why, how,

where and what. If a question is not easily classiﬁed

as one of the above, it will be classiﬁed as a what

question.

After they are categorized, the named entity (NE)

the answer will take is found. For what questions this

may involve discovering the question focus. The fo-

cus of a question is the part of a question that tells

what type of entity the answer will be. For instance,

the focus of “What city is home of the CN Tower?”

has a focus of “city”. We use a group of patterns to

discover the focus in what questions.

3.2 Document Retrieval

We are using Managing Gigabytes (MG) (Witten

et al., 1999) for our information retrieval system. We

separate each document into paragraphs, and index

each paragraph as if it were a document. When a

question is being processed, the question classiﬁca-

tion module creates a boolean query for MG to re-

trieve the documents.

3.3 Document Tagging

We use Collins Parser (Collins, 1996) and OAK Tag-

ger (Sekine, 2002) to tag the documents that are re-

trieved by MG. Collins Parser tags the word depen-

dencies from the documents, and the OAK Tagger

tags chunked parts of speech and named entities that

correspond to the answer type of the question. The

documents that are parsed, and the documents tagged

by OAK tagger, will be sent to the answer extractor.

3.4 Answer Extraction

The two sets of tagged documents will have their an-

swers extracted differently.

For the parsed set of documents, the question

parse tree will be used to ﬁll in the missing informa-

tion from the parse tree of the documents. If an en-

tity can be found such that it can complete the answer

parse tree, it is passed to the answer ranker.

For the OAK tagged documents, if they contain a

named entity corresponding to the answer type of the

question, they are then passed to the answer ranker.

For some other questions, patterns for both parsed

documents and the chunked part of speech documents

will extract possible answers to be ranked by the an-

swer ranker module.

3.5 Answer Ranking

If the answer type of the question corresponds to a

tagged named entity, all the entities extracted from

the tagged documents will be considered possible an-

swers. They will be ranked by how many times they

appear in the possible answer list, how close they ap-

pear to words from the question, and if they appear

in the list of entities extracted from the parsed docu-

ments. If the answer type is not a named entity, the

entities extracted from the parsed documents will be

considered possible answers and will be ranked only

on frequency.

QUESTION ANSWERING AS A KNOWLEDGE DISCOVERY TECHNIQUE

341

For factoid questions, the top ranked possible an-

swer is given as the answer to the question if it

achieves a rank above the threshold for the type of

question. For list questions, the possible answers that

achieve a rank higher than the threshold will be given

as the answer. For other questions, possible answers

that appeared more then two times are given as an-

swers. This is because our patterns sometimes extract

useless information, and if a piece of information is

important about a target, it will usually get extracted

more than once. A useless fact should only be ex-

tracted once from the set of documents.

4 EVALUATION

The TREC question answering track provides the test-

ing data to evaluate the accuracy of the systems. It

consists of sets of documents and questions/answers

related to those sets of documents. We evaluate our

system considering these data. The results of the eval-

uation are shown in Table 1.

Our system still not ready for all the types of ques-

tions that are asked in the TREC QA track collec-

tion. This difﬁculty arose because we mainly train

our system on the questions and answers, and we do

not present a corpus of questions large enough to in-

clude classiﬁcations for the questions. Therefore, we

lose in the accuracy of attempts to answer questions.

Table 1: Evaluation Results shown by Question Categories.

Question Type Success

Who 0.317

When 0.328

Why 0.245

How 0.265

Where 0.345

What 0.294

List 0.308

Others 0.145

Overall 0.281

5 CONCLUSIONS

We presented a system that extracts information about

entities and events given a pool of questions related to

that entity or event. We create categories for all the

questions. We extract rules to classify questions into

each of these categories. The system also includes

syntactic features and part of speech features for the

question classiﬁcation and answer extraction.

Our system still need some improvements. The

overall improvement is primarily expected by the ex-

panded classiﬁcation of questions and the addition of

dependency features to answer ﬁnding (Li and Roth,

2005; Pinchak and Lin, 2006). We hope to carry on

this research and obtain an even greater improvement.

REFERENCES

Chali, Y. and Dubien, S. (2004). University of Lethbridge’s

participation in TREC-2004 QA track. In Proceedings

of the Thirteenth Text REtrieval Conference.

Collins, M. (1996). A new statistical parser based on bi-

gram lexical dependencies. In Proceedings of ACL-

96, pages 184–191, Copenhagen, Denmark.

Harabagiu, S., Moldovan, D., Clark, C., Bowden, M.,

Williams, J., and Bensley, J. (2003). Answer min-

ing by combining extraction techniques with abduc-

tive reasoning. In Proceedings of the Twelfth Text RE-

trieval Conference, pages 375–382.

Li, X. and Roth, D. (2005). Learning question classiﬁers:

The role of semantic information. Journal of Natural

Language Engineering.

Moldovan, D., Harabagiu, S., Girju, R., Morarescu, P., Lac-

tusu, F., Novischi, A., Badulescu, A., and Bolohan, O.

(2002). LCC tools for question answering. In Pro-

ceedings of the Eleventh Text REtrieval Conference.

Moldovan, D., Harabagiu, S., Pasca, M., Mihalcea, R.,

Goodrum, R., Girju, R., and Rus, V. (1999). LASSO:

A toll for surﬁng the answer net. In Proceedings of

the 8th Text REtrieval Conference.

Pinchak, C. and Lin, D. (2006). A probabilistic answer

type model. In Proceedings of the 11th Conference

of the European Chapter of the Association for Com-

putational Linguistics, pages 393 – 400.

Roth, D., Cumby, C., Li, X., Morie, P., Nagarajan, R., Riz-

zolo, N., Small, K., and Yih, W. (2002). Question-

answering via enhanced understanding of questions.

In Proceedings of the Eleventh Text REtrieval Confer-

ence.

Schone, P., Ciany, G., P. McNamee, J. M., Bassi, T., and

Kulman, A. (2004). Question answering with QAC-

TIS at TREC-2004. In Proceedings of the Thirteenth

Text REtreival Conference.

Sekine, S. (2002). Proteus project oak system (English sen-

tence analyzer), http://nlp.nyu.edu/oak.

Voorhees, E. M. (2003a). Overview of the TREC 2002

Question Answering track. In Proceedings of the

Eleventh Text REtrieval Conference.

Voorhees, E. M. (2003b). Overview of the TREC 2003

Question Answering track. In Proceedings of the

Twelfth Text REtrieval Conference.

Witten, I., Muffat, A., and Bell, T. (1999). Managing Gi-

gabytes: Compressing and Indexing Documents and

Images. Morgan Kaufmann.

WEBIST 2008 - International Conference on Web Information Systems and Technologies

342