TOWARDS A LANGUAGE INDEPENDENT

ENCODING OF DOCUMENTS

A Novel Approach to Multilingual Question Answering

Pushpraj Shukla, Amitabha Mukherjee, Achla Raina

Indian Institute of Technology, Kanpur, India – 208016

stract: Given source text in several languages, can one answer queries in

some other language, without translating any of the sources into the language

of the questioner? In this paper we try to address this question as we report

our work on a restricted domain, multilingual Question – Answering system,

with current implementations for source text in English and questions posed in

English and Hindi. The cross - language functionality has been achieved by

converting the queries and the documents to an intermediate representation, an

inter-lingua called UNL. Built under the premise of a homogeneous

language-independent encoding, the UNL can be used as the predicate

knowledge base on which inferences can be performed effectively.

1 INTRODUCTION

Natural language question answering systems aim at making information access

easier by using natural-language questions directly and effectively as search queries

by exploiting their semantic structure, without asking for additional human effort to

translate the questions into sets of keywords. Much of the current research in question

answering systems focuses on open domains. The availability of large volumes of

data (e.g. documents extracted from the World Wide Web) has prompted the

development of systems that focus on shallow text processing. But there are many

document sets in restricted domains that are potentially valuable as a source for

question answering systems. For example, the documentation pages of Unix and

Linux systems would make an ideal corpus for Q/A systems targeted at users that

want to know how to use these operating systems. There is a wealth of information in

other technical documentation such as software manuals, car maintenance manuals,

and encyclopaedias of specific areas such as medicine. Users interested in these

specific areas would benefit from Q/A systems targeted to their areas of interest.

Multilinguality and cross-linguality have emerged as issues of great interest to the

Language Engineering community that deals with Question Answering systems.

Many organizations worldwide are grappling with problems like the following: Given

source text in several European languages, would it be possible to demonstrate

semantic understanding in some other language (like Hindi) without explicitly

translating any of the sources into the language of the questioner? This is a difficult

question to address at first. We present here a tool in this direction which can be

treated as a mechanism for mapping any language into a uniform language-

Shukla P., Mukherjee A. and Raina A. (2004).

TOWARDS A LANGUAGE INDEPENDENT ENCODING OF DOCUMENTS - A Novel Approach to Multilingual Question Answering.

In Proceedings of the 1st International Workshop on Natural Language Understanding and Cognitive Science, pages 116-125

DOI: 10.5220/0002685901160125

 SciTePress

independent predicate structure and thus make the task of multilingual Q/A more

realistic.

Traditional approaches classify cross–language information management systems as

follows:

1. systems that translate queries into the target language, or

2. systems that do the document collection into the source language, or both,

3. systems that convert queries and documents into an intermediate representation

(inter–lingua).

We adopted the last of the above schemes to implement the multilingual features in

the Q/A System. The inter-lingua used for the above purpose was the “Universal

Networking Language” (UNL) (Uchida, 2001; www.undl.org

). Our choice of the

inter-lingua depended on the functionality and the language independent features it

provided. With its set of “Universal Words”(UWs) having well defined universal

interpretations and ontological information embedded in them, a small and simple

binary predicate structure and a knowledge base connecting the UW’s as a weighted

graph of relations, UNL proved to be the most fit. Added to it is the world wide effort

in developing en-converters and de-converters for converting languages into UNL and

vice versa, which if fruitful, would allow us to extended our Q/A model to

substantially unrestricted, open – domains.

Figure 1: The System Implementation Flowchart

117

2 SYSTEM OVERVIEW

Figure 1 above displays the control flow between the major components of the Q/A

module. Two major tasks are involved in the system implementation. The first is that

of document processing, which is to be done once for each document we wish to be

queried. This converts the document into the interlingua, with added inference rules.

The other involves query processing and answer extraction and is done once for every

query. Our present work is based on an English corpus for safe drinking water built

from documents obtained from the official websites of EPA and WHO (Ground

Water and Drinking Water, EPA[Online]; WHO:Fact Sheets [Online]).

3 DOCUMENT PROCESSING

The corpus is manually annotated and processed through a Predicate Generator (the

UNL Parser) to get the complete set of semantic predicates for the document (the

“UNL expression” for the corpus). Some common known facts are also added to

provide context information which would be commonly known to the human reader

but is not available from the text itself.

For example, the NL corpus sentence, At some level, minerals are considered

contaminants that can make water unsafe, is annotated as follows:

<c>At some{<qua,>n} level.n.@pl. @entry</c> {<man,>p} mineral.@pl

{<gol,>p} are consider.p contaminant{1} .@pl{<obj,<p} that {<1}{<agt,>p}

can make.p.@ possible water{<obj,<p}<c>unsafe{<or,>p} or even{<man ,>p}

unsafe.p.@entry</c>{<gol ,<p}.

The corpus sentence in its annotated form is input to the UNL parser to generate the

UNL parsed graph, represented as a list of relations or predicates.

3.1 Semantic Equivalence

The same information may be expressed in very different ways:

1. Safe water can be obtained through boiling and distillation.

2. We can obtain safe water through boiling and distillation.

3. We can get safe water through boiling and distillation.

4. The methods for making safe water are boiling and distillation.

5. One can make safe water by boiling or distillation.

6. While distilling results in pure water, for practical purposes, boiling is sufficient to

make water safe for drinking.

Fortunately, a part of this problem (eg. active vs passive voice) is resolved by the

UNL encoding process – thus (1) and (2) will result in the same UNL structure:

agt(get(agt>thing,obj>thing,src>thing).@possible, we);

obj(get(agt>thing,obj>thing,src>thing), water(icl>liquid));

mod(water(icl>liquid), safe(mod<thing));

man(get(agt>thing,obj>thing,src>thing), through(icl>how(obj>thing)));

obj(through(icl>how(obj>thing)), :01);

and :01(boiling(icl>act), distillation(icl>act));

118

Even (3) which uses the word “get” which is used here in the same sense as “obtain”

results in the same universal word get(agt>thing,obj>thing,src> thing) and thus result

in the same UNL structure. However sentences (4 and 5) use “make” which has a

different UW, and these are handled in the inference engine by using rules for

unifying similar UWs when used in the context of water. Very wide variations such as

(6), which requires added pragmatic knowledge such as “pure water is safe”, and also

results in a set of two conjunctive UNL expressions (one for the “while” clause, and

the other for the main clause) can be handled but since the set of such constructs is

very large, they are not handled in the current version.

3.2 The Inference Engine

Although UNL structure provides a first order logic encoding of natural language, it is

not designed for making semantic inferences, and an inference engine needs to be

built for this purpose. The rules embedded in the process implement the First Order

Logic in order to obtain new inferences. For example, given the facts “Water- borne

diseases are caused by ingestion of contaminated water. and cholera is a water-borne

disease”, one may infer that “cholera is caused by ingestion of contaminated water.”

A meta-rule for this situation, incorporated as part of the inference rule base is that,

given:

agt(cause(icl>abstract thing).@entry:1);

obj(cause(icl>abstract thing).@entry:2);

nam(2:3);

which says that variable 1 causes 2, and 2 is a type of 3. Given this set of UNL

relations, the meta-rule says that one can infer:

agt(cause(icl>abstract thing).@entry:,1:);

obj(cause(icl>abstract thing).@entry:,3:);

i.e. 3 is caused by 1. The current system is designed to be tested only on a simple

Question and Answer mechanism. We use single-tiered inferences, and construct a

complete set of all possible inferences that can be made from the given text and the

pragmatic rules.

4 THE QUESTION – ANSWER MODULE

Every query is processed to get an answer template that represents the form of the

potential answer corresponding to the question. This template is input to an HPSG

Parser (Sharma et al, 2002) which outputs a pseudo UNL expression corresponding to

the proposed answer template. The pseudo UNL expression is subsequently subject to

a structure matching with the UNL document. The multilinguality of the system

comes from the multilingual characteristic of the parser and the language –

independent semantic structure provided by the inter-lingua (UNL).

119

4.1 Question Analysis and Processing – Generating the

Answer Template

The goal of the question analysis phase is to determine the focus of the question and

the expected type of answer for it. We then generate an answer template that

represents the form of the answer corresponding to a question, with a label “X”

introduced at the “answer slot” (the position where the answer key-word or key-

phrase needs to fit on). “X” represents the entity to be searched for as an answer. In

this work we use a set of transformational rules to arrive at the answer template.

These rules have been categorized on the basis of different question types, like the

“wh” questions, the “yes/no” questions, the “how” questions etc. The rules in each

category check different features of the question. In particular the rule may detect the

presence of a particular word or phrase occurrence, like words of a given part of

speech or those of a particular semantic category. The rules introduce the label “X” at

the location of the keyword or the key phrase in the answer pattern. For example, the

rule, HOW:AUX:1:V(PPL) > 1:AUX:V(PPL):BY:X, works upon a question such as

How is water contaminated? which is transformed to its corresponding answer

template Water is contaminated by X.

Most of these rules are language dependent. We have used a rule base of around

50 transformation rules for English and around 20 rules for Hindi, which offers a

relatively easier answer type identification and template generation.

An example could be – 1:kha^M:V(pre) > 1:X

:maoM:V(pre) which matches

questions like, “jala kha^M bahta hO?”(Where does water flow?) and have a easy

transformation to the answer “jala naidyaaoM maoM bahta hO”, while its English

counterpart will have to follow from a rule like where:does:1:V(base):2 >

1:V(pres):Prep(pos):X:2 to produce “water flows in rivers.” which requires a verb-

form transformation also.

There will be cases where more than one transformation rule applies for a given

question. In such cases, we let all the possible answer templates pass on to the next

phase. This is very common in case of complex query structures, for example the

“how” queries in English and the “kOsao” queries in Hindi. No answer selection is

done at this stage.

Thus the queries in one language are converted into the corresponding answer

templates in the same language and passed on to the next phase – the semantic parser.

4.2 From the Natural Language (NL) Answer Template to

the NL semantic predicates

The answer template is converted into a pseudo UNL representation by a multilingual

HPSG Parser (Sharma et al, 2002) which operates on lexicons specifying the

semantic selection (as against the categorical selection) properties of phrasal heads.

We have build separate semantic lexicons for English and Hindi with shallow

syntactic rules. Semantic relation attributes are mostly used in the lexicon instead of

syntactic subcategory features since the parsed answer form needs to be unified with a

database that is in the UNL semantic predicate format, i.e. the UNL Document. The

UNL structure uses relations that are defined in terms of semantic features such as

agency, place, etc. Therefore, these relations need to be identified in the parsed

answer form for structure matching to be possible. To take examples of a lexical

120

entries the said semantic feature information, let us look at the verb “makes” and its

corresponding verb “banaata” in Hindi:

<makes>@V(pre,sg){agt|!|obj|~gol|~plc}

< banaata >@V(pre,sg,male){/agt|~plc|~obj|~gol\|!}

So, for the word “makes”, we can capture the syntactic information of it being a

singular verb(sg) and used in present tense(pre) while the semantic description

"{agt|!|obj|~gol}" captures the fact that ”makes” heads a phrase which takes the form

of an Agent relation followed by the verb itself and then an Object relation, an

optional Goal relation and an optional Place(location) relation”. Note that agt, obj ,gol

and plc are all UNL relations. “banaata” has similar semantic subcategorization

though the positioning of the various sub-phrases in the phrase that “banaata” heads

is different and is specific to the Hindi language. The parser thus needs different

lexicons for parsing answer templates from different languages .

The Nominal heads which take on the roles agent, object and goal are entered in

the two lexicon were as follows:

<impurities>@agt(pl){~qua|~mod|!|~plc}

<water>@obj(sg){~qua|~mod|!|~plc}

<unsafe>@gol{!}

<kITaNau >@agt(pl){~qua|~mod|!|~prep}

<jala>@obj(sg){~qua|~mod|!|~prep}

<Asaurixat >@gol{!}

The HPSG parser reads the lexicons and states relations as given in lexical entries.

From the parsed tree thus obtained, we can get the relation between two nodes, which

would essentially be the label attached with the child node. To take an example of

how the pseudo-enconverter works, given the questions in Hindi(and English), “jala

kao kaOna Asaurixat banaata hO ?”(What makes water unsafe?), we generate the

answer templates jala kao X Asaurixat banaata hO (X makes water unsafe), with

the transformational rules 1: kaOna:2 > 1:X:2 (and what:V:1 > X:V:1)

The parsed outputs are as follows:

( ( X agt(sg)) makes V(pre,sg) (water obj(sg)) (unsafe gol) )

+-X makes water unsafe

|+-X_agt(sg)

|+-makes water unsafe

| | +-makes_V(pre,sg)

| | +-water_obj(sg)

| | +-unsafe_gol

( ( ( *jala _agt(sg){}* ( * kao _prep{}* ) ) (

*X_obj(){}*)(*Asaurixat_gol{}*)*banaata V(pre,sg,male){}*)*hO _aux{}* )

+- jala kao X Asaurixat banaata hO

+- jala kao X

Asaurixat banaata

| +-jala kao

| | +-jala _agt(sg)

| | +- kao _prep

| +-X_obj(sg)

| +- Asaurixat _gol

| +- banaata _ V(pre,sg,male)

121

+- hO _aux

The list of semantic predicates produced is –

agt(makes,X)

obj(makes,water)

gol(makes,unsafe)

agt(banaata,X)

obj(banaata, jala)

gol(banaata, Asaurixat)

4.3 From NL semantic predicates to inter-lingua

The set of predicates obtained thus far have the attributes in the Query Language. In

this phase these NL words are mapped to Universal Words (UW’s of UNL) to

produce the UNL semantic predicates( the UNL expression), very similar to the form

in which the Document to be queried had been converted using manual annotation

and predicate generation (Section 2).

So it is in this phase that the query ( or the Answer template ) representations,

which had so far been in different languages, are brought to a homogeneous language-

independent encoding suitable for search.

The UW Dictionaries give us the mapping form NL words to the UWs . Mainly

built to cater the needs of the machine translation groups, the UW Dictionaries are

expected to be huge and comprehensive. But since we did not have any formal

product release yet, we implemented our own dictionaries to cover our corpus needs.

The typical entries in the UW Dictionaries are like –

The English dictionary –

[makes] {} “make(agt>thing,obj>thing,src>thing)” (VER,PRE,SG) <ENGLISH>

[water] {} “water(icl>liquid)” (NOUN,SG) <ENGLISH>

[unsafe] {} “unsafe(aoj>thing)” (ADJ) <ENGLISH>

The Hindi Dictionary -

[banaata ] {} “make(agt>thing,obj>thing,src>thing)” (VER,SG,MALE,TAA) <HINDI>

[jala] {} “water(icl>liquid)” (NOUN,SG) <HINDI>

[Asaurixat ] {} “unsafe(aoj>thing)” (ADJ) <HINDI>

As can be seen, along with the translation of the words , the UW Dictionaries also

give us additional information regarding features of a word. By using the two

dictionaries, and processing some additional information that was carried over from

the last phase (for example the tense information we carried over for the Hindi answer

in a previous example), we can thus obtain the UNL predicates for the answer

template which prove to be same for similar queries in different languages. Thus for

the two answer templates jala kao X Asaurixat banaata hO and X makes water

unsafe ,corresponding to the two similar queries in Hindi and English respectively ,

we obtain the same UNL Expressions -

agt(make( agt>thing, obj>thing, src>thing). @pre, X)

obj(make( agt>thing, obj>thing, src>thing).@pre , water(icl>liquid))

gol(make( agt>thing, obj>thing, src>thing),unsafe (aoj>thing))

This set of predicates is now searched for and matched with the UNL Document of

the corpus to give us the entities which are contenders for fitting in place of label X in

the Answer Template.

122

4.4 The Search Engine

Given the answer form of the question in UNL predicates, it has to be matched with

the UNL document ( the complete set of UNL predicates for the entire corpus) to see

if an answer can be provided. First, each sentence in the UNL Document (as

generated in section 2) is converted into a UNL graph, with two arguments as nodes,

connected by a link with the label of the relation. Next, we convert the answer

template present as UNL predicates into the UNL graph. Continuing with our

example question , jala kao kaOna Asaurixat banaata hO (What makes water

unsafe), the UNL predicates for the answer template described in the previous section

produce a graph like the following –

Figure2: The Graph for the template “X makes water unsafe”.

Finding an answer involves matching this Query UNL graph with a UNL sub graph

from the UNL Document, If one of the nodes in the query has a variable X then the

match returns the value of this variable. For “yes/no” question the matching of the

whole graph returns “yes” else a “no”. For such questions, there is no label X in the

answer template.

4.5 Conversion of the Answer from the inter-lingua back to

NL (De-conversion process )

The search returns a possible entity (key-word or a key-phrase) in UNL which can

serve as the answer ( and a replacement for X ) in the answer template. If the answer

returned is a word, then the de-conversion from the Universal Word to the NL word

can be done directly by finding the reverse mapping from the UW Dictionary. In case

of phrases as answers, a formal de-conversion process has to be followed. We adopted

two strategies to deal with phrasal de-conversion. The first one makes use of the

“scope” functionality in UNL description wherein a complete phrase is taken within a

“scope” and is said to form relations with other parts of the sentence as a single word.

For example, in the question “How is cholera caused” which we looked up in the

search engine the phrase “ingestion of contaminated water” is treated as a scope and

the predicates generated for the document are –

obj(cause(agt>thing,obj>thing),cholera(icl>disease))

agt(cause(agt>thing,obj>thing),SCOPE1)

label(SCOPE1, ingestion of contaminated water)

123

The search is thus made to return the complete SCOPE instead of the individual word

“contamination”. For each of the SCOPES then the UW Dictionary also has complete

phrasal translations entered manually. For example –

[saMk/imat

jala ko ga/hNa] {} “ingestion of contaminated water” (SCOPE) <HINDI>

The second strategy uses shallow de-conversion rules. These rules cover common

phrasal constructs, particularly the noun phrases. Some examples are –

mod(X,Y) = X|Y which means that a modifier relationship between X and Y would

be “the word Y followed by the word X” in natural language. This rule for example

works well for both English and Hindi. For example mod(water,contaminated) can is

de-converted to “contaminated water” and to “saMk/imat

jala” in English and Hindi

respectively, which are valid constructs for both languages.

During our search , we incorporate both the strategies though the first one is given

preference owing to the unavailability of robust UNL-Hindi deconverters. The answer

obtained from this phase can directly be inserted in place of the label X in the answer

template and reported. In case of multiple answers matching a given set of predicates,

all the results are reported. There is no word sense disambiguation or answer selection

applied in this process.

5 RESULTS

The present system has given results upto 60% accuracy to the questions which have

answers directly taken from the corpus, both for Hindi and English. Failure analysis

of also suggests that upto 50% of the failures in obtaining a satisfactory answer in the

language of the query were due to the inabilities of the system to de-convert complex

English phrases into query language. The conversion of the question to answer

template resulted in another 27% of the failures. Also, the short answer questions(one

word and two word) are answered much more successfully(upto 70% success rate)

than the phrasal ones and the performance keeps on decreasing with the complexity of

the answer. The analysis of the system was done on a database of questions given by

students in the project group and other volunteers.

6 CONCLUSION

Being our first attempt at developing a Question Answer module, the focus was more

on system implementation than on system tuning. This may be the first attempt

toward developing a cross-language Hindi/English Question Answer Module, and

though the system is not restricted in any form to any language, we have reported our

current implementations for English and Hindi. The module is being tested on short

stories for children and the results are encouraging.

The performance of the system, for any language, will critically depend on the

effectiveness of the En-conversion (form NL to UNL) and the De-conversion(for

UNL to NL) engines developed for that language. Although, we have been able to

achieve reasonable accuracy on the En-conversion front, by using a self developed

parser for the purpose, the de-conversion task is still far from good. This imposes a

heavy restriction on the complexity of the answers that can be extracted in different

languages. While the UNL structure is a First Order Predicate form, there are

124

remarkable differences with normal logical models. For one, UNL structures do not

provide for an implication connective, and also use the disjunction relation “or” rather

sparingly. A rigorous mapping to more traditional logical structures is needed for

more extensive UNL based logical inferencing. Efforts are on in this direction.

Also, the manual process of designing the pragmatic knowledge-base is expensive

– it needs to be seen if further synergies can be gained by unifying this effort with

parts of the UNL KB. Despite these shortcomings, we hope the present work will

provide a start towards this difficult yet important problem.

REFERENCES

Bittner, M. (1998) ‘Cross-Linguistic Semantics for Questions’, Linguistics and Philosophy, vol.

21, pp. 1-82.

Uchida, H., 2001. UNL Beyond Machine Translation. International Symposium on Language

in Cyberspace, Seoul, Korea Systems. ICEIS Press.

Ground Water and Drinking Water, EPA. [online]. Available from

<http://www.epa.gov/safewater/dwh/ contams.html > [Accessed 6 July 2003].

WHO:Fact Sheets [online]. Available from http://www.who.int/inf-fs/en/fact112.html

[Accessed 6 July, 2004].

Sharma, D., Vikram ,K., Mittal, M. R., Mukerjee, A., Raina, A., M. (2002) ‘Saarthaka-A

generalized HPSG parser for English and Hindi’, Recent Advances in Natural Language

Processing-Proceedings ICON-2002, Mumbai, India, pp. 18-21.

Martins, R. (2002) ‘UNL as a linguistic point of view’ ICUKL, Goa, India

Moldovan, D., Pasca, M., Harabagiu, S., Surdeanu, M. (2002) ‘Performance Issues and Error

Analysis in an Open-domain Question-Answering System.’ 40th Annual Meeting of the

Association for Computational Linguistics(ACL), Philadelphia, pp. 33- 40.

Moldovan, D., Clark, C., Harabagiu, S., and Maiorana, S., (2003) ‘COGEX: A Logic Prover for

Question Answering.’ Proceedings of HLT-NAACL Main Papers ,Edmonton, pp. 87-93.

125