Completing Mixed Language Grammars Through Womb Grammars

Plus Ontologies

∗

Ife Adebara

, Veronica Dahl

1,3

and Sergio Tessaris

Department of Computer Science, Simon Fraser University, 8888 University Drive, Burnaby, Canada

Faculty of Computer Science, Free University of Bozen–Bolzano, Piazza Domenicani 3, 39100 Bolzano, Italy

Institute of Software Engineering and Compiler Construction, University of Ulm, Ulm, Germany

Keywords:

Self-modifying Grammars, Womb Grammars, Ontologies, Mixed Language Text, Partially Known Gram-

mars, Constraint Acquisition, Universal Language, Parsing, CHRG (Constraint Handling Rule Grammars),

Constraint Based Grammars, Property Grammars.

Abstract:

Womb Grammars are a recently introduced constraint-based methodology for acquiring linguistic information

on a given language from that of another, implemented in CHRG (Constraint Handling Rule Grammars).

This is a position paper that discusses their possible adaptation to multilingual text parsing. In particular, we

propose to detect unspeciﬁed information with appropriate ontologies. Our proposed methodology exploits the

descriptive power of constraints both for deﬁning sentence acceptability and for inferring lexical knowledge

from a word’s sentential context, even when foreign.

1 INTRODUCTION

Social media promotes communication across coun-

tries, multiplying the opportunities for users to spon-

taneously mix syntax, lexicons and jargons. Also,

there are domains where syntactic arrangements dif-

ferent from the standard arrangement are acceptable.

These factors, together with the increasing inﬁltration

of English words and speciﬁc group jargons into tech-

nical and even every day communications in many

other languages, results in the need for ever more ﬂex-

ible parsers if we are to succeed in extracting infor-

mation from text in timely fashion. Yet we are quite

far from being able to address the challenges inher-

ent in multilingual and creative text. In fact, one of

the worst nightmares for linguistics is that of trying to

parse textual sources that do not respect the standard

grammar.

Traditional parsers focus on constructing syntac-

tic trees for complete and correct sentences in a given

language. More ﬂexible parsing models can be ar-

rived at in economic fashion by giving up syntactic

trees as a focus and focusing instead on grammar con-

straints, also called properties. For instance, if we

∗

This research was supported by NSERC Discovery

grant 31611024 and was started during a visit by Veronica

Dahl and Sergio Tessaris to Universidade Nova de Lisboa.

were to work with tree-oriented rules such as:

np --> det, adj, n.

their adaptation into a language where nouns must

precede adjectives would require changing every rule

where these two constituents are involved. In contrast,

by expressing the same rule in terms of separate con-

straints, we only need to change the precedence con-

straint into saying that adjectives must precede nouns,

and the modiﬁcation carries over to the entire gram-

mar without further ado.

In this paper we propose to combine Womb Gram-

mar parsing- a property-based methodology for multi-

lingual parsing developed by Dahl and Miralles (Dahl

and Miralles, 2012)- with ontologies, in view of fur-

ther specifying partial information which can be lexi-

cal or structural, in an automatic manner.

2 BACKGROUND

Womb Grammars (Dahl and Miralles, 2012) were de-

signed for inducing a target language’s syntax from

the known syntax of a source language plus a rep-

resentative corpus of correct sentences in the target

language. As such they can be considered a kind of

self-modifying grammar, whose approach is quite dif-

ferent from that of predecessors (e.g. (Jackson, 2006)

292

Adebara I., Dahl V. and Tessaris S..

Completing Mixed Language Grammars Through Womb Grammars Plus Ontologies.

DOI: 10.5220/0005353202920297

In Proceedings of the International Conference on Agents and Artiﬁcial Intelligence (PUaNLP-2015), pages 292-297

ISBN: 978-989-758-073-4

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

resorts heavily to push-down automata; (Christiansen,

2011), while being more declarative, are an extension

of attribute grammars. Womb grammars, in contrast,

are constraint-based: they derive a target language’s

syntax by observing the list of violated properties that

are output when correct sentences in the target lan-

guage are fed to the source grammar, and correcting

that grammar so that these properties are no longer

violated.

In the original Womb Grammar formalism, we

had two languages: the source language, of which

both the syntax and the lexicon were known, and

the target language, of which only the lexicon and a

correct input corpus were known. Here we still as-

sume a main language such as English, but it might

be creatively cross fertilised with multilingual contri-

butions, both in structure and lexicon, from other lan-

guages.

3 OUR PROPOSED

METHODOLOGY

The main difﬁculty in adapting our methodology is

that the target language’s input can no longer be con-

sidered correct. We shall ﬁrst consider lexical and

structural intrusions separately, and then discuss how

to deal with them jointly.

Before doing so, let us brieﬂy recall how lexical

items are recognized and how constraints are enforced

by the Womb Grammar parser.

3.1 A Few Implementation Details

Our implementation of Womb Grammars (Dahl and

Miralles, 2012) is done in terms of CHRG, or Con-

straint handling Rule Grammars (Christiansen, 2005).

Below we show some actual code for completeness,

but our description should be intuitively clear enough

for those readers with no background on CHR to fol-

low.

Each word is stored in a CHRG symbol

word/3, along with its category and traits (i.e.

word(n,[sing,masc],livre)).

Grammar constraints are entered in terms of

a CHRG constraint g/1, whose argument stores

each possible grammar property. For instance, an

English noun phrase parser would include the con-

straints: g(obligatority(n)), g(constituency(det)),

g(precedence(det,adj)), g(unicity(det)),

g(requirement(n,det)), g(dependence(det,n)), and

so on. These properties are weeded out upon detec-

tion of a violation by CHRG rules that look for them,

e.g. an input noun phrase where an adjective pre-

cedes a noun will provoke deletion of the constraint

g(precedence(n,adj)) plus perhaps (if the rest of the

input corpus warrants it) inclusion of the converse

constraint: g(precedence(adj,n)). The following

CHRG rule accomplishes that:

!word(C2,_,_), ... , !word(C1,_,_),

{g(precedence(C1,C2))} <:>

{update(precedence(C1,C2))}.

Note that the rule works bottom-up, and that the

three dots are a facility of CHRG which allows us to

skip over an unspeciﬁed substring of words. The curly

brackets indicate a call to a procedure (as opposed to

a grammar symbol).

The CHRG parse predicate stores and abstracts

the position of each word in the sentence. In plain En-

glish, the above rule states that if a word of category

C2 precedes a word of category C1, and there is a

precedence rule stipulating that words of category C1

must precede words of category C2, the precedence-

updating rule needs to be invoked (in CHRG syn-

tax the symbols preﬁxed with exclamation points are

kept, while the ones without are replaced by the body

of the rule, in this case an update constraint that in-

vokes some housekeeping procedures).

Each of the properties dealt with has similar rules

associated with it.

3.2 Underspeciﬁed Lexical Categories

Let us ﬁrst consider the problem of accommodating

extraneous words. We assume in a ﬁrst stage that we

have only one language with known syntax and lex-

icon, and an input corpus which is correct save for

the occasional intrusion of neologisms or words be-

longing to another language or jargon. We can adapt

our Womb Grammar methodology to this situation,

by running the input corpus as is and observing the

list of violated properties that will be output. Since we

know everything to be correct except that some lexi-

cal items do not “belong”, we know that the violated

properties stem from those lexical items that failed to

parse. By examining the violated properties, we can

draw useful inferences about the lexical items in ques-

tion. For instance, if the head noun appears as an un-

known word, among the violated properties we will

read that the obligatory character of a noun phrase’s

noun has been violated, which can lead us to postu-

late that the word in question is a noun. A violated

exigency property would likewise suggest that the un-

recognised word has the category that is required and

has not been found.

It is clear that with sufﬁcient programming ef-

fort, any computational linguistic methodology can

CompletingMixedLanguageGrammarsThroughWombGrammarsPlusOntologies

293

be adapted to guess lexical categories of extraneous

words from context. However in most of them, this

would require a major modiﬁcation of the parser.

Take for instance DCGs (Deﬁnite Clause Gram-

mars, (Pereira and Warren, 1980)), where lexical rules

would appear as exempliﬁed by:

noun --> [borogove].

If the lexicon does not explicitly include the word

“borogrove” among the nouns, the parser would sim-

ply fail when encountering it. One could admit un-

known nouns through the following rule:

noun --> [_].

But since this rule would indiscriminately accept

any word as a noun ( and similar rules would have

to be included in order to treat possible extraneous

words in any other category), this approach would

mislead the parser into trying countless paths that are

doomed to fail, and might even generate wrong re-

sults.

In contrast, we can parse extraneous words

through Womb Grammar by anonymizing the cate-

gory and its features rather than the word itself, e.g.

word(Category,[Number,Gender],borogrove)), which

more accurately represents what we know and what

we don’t. The category and features will become ef-

ﬁciently instantiated through constraint satisfaction,

taking into account all the properties that must be sat-

isﬁed by this word in interaction with its context.

Of course, what would be most interesting would

be to derive the meaning of the word that “does not

belong”. While Womb Grammars do not yet have

a complete way of treating semantics, the clues they

can provide regarding syntactic category can serve to

guide a subsequent semantic analysis, or to bypass

the need for a complete semantic analysis by the con-

comitant use of ontologies relevant to domain-speciﬁc

uses of our parser. In general, we are not necessarily

interested in capturing the exact meaning of each un-

recognised word; but rather to infer its relation with

known words. The problem can be casted into the (au-

tomatic) extraction of a portion of the hypernym re-

lation involving the extraneous word using the actual

document or additional sources as corpora (see (Clark

et al., 2012)).

For instance, in the poem “Jabberwocky”, by

Lewis Carroll, nonsense words are interspersed

within English text with correct syntax. Our target

lexicon, which we might call Wonderland Lexicon or

WL, can be to some extent reconstructed from the sur-

rounding English words and structure by modularly

applying the constraints for English. Thus, “boro-

goves” must be labelled as a noun in order not to

violate a noun phrase’s exigency for a head noun.

In other noun phrases, the extraneous words can be

recognised only as adjectives. This is the case for

“the manxome foe” and “his vorpal sword”, once

the following constraints are applied: adjectives must

precede nouns, a noun phrase can have only one

head noun, determiners are also unique within a noun

phrase. In the case of “the slithy toves”, where there

are two WL words, the constraint that the head noun

is obligatory implies that one of these two words is

a noun, and the noun must be “toves” rather than

“slithy” (which is identiﬁed as an adjective as in the

two previous examples) in order not to violate the

precedence constraint between nouns and adjectives.

In other cases we may not be able to unambiguously

determine the category, for instance the WL word

“frabjous” preceding the English word “day” may re-

main ambiguous no matter how we parse it, if it satis-

ﬁes all the constraints either as a determiner or as an

adjective

Two of the poem’s noun phrases (“the Jubjub

bird” and “the Tumtum tree”) provide ontological

as well as lexical information (under the reasonable

assumption that capitalised words must be proper

nouns, coupled with the fact that as proper nouns,

these words do not violate any constraints). Our adap-

tation of Womb Grammars includes a starting-point,

domain dependent ontology (which could, of course,

initially be empty), which can be augmented with

such ontological information as the facts that Tum-

tums are trees and Jubjubs are birds. Similarly, input

such as “Vrilligs are vampires” would result in addi-

tions to the ontology besides in lexical recognition.

It could be that some input allows us even to equate

some extraneous words with their English equiva-

lents. For instance, if instead of having in the same

poem the noun phrases “his vorpal sword” and “the

vorpal blade”, we’d encountered “his vorpal sword”

and “the cutting blade”, we could bet on approximate

synonymy between “vorpal” and “cutting” , on the ba-

sis of our English ontology having established seman-

tic similarity between “sword” and “blade”.

Similarly, extraneous words that repeat might al-

low a domain-dependent ontology to help determine

their meaning. Taking once more the example of “his

vorpal sword” and “the vorpal blade”, by consulting

the ontology besides the constraints, we can not only

determine that “vorpal” is an adjective, but also that

it probably refers to some quality of cutting objects.

It would be most interesting to carefully study under

which conditions such ontological inferences would

be warranted.

Which precise constraints are deﬁned for a given lan-

guage subset is left to the grammar designer; those in this

paper are meant to exemplify more than to prescribe.

ICAART2015-InternationalConferenceonAgentsandArtificialIntelligence

294

3.3 Dealing with Extraneous Structures

We have said that Womb Grammars ﬁgure out the

syntax of a target language from that of a source lan-

guage by “correcting” the latter’s syntax to include

properties that were violated by the input corpus. An-

other variant of Womb Grammars, which we call Uni-

versal Womb Grammars, does not rely on a speciﬁc

source language, but uses instead the set of all prop-

erties that are possible between any two constituents

- a kind of universal syntax. This universal grammar

contains contradictory properties, for instance it will

state both that a constituent A must precede another

constituent B, and that B must precede A. One or both

of these properties will be weeded out by processing

the input corpus, which is assumed to be correct and

representative.

When dealing only with lexical intrusions, our so-

lution discussed in the previous section does not af-

fect the assumption, made by Womb Grammars, that

the input corpus is correct: we merely postulate an

anonymous category and features, and let constraint

solving automatically ﬁnd out from context which are

the “correct” ones (correct in the sense of our mul-

tilingual or neologism-creating environment) to asso-

ciate to an extraneous word.

Extraneous structures, particularly if coexisting

with extraneous lexicon, might be more difﬁcult to

deal with, because we rely upon the structural con-

straints being correct in order to infer an unknown cat-

egory ( e.g. the constraint that adjectives must precede

nouns helps to determine that the word “vorpal” func-

tions as an adjective in Lewis Carrol’s poem). There-

fore, in this section we assume there are no extraneous

words and we only deal with extraneous structures.

We shall then try to combine both approaches.

We assume, with no loss of generality, that the

main language is English and that it is being inﬁltrated

with structures of other languages– the same consid-

erations apply if the main language is another one.

One possibility is to use the the Hybrid Womb

Grammar approach with the user’s mother tongue

as target language and English as the source lan-

guage, thus obtaining a parser for the mixed language,

through training a hybrid Womb Grammar with a

user-produced representative corpus of sentences. We

can then run an input corpus that is representative of

the user’s talk (e.g. Spanglish) and this will result in

a Spanglish grammar adapted to the user in question.

Thereafter, this user will be able to create all the ne-

ologisms he wants, given that the structures used, al-

though they may be incorrect for either Spanish or

English, will be adequately represented in the Span-

glish grammar obtained, which is tailored to this user.

3.3.1 Hybrid Parser Generation

3.3.2 The Training Phase

Before being able to parse a user’s mixed use of two

languages, we propose to obtain a parser for the mixed

language, through training a hybrid Womb Grammar

with a user-produced representative corpus of sen-

tences. Let L

(the source language) be the main lan-

guage used in the text we want to parse, e.g. English.

Its syntactic component will be noted L

syntax

, and its

lexical component, L

lex

Let L

be the user’s mother tongue. We want to

obtain the syntax for the user’s blending of L

and

. Let us call this mixed language L

Since we have made the assumption that during

this training phase we have no extraneous words (that

is, no words that do not appear in the lexicon), we

have two options: we can either require that the user

do not include them in the training phase, so that the

target lexicon will be that of English (L

lex

) or we

can simply extend the target lexicon to include both

the source language’s and that of the user’s mother

tongue (L

lex

union L

lex

). Whichever of these two

options we take, let us call the mixed language’s lex-

icon (L

lex

. We can feed a sufﬁciently representative

corpus of sentences in L

that the user has produced,

to a hybrid parser consisting of L

syntax

and L

lex

. This

will result in some of the sentences being marked as

incorrect by the parser. An analysis of the constraints

these “incorrect” sentences violate can subsequently

reveal how to transform L

syntax

so it accepts as cor-

rect the sentences in the corpus of L

—i.e., how to

transform it into L

syntax

. Figures 1 and 2 respectively

show our problem and our proposed solution through

Hybrid Parsing in schematic form.

For example, let L

= English and L

= French,

and let us assume that English adjectives always pre-

cede the noun they modify, while in French they al-

ways post-cede it (an oversimpliﬁcation, just for il-

lustration purposes). Thus “the blue book” is correct

English, whereas in French we would more readily

say ”le livre bleu”.

If we plug the French lexicon and the English syn-

tax constraints into our Womb Grammar parser, and

run a representative corpus of (correct) French noun

phrases by the resulting hybrid parser, the said prece-

dence property will be declared unsatisﬁed when hit-

ting phrases such as ”le livre bleu”. The grammar re-

pairing module can then look at the entire list of un-

satisﬁed constraints, and produce the missing syntac-

tic component of L

’s parser by modifying the con-

straints in L

syntax

so that none are violated by the cor-

pus sentences.

CompletingMixedLanguageGrammarsThroughWombGrammarsPlusOntologies

295

lex

syntax

lex

syntax

Figure 1: The Problem

corpus

(Mixed sentences)

lex

syntax

Womb Grammar

Parser

Violated syntax

properties

Grammar

Repairing

Module

syntax

Figure 2: The Solution.

Some of the necessary modiﬁcations are easy to

identify and to perform, e.g. for accepting ”le livre

bleu” we only need to delete the (English) precedence

requirement of adjective over noun (noted ad j < n).

However, subtler modiﬁcations may be in order, per-

haps requiring some statistical analysis in a second

round of parsing: if in our L

corpus, which we have

assumed representative, all adjectives appear after the

noun they modify, French is sure to include the re-

verse precedence property as in English: n < ad j. So

in this case, not only do we need to delete ad j < n,

but we also need to add n < ad j.

3.4 Inferring Semantic Information

Extracting domain knowledge from text corpora is an

active research area which involves several commu-

nities (see e.g. (Clark et al., 2012) for an overview).

For our purposes we’ll focus on the problem build-

ing a (partial) hypernym relation graph from textual

corpora.

In our context, we are not interested in building a

precise structured conceptualisation of a domain but

to recognise hypernyms and hyponyms of the extrane-

ous words. Once we are able to recognise the mean-

ing of related words (e.g. using a background source

of information like EuroWordNet (Vossen, 2004)) we

can classify the missing words and grasp their mean-

ing. For example, searching the web for the exact

phrase ”a borogove is” returns a snippet containing

the sentence ”a borogove is a thin shabby-looking

bird” which allows us to infer that a ”borogove” is

a bird.

Different techniques have been developed to op-

timise the task of acquiring semantic structuring of a

domain; however, our problem is much more limited

because we are not interested in constructing a com-

plete taxonomy. In particular, the problems of preci-

sion and recall will not affect us to the same extent as

in the general case.

The fact that we start our search for hypernyms

from speciﬁc seed words and we cannot make strong

assumptions on the corpora we are analysing, makes

approaches based on hyponym patterns a natural

choice (see (Hovy et al., 2009; Snow et al., 2004)).

The basic idea is to search the corpora for speciﬁc

textual patterns which explicitly identify a hyponym

relation between terms (e.g., ”such authors as hX i”).

Hyponym patterns can be pre-deﬁned or extracted

from corpora using known taxonomies (e.g., (Snow

et al., 2004)). For our purposes we can reuse known

patterns and apply them to the text source being

parsed or external sources like Wikipedia or a web

search engine (Snow et al., 2006).

4 CONCLUSIONS

We have shown how to use the combined power of

Womb grammars plus ontologies in order to make

syntactic sense of text for which the grammar we dis-

pose of has only partial information. As well, we have

delineated how we could extend these abilities into se-

mantics.

While in this paper we have focused on a speciﬁc

language’s grammar, it might be useful to be able to

consult in a second stage the relevant fragment (e.g.

that of noun phrases if the extraneous word belongs

to one) of a universal grammar. This will be the case

for instance if the word that seems not to belong in

the text exhibits some property that does not exist in

the text’s main language. When this is the case, there

will be no way to assign for some word a category

that is in line with the surrounding ones and results in

no more properties being violated.

REFERENCES

Christiansen, H. (2005). CHR grammars. TPLP, 5(4-

5):467–501.

Christiansen, H. (2011). Adaptable grammars for non-

context-free languages. In Bio-Inspired Models for

ICAART2015-InternationalConferenceonAgentsandArtificialIntelligence

296

Natural and Formal Languages, pages 33–51. Cam-

bridge Scholars Publishing.

Clark, M., Kim, Y., Kruschwitz, U., Song, D., Albakour, D.,

Dignum, S., Beresi, U. C., Fasli, M., and De Roeck, A.

(2012). Automatically structuring domain knowledge

from text: An overview of current research. Informa-

tion Processing & Management, 48(3):552–568.

Dahl, V. and Miralles, J. (2012). Womb grammars: Con-

straint solving for grammar induction. In Sney-

ers, J. and Fr

uhwirth, T., editors, Proceedings of the

9th Workshop on Constraint Handling Rules, volume

Technical Report CW 624, pages 32–40, Department

of Computer Science, K.U. Leuven.

Hovy, E., Kozareva, Z., and Riloff, E. (2009). Toward

completeness in concept extraction and classiﬁcation.

In Proceedings of the 2009 Conference on Empirical

Methods in Natural Language Processing: Volume 2-

Volume 2, pages 948–957. Association for Computa-

tional Linguistics.

Jackson, Q. T. (2006). Adapting to Babel: Adaptivity and

Context-Sensitivity in Parsing. Verlag:Ibis Publishing.

Pereira, F. and Warren, D. (1980). Deﬁnite clause grammars

for language analysis - a survey of the formalism and

a comparison with transition networks. Artiﬁcial In-

telligence, 13:231–278.

Snow, R., Jurafsky, D., and Ng, A. Y. (2004). Learning

syntactic patterns for automatic hypernym discovery.

Advances in Neural Information Processing Systems

17.

Snow, R., Jurafsky, D., and Ng, A. Y. (2006). Semantic

taxonomy induction from heterogenous evidence. In

Proceedings of the 21st International Conference on

Computational Linguistics and the 44th Annual Meet-

ing of the Association for Computational Linguistics,

ACL-44, pages 801–808, Stroudsburg, PA, USA. As-

sociation for Computational Linguistics.

Vossen, P. (2004). Eurowordnet: a multilingual database

of autonomous and language-speciﬁc wordnets con-

nected via an inter-lingualindex. International Jour-

nal of Lexicography, 17(2):161–173.

CompletingMixedLanguageGrammarsThroughWombGrammarsPlusOntologies

297