OBTAINING E-R DIAGRAMS SEMI-AUTOMATICALLY FROM

NATURAL LANGUAGE SPECIFICATIONS

Farid Meziane and Sunil Vadera

School of Computing, Science and Engineering

Salford University, Salford M5 4WT, UK.

Keywords:

Software engineering, Entity relationship models, Speciﬁcations, Natural language processing.

Abstract:

Since their inception, entity relationship models have played a central role in systems speciﬁcation, analysis

and development. They have become an important part of several development methodologies and standards

such as SSADM. Obtaining entity relationship models, can however, be a lengthy and time consuming task for

all but the very smallest of speciﬁcations. This paper describes a semi-automatic approach for obtaining entity

relationship models from natural language speciﬁcations. The approach begins by using natural language

analysis techniques to translate sentences to a meaning representation language called logical form language.

The logical forms of the sentences are used as a basis for identifying the entities and relationships. Heuristics

are then used to suggest suitable degrees for the identiﬁed relationships. This paper describes and illustrates

the main phases of the approach and presents a summary of the results obtained when it is applied to a case

study.

1 INTRODUCTION

Since their inception, entity relationship models

(ERMs) have played a central role in systems speciﬁ-

cation, analysis and development. They have become

an important part of several development methodolo-

gies and standards such as SSADM (Ashworth and

Goodland, 1990). Obtaining ERMs, can however,

be a lengthy and time consuming task for all but the

very smallest of speciﬁcations. This paper describes

a semi-automatic approach for obtaining ERMs from

natural language (NL) speciﬁcations.

The overall approach, summarised in Figure 1, is

based on the view that nouns often denote entities

and verbs often denote relationships. However, as we

will see later, picking out just the nouns and verbs

using string matching is not adequate for producing

an ERM. We need also to identify the arguments and

the degrees of the relationships. To enable this, the

approach begins by using NL analysis techniques to

translate sentences to a meaning representation lan-

guage called logical form language (LFL). The logi-

cal forms (LFs) of the sentences are then used as a ba-

sis for identifying the entities, and relationships. The

quantiﬁers in the LFs are then used to suggest suitable

degrees for the identiﬁed relationships. The paper is

Specification

Natural language

analysis

Selection of entities

and relationships

Identification

of degrees

Grammar

Quantifier

based

selection

criteria

E-R Model

logical forms

partial

E-R model

sentences

Figure 1: Identifying E-R models semi-automatically

organised in a manner that follows the main phases

of the approach: section 2 describes the translation to

LFL; section 3 illustrates the identiﬁcation of the en-

tities and relationships; and section 4 shows how the

degrees can be identiﬁed. The paper concludes with

a summary of the results obtained when the approach

is applied to a case study.

638

Meziane F. and Vadera S. (2004).

OBTAINING E-R DIAGRAMS SEMI-AUTOMATICALLY FROM NATURAL LANGUAGE SPECIFICATIONS.

In Proceedings of the Sixth International Conference on Enterprise Information Systems, pages 638-642

DOI: 10.5220/0002606106380642

 SciTePress

2 TRANSLATING SENTENCES

TO LFL

As mentioned above, each sentence of a speciﬁcation

is ﬁrst analysed and translated to a statement in LFL.

In this section we summarise the syntax of LFL and

the translation process. We refer the reader to (Mc-

Cord, 1990; Meziane, 1994) for further details.

In general, each sentence can be translated to a LF

statement of the form:

determiner(Base, F ocus)

Typically, the base comes from the remainder of the

noun phrase (NP) in which the determiner appears,

and the focus comes from the sisters of the NP such

as verb phrases (VP), and other NP.

In LF, nouns, verbs, adjective, and pronouns are rep-

resented as follows, where we use the Prolog conven-

tion that variables begin with a capital letter:

Nouns Nouns are usually represented as one

place predicates (aircraft, aircraft(X)). Re-

lational nouns take two arguments (mother,

mother(X, Y )).

Verbs Depending on their category, verbs may be

represented by predicates having nil, one, two

or three arguments. Hence, the verbs snow,

crash, write and give are represented by: snow,

crash(X), write(X, Y ) and give(X, Y, Z).

Adjectives We distinguish two categories of adjec-

tives, extensional adjectives and intensional adjec-

tives. An adjective is intensional if it cannot be dis-

sociated from the noun it modiﬁes. An extensional

adjective can be dissociated from the noun it mod-

iﬁes. The following two examples illustrate these

two situations respectively:

• The pilot uses a moving map display.

the(pilot(X),the(moving(map(display(Y))), use(X,Y))).

• A complex aircraft uses a radar.

ex(aircraft(X) & complex(X), ex(radar(Y), use(X,Y))).

Pronouns There is no general rule on how to inter-

pret pronouns. Basically they are supposed to be

replaced by the nouns they represent. However, re-

solving pronoun references is a very difﬁcult prob-

lem. Our current implementation omits this aspect

of NL understanding.

The translation process can take English sentences

and produce LFs in the above representation. It does

this in two main phases. First, a syntax analysis is per-

formed to produce all possible parsings (syntax trees)

of the sentence according to the deﬁned grammar.

Each syntax tree is then transformed into a unique

LFL expression. This latter transformation forms the

ex denotes the usual existential quantiﬁer.

semantic interpretation of the English sentence and

involves the use of logical operators to combine the

different parts of the syntax tree to produce the de-

sired LF (see (McCord, 1990) for details). This pro-

cess has been implemented in Prolog and a suitable

grammar has been developed. For example, the gram-

mar rule used for a VP is:

vp(Infl,E,X) ==>

vhead(Infl,E,X,Slots):

postmods(Slots).

This rule deﬁnes a VP to be composed of a verb

head and a list of postverbal modiﬁers. The job of

vhead is to ﬁnd a verb with an inﬂection Infl,

subject marker X, verb type E and a list Slots of

postverbal modiﬁers which contains verb modiﬁers

such as objects, indirect objects and prepositional

phrases. The grammar also includes prepositions

as well as the other categories mentioned above.

Sentences may be ambiguous, and may therefore

have several meanings. In such cases, the NL anal-

yser results in alternative LFs and an analyst will be

required to select the intended meaning. Although,

this makes the process less automatic, this is helpful

since it enables ambiguities to be detected.

3 E-R MODELS FROM LOGICAL

FORMS

The ﬁrst task in identifying an ERM is to obtain a list

of entities in the speciﬁcation and the relationships

between them. There is no clear deﬁnition of what

constitutes an entity. In SSADM for example (Ash-

worth and Goodland, 1990), an entity is deﬁned as

something of importance to the system about which

information can be held. The same deﬁnition is also

used by Bowers (Bowers, 1988), who further suggests

that entities can be objects (person, car, events (birth,

scoring a goal), activities (production, playing) and

associations (marriage).

Grammatically speaking, the above list gives types

which deﬁne entities that are related. They all belong

to the same grammar category of nouns. Likewise,

a number of authors have reported that relations are

mainly described by verbs (Ashworth and Goodland,

1990; Gane and Sarson, 1979). We therefore base our

identiﬁcation process on the view that entities are de-

noted by nouns and relationships by verbs. However,

just scanning for nouns and verbs alone is not ade-

quate. There are three signiﬁcant problems that we

now illustrate with the following sentences:

1. A company maintains a description for each item of

stock.

Handling prepositions is non-trivial and details are

given in (Meziane, 1994)

OBTAINING E-R DIAGRAMS SEMI-AUTOMATICALLY FROM NATURAL LANGUAGE SPECIFICATIONS

639

2. A computer-assisted ﬂight planning system is used

by a complex aircraft.

3. The system of a simple aircraft is considered to

comprise the plan of the pilot.

The entities and relations can easily be picked out

in the ﬁrst sentence as the nouns: company, de-

scription, item, stock and verb maintains. How-

ever, we still need to ﬁnd what entities are related by

maintains and we don’t know the degree of the rela-

tions. In the second sentence, selection of nouns alone

as entities is inadequate since we need to identify the

compound noun computer-assisted ﬂight planning

system. The third sentence has two verbs, comprise

and consider. How can we identify that comprise is

the one that relates the entities and that consider is

only a subsidiary relation?

Fortunately, by ﬁrst translating sentences to LFL,

we are able to overcome these problems. So, for in-

stance, the above three sentences result in the follow-

ing LFs (where all and ex are the usual universal and

existential quantiﬁers):

1. all(item(X,stock),the(company(Y),ex(description(Z),

for(X,maintain(Y,Z)))))

2. ex(aircraft(X) & complex(X), ex(computer-

assisted(ﬂight(planning(Y))),use(X,Y)))

3. ex(aircraft(X)& simple(X),the(system(Y,X),

the(pilot(Z), the(plan(T,Z), (comprise(Y,T)))))

Given these LFs, the required information can be rel-

atively easy to extract:

1. The term maintain(Y,Z), in the ﬁrst LF, gives the

relationship between Y and Z which themselves are

qualiﬁed in the focus as the company and the de-

scription;

2. the compound noun computer-assisted ﬂight

planning system is easily identiﬁed from the term

computer-assisted(ﬂight(planning(system(Y)))).

3. In the third LF, the relation ’comprise’ is correctly

identiﬁed by extracting the inner verb relationship

comprise(Y,T) from the LF.

At this stage, its worth emphasising that this pro-

cess produces only an initial list of entities and rela-

tionships. The model may well be incomplete since

the informal description may be incomplete and may

contain irrelevant entities and relationships.

4 USING QUANTIFIERS TO

DETERMINE THE DEGREES

This section shows how the degrees of relationships

can sometimes be identiﬁed from the quantiﬁers in the

LFs of sentences. This approach to identifying the de-

grees is therefore highly dependent on the process of

identifying the quantiﬁers in the English sentences.

Hence, section 4.1 looks at the process of identify-

ing the quantiﬁers in some detail. Section 4.2 then

describes how the quantiﬁers help to identify the de-

grees.

4.1 Identifying Implicit Quantiﬁers

The English language has two articles: the deﬁnite ar-

ticle “the” and the indeﬁnite article “a” (“an”) It has

been assumed for a long time that both articles can

be interpreted as existential quantiﬁers. Some authors

have shown that this is not always true, and these

articles can sometimes lead to the universal quanti-

ﬁer(Hess, 1985). In the following subsections we will

identify some cases where the quantiﬁers can be iden-

tiﬁed from the articles.

The Deﬁnite Article “the”

The deﬁnite article is often translated into the unique

existensial quantiﬁer (ie., there exists one and only

one). It is, for instance, correct to assume that in the

sentence:

The student passed the exam.

we are talking about a particular student who passed

a particular exam. However, in the sentence:

The students passed the exam.

we cannot assume the unique existence for the ﬁrst

deﬁnite article. McCord (McCord, 1990) also recog-

nises that interpreting the deﬁnite article only as the

unique existence is not adequate but does not suggest

any alternatives. In our approach we use the singular-

ity or plurality of the noun to determine if it should

be interpreted as the unique existence or normal exis-

tence. It is interpreted as the unique existence only if

the noun quantiﬁed is in the singular form.

The Indeﬁnite Article “a”

There seems to be general agreement that the use of

the indeﬁnite article is always a source of ambiguities

(Allen, 1987). The indeﬁnite article can sometimes be

translated to the existential quantiﬁer and sometimes

to the universal quantiﬁer. According to Hess (Hess,

1985) the most important way to determine the quan-

tiﬁcation of a sentence is through the choice of the

verb. For example, consider the following sentences:

1. A text editor makesmodiﬁcations to a text ﬁle.

2. A text editor is making modiﬁcations to a text ﬁle.

3. A text editor made modiﬁcations to a text ﬁle.

4. A text editor has made modiﬁcations to a text ﬁle.

ICEIS 2004 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

640

The present tense is used in example (1) to say that a

text editor makes modiﬁcations to a text ﬁle in gen-

eral. The main use of the present tense is to express

habitual actions. In examples (2) to (4) we say that

there is, or was, a case of a text editor making modi-

ﬁcations to a text ﬁle. Therefore, Hess suggested that

because the present tense is used in the ﬁrst sentence,

text editor must be universally quantiﬁed. Likewise,

because of the tenses used in the other sentences, text

editor must be existentially quantiﬁed in the remain-

ing sentences.

In some cases the future is preferred over the

present tense for general statements as in the follow-

ing example:

A man who loves a woman will stroke her.

Dynamic verbs, such as to stroke, seem to call for

the future tense, whereas static verbs such as to re-

spect seem to go better with the present tense. Hence,

Hess formulated the following rules:

• Rule 1:

The subject of a sentence is existentially quantiﬁed

if the VP is in the past tense, in the progressive as-

pect, or in the perfective aspect.

• Rule 2:

Otherwise the subject is universally quantiﬁed, in

particular if it is in the present tense or in the future

tense.

Once we have determined the quantiﬁcation of the

subject of the sentence, we have to do the same thing

to the other components of the sentence. Let us con-

sider the following examples:

1. A man who loves a woman is happy.

2. A man that loves a woman respects her.

Intuitively, we can see that woman should be ex-

istentially quantiﬁed in the ﬁrst sentence and univer-

sally quantiﬁed in the second sentence. To observe

the difference, Let us consider the LFs of these sen-

tences:

1. all(man(X),ex(woman(Y)&love(X,Y),happy(X)))

2. all(man(X),all(woman(Y)&love(X,Y),respect(X,Y)))

The main verb of the ﬁrst sentence is happy and does

not refer to the NP woman. In the second sentence

the main verb respects refers to the NP woman.

This is the reason why the NP “woman” should be ex-

istentially quantiﬁed in the ﬁrst sentence and univer-

sally quantiﬁed in the second. Hence, Hess suggested

a third rule which is:

• Rule 3:

In a restrictive NP those arguments that are referred

to by the main verb are universally quantiﬁed and

those that are not referred to by the main verb are

existentially quantiﬁed.

This rule now enables the correct interpretation of

the above sentences. However, it does not hold for

non-restrictive NPs. In particular, when a NP appears

at the right of a verb, the kind of sentences we

have encountered suggest that the indeﬁnite article

should be interpreted as an existential quantiﬁer. For

example in the sentence:

A complex aircraft uses a radar.

The second indeﬁnite article is interpreted as the

existential quantiﬁer and not as the universal quan-

tiﬁer. There are two exceptions to the above rules

which are analysed in the following cases:

• As an exception to rule 2, the past tense can

express a universally quantiﬁed assertion, as in the

following example:

A student read books when I was young.

This universal quantiﬁcation is possible because

the main verb (read) requires a spatial or temporal

post modiﬁer(when).

• As an exception to rule 1, the progressive aspect

can express universal quantiﬁcation as in:

John is always coming late

This is only possible when the verb is modiﬁed by

expressions such as “always”, “in general”, “regu-

larly”.

To cover these exceptions, we can suggest the fol-

lowing fourth rule which takes precedence over rules

1 and 2.

• Rule 4:

1. The past tense can express a universally quanti-

ﬁed assertion if the main verb requires a spatial

or a temporal post modiﬁer.

2. The progressive aspect can express a universal

quantiﬁcation if the verb is modiﬁed by expres-

sions such as “always”, “in general” and “regu-

larly”.

4.2 Identifying the Degrees from the

Quantiﬁers

This section illustrates how we can make use of

the quantiﬁers identiﬁed in the previous section to

identify the degrees of some relations. Consider the

following examples and their LFs:

• A complex aircraft uses a radar.

all(aircraft(X) & complex(X), ex(radar(Y), use(X,Y)))

OBTAINING E-R DIAGRAMS SEMI-AUTOMATICALLY FROM NATURAL LANGUAGE SPECIFICATIONS

641

• The students passed the exam.

all(student(X), the(exam(Y), pass(X,Y)))

In the ﬁrst example, the ﬁrst entity in the relation is

quantiﬁed by the universal quantiﬁer and the second

by the existential quantiﬁer. Based on our current ex-

perience, and the examples encountered, usually only

one occurrence of the variable quantiﬁed by the exis-

tential quantiﬁer is involved in the relation. In such

cases the LF quantiﬁer “ex” is interpreted as a unique

existential quantiﬁer. We are therefore in a case were

many occurrences of the ﬁrst variable are related to

one occurrence of the second variable. By deﬁnition,

this is a many-to-one relationship. In the second ex-

ample, the interpretation is much more stronger since

we have a unique existence interpretation for the sec-

ond “the”. We have again a case of a many-to-one

relationship. Let us now consider the following set of

sentences and their LFs:

• The company maintains a description for each item

of stock.

all(item(X,stock), the(company(Y), ex(description(Z),

for(X,maintain(Y,Z)))))

• The student passed all exams.

the(student(X), all(exam(Y), pass(X,Y)))

The NP each item of stock, in the ﬁrst sentence,

suggests that we are talking about a particular stock

that contains many items. Therefore wa have a one-

to-many relation between the entity “stock” and the

entity “item”. In general, sentences where the ﬁrst

entity is singular and quantiﬁed by the deﬁnite article

and when the second entity is quantiﬁed by the uni-

versal quantiﬁer deﬁne one-to-many relationships. A

typical example is the second sentence. Let us con-

sider now the following sentence and its LF:

• The student passed the exams.

the(student(X), the(exam(Y), pass(X,Y)))

The previous rule does not apply because the sec-

ond entity is itself singular and quantiﬁed by the

unique existential quantiﬁer. As this example suggest,

we are talking about a particular student who passed

a particular exam. In this case, we infer a one-to-one

relationship between the entities.

These are the main cases where our approach can

help in identifying the degrees of the relations from

the LFs of the sentences. In other cases, when it is

difﬁcult to predict the degree of a relation, we let the

user determine it.

5 CONCLUSION AND FUTURE

WORK

This paper has presented a novel approach that can

help an analyst produce an initial ERM from speciﬁ-

cations written in NL. The approach makes use of NL

analysis techniques to translate sentence to LFs which

are then used as a basis for identifying the entities

and relationships. The quantiﬁers in the LFs also en-

able the identiﬁcation of the degrees of relationships

in some common cases.

The approach has been implemented in Prolog and

tested on some examples. To date, the most interest-

ing application has been to a speciﬁcation of a ﬂight

planning system that was written independently of our

work (Hepworth, 1988; Vadera and Meziane, 1994).

In that case study, the approach worked well in that:

• The majority of entities and relationships were cor-

rectly identiﬁed. The system identiﬁed 55 entities

of which only 1 was thought to be spurious and

none had been missed. It identiﬁed 52 relationships

of which were incorrect and none overlooked.

• Most of the degrees were correctly identiﬁed. The

degrees for 49 of the 52 identiﬁed relations were

correctly predicted.

Future research aims to develop the techniques so that

a wider range of sentences and more structured ob-

jects, like tables, can be handled by the NL process-

ing phase. This should enable a broader evaluation

of the approach on larger speciﬁcations. The results

obtained with our current implementation are encour-

aging and suggest that further research may lead to an

invaluable practical aid for producing ERMs.

REFERENCES

Allen, J. (1987). Natural language understanding. The

Benjamin/Cummings Publishing Company, Inc.

Ashworth, C. and Goodland, M. (1990). SSADM: A practi-

cal approach. McGraw-Hill Book company.

Bowers, D. (1988). From data to data base. Van Nostrand

reinhold (U.K) Co. Ltd.

Gane, C. and Sarson, T. (1979). Structured System Analysis.

Prentice-hall Software series.

Hepworth, B. (1988). An introduction to Z. Technical Re-

port BAe-WIT-RP-GEN-SWE-152, Systems Comput-

ing Department, British Aerospace Ltd.

Hess, M. (1985). How does natural language quantify? In

Second Conference of the European Chapter of the as-

sociation for Computational Linguistics, pages 8–15.

McCord, M. (1990). Natural language processing in Prolog.

In Adrian, W., editor, A logical approach to expert sys-

tems and natural language processing Knowledge sys-

tems and PROLOG, pages 391–402. Addison-Wesley

Publishing company.

Meziane, F. (1994). From English to Formal Speciﬁcations.

PhD thesis, University of Salford.

Vadera, S. and Meziane, F. (1994). From English To Formal

Speciﬁcations. The Computer Journal, 37(9):753–

763.

ICEIS 2004 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

642