Modeling

Software Speciﬁcations with Case Based

Reasoning

Nuno Seco

, Paulo Gomes

, and Francisco C. Pereira

Department of Computer Science, University College Dublin,

Dublin 4, Ireland

CISUC - Centro de Inform

atica e Sistemas da Universidade de Coimbra,

Departamento de Engenharia Inform

atica, Universidade de Coimbra,

3030 Coimbra, Portugal

Abstract. Helping software designers in their task implies the development of

tools with intelligent reasoning capabilities. One such capability is the integration

of Natural Language Processing (NLP) in Computer Aided Software Engineer-

ing (CASE) tools, thus improving the designer/tool interface. In this paper, we

present a Case Based Reasoning (CBR) approach that enables the generation of

Uniﬁed Modeling Language (UML) class diagrams from natural language text.

We describe the natural language translation module and provide an overview of

the tool in which it is integrated. Experimental results evaluating the retrieval and

adaptation mechanisms are also presented.

1 Introduction

Design is a complex task involving several types of reasoning mechanisms and knowl-

edge types [1]. The ﬁeld of software design is no exception, especially if we consider

software design as a way of computationally modeling part of the real world. CASE

tools were developed some decades ago to help designers model software systems.

Many of these tools are only graphical editors that graphically assist the user design-

ing the system. Given this fact, these tools are bereft of capabilities other than simple

editing skills, one can suggest that the inclusion of intelligent reasoning capabilities

is a goal that must be attained in order to relieve the designer from tedious and time

consuming tasks.

We are developing a CASE tool with several functions that extend beyond these

mundane editing skills. One such function is the ability to translate textual descriptions

of a proposed system into the graphical formalism used by the CASE tool. This pro-

cess implies the three following steps: morphological analysis, syntactic analysis, and

semantic analysis. The morphological analysis identiﬁes the lexical category of each

word. The syntactic analysis creates the parse tree of the text being analyzed. Finally,

semantic analysis associates meaning to text elements and converts them into the for-

malism used in our CASE tool. For the initial steps, there are well established methods

to perform these analysis stemming from the NLP community [2, 3]. The semantic anal-

ysis is domain dependent and requires thoroughly developed approaches to cope with

Seco N., Gomes P. and C. Pereira F. (2004).

Modeling Software Speciﬁcations with Case Based Reasoning.

In Proceedings of the 1st International Workshop on Natural Language Understanding and Cognitive Science, pages 135-144

DOI: 10.5220/0002670701350144

 SciTePress

the complexity of natural language. In this paper, we propose a new approach to se-

mantic analysis of texts and analyze means of converting them into a formal software

speciﬁcation language. The proposed approach for semantic analysis is based on the

CBR paradigm [4,5] and converts textual software descriptions into UML class dia-

grams

The immense number of syntactical and semantic relations which are possible in

natural language jeopardizes the development of a theoretical framework for transla-

tion. The lack of an aligned corpus containing software requirements and their respec-

tive software designs, precludes the possibility of using corpus based strategies, such

as Example Based Machine Translation (EBMT). The above reasons lead us to believe

that CBR is a suitable approach. Another advantage of a CBR approach, is that it al-

lows the system to evolve in time by learning new cases and by adapting itself to the

linguistic modeling preferences of its users. We have dubbed our NLP module NOESIS,

which is an acronym for Natural Language Oriented Engineering System for Interactive

Speciﬁcations.

The next section describes the CASE tool where NOESIS is integrated. The NOE-

SIS module is detailed in section 3 where we focus on the NLP steps of the translation

process, the case representation, retrieval and ﬁnally the adaptation mechanisms. Sec-

tion 4 presents experimental work. Finally, we bring our analysis to a conclusion and

offer some ideas for future work.

2 REBUILDER

The main goals of REBUILDER are to create and manage a repository of software

designs and to provide the software designer with a set of functions which are necessary

in promoting the reuse of previous design experiences.

Figure 1 illustrates the architecture of REBUILDER. It comprises of four main mod-

ules: the UML editor, the knowledge base manager, the knowledge base (KB), and the

CBR engine. It also depicts the two different user types: software designers and KB

administrators. Software designers use REBUILDER as a CASE tool and subsequently

reuse the software design knowledge it has previously stored. The KB administrator

keeps the KB updated and consistent. The UML editor serves as the intermediary be-

tween REBUILDER and the software designer while the KB manager is the interface

between the KB administrator and the system.

3 NOESIS

The aim of NOESIS is to assist the designer in the creation of the initial UML class

diagram. Consequently, the diagram may be passed along to the other reasoning mod-

ules capable of completing it. At the present stage, NOESIS focuses on the structural

requirements of the desired system salient in the software speciﬁcation documents. A

possible passage could be:

UML is the design language used by the CASE tool that we are developing (see [6])

136

Manager Client

UML

Editor

CBR

Engine

KB Manager Module

Knowledge Base

WordNet Server

WordNet

Case

Indexes

File Server

Design

Cases

Data Type

Taxonomy

Designer Client

UML

Editor

CBR

Engine

Software DesignersKB Administrator

Fig.1. REBUILDER’s Architecture.

A BankingCompany has many Customers. These Customers may have several

BankAccounts. A Customer can request a BankCard for each BankAccount.

These texts are subject to some peculiarities and simpliﬁcations [7], which some-

what eases the difﬁculty in understanding them. Of these peculiarities, explicitness is

probably one of the most signiﬁcant. It is useful in reducing to a minimum, the amount

of unexpressed information in the speciﬁcations, thus avoiding potential disputes over

what the software should or should not do.

NOESIS, similar to any other NLP system, is aware of the knowledge categories

identiﬁed in [2] which include Phonetics, Phonology, Morphology, Syntax, Semantics,

Pragmatics and Discourse. However, due to the speciﬁc nature of the problem, phonetic

and phonological knowledge is absent. We assume, for the moment, that the require-

ments are fed to NOESIS in the form of digitalized texts. The other omitted knowledge

category is pragmatic knowledge. As stated above, we assume that these texts are ex-

plicit. Therefore, implicit information is avoided, thus relieving the system of dealing

with pragmatic issues.

Figure 2 illustrates the main modules that are elements of the NLP engine of NOE-

SIS and the processes that each sentence undergoes. Sentences from the original text are

queued and posted individually to the NLP engine. Each word in the sentence is tagged

with a label that identiﬁes the grammatical class of that word. The tagged sentence is

then sent to a syntactic parser which derives the possible parse trees of the sentence.

One of these trees is then selected as the best derivation and is passed on to the case-

based semantic analyzer which outputs a meaningful representation (e.g. UML Class

Diagram). This representation is a diagram that corresponds to the sentence, yet it is

only a partial solution to the initial text. As a result, it is necessary to merge these par-

tial diagrams together before being presented as a candidate solution to the user. The use

of sentences as the basic unit of processing is arguable. Obviously, by breaking the text

into sentences we loose implicit paragraph-level or text-level relations, but on the other

hand, we are also able to make more detailed similarity judgments. Nevertheless, as

will be stated in section 3.2, during semantic analysis we capture these coarser grained

relations (text-level) by including a context and discourse evaluation component in our

similarity metric.

137

Sentence

POS Tagger

MORPH Analyzer

Tagged

Sentence

Syntactic Parser

Parse Tree

Case-Based

Semantic Analyzer

Partial

Diagram

NOESIS

Fig.2. NOESIS’s Architecture.

3.1 Morphological and Syntactic Analyzer

The ﬁrst phase of processing is Part-Of-Speech tagging. The tagger used in NOESIS

is a purely stochastic tagger called QTAG

. The tagset used by NOESIS is a simple

and reduced form of the original tagset. If the sentence is successfully tagged and all

nouns and verbs exist in the WordNet lexicon, then the tagged sentence is passed on

to the next analyzer. Otherwise, the sentence will undergo morphological processing in

order to identify inﬂected words. These inﬂections are removed from the sentence and

the tagger is asked to process the sentence again. If the sentence is still not correctly

tagged then the user may tag the problematic word(s) manually or rephrase the speciﬁ-

cation. It should be noted that the reduced tagset that we are using is not ﬁne grained,

so distinction between plural and singular forms, which would be an obvious indicator

of cardinality of the relations, is not possible at this point.

The objective of the syntactic analyzer is to discover all valid syntactic parse trees

for a given sentence. A very simple context free grammar for the English language is

used. The parsing algorithm used is known as the Earley algorithm [8]. After parsing,

the resulting derivation is passed on to the semantic analyzer which produces the corre-

sponding meaning representation.

3.2 Semantic Analyzer

The semantic analyzer receives a syntactic parse tree as its input and produces an UML

class diagram that partially models the requirement text. Our analyzer uses a CBR

mechanism to produce a plausible diagram.

The Case Base and Indexes

Previous requirements are stored and indexed as cases

in the case base. Cases in NOESIS are composed of a single parse tree (the leaves are

the words of the sentence), a corresponding class diagram and a mapping tuple that

establishes the coupling between nouns in the sentence and the entities in the diagram.

These cases are then indexed according to the verbs present in the sentence. Each verb

is attributed a synset

from WordNet, then an index representing the case is created and

attached to the corresponding synset node in WordNet. The use of verbs for indexing

is twofold; ﬁrstly verbs provide a relational and semantic framework for sentences [9],

therefore, the verb occupies a core position and no valid sentence may exist without a

More information can be obtained at http://www.clg.bham.ac.uk/tagger.html

A synset identiﬁes the intended meaning for a word.

138

verb. The second reason is concerned with performance reasons. There are many more

nouns than verbs in the English language [9], which means that searching in Word-

Net’s noun graph is much more demanding than searching in it’s verb graph. The most

relevant types of relations used to connect noun synsets are hypernym and meronym

relations. Regarding verb synsets, meronym relations do not exist. These were replaced

by entailment relations (speciﬁc verb clustering relations were also added) [9].

Attaching indexes to WordNet helps evaluate semantic similarity, but portrays noth-

ing regarding syntax which is also a very important facet of sentence interpretation.

In order to close this gap we have created a second structure that is also used for in-

dexation. This means that every index is simultaneously attached to these two different

structures. Figure 3illustrates how this syntactic indexation tree is organized. The root

of the tree has no syntactic meaning. It simply exists for the sake of simplifying the

algorithms that manipulate these data structures.

Root

S1

S2

S3

Fig.3. Syntactic index tree.

Imagine that the nodes s1 and s2 correspond to the parse trees in ﬁgure 4, these

trees could represent the syntactic derivation of the sentences ”Book that ﬂight.” and

”John ate the pizza.”, respectively. As can be seen, each node of the tree in ﬁgure 3

syntactically subsumes every other descending node. This kind of structure facilitates

the assessment of syntactic similarity. We assert that s1 subsumes s2 when every branch

of s1 is contained in s2. This indexation tree is dynamically rearranged when new cases

are added to the case base.

S1

VP

NP

Verb Det Noun

S2

VP

NP

Verb Det Noun

Noun

Fig.4. Example parse trees.

139

Retrieval Our retrieval algorithm comprises of two steps. Firstly, we look for relevant

candidates through the use of indexes and then select the most promising candidate that

maximizes our similarity metric.

The ﬁrst substep takes advantage of the implemented indexing scheme where both

WordNet and the syntactic index tree can be used. This ﬂexibility enabled the imple-

mentation of several algorithms that use both structures simultaneously or individually

(giving preference either to semantics or syntax).

Four retrieval algorithms were implemented which combined the use of these struc-

tures. The algorithms are all very similar to the one encountered in [6], except that all

relations between verb synsets are used during the search. The algorithm is based on

spreading activation, which expands the search to neighbor nodes until some condition

is attained and the search terminates. All algorithms commence by ﬁnding a list of entry

points into to the relevant structure(s). The verbs’ synset is used as the entry point into

WordNet. Regarding the syntactic index tree, the entry point is the node representing

the same syntactic derivation as our target sentence. If no such node exists then a list of

entry points is created which contains the nodes that directly subsume and are subsumed

by our target sentence. The four algorithms used by NOESIS are:

Semantic Retrieval — Only uses the WordNet verb graph.

Syntactic Retrieval — Only uses the syntactic index tree.

Semantic Filtering Retrieval — Uses the syntactic index tree but eliminates in-

dexes that have a semantic distance above some user deﬁned threshold.

4. Conjunctive Retrieval — Uses both structures simultaneously. Only intersecting

indexes that are reached during the search on both structures are considered valid

and may be returned.

Similarity Assessment The second substep ranks each of the indexes returned by the

retrieval algorithm. These indexes contain the parse tree for the case they represent

which is then used for similarity assessment. We use four different measures for ranking

indexes:

– Syntactic Similarity is given by:

× (

#inter sect(target, source)

#nodes(target)

#inter sect(tar get, source)

#nodes(source)

) (1)

where intersect computes a list of nodes that are syntactically common in both

parse trees and nodes returns a list of nodes contained in the tree. Since there is no

obvious reason to normalize the intersection in relation to either the source or the

target, we consider both with equal importance.

– Semantic Similarity is given by:

dist(n(target), n(source)) + dist(v(target), v(source))

2 × MSL

(2)

where dist computes the average semantic distance between noun synsets or verb

synsets. n and v return a list of the nouns or verbs contained in the parse tree.

MSL (Maximum Search Length) deﬁnes the maximum number of arcs that the

search may transverse. If the distance can not be computed the value of the MSL is

returned.

140

– Contextual Similarity is given by:

(

1 if sour ce belongs to a text from which a sentence was previously used

0 otherwise

(3)

– Discourse Similarity is given by:

(snum(target) − snum(source))

+ 1

(4)

where snum returns the absolute position of the sentence in the original text. Hence,

we try to match sentences that occur in similar positions in both texts, allowing us

to capture text-level relations.

These four formulas are normalized which will produce values between [0..1]. After

the computation of each of these components, a weighted sum is used to evaluate the

overall similarity between the target and the source. The indexes are then ranked and

the index that maximizes expression (6) is considered for reuse.

sim(tar get, source) = ω

· SyntacticSimilarity + ω

· SemanticSimilarity + (5)

· ContextSimilarity + ω

· DiscourseSimilarity

Adaptation

After all sentences of the target text have been mapped with a similar

source sentence the adaptation phase begins. Adaptation starts by loading the diagrams

speciﬁed by the selected indexes into memory. Then the names of the entities from the

source diagrams are replaced with the nouns encountered in the target sentence. This

replacement of names is possible because each case holds the mapping (see section

3.2) between the nouns in the sentence and the entities of the diagram. This mapping is

denoted by (6).

Υ : Source

sentence

−→ Source

diag ram

(6)

Having deﬁned the correspondence function between nouns and entities, we now

need a function that can map nouns from the target sentence with nouns apparent in

the source sentence. A naive mechanism is used to accomplish this task; basically the

isomorphism between nouns is established on the basis of their relative positions in

each sentence. This means that the ﬁrst noun in one sentence will be isomorphic with

the ﬁrst noun of the second sentence and so on, so forth. The correspondence is denoted

by (7).

Γ : T arget

sentence

−→ Source

sentence

(7)

Obtaining a diagram for the target sentence is now a matter of discovering the map-

pings between T arget

sentence

and Source

diag ram

. This is easily accomplished by ap-

plying (8) and replacing the names of the entities in the diagram with the nouns in the

target sentence.

141

Υ ◦ Γ : T arget

sentence

−→ Source

diagram

(8)

Applying the above process will yield a several diagrams; there will be as many

diagrams as target sentences. It frequently occurs that each diagram has entities with

the same name (some simple inﬂections may exist). These entities are merged resulting

in a single diagram that is presented to the user.

Retain The proposed diagram may be adapted by the user in the way he or she thinks

adequate. If many modiﬁcations are made on the suggested diagram, it may be due to

a lack of coverage of the case base. In other words, the case base fails to hold the nec-

essary knowledge to satisfactorily solve the problem. Confronted with such a situation,

it is reasonable for the user to submit the altered case to the case base. Then, it is up

to the KB administrator to decide if the submitted case should be indexed and made

accessible to the retrieval component.

4 Preliminary Experimental Studies

In this section, we present some results of preliminary experiments. The aim of these

experiments is to compare the output of NOESIS’s solution with that of a software

designer. We are vividly aware of the subjectivity involved in these experiments and

hold the view that the only reliable experiment to assess the proposed strategy, is one

where the tool is used in a real software production environment and is evaluated by its

users.

The results presented here are based on a case base comprising of 62 cases, which

corresponds to 12 different texts (an average of 4 sentences per text). A set of 22 prob-

lems (an average of 3 sentence per problem) were used for testing the overall accuracy

of the system.

For each problem text a corresponding UML class diagram was designed by the

authors. The same problems were given to NOESIS and the two solutions were com-

pared using a class diagram comparison metric yielding a number ranging from 0 to

1, where 1 represents an identical match and 0 represents two completely different di-

agrams. This metric is actually part of a larger evaluation component which computes

the similarity between two packages (each package may contain a diagram and several

other packages. A detailed explanation can be found in [10].

The four retrieval algorithms presented in section 3.2 along with different weighting

schemes were combined resulting in 60 distinct conﬁgurations. The weight combina-

tions used are evident in table 1. The Semantic Filtering Retrieval algorithm was used

with a predeﬁned threshold of 5 (see section 3.2). Each problem was solved using each

of the conﬁgurations presented in table 1.

Figure 5 purveys the average similarity of NOESIS’s diagrams in relation to the hu-

man generated solutions for every problem description in each of the 15 conﬁgurations.

In ﬁgure 5, SemR, SynR, SemFR and ConR correspond to Semantic Retrieval, Syntac-

tic Retrieval, Semantic Filtering Retrieval and Conjunctive Retrieval, respectively.

142

C01 C02 C03 C04 C05 C06 C07 C08 C09 C10 C11 C12 C13 C14 C15

1 0 0 0 0.5 0.5 0.5 0 0 0 0.33 0.33 0.33 0 0.25

0 1 0 0 0.5 0 0 0.5 0.5 0 0.33 0.33 0 0.33 0.25

0 0 1 0 0 0.5 0 0.5 0 0.5 0.33 0 0.33 0.33 0.25

0 0 0 1 0 0 0.5 0 0.5 0.5 0 0.33 0.33 0.33 0.25

Table 1. The ﬁfteen conﬁgurations used for experimental evaluation. The ω’s represent the dif-

ferent weights given to the overall similarity metric presented in section 3.2.

0.92

0.925

0.93

0.935

0.94

0.945

0.95

0.955

0.96

0.965

0.97

2 4 6 8 10 12 14

Average Similarity

Weight Configuration

Accuracy Results

SemR

SynR SemFR ConR

Fig.5. Average similarity between NOESIS’s solutions and human generated solutions.

These results show us that conﬁguration C03 (where only context is used for simi-

larity assessment) has inferior similarity for all four retrieval algorithms. The Semantic

Filtering Retrieval algorithm performs worse than all other algorithms. This may be a

result of the low threshold value used. This plot also indicates that the adaptation mech-

anism, independent of the retrieval algorithm, is able to adapt the retrieved cases to the

target situation.

5 Conclusions and Future Work

This paper presents a case-based reasoning system capable of translating textual soft-

ware descriptions into UML class diagrams. One advantage of using CBR to perform

semantic analysis of sentences is that CBR does not need a domain theory in which to

work. In software design, it is very difﬁcult to come up with a domain theory because

the designer is modeling an aspect (or a view) of the real world. It is impractical to cope

with all possible views. Another advantage of CBR is that it retains new knowledge in

143

the form of cases. This not only allows the system to evolve but also makes it possible

for the system to adapt itself to the designers’ modeling preferences.

The experimental results shown in section 4 are not conclusive but they encourage

further research using CBR as the main reasoning mechanism for translation of software

descriptions into UML class diagrams. This work has raised several issues that require

further research and improvement. One of these issues is the extension of the system to

deal with attributes and methods. The linguistic processing mechanism would also im-

prove if issues like, selecting the correct syntactic derivation, the resolution of anaphora

and the detection of compound nouns were solved internally by NOESIS. The use of a

complete tagset should also allow the system to make better similarity judgments.

References

1. Tong, C., Sriram, D.: Artiﬁcial Intelligence in Engineering Design. Volume I. Academic

Press (1992)

2. Daniel Jurafsky, J.H.M.: Speech and Language Processing. Prentice Hall (2000)

3. Manning, C., Sch

utze, H. In: Foundations of Statistical Natural Language Processing. The

MIT Press, Cambridge, US (1999)

4. Aamodt, A., Plaza, E.: Case–based reasoning: Foundational issues, methodological varia-

tions, and system approaches. AI Communications 7 (1994) 39–59

5. Kolodner, J.: Case-Based Reasoning. Morgan Kaufmann, San Mateo, California (1993)

Gomes, P., Pereira, F.C., Paiva, P., Seco, N., Carreiro, P., Ferreira, J.L., Bento, C.: Case

retrieval of software designs using wordnet. In Harmelen, F.v., ed.: European Conference on

Artiﬁcial Intelligence (ECAI’02), Lyon, France, IOS Press, Amsterdam (2002)

Ambriola, V., Gervasi, V.: Processing natural language requirements. In: Proc. of the 12th In-

ternational Conference on Automated Software Engineering, Los Alamitos, IEEE Computer

Society Press (1997) 36–45

Earley, J.: An efﬁcient context-free parsing algorithm. Communications of the ACM 13

(1970) 94–102

9. Miller, G., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.J.: Introduction to wordnet: an

on-line lexical database. International Journal of Lexicography 3 (1990) 235 – 244

10. Gomes, P., Pereira, F.C., Paiva, P., Seco, N., Carreiro, P., Ferreira, J.L., Bento, C.: Exper-

iments on case-based retrieval of software designs. In Craw, S., Preece, A.D., eds.: 6th

European Conference on Case-Based Reasoning (ECCBR’02). Volume 2416., Aberdeen,

Scotland, UK, Springer (2002)

144