A Code Merger to Support Reverse Engineering Towards Model-driven

Software Development

Oliver Haase, Nikolaus Moll and Paul Zerr

HTWG Konstanz, University of Applied Sciences, Computer Science Department,

Braunegger Str. 55, 78464 Konstanz, Germany

Keywords:

Iterative Model-driven Migration, Reverse Engineering, Code Merging.

Abstract:

Model-driven engineering is a promising approach whose feasibility for commercial development is currently

being validated. While most approaches discuss forward-engineering steps, only little research has been done

on model-driven software migration. More precisely, it is unclear how to transform — or reverse engineer

— existing code into generated and hand-crafted artifacts. We present an iterative approach to this problem.

Assuming some evolving high-level representations of a software legacy system, code generators may produce

a second version of the system to an extend where hand-crafted code is still necessary for completion. In this

report we present a code merger that completes the generated code by reusing the implementation of the

software legacy system.

1 INTRODUCTION

Model-Driven Software Development (MDSD) is a

promising engineering approach, in particular when

multiple variants of a product family have to be de-

veloped and maintained. Various papers describe how

MDSD can improve software maintenance but also

software ﬂexibility, productivity and reliability, see,

e.g., (van Deursen and Klint, 1998; Kieburtz et al.,

1996; Ladd and Ramming, 1994; Krueger, 1992).

With MDSD, a software system is described by meta-

models, models, generators, as well as hand-crafted

code, i.e. those parts of the code that cannot or make

little sense to be expressed on the model level. While

most MDSD approaches focus on forward engineer-

ing, i.e. the development of new software, only lit-

tle research has been done on how to transform — or

reverse engineer — existing code into the aforemen-

tioned MDSD artifacts.

Reverse engineering existing code towards MDSD

is not always the prudent thing to do. Many pro-

ductive software systems, and in particular those that

are comparably stable, well-designed and hence well

maintainable and extensible, are better left untouched.

Our main goal, in contrast, is to aid the reverse engi-

neering of software that is used for product families

and thus is adapted, conﬁgured, and user tailored in

various ways. Often, companies use reference sys-

tems that are adapted and conﬁgured for new cus-

tomers; just as often, this conﬁguration process has

grown over time and has become highly complicated.

Expressing the reference system in appropriate mod-

els and meta-models can substantially improve the

production of new variants; moreover, MDSD pro-

vides a signiﬁcantly higher degree of platform inde-

pendence for these product families. This is espe-

cially true if the high-level representation is expressed

in one or several domain-speciﬁc languages (DSLs).

The process of reverse engineering towards

MDSD must be iterative. A typical, complex soft-

ware system cannot be transformed into a combina-

tion of DSLs (meta-models), models, and generators

in a single step. Instead, the reverse engineer will

start small, extract individual features from only a few

classes, have only part of the code generated, and have

that generated code combined with the hand-crafted

legacy code. Step by step, the DSLs, models, and gen-

erators will grow, and the percentage of hand-crafted

code will decrease.

Please note that we focus our considerations on

Java as the currently most widely spread program-

ming language. The underlying principles are, how-

ever, applicable to other object-oriented or modular

languages as well. Please also note that this paper

is not about DSL design and features extraction, but

about the iterative merging of hand-crafted and gen-

erated code such that the resulting software provides

the same functionality as the legacy code base. In a

Haase O., Moll N. and Zerr P..

A Code Merger to Support Reverse Engineering Towards Model-driven Software Development.

DOI: 10.5220/0004309000830088

In Proceedings of the 1st International Conference on Model-Driven Engineering and Software Development (MODELSWARD-2013), pages 83-88

ISBN: 978-989-8565-42-6

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

productive environment, operability of the system at

any time in the iterative process is absolutely instru-

mental.

The merging process is far from trivial. This

is partly owing to the iterative nature of the ap-

proach that leaves us with incomplete, generated ar-

tifacts, and partly because of the very nature of most

legacy systems that simply are not as regular and

well-behaved as a strictly forward engineered system

would be. Our approach provides solutions to both

of these problems, as we will show in the subsequent

sections.

2 ITERATIVE REVERSE

ENGINEERING

As described in the introduction, we consider reverse

engineering towards MDSD an iterative process. This

process uses, creates, and modiﬁes several artifacts,

which exist in different versions, each of which cor-

responds to the respective iteration in the overall pro-

cess:

• S

represents the initial source code for iteration i.

This source code comprises both manual and gen-

erated code. The original legacy code S

can be

considered a special case that contains only man-

ual code.

• DSL

is a domain speciﬁc language that covers

those features that are generable in iteration i.

• M

is the model that describes those features of

the software system that have been extracted onto

the model level up until the i-th iteration of the

process; of course, M

must conform to DSL

• GEN

is the generator that accompanies DSL

and

that transforms M

into generated code.

• G

is the code generated by GEN

In the following, we use the terms S

and G

denote the code of a software system in its entirety,

or only an individual class, depending on the context.

We do this in order not to unnecessarily complicate

out terminology. Each iterative cycle i consists of

two phases that are described below and graphically

shown in ﬁgure 1.

During this iterative process, the model level arti-

facts DSL

, M

, and GEN

, as well as the proportion of

generated code G

will grow successively. Comple-

mentarily, the amount of manual code will decrease

with each step.

• Phase 1: Feature Extraction

In each iteration cycle, the reverse engineer aims

Merger

DSL

instance of

GEN

replaces

i+1

Figure 1: The two phases of the iterative reverse engineer-

ing process towards MDSD.

to advance the generated code by extracting ad-

ditional features. Technically speaking, the engi-

neer replaces a DSL, a generator, or a model by

a newer version. Often, advancing one of these

components also requires changes in the other

ones, e.g. adding a new meta-model entity to a

DSL requires the code generator to cover this new

entity as well. The feature extraction is indicated

by the dotted arrows in ﬁgure 1.

• Phase 2: Code Generation and Merging

The second phase is indicated by the solid arrows

in ﬁgure 1. First, the generator, GEN

, transforms

the model, M

, into code, G

. Roughly speaking,

will be a subset of the software system S

. In

order to obtain a new version, S

i+1

, of the soft-

ware, a merger combines the generated code G

with the existing codebase S

. For that purpose,

the merger complements the generated code with

those manual code pieces from S

that have (not

yet) been generated. This new code version, S

i+1

then replaces S

in the next iteration cycle. Typi-

cally, the proportion of generated code in S

i+1

will

be greater than in S

The merging process is far from trivial, because in

real world systems, manually crafted code hardly

ever is as well-structured and regular as gener-

ated code. As a simple example consider the get

and set methods in a fairly large codebase: They

might follow the usual naming conventions for

90% of all attributes, but might be named differ-

ently for the remaining 10% (e.g. readX() in-

stead of getX()). With a naive reverse engineer-

ing and merging process, the generation of get-

ters and setters would not be possible at all due

to the rather few exceptions, because the genera-

tor would not be able to distinguish between the

two categories. Our merger, in contrast, can deal

with irregularities like these, as will be explained

MODELSWARD2013-InternationalConferenceonModel-DrivenEngineeringandSoftwareDevelopment

in detail in section 4.

Executing the above described iterative process al-

lows the reverse engineer to pull more and more fea-

tures of the software system onto the model level. As

more and more generated code replaces manual code,

each new revision of S

simpliﬁes the maintainability

of the software system, and eases the generation of

new code variants.

As with any kind of code refactoring, it is im-

portant to support the reverse engineering process to-

wards MDSD by continuous unit testing. The reason

for this is that generated method implementations can

vary from the original code, as long as the methods

provide the same functionality. See section 4 for fur-

ther information.

It should have become obvious that the generated

code will almost by deﬁnition be incomplete. To be

able to adequately deal with this incompleteness in

the merging step, we assume that for a given method

in S

, the generator either (i) does not generate the

method at all, or (ii) generates only the declaration of

the method using its method header

, or (iii) gener-

ates the complete method deﬁnition, i.e. both header

and body. There is no generation of partial method

bodies. This assumption is, however, not a serious

restriction, because each method that could be gen-

erated partly can be refactored and decomposed in a

preparatory step. For ﬁelds, either only the declara-

tion or the declaration together with an initializer is

generated. Finally, inner classes are generated recur-

sively within their containing classes.

3 SEPARATING GENERATED

AND MANUAL CODE

When forward engineering a software system with

MDSD, the partly generated code needs to be com-

pleted with hand-crafted code. One approach is to

manually add code directly into the generated source

code. However, this technique is widely consid-

ered problematic, mainly for two reasons: (1) The

hand-crafted regions of the source ﬁle must not be

overwritten when the artifact is regenerated, and (2)

putting generated code under version control is gen-

erally avoided. Therefore, most authors (e.g., (V

olter,

2009)) recommend the strict separation of hand-

crafted and generated code into different source ﬁles.

In the following, we brieﬂy discuss the two separa-

tion techniques that are commonly propagated, and

The method header consists of modiﬁers, generic type

parameters, result type, method identiﬁer, parameters and

the thrown exceptions.(Gosling et al., 2005, Page 210)

evaluate their applicability for model-driven reverse

engineering.

1. Inheritance is probably the most widely-used

technique to separate generated and manual code.

The idea is to generate a base class which can be

extended by a concrete subclass implementation

that contains the manual code. For reverse engi-

neering, this would mean to decompose an exist-

ing legacy class A into a generated base class, A

and a hand-crafted subclass A

. This decomposi-

tion, however, leads to a variety of technical prob-

lems, one of which has to do with access rights:

Assume a private ﬁeld in a legacy class A that

after reverse engineering can be generated into

the new base class A

. To give subclass A

ac-

cess to this ﬁeld, its access right must be modiﬁed

from private to protected. This modiﬁcation,

however, might have undesired side-effects some-

where downstreams the potentially complex pre-

existing type hierarchy.

2. Composition and delegation is another approach

where a composed object uses an associated del-

egate object to delegate its tasks to. With model

driven engineering, the composed object contains

the generated code, whereas the delegate object

contains the hand-crafted code. In (Walter and

Haase, 2008), this technique is used to transform

existing legacy code into a model-driven struc-

ture. This approach, however, leads to even more

serious problems with access rights than inheri-

tance, owing to the fact that object identity gets

lost with composition and delegation, because

each object in the legacy system gets split across

two separate objects. As a consequence, each for-

merly private ﬁeld must be made not only pro-

tected but at least package-private if the other ob-

ject needs access to it, too.

In summary, these separation techniques are not

adequate for reverse engineering, because decompos-

ing existing legacy classes into new classes, be it by

subclassing or by composition, is a very disruptive

technique that can have unexpected and undesired ef-

fects on the correctness of the overall system. In

addition, on a semantic level, both approaches are

meant for other purposes than separating generated

from manual code which makes their application for

code separation questionable: (1) Inheritance is meant

to model is-a relationships; however, there is no such

relationship between the generated and the manual

code portion of a class. (2) Composition is meant to

model part-of relationships, which is also not the ap-

propriate relationship between the two code portions.

In contrast, manual and generated code are two equal

parts that together form the functionality of a class.

ACodeMergertoSupportReverseEngineeringTowardsModel-drivenSoftwareDevelopment

As a result, we do not consider code separation

neither by inheritance nor by composition appropri-

ate for reverse engineering. These techniques might

rather be suitable for model-driven forward engineer-

ing, if at all. In fact, there seems to be a recent trend

not to split classes for code separation even for for-

ward engineering, for the reasons discussed above.

For reverse engineering, in any instance, we have de-

cided for the co-existence of generated and manual

code in single source ﬁles, and to mitigate the poten-

tial drawbacks through a set of supporting techniques

and tools:

• From a software developer’s point of view, the

distinction between generated and hand-crafted

code must be explicit. We use the standard anno-

tation javax.annotations.Generated to mark

the generated portions of the code.

• In each iteration, our code merger makes sure the

manual code is not overwritten unintentionally.

• We have developed an Eclipse editor that can gray

out and fold in generated code automatically, in

order to draw the developer’s attention to the man-

ual portions of the source code.

• The same editor protects the generated code

against modiﬁcation by making it read-only.

Of course, a developer can always remove a

@Generated annotation and then modify the for-

merly protected code portion. We ﬁnd it, however,

acceptable to protect developers against uninten-

tional modiﬁcations of generated code while leav-

ing open a back door for intentional changes.

One issue that remains open with mixed source

ﬁles is versioning control, because putting gener-

ated code under versioning control is generally to be

avoided. We believe, however, that (1) the associated

effects, such as new versions that are only due to the

(changed) generated part of a source ﬁle, are often

overrated, and that (2) developing a versioning control

system that understands the standard @Generated an-

notation and acts accordingly would be an interesting

and worthwhile task for model driven development in

general.

4 MERGING PROCESS

As we have discussed so far, in each step of our iter-

ative approach the partly generated code G

has to be

combined with the current version of the source code

. In this section, we describe this merging process.

Methods in G

can be generated either completely

or incompletely. An incomplete method consists of a

method header only and is generated without a body

A common example for incomplete methods are hook

methods. The source code S

, in contrast, contains

complete members only. These members can be (1)

hand-crafted, (2) completely generated or (3) partly

generated and partly hand-crafted.

A method in S

i+1

, i.e. in the result of the merging of

and G

, is

• completely generated, if the generated method in

is complete (consists of both a method header

and body).

• partly generated and partly handcrafted, if the

generated method in G

is incomplete (has only a

method header, but does not have a method body).

A ﬁeld in S

i+1

• completely generated, if the ﬁeld is initialized in

, or the ﬁeld is neither initialized in S

nor in G

• partly generated and partly handcrafted, if the

ﬁeld is initialized in S

, but not initialized in G

Completely generated elements are marked by the

following annotation:

@Generated("de.htwgkn.mdre.gen")

Incompletely generated elements, on the other

hand, are marked by the following annotation:

@Generated("de.htwgkn.mdre.gen",

value="declaration")

Bringing S

and G

together is in fact a merge oper-

ation, which leads to a new source code S

i+1

. Ideally,

i+1

contains more generated members than S

The merging process is done per class, so all

classes from G

will be merged with their counterparts

in S

. Classes, in turn, are merged per member. For

each class member in G

, a corresponding member in

is searched. A corresponding member has the same

name and - if it is a method - the same signature, too.

There are, however, situations where no correspond-

ing member in S

can be found. Typical reasons are:

• The Semantically Corresponding Member in S

has a Different Name. Such a misnamed member

can occur because legacy code is typically not as

regular as generated code. As a simple example,

consider a legacy system where most getters fol-

low the usual naming convention, getX(). How-

ever, for a speciﬁc ﬁeld, y, the getter has been

named readY(). For this attribute, the generated

getter, getY(), is misnamed (with respect to the

legacy code).

The method private void generated(); is an ex-

ample for an incompletely generated method. Obviously,

the code for such a method is syntactically incorrect, be-

cause non-abstract methods must have a body.

MODELSWARD2013-InternationalConferenceonModel-DrivenEngineeringandSoftwareDevelopment

To avoid misnamed members in G

, renaming the

corresponding members in S

before reverse engi-

neering may be a solution. This works, however,

only for internal APIs; in these cases a modern

IDE can automatically refactor the method as well

as all its users. For a method that is part of a public

API, renaming is generally not an option.

• There is no Corresponding Member in S

. The

member that has been generated into G

does not

exist in S

. Such a superﬂuous member can oc-

cur due to inconsistencies in the models and gen-

erators — caused by mistakes or oversimpliﬁca-

tions during the reverse engineering step. As a

very simple example, consider a class where most

private ﬁelds have getters. With an oversimpli-

ﬁed model, meta-model or generator, getters for

all attributes will be generated. Such a superﬂu-

ous member can, however, become useful when it

comes to the forward engineering of future vari-

ants of the reference system.

The merger is not able to classify such a situa-

tion on its own, so the engineer has to decide how

to proceed with the current member from G

. If it is a

misnamed member, the engineer indicates the corre-

sponding member in S

. If the member is superﬂuous,

the engineer has the option to copy it to S

i+1

anyway,

or ignore it.

A special situation arises when a whole class that

does not exist in S

is generated. This happens, e.g.,

when an evolved version of a generator generates a

new utility class to factor out common tasks. Again,

the engineer decides whether to copy the completely

generated class, or to ignore it.

4.1 Rules

These rules explain how S

and G

are merged to ob-

tain the next iteration, S

i+1

1. Methods:

(a) Equal Signature, Completely Generated. G

becomes S

i+1

, and is marked as completely

generated.

(b) Equal Signature, Partly Generated. S

i+1

a combination of G

’s method header and S

’s

body. It is annotated as partly generated.

name must not be changed automatically

, and

on the other hand, it’s generally not possible for

a merger to adjust the generated code to use

the legacy method. As a result, there will be

Methods might be part of an API, on which other

projects depend.

two methods for the same task in S

i+1

, one us-

ing the legacy name, the other using the gener-

ated name. To avoid code redundancy, the for-

mer legacy method will be a completely gener-

ated delegation method, which calls the newly

added generated method.

Whether the other method is partly or com-

pletely generated, depends on whether the gen-

erated method has a body or not. If it has no

method body, the former body of the legacy

method will be used and the method will be

partly generated.

(d) Copied. A superﬂuous method, which is

copied from G

to S

i+1

, will be annotated as

completely generated.

2. Fields:

(a) Equal Name, Completely Generated. G

be-

comes S

i+1

, and is marked as completely gen-

erated.

(b) Equal Name, Partly Generated. S

i+1

is a

combination of G

’s declaration and S

’s initial-

ization. It is annotated as partly generated.

mapped to the generated ﬁelds. The legacy

names will be kept, and the names of the gener-

ated ﬁelds will be replaced by the legacy names

in every completely generated method in S

i+1

(d) Copied. A superﬂuous ﬁeld, which is copied

from G

to S

i+1

, will be annotated as com-

pletely generated.

3. Types:

(a) Inner Types. are merged recursively

(b) Copied Types. All members of a superﬂuous

class, which is copied from G

to S

i+1

, will be

annotated as completely generated.

Legacy elements which correspond to misnamed

generated elements will be marked by the following

annotation:

@Generated("de.htwgkn.mdre.gen",

value="mappedTo:qualified.element.name")

4.2 Implementation

The reverse engineering tools are implemented as an

Eclipse bundle, MDREclipse. The plugin contains the

merger, an editor extension and supports auto folding.

The merger requires two source directories that con-

tain the legacy code S

and the generated code G

. In

addition, the target directory for the merged code S

i+1

has to be speciﬁed. If the target is not empty, the engi-

neer is asked whether it should be cleared ﬁrst. At the

beginning of the merging process, all ﬁles from the

ACodeMergertoSupportReverseEngineeringTowardsModel-drivenSoftwareDevelopment

source S

are copied to the target. Then, the generated

ﬁles are merged into S

i+1

. If the merger does not ﬁnd

a corresponding member for a generated member, the

engineer will be asked how to proceed by a dialog.

The merging progress is shown in a text console. The

code parsing is based on the Java Development Tools

(JDT) provided by Eclipse. JDT (Eclipse Foundation,

2010) allows to create abstract syntax trees (ASTs)

from Java sources. These trees are traversed using the

visitor pattern (Gamma et al., 1995).

The editor extension highlights generated code by

changing its background color. This color can be con-

ﬁgured in the plugin’s preferences. Also, the exten-

sion protects the generated code from most manual

changes. Typing, overwriting and deleting parts of

generated code will be prevented. All of these fea-

tures can be disabled in the preferences.

The last component of MDREclipse is the auto

folding of generated methods. To use this feature,

Eclipse’s default folding has to be disabled and the

MDREclipse folding activated. Generated methods

will then automatically be folded in when a Java

source ﬁle is opened.

5 CONCLUSIONS AND FUTURE

WORK

This paper presents an iterative approach to transform

reference systems into MDSD artifacts. We outlined

our basic ideas in Section 2 and argued subsequently

why separating manual and generated code into dif-

ferent ﬁles is problematic for our approach. In Sec-

tion 4 we presented a code merger. We explained var-

ious merging situations and discussed under what cir-

cumstances merging cannot be done completely auto-

matically.

We do not claim that our approach solves all

problems in the ﬁeld of reverse engineering towards

MDSD. Code separation and code merging during an

iterative process is a rather technical aspect; feature

extraction, on the other hand, is a semantical chal-

lenge that is very hard or impractical to solve at a

generic level.

Please note, that even though in each iteration

step we perform only equivalence preserving modi-

ﬁcations, the approach can very well be interspersed

with forward engineering steps, as long as the reverse

and forward engineering steps are performed sepa-

rately one after the other. During a forward engineer-

ing iteration, the engineer can modify the model level

artifacts, i.e. DSLs, models, and generators, as well

as the manually crafted code. The modiﬁed software

can then, in a next step, again be reverse engineered

towards a greater MDSD proportion.

Finally, it should be noted that the success of the

reverse engineering process strongly depends on the

quality of the code base. If the reference system

is well structured and coded, feature extraction be-

comes easier and the model level artifacts become

better structured as well.

ACKNOWLEDGEMENTS

This research has been funded by the German BMBF

(Ministry for Education and Research) under the um-

brella of the IngenieurNachwuchs program. In addi-

tion, we would like to thank our industrial partners,

Seitenbau GmbH and Sybit GmbH, for their support.

REFERENCES

Eclipse Foundation (accessed on July 26th 2010). Java De-

velopment Tools. http://www.eclipse.org/jdt/.

Gamma, E., Helm, R., Johnson, R., and Vlissides, J. (1995).

Design patterns: Elements of reusable object-oriented

software. Addison-Wesley Longman Publishing Co.,

Inc., Boston, MA, USA.

Gosling, J., Joy, B., Steele, G., and Bracha, G. (2005).

The Java Language Speciﬁcation (Third Edition).

Addison-Wesley.

Kieburtz, R., McKinney, L., Bell, J., Hook, J., Kotov, A.,

Lewis, J., Oliva, D., Sheard, T., Smith, I., and Walton,

L. (1996). A software engineering experiment in soft-

ware component generation. Software Engineering,

International Conference on, 0:542.

Krueger, C. W. (1992). Software reuse. ACM Comput.

Surv., 24(2):131–183.

Ladd, D. A. and Ramming, J. C. (1994). Two applica-

tion languages in software production. In VHLLS’94:

Proceedings of the USENIX 1994 Very High Level

Languages Symposium Proceedings on USENIX 1994

Very High Level Languages Symposium Proceedings,

pages 10–10, Berkeley, CA, USA. USENIX Associa-

tion.

van Deursen, A. and Klint, P. (1998). Little languages: lit-

tle maintenance. Journal of Software Maintenance,

10(2):75–92.

olter, M. (2009). MD* best practices. Journal of Object

Technology, 8(6):79–102.

Walter, R. and Haase, O. (2008). How to make legacy code

MDSD-ready. In Second Workshop on MDSD Today.

MODELSWARD2013-InternationalConferenceonModel-DrivenEngineeringandSoftwareDevelopment