example, this technique has been proposed to
identify the program elements associated to the
visible features of the programs (Rajlich, Wilde
2002) (Eisenbarth, Koschke 2003).
In fact, these techniques try to partition the set of
source code statements and program elements into
subsets that will hopefully help to rebuild the
architecture of the system. The key problem is to
choose the relevant set of criteria (or similarity
metrics (Wiggert 1997)) with which the “natural”
boundaries of components can be found. In the
reverse-engineering literature, the similarity metrics
range from the interconnection strength of Rigi
(Müller, Orgun, Tilley, Uhl 1993) to the sophisti-
cated information-theory based measurement of
Andritsos (Andritsos, Tzerpos 2003) (Andritsos,
Tzerpos 2005), the information retrieval technique
such as Latent Semantic Indexing (Marcus 2004)
(Kuhn, Ducasse, Girba 2005) or the kind of variables
accessed in formal concept analysis (Siff, Reps
1999) (Tonella 2001). Then, based on such a
similarity metric, an algorithm decides what element
should be part of the same cluster (Mitchell 2003).
On the other hand, Gold proposed a concept
assignment technique based on a knowledge base of
programming concepts and syntactic “indicators”
(Gold 2000). Then, the indicators are searched in the
source code using neural network techniques and,
when found, the associated concept is linked to the
corresponding code. However he did not use his
technique with a knowledge base of domain
(business) concepts. In contrast with these
techniques, our approach is “business-function-
driven” i.e. we clusters the software elements
according to the supported the business tasks and
functions. The domain modelling discipline of our
reverse-engineering method presents some similarity
with the work of Gall et al. (Gall, Klosch,
Mittermeier 1996) (Gall, Weidl 1999). These
authors tried to build an object-oriented
representation of a procedural legacy system by
building two object models. First, with the help of a
domain expert, they build an abstract object model
from the specification of the legacy system. Second,
they reconstruct an object model of the source code,
starting from the recovered entity-relationship model
to which they append dynamic services. Finally,
they try to match both object models to produce the
final object oriented model of the procedural system.
The authors report that one of the main difficulties is
the assignation of the dynamic features to the
recovered objects (what they call the “ambiguous
service candidates”). In contrast, our approach does
not try to transform the legacy system into some
object-oriented form. The robustness diagram we
build is simply a way to document the software
roles. Our work bears some resemblance to the work
of Eisenbarth and Koschke (Eisenbarth, Koschke
2003) who used Formal Concept Analysis. However
the main differences are:
1. The scenarios we use have a strong business-
related meaning rather than being built only to
exhibit some features. They represent full use-
cases.
2. The software clusters we build are interpretable
in the business model. We do group the software
element after their roles in the implementation of
business functions.
3. We analyse the full execution trace from a real-
use-case to recover the architecture of the
system.
4. The elements we cluster are modules or classes
identified in the visible high-level structure of
the code.
Finally, it is worth noting that the use-cases play, in
our work, the same role as the test cases in the
execution slicing approach of Wong et al. (Wong,
Gokhale, Horgan, Trivedi 1999). However, in our
work, the “test cases” are not arbitrary but represent
actual use-cases of the system.
6 CONCLUSION
The reverse-engineering process we present in this
article rests on the Unified Process from which we
borrowed some activities and artefacts. The
techniques are based on the actual working of the
code in real business situations. Then, the
architecture we end up with is independent on the
number of maintenances to the code. Moreover it
can cope with situation like dynamic calls, which are
tricky to analyse using static techniques. We
actually reverse-engineered all the use-cases of the
system and found that the modules involved in all of
them were almost the same. Finally, this experiment
seems to show that this technique is scalable and is
able to deal with industrial size software.
As a next step in this research we are developing a
semi-automatic robustness diagram mapper that
takes a robustness diagram and a trace file as input
and produces a possible match as output. This
system uses a heuristic-based search engine coupled
to a truth maintenance system.
ROLE-BASED CLUSTERING OF SOFTWARE MODULES - An Industrial Size Experiment
11