technique such as Latent Semantic Indexing (Marcus,
2004) or the kind of variables accessed in formal
concept analysis (Siff, Reps, 1999) (Tonella, 2001).
Then, based on such a similarity metric, an algorithm
decides what element should be part of the same
cluster (Mitchell, 2003). In their work, Xiao and
Tzerpos compared several clustering algorithms based
on dynamic dependencies (Xiao, Tzerpos, 2005). In
particular they focused on the clustering based on the
global frequency of calls between classes. But this
approach does not discriminate the situations where
the calls happen in different locations in the trace.
This is to be contrasted with our approach that takes
the location of the calls in the trace into account. Very
few authors have worked on sampling or
segmentation techniques for trace analysis. One
pioneering work is the one of (Chan et al., 2003) to
visualize long sequence of low-level Java execution
traces in the AVID system (including memory event
and call stack events). But their approach is quite
different from ours. It selectively picks information
from the source (the call stack for example) to limit
the quantity of information to process. The problem to
process very large execution traces is now beginning
to be dealt with in the literature. For example,
Zaidman and Demeyer proposed to manage the
volume of the trace by searching for common global
frequency patterns (Zaidman, Demeyer, 2004). In
fact, they analyzed consecutive samples of the trace to
identify recurring patterns of events having the same
global frequencies. In other words they search locally
for events with similar global frequency. It is then
quite different from our approach that analyzes class
distribution throughout the trace. Another technique is
to restrict the set of classes to include in the trace like
in the work of (Meyer, Wendehals, 2005). In fact,
their trace generator takes as input a list of classes,
interfaces and methods that are to be monitored
during the execution of the program under analysis.
Similarly, the tool developed by (Vasconcelos et al.,
2005) allows the selection of the packages and classes
to be monitored for trace collection. In this work, the
trace is sliced by use-case scenarios and message
depth level and it is possible to study the trace per
slice and depth level. Another technique developed by
(Hamou-Lhadj, 2005) uses text summarization
algorithms, which takes an execution trace as input
and returns a summary of its main contents as output.
(Sartipi, Safyallah, 2006) use a patterns search and
discovery tool to separate, in the trace, the patterns
that correspond to common features from the ones
that correspond to specific features. Although the
literature is abundant in clustering and architecture
recovery techniques exploiting execution traces, most
of the approaches are analytical. The few paper
dealing with statistical approaches are still very
rudimentary.
6 CONCLUSIONS
The technique we presented in this paper is aimed at
computing the dynamic coupling between classes or
modules in legacy systems. The metrics is based on
the segmentation of the execution trace that let us
compute a time series of data for each class
occurring in the trace. The time series are then
filtered using the moving average technique,
borrowed from the technical analysis in finance. The
normalized result can then be used to compute the
coupling between any two classes. This represents
the key contributions of our work. In fact, none of
the published papers ever tried to segment the trace
to get another “perspective” on the trace analysis
problem. Our coupling metrics is used to identify
clusters of strongly coupled classes that represent the
functional components of the software. This metrics
has proven to be largely insensitive to the number of
segments. In fact, in the experiments we conducted
we saw that by changing the number of segments in
the trace, the order of the pairs of classes in the list
ranked by increasing coupling value stayed the
same. Only the coupling value changed slightly,
which is understandable since the number of
occurrences of the classes in each segment may
differ if the segment size is changed. Our technique
has the big advantage over analytical techniques (i.e.
the ones that analyze the individual events in the
trace) that it is very scalable. In fact, we were able to
analyze traces up to 7 millions of events without
trouble. So far, no techniques based on analytical
approaches for component identification have been
shown to be able to cope with such an amount of
data. Again, our method is successful because it
processes the trace using statistical technique rather
than analytical techniques. Clustering classes based
on their involvement in use-case implementation is
only the first step to recover the architecture of a
legacy system. The next step is to recover the
connections between the components using again the
coupling between the contained classes. Then not
only can, the coupling metrics, lead to the
identification of functional components, but also to
the identification of the connections between the
components. Therefore, our trace analysis technique
let us reconstruct the complete functional
architecture of the software. However the detailed
explanation of the architecture reconstruction
ICSOFT 2009 - 4th International Conference on Software and Data Technologies
200