4 EXPERIMENTAL SETUP
We collect data from a local GSM operator
company. The data contains one project and its 3
versions. The project is implemented in Java and
corresponds to a middleware application. We
collected 26 static code attributes including Halstead
metrics, McCabe's cyclomatic complexity and lines
of code from the project and its versions with our
Metric Parser, Prest (Turhan, Oral and Bener, 2007)
which is written in Java. The class information of all
project versions is listed in Table 1.
Table 1: Attribute and Class information of the project.
In order to estimate the classes that should be
refactored during each version upgrade, we made an
assumption that if the cyclomatic complexity,
essential complexity or total number of operands
decreases from the beginning of the project, then
these classes are assumed to be refactored. Since we
did not know the affect of metrics on refactoring
decision we assumed that complexity metrics
affected the refactoring decision during the version
upgrades. We normalized the data in Trcll1 datasets
since it is a complex project at the application layer.
It is refactored and changed frequently during the
development of the project. After collecting refactor
data, we apply Weighted Naïve Bayes for automatic
prediction of candidate classes.
We have designed the experiments to achieve
high degree of internal validity by carefully studying
the effect of independent variables on dependent
variable in a controlled manner (Mitchell and Jolley,
2001). In order to carry on statistically valid
experiments, datasets should be prepared carefully.
A common technique is working with two data sets
which are namely test and train instead of entire
data. Generally, these sets are constructed randomly
by dividing whole data into two sets. Here, a
problem arises due to the nature of random selection,
which is that it is not guaranteed to have a good
representation of the real data by doing a single
sampling. To cope with that problem, k-fold cross
validation is used. In k-fold cross validation, data is
divided into k equal portions and training process is
repeated for k times with k-1 folds used as train data
and one fold is used as test data (Turhan and Bener,
2007). We chose k as 10 in our experiments. This
whole process is repeated 10 times with the shuffled
data. Moreover, since both the train and test data
should have a good representation of the real data,
the ratio among the refactored and not-refactored
samples should be preserved. We have used
stratified sampling so when dividing the data into 10
folds, we made sure that each fold preserves the
refactored/ not-refactored samples ratio.
5 RESULTS
Table 2: Confusion Matrix.
We evaluated the accuracy of our predictor with
probability of detection (pd = A/(A+C)) and
probability of false alarm (pf = B/(B+D)) measures
(Menzies, Greenwald and Frank, 2007). Pd is the
measure of detecting real refactored classes over all
real refactored ones and pf is the measure of
detecting classes as refactored that are not actually
refactored over all not-refactored classes. Higher pd
values and lower pf values reflects the accuracy of
the predictor. The confusion matrix used for
calculating pd and pf is shown in Table 2.
The results of our experiments show that in the
first unstable version of a software our predictor
detects the classes that need to be refactored with
63% accuracy (Table 3). In the second version
which we can call that the first stable version the
predictor’s performance increases to 90%. We can
conclude that the learning performance improves as
we move to more stable versions and learn more
about the complexity of the code. Our results also
show that learning complexity related information
on the code, i.e. weight assignment, considerably
improves the learning performance of the predictor
as evidenced by the IG+WNB pd of 82 (avg) versus
NB pd of 76 (avg). We also observe that as we move
to later versions false alarm rates decrease (from
pf:16 to pf:11) with our proposed learner. Low pf
rates prevent software architects from manual
analysis of classes which are not needed to be
refactored. In tne three versions of a complex code
such as Trcll1 project we can predict 82% of the
refactored classes with 13% of manual inspection
effort on the average. Our concern for external
validity is the use of limited number of datasets. We
used one complex project and its three versions. To
REFACTORING PREDICTION USING CLASS COMPLEXITY METRICS
291