PREDICTING DEFECTS IN A LARGE TELECOMMUNICATION
SYSTEM
Gözde Koçak, Burak Turhan and Ayşe Bener
Department of Computer Engineering, Boğaziçi University, 34342 Bebek, Istanbul, Turkey
Keywords: Software testing, Defect Prediction, Call Graph, Empirical Analysis.
Abstract: In a large software system knowing which files are most likely to be fault-prone is valuable information for
project managers. However, our experience shows that it is difficult to collect and analyze fine-grained test
defects in a large and complex software system. On the other hand, previous research has shown that
companies can safely use cross company data with nearest neighbor sampling to predict their defects in case
they are unable to collect local data. In this study we analyzed 25 projects of a large telecommunication
system. To predict defect proneness of modules we learned from NASA MDP data. We used static call
graph based ranking (CGBR) as well as nearest neighbor sampling for constructing method level defect
predictors. Our results suggest that, for the analyzed projects, at least 70% of the defects can be detected by
inspecting only i) 6% of the code using a Naïve Bayes model, ii) 3% of the code using CGBR framework.
1 INTRODUCTION
Software testing is one of the most critical and costly
phases in software development. Project managers
need to know “when to stop testing?” and “which
parts of the code to test?”. The answers to these
questions would directly affect defect rates and
product quality as well as resource allocation and the
cost.
As the size and complexity of software increases,
manual inspection of software becomes a harder
task. In this context, defect predictors have been
effective secondary tools to help test teams locate
potential defects accurately (Menzies et.al., 2007a).
In this paper, we share our experience for
building defect predictors in a large
telecommunication system and present our initial
results. We have been working with the largest GSM
operator (~70% market share) in Turkey, Turkcell,
to improve code quality and to predict defects before
the testing phase. Turkcell is a global company
whose stocks are traded in NYSE and operates in
Turkey, Azerbaijan, Kazakhstan, Georgia, Northern
Cyprus and Ukraine with a customer base of 53,4
million. The underlying system is standard 3-tier
architecture, with presentation, application and data
layers. Our analysis focuses on the presentation and
application layers. However, the content in these
layers cannot be separated as distinct projects. We
were able to identify 25 critical components, which
we will refer to as project throughout this paper.
We used a defect prediction model that is based
on static code attributes. Some researchers have
argued that the information content of the static code
attributes is very limited (Fenton, 1999). However,
static code attributes are easy to collect, interpret and
many recent research have successfully used them to
build defect predictors (Menzies et.al., 2007a,
2007b; Turhan and Bener 2007, 2008). Furthermore,
the information content of these attributes can be
increased i.e. using call graphs (Kocak et.al. 2008).
Kocak et.al shows that integrating call graph
information in defect predictors decreases their false
positive rates while preserving their detection rates.
The collection of static code metrics and call
graphs can be easily carried out using automated
tools (Menzies et.al. 2007; Turhan and Bener 2008).
However, matching these measurements to software
components is the most critical factor for building
defect predictors. Unfortunately, in our case, it was
not possible to match past defects with the software
components in the desired granularity, module level,
where we mean the smallest unit of functionality
(i.e. java methods, c functions). Previous research in
such large systems use either component or file level
code churn metrics to predict defects (Nagappan and
Ball, 2006; Zimmermann and Nagappan, 2006;
Ostrand and Weyuker, 2002; Ostrand et al. 2004;
284
Koçak G., Turhan B. and Bener A. (2008).
PREDICTING DEFECTS IN A LARGE TELECOMMUNICATION SYSTEM.
In Proceedings of the Third International Conference on Software and Data Technologies - SE/GSDCA/MUSE, pages 284-288
DOI: 10.5220/0001887502840288
Copyright
c
SciTePress
Ostrand et al. 2005; Bell et al. 2006; Ostrand et al.
2007). The reason is that file level is the smallest
granularity level that can be achieved. However,
defect predictors become more precise as the
measurements are gathered from smaller units
(Ostrand et al. 2007).
Therefore, we decided to use module level cross
company data to predict defects for Turkcell projects
(Menzies et al. 2007b). Specifically, we have used
module level defect information from NASA MDP
projects to train defect predictors and then obtained
predictions for Turkcell projects. Previous research
have shown that cross company data gives stable
results and using nearest neighbor sampling
techniques further improves the prediction
performance when cross company data is used
(Menzies et al. 2007; Turhan and Bener, 2008). Our
experiment results with cross-company data on
Turkcell projects, estimate that we can detect 70% of
the defects with a 6% LOC investigation effort.
In order to decrease false alarm rates, we
included the CGBR framework in our analysis based
on our previous research (Kocak et al. 2008). Using
CGBR framework improved our estimated results
such that the LOC investigation effort decreased
from 6% to 3%.
The rest of the paper is organized as follows: In
section 2 we briefly review the related literature, in
section 3 we explain the project data. Section 4
explains our rule-based analysis. Learning based
model analysis is discussed in section 5. The last
section gives conclusion and future direction.
2 RELATED WORK
Ostrand and Weyuker have been performing similar
research for AT&T and they also report that it is
hard to conduct an empirical study in large systems
due to difficulty in finding the relevant personnel
and the high cost of collecting and analyzing data
(Ostrand and Weyuker, 2002).
Fenton and Ohlsson presented results of an
empirical study on two versions of a large-scale
industrial software, which showed that the
distribution of the faults and failures in a software
system can be modeled by Pareto principle (Fenton
and Ohlsson, 2000). They claimed that neither size
nor complexity explain the number of faults in a
software system. Other researchers found interesting
results showing that small modules are more fault -
prone than larger ones (Koru and Liu, 2005; Koru
and Liu, Dec. 2005; Malaiya and Denton, 2000;
Zhang 2008). Our results will also show evidence in
favor of this fact.
As mentioned, Ostrand, Weyuker and Bell
predicted fault-prone files of the large software
system in AT&T by using a negative binominal
regression model (Ostrand and Weyuker, 2002;
Ostrand et al. 2004; Ostrand et al. 2005; Bell et al.
2006; Ostrand et al. 2007). They report that their
model can detect 20% of the files that contain 80%
of all faults. Similarly, Nagappan, Ball and
Zimmermann analyzed several Microsoft software
components using static code and code churn
metrics to predict post-release defects. They
observed that different systems could be best
characterized by different sets of metrics (Nagappan
and Ball, 2006: Zimmermann and Nagappan, 2006).
Our work differs at a large extent from previous
work. Ostrand, Weyuker and Bell carried out the
most similar work to this research, where they used
file level measurements as a basic component.
However, we prefer using modules, since modules
provide finer granularity. They have collected data
from various releases of projects and predict post-
release defects, whereas we have data from single
release of 25 projects and we try to predict pre-
release defects.
3 DATA
In this research we analyzed 25 ‘Trcll’ projects. All
projects are implemented in Java and we have
gathered 29 static code metrics from each. In total,
there are approximately 48,000 modules spanning
763,000 lines of code.
Figure 1: NASA datasets used in this study.
We used cross company data from NASA MDP
that are available online in the PROMISE repository
(Boetticher et al. 2007; NASA). Figure 1 shows the
characteristics of NASA projects. Each NASA
dataset has 22 static code attributes. In our analysis,
we have used only the common attributes (there are
17 of them) that are available in both data sources.
PREDICTING DEFECTS IN A LARGE TELECOMMUNICATION SYSTEM
285
4 DATA ANALYSIS
4.1 Average-case Analysis
Figure 2 shows the average values of 17 static code
metrics collected from the 25 telecom datasets. It
also shows the recommended intervals based on
statistics from NASA MDP projects, when
applicable. Cells marked with gray color correspond
to metrics that are out of the recommended intervals.
There are two clear observations in Figure 2.
Developers do not write comments throughout the
source code and low number of operands and
operators indicate small, modular methods.
Figure 2: Average-case analysis about Turkcell datasets.
4.2 Rule-based Analysis
Based on the recommended intervals in Figure 2, we
have defined simple rules for each metric. These
rules fire, if a module’s metric is not in the specified
interval, indicating the manual inspection of the
module. Figure 3 shows the 17 basic rules and
corresponding metrics, along with 2 derived rules.
The first derived rule, Rule 18, define a disjunction
among 17 basic rules. That is Rule 18 fires if any
basic rule fires. Note that, the gray colored rules in
Figure 3 fire too frequently that cause rule 18 to fire
all the time. The reason is that the corresponding
comment and Halstead metrics’ related intervals do
not fit Turkcell’s code characteristics.
In order to overcome this problem we have
defined Rule 19 that fires if all basic rules, but the
Halstead fire. This reduces the firing frequency of
the disjunction rule. However, Rule 19 states that
6484 modules (14%) corresponding to 341,655 LOC
(45%) should be inspected in order to detect
potential defects.
Inspection of 45% of total LOC is impractical.
On the other hand, learning based model will be
shown to be far more effective. We have designed
two types of analysis using the learning based
model. Analysis #1 uses the cross-company
predictor with k-Nearest Neighbor sampling for
predicting fault-prone modules. Analysis #2
incorporates CGBR framework into static code
attributes and than apply the model of Analysis #1.
Figure 3: Rule-based analysis.
5 ANALYSIS
5.1 Analysis #1
In this analysis we used the Naïve Bayes data miner
that achieves significantly better results than many
other mining algorithms for defect prediction
(Menzies et.al., 2007a). We selected a random 90%
subset of cross-company NASA data to train the
model. From this subset, we have selected similar
projects that are similar to Trcll in terms of
Euclidean distance in the 17 dimensional metric
spaces. The nearest neighbors in the random subset
are used to train a predictor, which then made
predictions on the Trcll data. We repeated this
procedure 20 times and raised a flag for modules
that are estimated as defective at least in 10 trials.
Figure 4 shows the results from the first analysis.
The estimated defect rate is %15 that is consistent
with the rule-based model’s estimation. However,
there is a major difference between the two models
in terms of their practical implications. For the rule-
based model, estimated defective LOC corresponds
to 45% of the whole code, while module level defect
rate is 14% on the other hand; for the learning-based
model, estimated defective LOC corresponds to only
6% of the code, where module level defect rate is
still estimated as 15%.
This significant difference is occurred because
rule base model makes decisions based on individual
ICSOFT 2008 - International Conference on Software and Data Technologies
286
metrics and it has a bias towards more complex and
larger modules. On the other hand learning based
model combines all ‘signals’ from each metric and
estimates that defects are located in smaller
modules.
Figure 4: Analysis #1 results.
5.2 Analysis #2
We argue that module interactions play an important
role in determining the complexity of the overall
system rather than assessing modules individually.
Therefore in a previous research (Kocak et.al. 2008)
a model is proposed to investigate the module
interactions with static call graphs. Kocak et.al.
proposed the call graph based ranking (CGBR)
framework that is applicable to any static code
metrics based defect prediction model.
To implement CGBR framework we created
NxN matrix for building the call graphs where N is
the number of modules. In this matrix, rows contain
the information whether a module calls the others or
not. Columns contain how many times a module is
called by other modules. Inspired from the web page
ranking methods, we treated each caller-to-callee
relation in the call graph as hyperlinks from a web
page to another. We then assigned equal initial ranks
(i.e. 1) to all modules and iteratively calculated
module ranks using PageRank algorithm.
In this study we analyzed the static call graph
matrices for only 22 projects, since the other 3
projects were so large that their call graph analysis
were not completed at the time of writing this paper,
due to high memory requirements.
Figure 5: Analysis #2 results.
In analysis #2, we have calculated CGBR values,
quantized them into 10 bins and assigned each bin, a
weight value from 0.1 to 1 considering their
complexity levels. Then, we have adjusted the static
code attributes by multiplying each raw in the data
table with corresponding weights, before we trained
our model as in Analysis #1.
Figure 5 shows the results of analysis #2. In
order to catch 70% of the defects, the second model
proposes to investigate only 3% proportion of the all
code.
6 CONCLUSIONS
In this study we investigate how to predict fault-
prone modules in a large software system. We have
performed an average case analysis for the 25
projects. This analysis shows that the software
modules were written using relatively low number of
operands and operators to increase modularity and to
decrease maintenance effort. However, we have also
observed that the code base was purely commented,
which makes maintenance a difficult task.
Our initial data analysis revealed that a simple
rule-based model based on recommended standards
on static code attributes estimates a defect rate of
15% and requires 45% of the code to be inspected.
This is an impractical outcome considering the scope
of the system. Thus, we have constructed learning
based defect predictors and performed further
analysis. We have used a cross-company NASA data
PREDICTING DEFECTS IN A LARGE TELECOMMUNICATION SYSTEM
287
to learn defect predictors, due to lack of local
module level defect data.
The first analysis confirms that the average
defect rate of all projects was 15%. While the simple
rule based module requires inspection of 45% of the
code, the learning based model suggested that we
needed to inspect only 6% of the code. This is from
the fact that rule based model has a bias towards
more complex and larger modules, whereas learning
based model predicts that smaller modules contain
most of the defects.
Our second analysis results employed data
adjusted with CGBR framework and improved the
estimations further and suggested that 70% of the
defects could be detected by inspecting only 3% of
the code.
Our future work consists of collecting local
module level defects to be able to build within-
company predictors for this large telecommunication
system. We also plan to use file level code churn
metrics in order to predict production defects
between successive versions of the software.
ACKNOWLEDGEMENTS
This research is supported by Boğaziçi University
research fund under grant number BAP 06HA104,
the Turkish Scientific Research Council
(TUBITAK) under grant number EEEAG 108E014
and Turkcell A.Ş.
REFERENCES
Bell, R.M., Ostrand, T.J., Weyuker, E.J., July 2006.
Looking for Bugs in All the Right Places. Proc.
ACM/International Symposium on Software Testing
and Analysis (ISSTA2006), Portland, Maine, pp. 61-
71.
Boetticher, G., Menzies, T., Ostrand, T., 2007. PROMISE
Repository of empirical software engineering data
http://promisedata.org/repository, West Virginia
University, Department of Computer Science.
Fenton N.E. and Neil M., A critique of software defect
prediction models. IEEE Transactions On Software
Engineering (1999) vol. 25 pp. 675-689
Fenton, N.E., Ohlsson, N., Aug 2000. Quantitative
Analysis of Faults and Failures in a Complex Software
System. IEEE Trans. on Software Engineering, Vol
26, No 8, pp.797-814.
Kocak, G., Turhan, B., Bener, A., 2008. Software Defect
Prediction Using Call Graph Based Ranking (CGBR)
Framework, to appear in Proceedings of
EUROMICRO SPPI (2008), Parma, Italy.
Koru, A. G., Liu, H., 2005. An Investigation of the Effect
of Module Size on Defect Prediction Using Static
Measures. Proceeding of PROMISE 2005, St. Louis,
Missouri, pp. 1-6.
Koru, A. G., Liu, H., Nov.-Dec. 2005. Building effective
defect-prediction models in practice Software, IEEE,
vol. 22, Issue 6, pp. 23 – 29.
Malaiya, Y. K., Denton, J., 2000. Module Size
Distribution and Defect Density, ISSRE 2000, pp. 62-
71.
Menzies, T., Greenwald, J., Frank, A., 2007. Data Mining
Static Code Attributes to Learn Defect Predictors,
IEEE Transactions on Software Engineering, 33, no.1,
2-13.
Menzies, T., Turhan, B., Bener, A., Distefano, J., 2007.
“Cross- vs within-company defect prediction studies”,
Technical report, Computer Science, West Virginia
University.
NASA, “WVU IV&V facility metrics data program.”
[Online]. Available: http://mdp.ivv.nasa.gov
Ostrand, T.J., Weyuker., E.J., July 2002. The Distribution
of Faults in a Large Industrial Software System. Proc.
ACM/International Symposium on Software Testing
and Analysis (ISSTA2002), Rome, Italy, pp. 55-64.
Ostrand, T.J., Weyuker, E.J., Bell, R.M., July 2004.
Where the Bugs Are. Proc. ACM/International
Symposium on Software Testing and Analysis
(ISSTA2004), Boston, MA.
Ostrand, T.J., Weyuker, E.J., Bell, R.M., April 2005.
Predicting the Location and Number of Faults in Large
Software Systems. IEEE Trans. on Software
Engineering, Vol 31, No 4.
Ostrand, T.J., Weyuker, E.J., Bell, R.M., July 2007.
Automating Algorithms for the Identification of Fault-
Prone Files. Proc. ACM/International Symposium on
Software Testing and Analysis (ISSTA07), London,
England.
Turhan, B., Bener, A., 2008. Data Sampling for Cross
Company Defect Predictors, Technical Report,
Computer Engineering, Bogazici University.
Turhan , B., Bener, A., A Multivariate Analysis of Static
Code Attributes for Defect Prediction. Quality
Software, 2007. QSIC '07. Seventh International
Conference on (2007) pp. 231 - 237
Nagappan, N. and Ball T., Explaining failures using
software dependences and churn metrics. Technical
Report, Microsoft Research (2006)
Zhang, H., On the Distribution of Software Faults.
Software Engineering, IEEE Transactions on (2008)
vol. 34 (2) pp. 301-302
Zimmermann, T., Nagappan, N. Predicting Subsystem
Failures using Dependency Graph Complexities.
Technical Report, Microsoft Research (2006).
ICSOFT 2008 - International Conference on Software and Data Technologies
288