ON THE PREDICTABILITY OF SOFTWARE EFFORTS

USING MACHINE LEARNING TECHNIQUES

Wen Zhang, Ye Yang and Qing Wang

Laboratory for Internet Software Technologies, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China

Keywords:

Software effort prediction, K-medoids, BPNN, Data imputation.

Abstract:

This paper investigates the predictability of software effort using machine learning techniques. We employed

unsupervised learning as k-medoids clustering with different similarity measures to extract natural clusters of

projects from software effort data set, and supervised learning as J48 decision tree, back propagation neural

network (BPNN) and na

ive Bayes to classify the software projects. We also investigate the impact of imput-

ing missing values of projects on the performances of both unsupervised and supervised learning techniques.

Experiments on ISBSG and CSBSG data sets demonstrate that unsupervised learning as k-medoids clustering

has produced a poor performance in software effort prediction and Kulzinsky coefﬁcient has the best perfor-

mance in software effort prediction in measuring the similarities of projects. Supervised learning techniques

have produced superior performances in software effort prediction. Among the three supervised learning tech-

niques, BPNN produces the best performance. Missing data imputation has improved the performances of

both unsupervised and supervised learning techniques.

1 INTRODUCTION

Software effort refers to estimate the human effort

needed to develop a software artifact (Finnie et al.,

1997). Overestimate of software effort may lead to

tight schedule of development and faults may leave

in the system after delivery whereas underestimate of

effort may lead to delay of deliver of system and com-

plains from customers. The importance of software

development effort estimation has motivated the con-

struction of models to predict it as accurate as possi-

ble.

Current software effort prediction techniques can

be categorized into four types-empirical, regres-

sion, theory-based, and machine learning technqiues

(Pendharkar et al., 2005). Machine Learning (ML)

techniques learn patterns (knowledge) from historical

project data and use these patterns for effort predic-

tion, such as Artiﬁcial Neural Network (ANN), de-

cision tree, and na

ıve Bayes. Recent studies (Pend-

harkar et al., 2005) (Jorgensen, 2004) provide detailed

reviews of different studies on predicting software de-

velopment effort.

The primary concern of this paper is on using ma-

chine learning techniques to predict software effort.

Despite that COCOMO has provided a viable solu-

tion to effort estimation by building analytic model,

machine learning techniques such as na

ive Bayes and

artiﬁcial neural network have come up with alterna-

tive approaches by making use of knowledge learned

from historical projects. Although machine learning

techniques though may not be the best solution for

effort estimation, we believe they can be used at least

by project managers to complement other models. Es-

pecially in intensely competitive software market, ac-

curate estimation of software development effort has

a decisive effect on success of a software project.

Consequently, effort estimation using different tech-

niques, and further risk assessment of budget over-

run are of necessity for a trustworthy software project

(Yang et al., 2009).

The basic idea of using machine learning tech-

niques for effort prediction is that, historical data set

contains many historical projects which are described

by features with their values to characterize those

projects and, similar values of the features of projects

may induce almost the similar project efforts. The

task of machine learning methods is to learn the inher-

ent patterns of feature values and their relations with

project efforts, which can be used for predicting the

effort of new projects.

The rest of this paper was organized as follows.

Section 2 introduced evaluation measures of soft-

ware effort prediction. Section 3 described the data

Zhang W., Yang Y. and Wang Q..

ON THE PREDICTABILITY OF SOFTWARE EFFORTS USING MACHINE LEARNING TECHNIQUES.

DOI: 10.5220/0003408200050014

In Proceedings of the 6th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE-2011), pages 5-14

ISBN: 978-989-8425-57-7

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

sets used in this paper and the preprocessing for ex-

periments. Section 4 conducted clustering software

projects in the data sets for effort prediction. Section 5

classiﬁed software projects on their efforts using ma-

chine learning classiﬁers. Section 6 presented the re-

lated work of this paper and Section 7 concluded this

paper.

2 EVALUATION MEASURES

In software engineering, the deviation of predicted ef-

fort to real effort is used to measure the accuracy of

effort estimators, such as MMRE (Magnitude of Rel-

ative Error), PRED(x) (Prediction within x) and AR

(Absolute Residual) (Korte and Port, 2008). In ma-

chine learning, the performance of classiﬁcation is of-

ten evaluated by accuracy and, F-measure (Steinbach

et al., 2000) is usually used to evaluate the perfor-

mance of unsupervised learning. Essentially, the eval-

uation measures of effort predictors in software engi-

neering and those in machine learning do not conﬂict.

In this paper, we adopted the measures from machine

learning to evaluate the performances of effort predic-

tors.

2.1 Accuracy

Assuming that D = (D

, ..., D

) is a collection

of software projects, where D

is a historical project

and it is denoted by n attributes X

(1 6 i 6 n). That

is, D

= (x

, ..., x

i j

, ..., x

)

. h

denotes the label of

effort for project D

i j

is the value of attribute X

(1 6 j 6 n) on D

To evaluate the performance of a classiﬁer in effort

prediction, the whole data set was divided into two

subsets: one is used for training the classiﬁer and the

other one is used for testing. That is, D = (D

train

test

) = (D

, ..., D

| D

k+1

, ..., D

)

, where k is the

predeﬁned number of projects in training set and m is

the total number of projects in D. For instance, in 10-

fold-cross validation, k should be predeﬁned as 0.9m

and the remaining 0.1m projects are used for testing

the trained model. h

is known for training set but re-

mains unknown for testing set. By machine learning

on the training set, a classiﬁer denoted as M is pro-

duced. If we deﬁne a Boolean function F as Equation

1, then the performance of M is evaluated by accuracy

as Equation 2.

F(M(D

), h

) =

{

1, if M(D

) = h

;

0, otherwise.

(1)

accuracy =

m − k

∑

k< j≤m

F(M(D

), h

) (2)

2.2 F-measure

In classiﬁcation, Y was partitioned into l clusters and

l is a predeﬁned number of clusters in the data set.

That is, Y = c

, ..., c

, c

= {D

i,1

, ..., D

i,|c

} (1 ≤ i ≤ l)

and c

∩ c

= ϕ. F-measure [14] is employed to evalu-

ate the performance of unsupervised learning (cluster-

ing). The formula of F-measure is depicted as Equa-

tions 6 with the supports of Equations 3, 4, and 5.

P(i, j) =

i, j

(3)

R(i, j) =

i, j

(4)

F(i, j) =

2 × P(i, j) × R(i, j)

P(i, j) + R(i, j)

(5)

F − measure =

∑

max

F(i, j) (6)

Here, n

is the number of software projects with

effort label h

, n

is the cardinality of cluster c

, and

i, j

is the number of software projects with effort la-

bel h

in cluster c

. n is the total number of software

projects in Y . P(i, j)is the proportion of projects in

cluster c

with effort label h

; R

i, j

is the proportion of

projects with effort label h

in cluster c

; F(i, j) is the

F-measure of cluster c

with respect to projects with

effort label h

. In general, the larger the F-measure is,

the better is the clustering result is.

3 THE DATA SETS

We employed two data sets to investigate the pre-

dictability of software effort using machine learn-

ing techniques. The one is ISBSG (International

Software Benchmarking Standard Group) data set

(http://www.isbsg.org) and the other one is CSBSG

(Chinese Software Benchmarking Standard Group)

data set (He et al., 2008).

3.1 ISBSG Data Set

ISBSG data set contains 1238 projects from insur-

ance, government, etc., of 20 different countries and

each project was described with 70 attributes. To

make the data set suitable for the experiments, we

conduct three kinds of preprocessing: data pruning,

data discretization and adding dummy variables.

ENASE 2011 - 6th International Conference on Evaluation of Novel Software Approaches to Software Engineering

Table 1: The used attributes from ISBSG data set.

Branch Description Type

Sizing technique

Count Approach Nominal

Adjusted Functional Points Continuous

Schedule Project Activity Scope Nominal

Quality

Minor Defects Continuous

Major Defects Continuous

Extreme Defects Continuous

Grouping Attributes

Development Type Nominal

Organization Type Nominal

Business Area Type Nominal

Application Type Nominal

Architecture Architecture Nominal

Documents Techniques Development Techniques Nominal

Project Attributes

Development Platform Nominal

Language Type Nominal

Primary Programming language Nominal

1st Hardware Nominal

1st Operating System Nominal

1st Data Base System Nominal

CASE Tool Used Nominal

Used Methodology Nominal

Product Attributes Intended Market Nominal

We pruned ISBSG data set into 249 projects with

22 attributes using the criterion that each project must

have at least 2/3 attributes whose values are observed

and, for each attribute, its values must be observed on

at least 2/3 of total projects. We adopt the criterion

for data selection in that too many missing values will

deteriorate the performances of most machine tech-

niques thus a convincing evaluation of software effort

prediction is impossible. Among the 22 attributes, 18

of them are nominal attributes and 4 of them are con-

tinuous attributes. Table 1 lists the attributes used in

the ISBSG data set.

Data discretization is utilized to transfer the con-

tinuous attributes into discrete variables. The values

of each continuous attribute are preprocessed into 3

unique partitions. Too many partitions of values of

an attribute will cause data redundancy nevertheless

too few partitions may not capture the distinction of

values of continuous attributes.

For each nominal attribute, dummy variables are

added according to its unique values to make all vari-

ables having binary values. As a result, all the projects

are described using 99 boolean variables with 0-1 and

missing values. Only some of machine learning tech-

niques can handle mixed data of nominal and continu-

ous values but, most machine learning techniques can

be used to handle Boolean values. In preprocessing,

missing values are denoted as “-1” and kept for all

projects on corresponding variables. Table 2 shows

the value distribution of variables of ISBSG projects

after preprocessing. Most values of the variables are

zeros due to the transferring from discrete attributes

to binary variables.

Table 2: The value distribution of variables of projects in

ISBSG data set.

Value Proportion

1 20% ∼ 50%

0 20% ∼ 60%

-1 5% ∼ 33%

Table 3: The effort classes categorized in ISBSG data set.

Class No Number of projects Label

1 64 Low

2 85 Medium

3 100 High

Finally, software effort of those selected 249

projects in the ISBSG data set was categorized into

3 classes. The projects with “normalized work ef-

fort” more than 6,000 person hours were categorized

into the class with effort label as ”high”, projects with

”normalized work effort” between 2,000 and 6,000

person hours as ”medium” and projects with ”nor-

malized work effort” less than 2,000 person hours as

”low”. Table 3 lists the effort distribution of the se-

lected projects in the ISBSG data set.

ON THE PREDICTABILITY OF SOFTWARE EFFORTS USING MACHINE LEARNING TECHNIQUES

Table 4: The used attributes from CSBSG data set.

Branch Description Type

Basic information of projects

Count Approach Nominal

City of development Nominal

Business area Nominal

Development type Nominal

Application type Nominal

Development Platform Nominal

IDE Nominal

Programming Language Nominal

Operation System Nominal

Database Nominal

Target Market Nominal

Architecture Nominal

Maximum Number of Concurrent Users Nominal

Life-cycle model Nominal

CASE Tool Nominal

Size

Added lines of code Continuous

Revised lines of code Continuous

Reused lines of code Continuous

Number of team members in inception phase Continuous

Number of team members in requirement phase Continuous

Number of team members in design phase Continuous

Number of team members in coding phase Continuous

Number of team members in testing phase Continuous

Schedule Time limit in planning Continuous

Quality

Predicted number of Defects in requirements phase Continuous

Predicted number of Defects in design phase Continuous

Predicted number of Defects in testing phase Continuous

Number of defects within one month after deliver Continuous

Other

Number of requirement changes in requirement phase Continuous

Number of requirement changes in design phase Continuous

Number of requirement changes in coding phase Continuous

Number of requirement changes in testing phase Continuous

3.2 CSBSG Data Set

CSBSG data set contains 1103 projects from Chinese

software industry. It was created in 2006 with its mis-

sion to promote Chinese standards of software pro-

ductivity. CSBSG projects were collected from 140

organizations and 15 regions across China by Chi-

nese association of software industry. Each CSBSG

project is described with 179 attributes. The same

data preprocessing as those used in ISBSG data set

is used on CSBSG data set. In data pruning, 104

projects and 32 attributes (15 nominal attributes and

17 continuous attributes) are extracted from CSBSG

data set. Table 4 lists the attributes used in the CSBSG

data set.

In data discretization, the values of each contin-

uous attribute are partitioned into 3 unique classes.

Dummy variables are added to transfer nominal at-

tributes into Boolean variables. As a result, 261

Boolean variables are produced to describe the 104

projects with missing values denoted as “-1”. The

value distribution of variables of CSBSG projects is

shown in Table 5 and we can see that CSBSG data set

has more missing values than ISBSG data set.

Table 5: The value distribution of variables for describing

projects in CSBSG data set.

Value Proportion

1 15% ∼ 40%

0 20% ∼ 60%

-1 10% ∼ 33%

Finally, the projects in CSBSG data set were cat-

egorized into 3 classes according to their real efforts.

The projects with “normalized work effort” more than

ENASE 2011 - 6th International Conference on Evaluation of Novel Software Approaches to Software Engineering

5,000 person hours were categorized into the class

with effort label as “high”, projects with “normalized

work effort” between 2,000 and 5,000 person hours

as ”medium” and projects with “normalized work ef-

fort” less than 2,000 person hours as “low”. Table 6

lists the effort distribution of the selected projects in

the CSBSG data set.

Table 6: The effort classes categorized in CSBSG data set.

Class No Number of projects Label

1 27 Low

2 31 Medium

3 46 High

4 PREDICTING SOFTWARE

PROJECTS WITH

UNSUPERVISED LEARNING

4.1 k-medoids Clustering and Similarity

Measures

Generally, researchers in software engineering hold

the assumption that projects with similar character-

istics, such as the number of function points, applica-

tion domain and programming language, are expected

to have approximately equivalent efforts (at least they

should be in the same effort class). In the standpoint

of machine learning, clustering software projects on

the basis of a random subset can capture informa-

tion on the unobserved attributes (Krupka and Tishby,

2008). If we regarded effort as an attribute that also

characterize software projects in the data set, then

software effort can be deduced by clustering projects

using other attributes.

To validate this assumption, k-medoids (Theodor-

idis and Koutroumbas, 2006) is adopted for clustering

the projects and three similarity measures are used to

measure the similarities of boolean vectors that rep-

resent the software projects. k-medoids is actually

evolved from k-means (Theodoridis and Koutroum-

bas, 2006) and their difference lies in that k-medoids

assigns existing element in a cluster as cluster cen-

ter but k-means assigns mean vector of elements

in a cluster as the cluster center. We adopt k-

medoids other than k-means because the mean vector

of boolean vectors lacks explainable meaning in prac-

tice nevertheless their medoid denotes a real project.

The typical k-medoids clustering is implemented by

partitioning around medoids (PAM) algorithm as de-

picted in Algorithm 2.1. The computation complex-

ity and the convergence of PAM algorithm refers to

Table 7: Three similarity measure used in k-medoids clus-

tering.

Measure Similarity Range

Dice

2A+B+C

[0,

]

Jaccard

A+B+C

[0,1]

Kulzinsky

B+C

∞

(Theodoridis and Koutroumbas, 2006).

Algorithm 1. The k-medoids clustering implemented

by PAM algorithm.

Input:

k, the number of clusters

m, Boolean vectors

Output:

k clusters partitioned from the m Boolean vectors.

Procedure:

1. Initialize: randomly select k of the m Boolean

vectors as the mediods.

2. Associate each Boolean vector to the closest

medoid under predeﬁned similarity measure.

3. For each mediod d

4. For each non-medoid Boolean vector b

5. Swap d and b and compute the total cost of

the conﬁguration

6. End for

7. End for

8. Select the conﬁguration with the lowest cost.

9. Repeat steps 2 to 5 until there is no change in

the medoid.

The three adopted similarity measures are Dice

coefﬁcient, Jaccard coefﬁcient and Kulzinsky coefﬁ-

cient for binary vectors (Gan et al., 2007). Assuming

that D

and D

are two n-dimensional Boolean vectors

and s

, D

) is the number of entries in D

and D

whose values are p and q respectively, we deﬁne A, B,

C and D in Equation 7.

A = s

, D

), B = s

, D

C = s

, D

), D = s

, D

)

(7)

The similarity measures of Dice, Jaccard and

Kulzinsky coefﬁcients are listed in Table 7. We re-

gard that D, which means that the characteristic does

not exist in both D

and D

, might not be an important

factor when measuring similarity of two projects be-

cause, the proportion of zero in values of variables is

very large in both ISBSG and CSBSG data set.

4.2 Clustering Results

PAM algorithm is used to cluster the software projects

in ISBSG and CSBSG data sets. The number of clus-

ters is predeﬁned as the number of classes. That is, the

ON THE PREDICTABILITY OF SOFTWARE EFFORTS USING MACHINE LEARNING TECHNIQUES

Table 8: k-medoids clustering on ISBSG data set.

Similarity Measure

F-measure

Without imputation With imputation

Dice Coefﬁcient 0.3520 0.3937

Jaccard Coefﬁcient 0.3824 0.4371

Kulzinsky Coefﬁcient 0.4091 0.4624

parameter k in PAM algorithm for both ISBSG and

CSBSG data sets was set as 3. Without imputation,

we regard the missing values in boolean vectors as

zeros. We also employed MINI imputation technique

(Song and Shepperd, 2007) to impute the missing val-

ues before clustering. Due to the unstable clustering

results caused by initial selection of cluster centers

in PAM algorithm, we repeated each experiment 10

times and ensemble clustering proposed by Zhou et

al (Zhou and Tang, 2006) was utilized to produce the

ﬁnal clusters. Table 8 shows the performances of k-

medoids clustering on ISBSG data set using PAM al-

gorithm with three similarity measures with and with-

out imputation.

We can see from Table 8 that in similarity

measure, Kulzinsky coefﬁcient has the best perfor-

mance among the three measures and Jaccard coef-

ﬁcient has better performance than Dice coefﬁcient.

In k-medoids clustering without (with) imputation,

Kulzinsky coefﬁcient increases the F-measure by

16.29% (17.45%) and Jaccard Coefﬁcient increases

the F-measure by 8.6% (5.78%) using Dice coefﬁ-

cient as the baseline.

This outcome illustrates that the number of com-

mon entries as in Equation 7 is more important than

other indices in similarity measure of software project

in effort prediction using clustering. Imputation sig-

niﬁcantly improves the quality of clustering results.

This validates the effectiveness of imputing missing

values of projects represented by boolean vectors in

k-medoids clustering.

To have a detailed look at the clustering results,

Table 9 shows the projects in the produced clusters

across the classes in Table 3. These clusters were pro-

duced by k-medoids clustering using Kulzinsky co-

efﬁcient with imputation (i.e. F-measure is 0.4624).

We can see that k-medoids clustering actually has not

produced high-quality clusters in the ISBSG data set.

The results are not good as acceptable in real practice

of software effort prediction. For instance, cluster 2

mixes projects in both class 1 and 2 and, most projects

in one class scatter on more than one cluster such as

the projects in class 2 and class 3.

Table 10 shows the performance of PAM algo-

rithm on CSBSG data set. Table 11 shows the dis-

tribution of projects in clusters across classes. Simi-

Table 9: Clustering result using Kulzinsky coefﬁcient with

imputation on ISBSG data set.

Similarity Class 1 Class 2 Class 3 Total

Cluster 1 26 23 28 77

Cluster 2 32 40 27 99

Cluster 3 6 22 45 73

Total 64 85 100 249

larly, k-medoids clustering has not produced a favor-

able performance on CSBSG data set. By contrast,

the performance of k-medoids clustering on CSBSG

data set is worse than that on ISBSG data set. Without

(with) imputation, the average F-measure on the three

coefﬁcients on CSBSG data set is decreased by 7.5%

(3.7%) using the average on ISBSG data set as base-

line. We explain this outcome as that CSBSG data set

has less ones and more missing values (denoted as “-

1”) in boolean vectors than ISBSG data set, as can be

seen in Tables 2 and 5. Based on the analysis, the

predictability of software effort using unsupervised

learning is not acceptable by software industry.

5 CLASSIFICATION OF

SOFTWARE PROJECTS

5.1 Supervised Learning Techniques

The employed supervised learning techniques are

those usually used in effort prediction, including J48

decision tree, BPNN and na

ıve Bayes. The J48 de-

cision tree classiﬁer follows the following simple al-

gorithm. In order to classify the effort of a software

project, it ﬁrstly creates a decision tree based on the

values of variables in the training data set. Whenever

it encounters a set of boolean vectors (training set) it

identiﬁes the variable that has the largest information

gain (Quinlan, 1993). Among the possible values of

this variable, if there is any value for which there is

no ambiguity, that is, for which the projects falling

within this value having the same label of effort, then

we terminate that branch and assign to the terminal

node the label of effort.

The back propagation neural network (BPNN)

(Rumelhart et al., 1986) is used to classify the soft-

ENASE 2011 - 6th International Conference on Evaluation of Novel Software Approaches to Software Engineering

Table 10: k-medoids clustering on CSBSG data set.

Similarity Measure

F-measure

Without imputation With imputation

Dice Coefﬁcient 0.3403 0.3772

Jaccard Coefﬁcient 0.3881 0.4114

Kulzinsky Coefﬁcient 0.4065 0.4560

Figure 1: BPNN with 5 nodes in hidden layer and 3 nodes

in output layer.

ware projects in both ISBSG and CSBSG data sets

as well. BPNN deﬁnes two sweeps of the network:

ﬁrst a forward sweep from the input layer to the out-

put layer and second a backward sweep from the out-

put layer to the input layer. The back ward sweep is

similar to the forward sweep except that error values

are propagated back through the network to determine

how the weights of neurons are to be changed during

training. The objective of training is to ﬁnd a set of

network weights of neurons that construct a model for

prediction with minimum error.

Table 11: Clustering result using Kulzinsky coefﬁcient with

imputation on CSBSG data set.

Similarity Class 1 Class 2 Class 3 Total

Cluster 1 9 11 16 36

Cluster 2 10 10 17 37

Cluster 3 8 10 13 31

Total 27 31 46 104

A three-layer fully connected feed-forward net-

work which consists of an input layer, a hidden layer

and an output layer is adopted in the experiments. The

“tansigmod” function is used in the hidden layer with

5 nodes and “purelinear” function for the output layer

with 3 nodes [17]. The network of BPNN is designed

as shown in Figure 2.

ıve Bayes (Duda et al., 2003) is a well known

probabilistic classiﬁer in machine learning. It is based

on the Bay’s theorem of posteriori probability and

assumes that the effect of an attribute value on a

given class is independent of the value of the other

attributes. This class conditional independence as-

sumption simpliﬁes computation involved in build-

ing the classiﬁer so we called the produced classi-

ﬁer “naive”. Compared to other traditional prediction

models, na

ıve Bayes provides tools for risk estima-

tion and allows decision-makers to combine histori-

cal data with subjective expert estimates (Pendharkar

et al., 2005).

The J48 decision tree and na

ıve Bayes

are implemented using Weka (Waikato

Environment for Knowledge Analysis)

(http://www.cs.waikato.ac.nz/ml/weka/) and, BPNN

is implemented using Matlab simulink tool box

(http://www.mathworks.com/products/neural-net/).

Also, MINI algorithm [12] is used to impute the

missing values of Boolean vectors if necessary.

Our experiments were carried out with 10-ﬂod

cross-validation technique. For each experiment,

we divided the whole data set (ISBSG or CSBSG

data set) into 10 subsets. 9 of 10 subsets are used

for training and the remaining 1 subset was used for

testing. We repeat the experiment 10 times and, the

performance of the prediction model is measured by

the average of 10 accuracies of the 10 repetitions.

5.2 Classiﬁcation Result

Table 12 shows the performances of the three men-

tioned classiﬁers in classifying the projects in ISBSG

data set. On average, we can see that BPNN out-

performs other classiﬁers in classifying the software

projects based on efforts. J48 decision tree has bet-

ter performance than na

ıve Bayes. Using the per-

formance of na

ıve Bayes as the baseline, BPNN in-

creases the average accuracy by 16.25% (11.71%) and

J48 decision tree by 5.6% (2.5%) without (with) im-

putation.

We explain this outcome that BPNN has the best

capacity to eliminate the noise and peculiarities be-

cause it adopts back sweep to change the weights

of neurons for reducing errors of predictive model.

However, the performance of BPNN is not robust

as other classiﬁers (we observe this point from its

standard deviation). The adoption of cross-validation

technique may reduce overﬁtting of BPNN to some

extent but, it cannot eliminate the drawback of BPNN

entirely. The J48 decision tree classiﬁes the projects

using learned decision rules. Due to the adoption of

information gain [15], those variables having more

discriminative power will be fetched out by J48 in ear-

lier branches in constructing decision rules and thus,

the noise and peculiarities connotated in the variables

ON THE PREDICTABILITY OF SOFTWARE EFFORTS USING MACHINE LEARNING TECHNIQUES

with less discriminative power will be ignored auto-

matically (especially in tree pruning).

ıve Bayes has the worst performance among the

three classiﬁers in classifying software efforts. We

explain this as that the variables used for describing

projects may not be independent of each other. More-

over, na

ıve Bayes regard all variables as has equiv-

alent weights as each other in the prediction model.

The conditional probabilities of all variables have the

same weight when predicting the label of an incom-

ing project. However, in fact, some variables of

projects have more discriminative power than other

variables in deciding the project effort. The noise

and peculiarities are often contained in the variables

those have little discriminative power and those vari-

ables should be given less importance in the predic-

tion model. We conjecture that this fact is also the

cause of the poor performance of k-medoids in project

clustering projects. In the same manner as that in k-

medoids clustering, MINI technique has signiﬁcantly

improved the performance of project classiﬁcation by

imputing missing values in Boolean vectors.

Table 13 shows the performances of the three clas-

siﬁers on CSBSG data set. The similar conclusion as

on ISBSG data set can be drawn on CSBSG data set.

However, the performances of three classiﬁers on CS-

BSG data set are worse than those on ISBSG data set.

The average of overall accuracies of the three tech-

niques without (with) imputation on CSBSG data set

is decreased by 6.95% (6.66%) using that on ISBSG

data set as the baseline. We also explain this outcome

as the lower quality of CSBSG data set than that of

ISBSG data set.

We can see from Tables 12 and 13 that, in both

ISBSG and CSBSG data sets, all the three super-

vised learning techniques have not produced a favor-

able classiﬁcation on software efforts using project at-

tributes. The best performance that was produced by

BPNN is with the accuracy around 60%. The accu-

racy as 60% is meaningless for software effort pre-

diction in most cases because, that means that at the

probability 0.4, the prediction results fall beyond the

range of each effort class. Combined with the results

of effort prediction from unsupervised learning, we

draw that the predictability of software effort using

supervised learning techniques is not acceptable by

software industry, either.

6 RELATED WORK

Srinivasan and Fisher (Srinivasan and Fisher, 1995)

used decision tree and BPNN to estimate software de-

velopment effort. COCOMO data with 63 historical

projects was used as the training data and Kremer data

with 15 projects was used as testing data. They re-

ported that decision tree and BPNN are competitive

with traditional COCOMO estimator. However, they

pointed out that the performances of machine learning

techniques are very sensitive to the data on which they

were trained. (Finnie et al., 1997) compared three es-

timation techniques as BPNN, case-based reasoning

and regression models using Function Points as the

measure of system size. They reported that neither of

case-based reasoning and regression model was favor-

able in estimating software efforts due to the consid-

erable noise in the data set. BPNN appears capable

of providing adequate estimation performance (with

MRE as 35%) nevertheless its performance is largely

dependent on the quality of training data as well as

the suitability of testing data to the trained model. Of

all the three methods, a large amount of uncertainty is

inherent in their performances. In both (Finnie et al.,

1997) and (Srinivasan and Fisher, 1995), a serious

problem confronted with effort estimation using ma-

chine learning techniques is that huge uncertainty in-

volved in the robustness of these techniques. That is,

model sensitivity and data-dependent property of ma-

chine learning techniques hinder their admittance by

industrial practice in effort prediction. These work as

well as (Prietula et al., 1996) motivates this study to

investigate the effectiveness of a variety of machine

learning techniques on two different data sets.

(Park and Baek, 2008) conducted an empirical

validation of a neural network model for software ef-

fort estimation. The data set used in their experiments

is collected from a Korean IT company and includes

148 IT projects. They compared expert judgment, re-

gression models and BPNN with different input vari-

ables in software effort estimation. They reported

that neural network using Function Point and other 6

variables (length of project, usage level of system de-

velopment methodology, number of high/middle/low

level manpower and percentage of outsourcing) as in-

put variables outperforms other estimation methods.

However, even in the best performance, the average

MRE is nearly 60% with standard deviation more than

30%. This result makes it is very hard that the method

proposed in their work can be satisfactorily admitted

in practice. For this reason, a validation of machine

learning methods is necessary in order to shed light on

the advancement of software effort estimation. This

point also motivates us to investigate the effectiveness

of machine learning techniques for software effort es-

timation and the predictability of software effort using

machine techniques.

(Shukla, 2000) proposed a neuron-genetic ap-

proach to predict software development effort while

ENASE 2011 - 6th International Conference on Evaluation of Novel Software Approaches to Software Engineering

Table 12: Classiﬁcation of software project efforts on ISBSG data set.

Classiﬁer

Average accuracy ± Standard Deviation

Without imputation With imputation

J48 decision tree 0.5706 ± 0.1118 0.5917 ± 0.1205

BPNN 0.6281 ± 0.1672 0.6448 ± 0.1517

ıve Bayes 0.5403 ± 0.1123 0.5772 ± 0.1030

Table 13: Classiﬁcation of software project efforts on CSBSG data set.

Classiﬁer

Average accuracy ± Standard Deviation

Without imputation With imputation

J48 decision tree 0.4988 ± 0.1103 0.5341 ± 0.1322

BPNN 0.1650 ± 0.1650 0.6132 ± 0.1501

ıve Bayes 0.5331 ± 0.1221 0.5585 ± 0.0910

the neural network is employed to construct the pre-

diction model and genetic algorithm is used to op-

timize the weights between nodes in the input layer

and the nodes in the output layer. They used the same

data sets as that was used in Srinivasan and Fisher

(Srinivasan and Fisher, 1995) and reported that the

neuron-genetic approach outperforms both decision

tree and BPNN. However, they also reported that local

minima and over ﬁtting deteriorate the performance

of the proposed method in some cases, even make it

a poorer predictor than traditional estimator as CO-

COMO (Beohm, 1981). The focus of our study is

not to propose a novel approach to software effort es-

timation but to extensively review the usefulness of

machine learning techniques in software effort esti-

mation. That is, to how much extent the typical ma-

chine techniques can accurately estimate the effort of

a given project using historical data.

7 CONCLUDING REMARKS

In this paper, we conducted a series experiments to

investigate the predictability of software effort using

machine learning techniques. With ISBSG and CS-

BSG data sets, unsupervised learning as k- medoids

clustering is used to cluster software projects with re-

spect to efforts and, supervised learning as J48 deci-

sion tree, BPNN and naive Bayes are used to classify

the projects. Our assumption for this investigation is

that the efforts of software projects can be deduced

from the values of other attributes in historical data

and projects with similar values on attributes other

than effort will also have approximately equivalent ef-

forts.

The experimental results demonstrate that neither

unsupervised nor supervised learning techniques can

provide software effort prediction with a favorable

model. Despite of this fact, Kulzinsky coefﬁcient has

produced the best performance in similarity measure

for unsupervised learning and, BPNN has produced

the best performance among the examined supervised

learning techniques. Moreover, the MINI imputation

can improve data quality and improve effort predic-

tion signiﬁcantly.

ACKNOWLEDGEMENTS

This work is supported by the National Natural

Science Foundation of China under Grant Nos.

6087307261073044, and 60903050; the National

Science and Technology Major Project; the Na-

tional Basic Research Program under Grant No.

2007CB310802; the Scientiﬁc Research Foundation

for the Returned Overseas Chinese Scholars, State

Education Ministry.

REFERENCES

Beohm, B. (1981). Software Engineering Economics.

Prentice-Hall, New Jersey, USA, 2nd edition.

Duda, R., Hart, P., and Stork, D. (2003). Pattern Classiﬁca-

tion. John Wiley & Sons, 2nd edition.

Finnie, G., Wittig, G., and Desharnais, J. (1997). A com-

parison of software effort estimation techniques: Us-

ing function points with neural networks, case-based

reasoning and regression models. Journal of Systems

and Software, 39:281–289.

Gan, G., Ma, C., and Wu, J. (2007). Data clustering, theory,

algorithmsm, and applications. ASA-SIAM Series on

Statistical and Applied Probability, page 78.

He, M., Li, M., Wang, Q., Yang, Y., and Ye., K. (2008). An

investigation of software development productivity in

china. In Proceedings of International Conference on

Software Process, pages 381–394.

ON THE PREDICTABILITY OF SOFTWARE EFFORTS USING MACHINE LEARNING TECHNIQUES

Jorgensen, M. (2004). A review of studies on expert es-

timation of software development effort. Journal of

Systems and Software, 70:37–60.

Korte, M. and Port, D. (2008). Conﬁdence in software cost

estimation results based on mmre and pred. In Pro-

ceedings of PROMISE’08, pages 63–70.

Krupka, E. and Tishby, N. (2008). Generalization from ob-

served to unoberserved features by clustering. Journal

of Machine Learning Research, 83:339–370.

Park, H. and Baek, S. (2008). An empirical validation of a

neural network model for software effort estimation.

Expert System with Applications, 35:929–937.

Pendharkar, P., G.Subramanian, and J.Roger (2005). A

probabilistic model for predicting software develop-

ment effort. IEEE Transactions on Software Engineer-

ing, 31(7):615–624.

Prietula, M., Vicinanza, S., and Mukhopadhyay, T. (1996).

Software-effort estimation with a case-based resoner.

Journal of Experimental & Theoritical Artiﬁcial Intel-

ligence, 8:341–363.

Quinlan, J. (1993). Programs for Machine Learning. Mor-

gan Kaufmann Publishers, 2nd edition.

Rumelhart, D., Hinton, G., and Williams, J. (1986). Learn-

ing internal representations by error propagation. In

Parallel Distributed Processing, Exploitations in the

Microstructure of Cognition, pages 318–362.

Shukla, K. (2000). Neuro-genetic prediction of software

development effort. Information and Software Tech-

nology, 42:701–713.

Song, Q. and Shepperd, M. (2007). A new imputation

method for small software project data sets. Journal

of Systems and Software, 80:51–62.

Srinivasan, K. and Fisher, D. (1995). Machine learning

approaches to estimating software development ef-

fort. IEEE Transactions on Software Engineering,

21(2):126–137.

Steinbach, M., Karypis, G., and Kumar, V. (2000). A com-

parison of document clustering techniques. In KDD-

2000 Workshop on Text Mining, pages 109–110.

Theodoridis, S. and Koutroumbas, K. (2006). Pattern

Recognition. Elsevier, 3rd edition.

Yang, Y., Wang, Q., and Li, M. (2009). Process trustwor-

thiness as a capability indicator for measuring and im-

proving softwaer trustworthiness. In Proceedings of

ICSP 2009, pages 389–401.

Zhou, Z. and Tang, W. (2006). Clusterer ensemble.

Knowledge-Based Systems, 19:77–83.

ENASE 2011 - 6th International Conference on Evaluation of Novel Software Approaches to Software Engineering