A Novel Approach to Quantifying the Influence

of Software Process on Project Performance

Jia-kuan Ma

1,2

, Xiao-fan Tong

1,2

, Ya-sha Wang

1,2

and Gang Li

3,4

Key Laboratory of High Confidence Software Technologies, Ministry of Education

Beijing, China

Software Institute, School of Electronics Engineering and Computer Science

Peking University, Beijing, China

Shandong Computer Science Center, Jinan, China

Shandong Provincial Key Laboratory of Computer Network, Jinan, China

Abstract. Determining the appropriate process to be used is a key ingredient of

project management. To this end, understanding the influence of activities on

the project performance can facilitate the project management. However, quan-

tifying such a relationship via traditional Multiple Linear Regression method

tends to be challenging, for the amount of independent variables (activities in

software process) is usually larger than the size of dataset. Aiming at such a

problem, in this paper we propose a novel approach. By combing the Dantzig

selector and Ordinary Least Squares (OLS) regression method, our approach

can derive the regression model in such challenging situations, which further set

the theoretical stage for studying the quantitive influences of software process

on project performance.

1 Introduction

Inherent in every software project there is a process (whether known or unknown,

whether good or bad, and whether stable or erratic) [1]. Nowadays, it has been widely

accepted that for a given software project, the employed process can have significant

influence on project performance (e.g. schedule, budget, quality of deliverables) in

general [2-3]. Specifically, [4] found evidence, in a sample of 61 organizations, that

higher CMM process maturity is associated with better project performance.

Therefore, determining the appropriate process to be used is a key ingredient of

project management. However, the conclusion that software process can influence

project performance does not tell us details of this relationship. For instance:

 Among various activities in a given process, are certain activities more likely to

influence the final project performance than other activities? The answer to this ques-

tion can facilitate the project manager to cast more focus on the more crucial activities.

 Which activities can have significant positive impact on the final project perfor-

mance? Which activities can have significant negative impact on the final project

performance? Answers to these questions can facilitate the project manager to better

Ma J., Tong X., Wang Y. and Li G..

A Novel Approach to Quantifying the Inﬂuence of Software Process on Project Performance.

DOI: 10.5220/0003584000440050

In Proceeding of the 1st International Workshop on Evidential Assessment of Software Technologies (EAST-2011), pages 44-50

ISBN: 978-989-8425-58-4

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

customize the applied process. That is, tries to include positive-impact activities while

exclude negative-impact activities.

Theoretically, answering these questions calls for establishing quantitive relation-ship

between software process and project performance. In particular, taking each activity

in a given process as an independent variable, project performance as the dependent

variable, we need to quantify the influence of independent variables on a dependent

variable.

In this light, the most commonly employed approach is Multiple Linear Regres-

sion model, which can be expressed as follow:

= β

+ β

+ … + β

+ ε

i=1,2,…,n

where

= the ith observation of the dependent variable Y. In our context, Y

denotes the

project performance of a given project (labeled i) in the dataset.

= the ith observation of the independent variable X

(j = 1,2,…,p). In our context,

denotes an activity (labeled j) in the process of a given project (labeled i) in the

dataset.

= the intercept of the equation; ε

= the error term.

, β

…, β

the slope coefficients for each of the independent variables. In our

context, β

denotes the correlation between an activity X

and the final project perfor-

mance Y. Such a correlation is formulated from the dataset.

n = the number of observations. In our context, n denotes the number of projects we

collected, namely size of the dataset.

In general, Multiple Linear Regression estimates β

, …, β

through the Ordinary Least

Squares (OLS) criteria, which minimizes the sum of squared residuals:

min

∑



−



−







− 







− … − 







)







When the number of observations (n) is larger than the number of independent va-

riables (p), namely n > p, the criteria above is equivalent to the p+1 first order condi-

tions. Solving such equations can draw the estimations of β

to β

, and therefore for-

mulate the Multiple Linear Regression model.

Unfortunately, this is commonly not the case when conducting such studies in

practice. On one hand, a software process usually embodies a host of activities. For

instance, the international standard IEEE Std 12207:2008 [5], which serves as a major

process framework, contains 123 activities in total. As a result, when taking activities

in a software process as independent variables, the amount of independent variables

(p) is usually large. On the other hand, the fact that collecting software process and

project related data is costly and challenging [6] limits the size of feasible dataset.

Consequently, the number of observations is usually relatively smaller compared to

the amount of activities. Under such a condition, the OLS criteria become no longer

applicable.

Aiming at such a problem, we propose in this paper a new approach to formulate

the quantitive relationship. Our approach first performs variable selection based on

the Dantzig selector. Then, with the n > p criteria satisfied, we iteratively apply the

OLS regression following the Backward Elimination method. In this way, the quanti-

tive relationship between software process and project performance can be derived

from a relatively small dataset. Meanwhile, the underlying rigorous mathematical

properties of our approach guarantee the accuracy and reliability of the result.

The rest of this paper is organized as follows. Section 2 presents a brief introduc-

tion to the Dantzig selector. Section 3 proposes our novel approach of combing the

Dantzig selector with traditional Multiple Linear Regression. Finally, we conclude

and discuss future work in Section 4.

2 The Dantzig Selector

In many statistical applications, the number of independent variables p is larger than

the number of observations n. Suppose we have the following linear regression model:

y = Xβ + ε (1)

where β = (β

, β

, ..., β

)

∊R

is the associated regression coefficients, and ε

’s

are

i.d.d N(0, σ

). X is a data matrix with possibly fewer rows than columns, i.e. n < p. X

= (x

, x

, ... , x

), where x

= (x

, x

, ... x

)

, i = 1, 2, ..., n, are predictor variables.

Besides, y = (y

, y

, ..., y

)

∊ R

is a vector of observation.

When n < p, OLS criterion cannot provide a unique performance. To deal with this

challenge, Candes and Tao recently proposed a new approach, namely the Dantzig

selector [7], which can generate a sparse estimate of β. The sparse nature means many

coefficients in the result are exactly 0. In this sense, the Dantzig selector provides a

reliable method for variable selection.

Specifically, the Dantzig selector is solution to the following l

-regularization

problem (2).

min



∈



|| β



ℓ

,s.t. ||X

∗

||

ℓ∞

≤(1+t



)



2logp ∙ σ

(2)

where r is the residual vector y−Xβ



and t is a positive scalar. Candes and Tao indi-

cates that if X obeys a uniform uncertainty principle (with unit-normed columns) and

if the true parameter vector is sufficiently sparse (which here roughly guarantees that

the model is identifiable), then we can get the following result (3) with very large

probability.

|| β



− β ||

ℓ

≤C



∙2logp∙(σ



+minβ





,σ







) (3)

To further estimate β with noisy data ε, for some λ

>0, consider solving the following

convex program,

min



∈



|| β



ℓ

,s. t. ||X

∗

||

ℓ∞

≔sup





|(X

∗

)



|≤λ



∙ σ (4)

where  = y − Xβ



. In other words, the estimator β



is with minimum complexity (as

measured by the l

-norm) among all objects that are consistent with the data. The

estimator (4) is called the Dantzig selector.

Since (4) is convex, it can easily be recast as a linear program (5),

minu



,s.t.−u≤β



≤uand−λ



σ1≤X

∗

( − Xβ



)≤λ



σ1

(5)

where the optimization variables are u, β



∈R



and σ1 is a p-dimensional vector of

ones. Accordingly, this estimation procedure is computationally tractable. Candes and

Tao proved that the Dantzig selector is surprisingly accurate at the same time.

Ever since its birth, the Dantzig selector algorithm has drawn enormous attentions.

There have been much useful research work, such as a generalized Dantzig selector

[8]

, DASSO [9] method used to solve the Dantzig selector. Analyzing data sets with

the sample size n smaller than the number of variables p, such challenges arise in

many other fields ranging from health sciences to economics. For instance, in disease

classification using microarray gene expression data [10], the number of arrays is

usually in order of tens but the total amount of gene expression profiles is often tens

of thousands. The Dantzig selector has been applied respectively as an effective solu-

tion to various specific problems.

3 Our Approach

Our approach consists of two steps. First, we leverage the Dantzig Selector to select

the most correlative dependent variables from the original massive candidates. After

the selection, the amount of picked variables will drop below the number of samples.

Then, we apply the OLS regression iteratively to derive the ultimate regression model,

as indicated by the Backward Elimination method.

3.1 Variable Selection via the Dantzig Selector

As a variable selection method, the Dantzig Selector itself does not designate the

number of variables to pick out. In regard of accuracy and reliability of the selection

result, we use a statistical method called cross-validation to discover the best amount

of variables to pick out. It has been shown in [11] that the cross-validated choice of

the penalty parameter is consistent for model selection in general conditions.

Specifically, we use fivefold cross-validation to make the decision. The details of

the cross-validation procedure are listed as follows.

(a) Standardize the data; denote the full sample set by T; divide it randomly and

equally into 5 parts, then get subsets T

, v=1, 2, 3, 4, 5.

(b) Define the fivefold cross-validation training set as T - T

, and test set as T

on training data set T - T

to select J pa-

rameters which is denoted as a set called S





; do this repeatedly increasing J from 1 to

k , a certain positive integer, and get sets S





, S





, ..., S





(d) Let PE

(J) be the prediction error when S





is applied to the test data set T

, and

form the estimate PE(J) =





∑







(J).

(e) Find the J



that minimizes PE(J) and our selected model is S





An implementation of the Dantzig Selector is available at: http://www.acm.caltech.edu/l1magic/

This process is called fivefold cross-validation indicating that we divide the full sam-

ple set into five parts. Note that this fivefold cross-validation is not the same as esti-

mating the prediction error of the fixed models S

, S

, ..., S

and then choosing the

one with the smallest prediction error. This latter procedure is described in [12], and

can lead to inconsistent model selection unless the cross-validation test set T

grows

at an appropriate asymptotic rate.

It needs to be emphasized that the fivefold cross-validation is taken once after a

running loop from steps (a) to (e). To strengthen the stability of result, we can repeat

this kind of cross-validation plenty of times and choose the stable S





3.2 OLS Regression Afterwards

After variable selection, we are able to pick J



significant variables and have J



< n.

Suppose that the smallest significant level we set for the model is s. Now we can

apply the regression analysis to derive the needed model.

We first do the OLS using the picked out dependent variables S





against y. Then,

with the first round regression result in hand, we can examine the p-value of each

dependent variable. For a certain dependent variable X

, its corresponding p-value

tells us the smallest significance level at which the H

: β

=0 (null hypothesis) would

be rejected given the observed value of the t statistic. Therefore, a large p-value indi-

cates that the corresponding parameter is not significant in this model. Such insignifi-

cant parameters should be filtered out of the ultimate model.

According to the Backward Elimination method, we implement the above analysis

as the follows. Find the largest p-value from OLS regression result and see if it is

larger than the smallest significant level s we set. If this p-value meets this condition,

filter the corresponding parameter out and run OLS regression once again using the

rest of parameters until there is no p-value larger than s.

Now we are able to obtain a regression model, where the independent variables

are activities that have significant impact to the dependent variable, namely project

performance. The coefficient denotes the impact of corresponding activity to project

performance, including the direction (+ for positive impact, - for negative impact) and

relative strength (indicated by the absolute value).

4 Conclusions and Future Work

Understanding the influence of certain activities on the project performance can facili-

tate the project management. However, the fact that the amount of independent va-

riables (activities in software process) is usually larger than the size of dataset fabri-

cates a barrier for quantitive analysis. The approach we proposed in this paper pro-

vides a theoretical basis for solving this problem.

Recently, we are applying this approach to study the influence of acquirer’s partic-

ipation process on project performance. The amount of activities in the investigated

acquirer participation process is 84. Under collaboration with several companies in

ShanDong province of China, we are able to collect 25 projects. The preliminary

result (depicted in Fig. 1 and Fig. 2) shows that our approach works well in such an

84 independent variables, 25 samples situation. Further evaluation and improvement

of our approach is still under discussion.

Fig. 1. The final regression result. Note that the Adj R-squared indicating the explanation of the

model reaches nearly 96.8%.

Fig. 2. The result of normal distribution of residuals test. As we can see, plots standing for

residuals are distributed around the 45 degree line, which tells that we can generally consider

that residuals approximately obey the law of normal distribution.

References

1. Jonathan E. Cook, Alexander L. Wolf, Automating process discovery through event-data

analysis. ICSE '95: Proceedings of the 17th international conference on Software engineer-

ing (1995), pp. 73-82.

2. Humphrey, W. S., Managing the Software Process. 1989: Addison-Wesley. Software

Process: A Roadmap.

3. A. Rai, H. Al-Hindi, The effects of development process modeling and task uncertainty on

development quality performance, Information & Management 37, 2000, pp.335–346J.D.

0.00 0.25 0.50 0.75 1.00

Normal F[(uhat-m)/s]

0.00 0.25 0.50 0.75 1.00

Empirical P[i] = i/(N+1)

4. Herbsleb, D. R. Goldenson, A system survey of CMM experience and results, Proceedings

of ICSE 18, 1996, pp.323–330.

5. IEEE Std 12207:2008. 2008. Systems and software engineering-Software life cycle

processes Information technology-Software life cycle processes.

6. Zhihao Chen, Daniel Port, Yue Chen, Barry Boehm, Evolving an Experience Base for

Software Process Research, International Workshop on Software Process. 2005.

7. Emmanuel Candes, Terence Tao. The Dantzig selector: statistical estimation when p is

much larger than n. The Annals of Statistics, 2007,35(6):2313-2351.

8. Gareth M. James, Peter Radchenko. 2009. A generalized Dantzig selector with shrinkage

tuning. Biometrika, 2009,96(2):323–337.

9. Gareth M. James, Peter Radchenko, Jinchi Lv. DASSO: connections between the Dantzig

selector and Lasso. Journal of the Royal Statistical Scoiety 2009, 71(1):127-142.

10. R. Tibshirani, T. Hastie, B. Narasimhan, G. Chu. Class prediction by nearest shrunken

centroids, with applications to DNA microarrays. Statistical Science, 2003, 18(1):104-117.

11. Bunea, F., Tsybakov, A. B. and Wegkamp, M. H. Aggregation for Gaussian regression.

Ann. Statist. 35 1674–1697. MR2351101. 2007.

12. Shao, J. Linear model selection by cross-validation. Journal of the American Statistical

Association 88 486–494. 1992.