customize the applied process. That is, tries to include positive-impact activities while
exclude negative-impact activities.
Theoretically, answering these questions calls for establishing quantitive relation-ship
between software process and project performance. In particular, taking each activity
in a given process as an independent variable, project performance as the dependent
variable, we need to quantify the influence of independent variables on a dependent
variable.
In this light, the most commonly employed approach is Multiple Linear Regres-
sion model, which can be expressed as follow:
Y
i
= β
0
+ β
1
X
1i
+ β
2
X
2i
+ … + β
p
X
pi
+ ε
i
i=1,2,…,n
where
Y
i
= the ith observation of the dependent variable Y. In our context, Y
i
denotes the
project performance of a given project (labeled i) in the dataset.
X
ji
= the ith observation of the independent variable X
j
(j = 1,2,…,p). In our context,
X
ji
denotes an activity (labeled j) in the process of a given project (labeled i) in the
dataset.
β
0
= the intercept of the equation; ε
i
= the error term.
β
1
, β
2,
…, β
p
=
the slope coefficients for each of the independent variables. In our
context, β
p
denotes the correlation between an activity X
p
and the final project perfor-
mance Y. Such a correlation is formulated from the dataset.
n = the number of observations. In our context, n denotes the number of projects we
collected, namely size of the dataset.
In general, Multiple Linear Regression estimates β
0
, …, β
p
through the Ordinary Least
Squares (OLS) criteria, which minimizes the sum of squared residuals:
min
∑
(Y
−
−
−
− … −
)
When the number of observations (n) is larger than the number of independent va-
riables (p), namely n > p, the criteria above is equivalent to the p+1 first order condi-
tions. Solving such equations can draw the estimations of β
0
to β
p
, and therefore for-
mulate the Multiple Linear Regression model.
Unfortunately, this is commonly not the case when conducting such studies in
practice. On one hand, a software process usually embodies a host of activities. For
instance, the international standard IEEE Std 12207:2008 [5], which serves as a major
process framework, contains 123 activities in total. As a result, when taking activities
in a software process as independent variables, the amount of independent variables
(p) is usually large. On the other hand, the fact that collecting software process and
project related data is costly and challenging [6] limits the size of feasible dataset.
Consequently, the number of observations is usually relatively smaller compared to
the amount of activities. Under such a condition, the OLS criteria become no longer
applicable.
Aiming at such a problem, we propose in this paper a new approach to formulate
the quantitive relationship. Our approach first performs variable selection based on
the Dantzig selector. Then, with the n > p criteria satisfied, we iteratively apply the
OLS regression following the Backward Elimination method. In this way, the quanti-
tive relationship between software process and project performance can be derived
45