Converted into actionable knowledge, data has be-
come an unprecedented economic value as witnessed
by the success of companies like Google, Facebook,
and Twitter. Big Data has also revolutionized soft-
ware engineering research and its scientific method-
ologies leading a paradigm shift away from data-
scarce, static, coarse-grained, and simple studies to-
wards data-rich, dynamic, high resolution, and com-
plex observations and simulations.
However, Big Data has not yet been fully ex-
ploited in estimation model building, validation, and
exploitation. One reason might be that the relevant
data is distributed among different stakeholders in an
organization, e.g., relevant impact factors of devel-
opment time are task complexity (estimated by the
development team), team competence and availabil-
ity (assessed by the personnel lead), and processes
and environment maturity (evaluated by a CTO). An-
other reason might be that the relevant data becomes
available at different points in time. For instance, the
aforementioned factors of estimation are available be-
fore the project starts, while the actual ground truth
only gets available afterward. Further to complicate
matters, data points can be generated in different or-
ganizations and, hence, are hard to collect and to ex-
ploit commonly.
As a consequence, we face a lot of shortcomings
in the currently suggested estimation models, despite
all benefits of cloud and Big Data technologies. Es-
timation models are built from prior data collected
years before and looking at software development
practices of today; they are outdated. Consequently,
they emphasize factors that are nowadays irrelevant
and neglect factors of high contemporary importance.
The model constants of these models are often out-
dated too and cannot be adapted to today’s technolo-
gies without significant effort. There is no automatic,
continuous learning and improvement.
The models are incomplete, e.g., they estimate the
size but not the effort, or estimate the effort based on
the size but do not estimate the size itself. They are
also incompatible with each other, i.e., they require
similar, though not identical, factors. As an example,
the development environments and product factors are
regarded both to impact on the estimated system size
and the development effort, which gives these factors
too high weights overall.
Developing new estimation models is not a solu-
tion to the aforementioned shortcomings if the pro-
cess of their development would not change. We
claim that the estimation model development faces
some inherent problems. New models cannot be
tested without a significant cost in time and effort.
The errors and, hence, the suitability of these mod-
els cannot be calculated automatically and in a uni-
versal way. Also, some estimation factors, e.g., pa-
rameters of a project, partially overlap, are dependent
on each other, i.e., they correlate. If the correlation is
strong, they are redundant and therefore, superfluous.
To date, this can only be recognized and optimized
with a manual effort, if done at all.
To train a new model, more data points have to be
collected first. Given the fact that there are often not
enough data points available or they do not exist in a
uniform structure because no uniform process exists
for collecting the data we face a so-called “cold start
problem”. This leads to a lot of effort for data collec-
tion and model estimation before gaining any value.
However, even imprecise objective estimation models
have a value compared to and complementing subjec-
tive assessments.
1.3 Research Aims
This research contributes with an approach to test
and improve estimation models mapping, e.g., cost
drivers to costs. The approach shall be agnostic with
respect to the specific domain and help to test and im-
prove models for mapping any indicator of an out-
come to the actual outcome.
Additionally, the following requirements shall be
met: The approach is data-driven and improves with
newly available data points. In such an approach, new
model ideas should be easily trained and tested with
old data points. New models should coexist and com-
pete with previously defined models. The approach
supports the calculation and comparison of the accu-
racy of all competing models based on all data points
and adjusts the parameters of each model to minimize
potential errors. The approach should also be under-
standable by human experts.
The goal of this work is not to develop one or
more concrete, accurate estimation models. Because
of the cold start problem and the fact that nowadays
technologies change faster, simultaneously makes the
continuous learning of the coefficients more impor-
tant, our approach should help improve models and
increase their accuracy over time. Moreover, it should
not matter what a model looks like, i.e., what class
of functions it exploits nor its domain. Instead, the
approach shall be universal. If sufficient initial data
points are available, arbitrary models can be tested,
trained, iteratively improved, and eventually used.
The following research questions guided our
work:
RQ1 How can we define a continuously improving
approach where new models are easy to imple-
ment, validate, and adjust?
ICSOFT 2019 - 14th International Conference on Software Technologies
142