ities), and for possibly enriching these traces with
derived/aggregated data. In this way, a high-quality
process-orientedview of bug histories can be obtained
and analyzed with existing (or novel) process min-
ing methods, in order to eventually build a predictive
model, capable to estimate, at run-time, the remain-
ing fix time of a bug. The approach has been imple-
mented in a system prototype, offering an integrated
and extensible set of data-transformation and predic-
tive learning tools.
By virtue of its generality and flexibility, the pro-
posed approach can be applied profitably to a variety
of real-life bug repositories, while allowing the ana-
lyst to customize the discovery of a fix-time model
to the specific data schema and business rules of the
repository under analysis. Moreover, as the approach
only assumes that each log event represents a modi-
fication to a case attribute, it can be easily extended
to analyze the logs of other lowly-structured process
management systems (such as, e.g., issue-tracking
systems or data-centric transactional systems).
The rest of the paper is structured as follows. Sec-
tion 2 summarizes some relevant related works, and
the main points of novelty of our proposal. After in-
troducing a few basic concepts in Section 3, we illus-
trate, in Section 4, our core log-abstraction methods.
The overall discovery approach and the implemented
system are presented in Sections 5 and 6, respectively.
We then discuss a series of tests in Section 7, and draw
some concluding remarks in Section 8.
2 RELATED WORK
Previous approaches to the forecasting of bug fix
times mainly rely on the application of classical learn-
ing methods, devised for analysing propositional data
labelled with a discrete or numerical target. In par-
ticular, linear regressors and random-forest classi-
fiers were trained in (Anbalagan and Vouk, 2009) and
in (Marks et al., 2011), respectively, in order to pre-
dict bug lifetimes, using different bug attributes as
input variables. Different standard classification al-
gorithms were exploited instead in (Panjer, 2007) to
the same purpose. Decision trees were also exploited
in (Giger et al., 2010) to estimate how promptly a new
bug report will receive attention. Moreover, a stan-
dard linear regression method was used in (Hooimei-
jer and Weimer, 2007) to predict whether a bug report
will be triaged within a given amount of time.
As mentioned above,none of these approaches ex-
plored the possibly to improve such a preliminary es-
timate subsequently, as long as the bug undergoes dif-
ferent treatments and modifications. The only (par-
tial) exception is the work in (Panjer, 2007), where
some information gathered after the creation of a bug
is used as well, but just for the special case of un-
confirmed bugs, and up to the moment of their accep-
tation. On the contrary, we want to exploit the rich
amount of log data stored for the bugs (across their
entire life), in order to build a history-aware predic-
tion model, providing accurate run-time forecasts for
the remaining fix time of new (unfinished) bug cases.
Predicting processing times is the goal of an
emerging research stream in the field of Process Min-
ing, which specifically addresses the induction of
state-aware performance model out of historical log
traces. In particular, the discovery of an annotated
finite-state model (AFSM) was proposed in (van der
Aalst et al., 2011), where the states correspond
to abstract representations of log traces, and store
processing-time estimates. This learning approach
was combined in (Folino et al., 2012; Folino et al.,
2013) with a predictive clustering scheme, where the
initial data values of each log trace are used as de-
scriptive features for the clustering, and its associated
processing times as target features. By reusing ex-
isting induction methods, each discovered cluster is
then equipped with a distinct prediction model — pre-
cisely, an AFSM in (Folino et al., 2012), and classical
regression models in (Folino et al., 2013).
Unfortunately, these Process Mining techniques
rely on a process-oriented representation of system
logs, where each event refers to a well-specified task;
conversely, common bug tracking systems just regis-
ter bug attribute updates, with no link to resolution
tasks. To overcome this limitation, we try to help
the analyst extract high-level activities out of bug his-
tory records, by providing her/him with a collection
of data transformation methods, tailored to fine-grain
attribute-update records, like those stored in bug logs.
The capability of derived data to improve fix-
time predictions was pointed out in (Bhattacharya and
Neamtiu, 2011), where a few summary statistics and
derived properties were computed for certain Bugzilla
repositories, in a pre-processing phase. We attempt
to generalize such an approach, by devising an ex-
tensible set of data transformation and data aggrega-
tion/abstraction mechanisms, allowing to extract and
evaluate such derived features for a generic bug log.
3 PRELIMINARIES
In order to make the discourse concrete, let us fo-
cus on the structure of a bug repository developed
with Bugzilla (
http://www.bugzilla.org
), a general-
purpose bug-tracking platform, devoted to support
ICEIS2014-16thInternationalConferenceonEnterpriseInformationSystems
100