phase. By continuously observing human problem re-
solving capabilities (e.g., in case of system errors, un-
expected system behavior, changing forms) RPA tools
can adapt and handle non-standard cases (Aalst et al.,
2018). Moreover, process mining can also be used to
continuously improve the orchestration of work be-
tween systems, robots, and people.
In (Geyer-Klingeberg et al., 2018) it is shown
how Celonis aims to support organizations through-
out the whole lifecycle of RPA initiatives. Three steps
are identified: (1) assessing RPA potential using pro-
cess mining (e.g., identifying processes that are scal-
able, repetitive and standardized), (2) developing RPA
applications (e.g., supporting training and compari-
son between humans and robots), and (3) safeguard-
ing RPA benefits (e.g., identifying concept drift and
compliance checking). The “automation rate” can be
added as a performance indicator to quantify RPA ini-
tiatives.
In (Leno et al., 2020) the term Robotic Process
Mining (RPM) is introduced to refer to “a class of
techniques and tools to analyze data collected dur-
ing the execution of user-driven tasks in order to sup-
port the identification and assessment of candidate
routines for automation and the discovery of routine
specifications that can be executed by RPA bots”. The
authors propose a framework and RPM pipeline com-
bining RPA and process mining, and identify chal-
lenges related to recording, filtering, segmentation,
simplification, identification, discovery, and compila-
tion.
Several vendors (e.g., Celonis, myInvenio,
NikaRPA, UiPath) recently adopted the term Task
Mining (TM) to refer to process mining based on
user-interaction data (complementing business data).
These user-interaction data are collected using task
recorders (similar to spy-ware monitoring specific
applications) and OCR technology to create textual
data sets. Often screenshots are taken to contextualize
actions taken by the user. Natural Language Process-
ing (NLP) techniques and data mining techniques
(e.g., clustering) are used to enrich event data. The
challenge is to match user-interaction data based on
identifiers, usernames, keywords, and labels, and
connect different data sources. Note that the usage of
task mining is not limited to automation initiatives.
It can also be used to analyze compliance and
performance problems (e.g., decisions taken without
looking at the underlying information). Note that
screenshots can be used to interpret and contextualize
deviating behavior. For example, such analysis can
reveal time-consuming workarounds due to system
failures.
4 DEFINING VARIABILITY
The Pareto principle (Pareto, 1896) can be observed
in many domains, e.g., the distribution of wealth, fail-
ure rates, and files sizes. As shown in Figure 1, this
phenomenon can also be observed in process min-
ing. Often, a small percentage of activities accounts
for most of the events, and a small percentage of
trace variants accounts for most of the cases. When
present, the Pareto distribution can be exploited to dis-
cover process models describing mainstream behav-
ior. However, for larger processes with more activi-
ties and longer traces, the Pareto distribution may no
longer be present. For example, it may be that most
traces are unique. In such cases, one needs to abstract
or remove activities in the log to obtain a Pareto dis-
tribution, and separate mainstream from exceptional
behavior.
The goal of this section is to discuss the notion of
variability in process mining. To keep things simple,
we focus on control-flow only. Formally, events can
have any number of attributes and also refer to prop-
erties of the case, resources, costs, etc. In the context
of RPA, events can also be enriched with screenshots,
text fragments, form actions, etc. These attributes will
make any case unique. However, even when all cases
are unique, we would still like to quantify variability.
Therefore, the principles discussed below are generic
and also apply to other attributes.
As motivated above, we only consider activity la-
bels and the ordering of events within cases. Consider
again the simplified event log fragment in Table 1. In
our initial setting, we only consider the activity col-
umn. The case id column is only used to correlate
events and the timestamp column is only used to or-
der events. All other columns are ignored. This leads
to the following standard definition.
Definition 1 (Traces). A is the universe of activities.
A trace t ∈ A
∗
is a sequence of activities. T = A
∗
is
the universe of traces.
Trace t = hCreatePO,SendPO,RecOrder,RecInv,
Paymenti ∈ T refers to 5 events belonging to the same
case (case QR5753 in Table 1). An event log is a col-
lection of cases, each represented by a trace.
Definition 2 (Event Log). L = B(T ) is the universe
of event logs. An event log L ∈ L is a finite multiset
of observed traces.
An event log is a multiset of traces. Event
log L = [hCreatePO,SendPO, RecOrder, RecInv,
Paymenti
5
,hCreatePO,Canceli
3
,hSendPO, RecInv,
RecOrder,Paymenti
3
,] refers to 10 cases (i.e.,
|
L
|
= 10). In the remainder, we use single letters
for activities to ensure a compact representation.
DATA 2020 - 9th International Conference on Data Science, Technology and Applications
8