should go into the model, how superficially unrelated
data dispersed in the texts should be combined, and
how to obtain a formal model of a process.
The CREWS model (Achour, 1998) is often used
to formulate a conceptual description of a process.
The model specifies two kinds of building blocks, ob-
jects (agents or resources) and actions (atomic actions
or action flows). In a text objects are generally repre-
sented by noun phrases, and actions by verb phrases.
Extraction of entities and relations is a standard topic
of natural language processing, therefore it is rather
straightforward to integrate both technologies in or-
der to match a process with a text that describes it. In
the R-BPD framework prototype (Ghose et al., 2007),
constituents of a process model are extracted with
a combination of template matching, part-of-speech
tagging, and phrase chunking. Other systems take ad-
vantage of more sophisticated natural language pro-
cessing techniques. The use case analysis engine by
Sinha et al. (2008) implements a pipeline that in-
volves lexical processing, parsing, dictionary-based
concepts annotation, anaphora resolution, context an-
notation, and building of process models. A similar
method is implemented by Gonc¸alves et al. (2009,
2011). Both rely on problem-oriented grammars for
shallow parsing. Friedrich et al. (2011) use a wide-
coverage Stanford parser (Manning, 2003), and utilize
WordNet (Miller, 1995) and FrameNet (Baker et al.,
1998) databases to obtain information about semantic
relations, including synonymy. Ackermann and Volz
(2013) describe a prototype system based on recur-
sively defined templates which is able to extract do-
main models from text. They adopt feature structures
to model both the templates and the results of natu-
ral language analysis, and apply a variant of unifica-
tion algorithm to perform the matching between the
two. This approach allows to map text fragments to
elements of an intended domain model and to inter-
actively modify templates based on user feedback. In
general, their system is pattern-based and uses unifi-
cation techniques for comparison of tree structures of
syntactic nature.
3 LIMITATIONS
It should be obvious from the above, that the ability
of such systems to cope with intricacies of natural lan-
guage is limited by the tools that are used to produce
representations of entities and events that comprise a
process. Common approaches to extraction of a pro-
cess model from a document rely on a typical text
processing pipeline, which includes text normaliza-
tion, segmentation, tokenization, morphological anal-
ysis, named entity recognition, and syntactic parsing.
Some systems additionally perform semantic parsing
to produce trees labeled with semantic roles of the
constituents. In order to handle pronouns an anaphora
resolution step may be added to the pipeline. All sys-
tems process each sentence independently.
A serious limitation of current approaches is their
inability to consistently handle multiple descriptions
of the same process. Process descriptions supplied by
users contain errors due to the lack of proof-reading
and tend to describe only those parts of processes that
their authors are aware of or interested in. Documents
by different authors can be inconsistent with respect
to their vocabulary. As a result, these systems are gen-
erally designed to extract a model of a single process
from a single document. Currently, processing multi-
ple sources requires generation of models for each of
the documents with a subsequent merging by a human
expert, which should be automated. The common vo-
cabulary is even more important in an enterprise en-
vironment, where other knowledge-based systems are
available. The process extraction system should be
able to align extracted entities with existing ontolo-
gies.
To create a formal model of a process, a set of
hand-crafted rules is used by each system. This ap-
proach is perfectly valid for building models with a
known structure. However, the facts that do not fit
into the structure are ignored, making such systems
less convenient for exploratory analysis, and restrict-
ing the number of texts that can be processed. Further,
a known limitation of systems that use explicit rules
is the difficulty of rule modification. It is desirable to
have simpler rules that could be learned from corpora,
and to be able to combine them with higher-level pro-
cess analysis algorithms.
In sum, to overcome the limitations of the current
approaches it should be made possible to automati-
cally construct entity and event representations that
could easily integrate in the existing natural language
processing pipelines, support easy alignment with ex-
ternal knowledge bases and ontologies, be expressive
enough to allow for flexible data integration across the
limits of a sentence, and be formally compatible with
existing frameworks of process extraction. To achieve
this we propose a method of semantic unification for
entity resolution and data reconciliation based on fea-
ture structures. To build a process model, we adopt
the methods of process mining (van der Aalst, 2011),
where the text is translated into a sequence of ordered
events, and a process mining algorithm is invoked to
infer a model that fits this sequence.
Process Extraction from Texts using Semantic Unification
255