process activities, i.e. P-Data = V
i
n
∪ V
o
n
may not
be maintained in the process data store in its
entirety, since only process data items v ∈ P-Data,
for which either Source(v) = {Internal} or
Destination(v) = {Internal} will be maintained.
Ideally, each data item belonging to this subset of
internal data, but especially for data items also
belonging to reference and operational data sets (i.e.
R
i
n
∪ O
i
n
∪ R
o
n
∪ O
o
n
), there is a need to profile
them against the corporate master data. We refer to
this subclass of data as Quality Sensitive Data
(QSD).
Profiling basically entails a mapping of all items
of QSD to corporate master data items. Thus
profiling will not only enable the organization to
identify quality sensitive data relevant to the
process, but also assist in ensuring that corporate
master data repositories are not negligent of such
process critical data. The task of profiling identifies
a need for tool support within the BP/DQM
component that enables the mapping to be
undertaken.
2 - Data Linking - Establishing the link between
the master data and various enterprise data sources is
a critical step in the overall data quality protocol.
Ability to prepare master data from disparate
systems within the enterprise, into a centralized
repository is already available to some extent, see
e.g. SAP NetWeaver Master Data Management .
3 - Data Refresh - Potentially, any data item
read from the process data store may be stale.
Although serious impact of reading stale data may
be limited to QSD. Thus the read operation of the
process data store initiated by the process enactment
system (see Figure 1), must trigger a refresh from
the corporate master data through BP/DQM.
Profiling and linking of QSD will reduce the
problem to a simple search, read and upload. In the
absence of profiling/linking steps, the difficulty to
find the latest version of the QSD within enterprise
application databases is rather evident.
4 - Data Unification - Unification is arguably
the most difficult part of this protocol. Potentially
write/update of any item in QSD can introduce
inconsistency. As a result, the BP/DQM must trigger
a semantic lookup in the corporate master data in
order to provide a unified view of process data.
However, such a lookup must determine if a
particular data value is being represented differently
in the corporate master data, and if so, what is the
preferred value. For example, “High Density WP
‘33”, “Wide Panel 33’ HD”, “HDWP33” may all
represent the same entity. Research results on text
similarity (Gravano et al, 2003), may be used in this
regard to a limited extent. Building synonym listings
(Koudas et al., 2004) as part of corporate master data
may assist further, but in most cases human
intervention may be required to determine semantic
equivalence of two data values.
5 CONCLUSIONS
The induction of data quality protocols should take
place within business process management systems,
as business processes typically provide the first point
of contact for enterprise applications through which
enterprise data is created and maintained. We have
undertaken a detailed analysis of process relevant
data, outlining its properties as well as typical errors.
The enhanced understanding of process data through
this analysis has led to the development of an
extended BPMS reference architecture that proposes
an additional component, namely the BP/Data
Quality Monitor (BP/DQM).
The scope of this paper covers a basic discussion
on BP/DQM functionality. The proposed protocol
needs to be developed at a finer level in order to
fully demonstrate the capability (and limitations) of
the proposed BP/DQM. In particular, the semantic
lookup required for data unification (addressing the
problem of inconsistent data), holds many
challenges. This aspect of the problem is the current
focus of our work.
REFERENCES
Butler Group. 2006. Data Quality and Integrity –
Ensuring Compliance and Best use for organizational
data assets. Feb 2006.
L. Gravano, P. G. Ipeirotis, N. Koudas, D. Srivastava.
2003. Text joins for data cleansing and integration in
an rdbms., in: Proceedings of the 19th International
Conference on Data Engineering, IEEE Computer
Society, 2003
N. Koudas, A. Marathe, D. Srivastava. 2004. Flexible
string matching against large databases in practice., in:
Proceedings of the Thirtieth International Conference
on Very Large Data Bases, Morgan Kaufmann, 2004.
Frank Leymann, D. Roller. 2000. Production Workflow:
Concepts and Techniques. Sydney, Prentice-Hall of
Australia Pty. Limited.
E. Rahm, P. A. Bernstein. 2001. A survey of approaches to
automatic schema matching, The VLDB Journal 10 (4)
(2001) 334–350.
Thomas Redman. 1996. Data Quality for the Information
Age.Artech House 1996.
Shazia Sadiq, Maria Orlowska, Wasim Sadiq, Cameron
Foulger. 2004. Data Flow and Validation in Workflow
Modelling. The Fifteenth Australasian Database
Conference Dunedin, New Zealand, January 18 -- 22,
2004.
ICEIS 2007 - International Conference on Enterprise Information Systems
476