Unobtrusive Integration of Data Quality in Interactive Explorative Data

Analysis

Michael Behringer

, Pascal Hirmer

, Alejandro Villanueva

, Jannis Rapp

and Bernhard Mitschang

Institute of Parallel and Distributed Systems, University of Stuttgart, Universit

atsstr. 38, D-70569 Stuttgart, Germany

{ﬁrstname.lastname}@ipvs.uni-stuttgart.de

Keywords:

Data Quality, Explorative Data Analysis, Human-in-the-Loop, Data Mashups.

Abstract:

The volume of data to be analyzed has increased tremendously in recent years. To extract knowledge from

this data, domain experts gain new insights using graphical analysis tools for explorative analyses. Hereby, the

reliability and trustworthiness of an explorative analysis are determined by the quality of the underlying data.

Existing approaches require a manual inspection to ensure data quality. This inspection is frequently neglected,

partly because domain experts often lack the necessary technical knowledge. Moreover, they might need many

different tools for this purpose. In this paper, we present a novel interactive approach to integrate data quality

into explorative data analysis in an unobtrusive manner. Our approach efﬁciently combines the strength of

different experts, which is currently not supported by state-of-the-art tools, thereby allowing domain-speciﬁc

adaptation. We implemented a fully working prototype to demonstrate the ability of our approach to support

domain experts in explorative data analysis.

1 INTRODUCTION

Nowadays, more data are generated than ever before

in history (Reinsel et al., 2018), and data are the foun-

dation of almost all business processes and strategic

decisions (Grover and Kar, 2017). Oftentimes, at the

beginning of the analysis, the exact methodology is

still unclear. One speaks of an explorative analysis

in which it must ﬁrst be decided which data sources

to use, which data cleaning steps to conduct, and so

on (Polyzotis et al., 2018). Here, data analysis pro-

cesses, such as the KDD process (Fayyad et al., 1996)

or CRISP-DM (Shearer, 2000), provide guidance on

how to proceed with the analysis and are structured in

a highly iterative manner, i.e., with a high number of

feedback loops to incorporate new ﬁndings and con-

tinuously improve the analysis.

In many cases, however, the exploratory anal-

ysis is not performed by Data Scientists with

in-depth technical knowledge but by domain ex-

perts (Behringer et al., 2017). For this purpose, there

are, for instance, the approaches Self-Service Busi-

https://orcid.org/0000-0002-0410-5307

https://orcid.org/0000-0002-2656-0095

https://orcid.org/0000-0002-9311-5573

https://orcid.org/0000-0003-0809-9159

ness Intelligence (Alpar and Schulz, 2016) or Visual

Analytics (Thomas and Cook, 2005). However, the

former follows predeﬁned analysis paths while the

latter only solves speciﬁc challenges (Keim et al.,

2010; Stodder, 2015). To provide more freedom for

domain experts, graphical data analysis tools are of-

ten used. With the help of these tools, data sources

are graphically connected with operators (e.g., con-

ducting data preprocessing) in an intuitive way, thus,

specifying the analysis workﬂow. However, a signif-

icant challenge here is to assess the underlying data

quality since the validity of the analysis results can-

not be guaranteed if the data quality is insufﬁcient. In

common data analysis tools, the data quality can be

evaluated, but this is not intuitive, and not achievable

without manual effort.

…

Data Source Data Cleaning

Statistics Statistics

Data Mining

Figure 1: Manual evaluation of data quality in

state-of-the-art approaches.

276

Behringer, M., Hirmer, P., Villanueva, A., Rapp, J. and Mitschang, B.

Unobtrusive Integration of Data Quality in Interactive Explorative Data Analysis.

DOI: 10.5220/0011998100003467

In Proceedings of the 25th International Conference on Enterprise Information Systems (ICEIS 2023) - Volume 2, pages 276-285

ISBN: 978-989-758-648-4; ISSN: 2184-4992

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

Figure 1 shows such an analysis workﬂow en-

riched by data quality inspection. As can be seen,

an additional operator ”Statistics” has to be added in

each step of the analysis to evaluate the data quality.

Two major problems arise with this approach:

(1) the statistics displayed here are not customized to

the data, but generic and, thus, interpretation requires

appropriate knowledge and (2) a domain expert will

tend to be convinced of their own analysis, so this re-

view is rather neglected.

Accordingly, it is essential that data quality is con-

stantly monitored without explicit attention from the

domain expert and that, in case of critical low data

quality, the domain expert is made aware of quality

issues and is able to react to these.

…

Data Source Data Cleaning Data Mining

0.950.46

reﬁne reﬁne

Figure 2: Possible integration of data quality indicators in

graphical data analysis tools.

A possible approach is illustrated in Figure 2,

showing how data quality can be integrated in an un-

obtrusive manner. The main contributions of this pa-

per are: (i) a set of requirements for integrating data

quality into interactive exploratory analysis, (ii) the

results of a comprehensive literature review concern-

ing the implementations of these requirements in dif-

ferent tools available on the market, (iii) our novel

approach for unobtrusive integration of data quality

into graphical data analysis tools and how different

experts can contribute their respective strengths in this

context, and (iv) a prototypical implementation of our

approach to illustrate how such a tool could look like

for domain experts.

The remainder of this paper is structured as fol-

lows: In Section 2, we deﬁne requirements that are

necessary for the appropriate integration of data qual-

ity into the interactive analysis process. Next, we

evaluate related data analysis tools according to our

requirements in Section 3. In Section 4, we introduce

our novel approach for unobtrusive integration of data

quality into interactive data analysis. Then, we show a

brief overview of our implemented prototype in Sec-

tion 5 and discuss how this prototype fulﬁlls the re-

quirements in Section 6. We present related work in

Section 7. Finally, we conclude this work and show

future work in Section 8.

2 REQUIREMENTS

In this section, we deﬁne several requirements that

are necessary to enable domain experts to be aware

of data quality at all times during exploratory analy-

sis and react to it when needed. In general, there are 5

requirements for user-centric data analysis processes

that should be fulﬁlled for an optimal involvement

of a domain expert in the analysis (Behringer et al.,

2017). In this paper, we adapt these requirements to

the speciﬁc circumstances related to data quality:

(R1) Integration into Entire Data Analysis Process.

For a successful analysis, a domain expert must be

informed about the data quality at all times. Further-

more, it is indispensable for assessing the performed

analysis steps to show their impact on the data qual-

ity. This is the prerequisite to understand the quality

of the underlying data, to improve it if necessary, and

ﬁnally to evaluate the reliability of the analysis.

(R2) Feedback at Different Levels of Detail. An

important criterion for the involvement of domain ex-

perts with regard to interactive data analysis is to

avoid information overload. Thus, information about

the current data quality has to be presented according

to the respective context, i.e., less detailed informa-

tion considering the entire analysis process and more

details on single analysis steps or on request. These

levels of detail should be separated into different data

quality dimensions, e.g., completeness or timeliness.

(R3) Involvement of Several User Roles. It can-

not be assumed that a domain expert has comprehen-

sive knowledge of all available data sources. Thus,

metadata related to data quality must already be pre-

annotated, e.g., by domain experts working within the

respective domain or technical experts residing in the

IT department. Hence, several user roles can con-

tribute the respective strengths to the analysis process

and support self-service analysis, leading to a clear

separation of concerns.

(R4) Automated Background Monitoring. Data

quality monitoring should require as less attention

from domain experts as possible. Instead, it should

take place primarily in the background. Then, if the

data quality decreases below a threshold, a domain

expert can be alerted, preventing the need to check

data quality and ensure appropriate surveillance.

(R5) Assisted Solving of Identiﬁed Issues. If the

data quality is insufﬁcient, the reasons behind this

have to be communicated to the domain expert in

a comprehensible manner. Therefore, it is tremen-

dously important that suggestions are made to over-

come these deﬁciencies and to support the domain ex-

pert in this analysis process.

Unobtrusive Integration of Data Quality in Interactive Explorative Data Analysis

277

Table 1: Coverage of the identiﬁed requirements in various tools.

Functionality

Phase

Data Wrangling Data Analysis Hybrid

R1: Integration into the entire data analysis process

R2: Feedback at different levels of detail

R3: Involvement of several user roles

R4: Automated background monitoring

R5: Assisted solving of identiﬁed issues

None or poor support [0%,25%] Medium support (25%,50%] Good support (50%,75%] High support (75%,100%]

3 EVALUATION OF RELATED

TOOLS AND FOUNDATIONS

In this section, we evaluate related tools according to

our requirements and we introduce foundations.

3.1 Evaluation of Related Tools

In order to emphasize the importance of our approach,

we analyzed and compared established tools in regard

to their consideration of data quality aspects. With re-

gard to the requirements introduced, we evaluated the

leading tools based on the respective Magic Quad-

rants of Gartner in the area of data wrangling/data

quality (Gartner Inc., 2021b) and data analysis (Gart-

ner Inc., 2021a). We also added tools that support

both data wrangling and data analysis. These tools

comprise Rapid Miner, KNIME as well as Tableau

including Data Prep and cover all steps of the data

analysis process. We evaluated these tools with re-

spect to our introduced requirements. An overview of

this comparison is shown in Table 1.

Regarding our ﬁrst requirement R1, the integra-

tion into the entire data analysis process, we conclude

that data wrangling and data analysis tools offer very

limited support while hybrid tools offer good support.

However, in hybrid approaches, data quality needs to

be manually integrated into the data analysis process-

ing. In each step of the analysis workﬂow, domain ex-

perts need to deﬁne which data quality metric should

be evaluated and insert speciﬁc steps to do so. In more

complex processes, however, this can become a very

tedious task. Hence, a full integration without a large

amount of user interaction would be desirable.

The second requirement, the feedback at different

levels of detail, is supported very well by data wran-

gling tools, while data analysis and hybrid tools offer

nearly no support. In data analysis tools, user feed-

back regarding data quality is given in a very limited

Tableau: https://tableau.com

scope, for instance, only on column level and not in

separate dimensions. In hybrid tools, it is necessary to

insert manual steps into the process which can provide

some kind of feedback regarding data quality. How-

ever, this also means that feedback is only contained

in these inserted steps and not integrated into the over-

all user interface of the tools.

Regarding the third requirement, the involvement

of different user groups, we did not ﬁnd any higher

support in the analyzed data analysis and hybrid tools.

With regard to data wrangling tools, it should be noted

that these are usually utilized by a different expert

than for the subsequent data analysis. This in turn

means that although different user roles work together

as part of the entire data analysis process, the data

quality is predetermined in this case, irrespective of

the analysis. Hence, there is a gap regarding R3.

The fourth requirement speciﬁes that a calculation

in the background is desirable, without interrupting

domain experts while specifying data analysis pro-

cesses. The analyzed tools do not offer sufﬁcient sup-

port for this. While data wrangling tools have another

focus, data analysis tools usually do not provide inte-

grated solutions or lack of required metadata to cal-

culate data quality. Hybrid approaches require insert-

ing data quality assessment steps which need to be

actively triggered. Hence, there is no sufﬁcient back-

ground calculation possible in these tools.

Finally, regarding requirement ﬁve, the interac-

tive correction, a good support was provided by data

wrangling tools. Here, interaction is possible, how-

ever, this is solely focused on the data level and not

on contextual level regarding the entire data analy-

sis process. Considering meta data as well would be

beneﬁcial. For data analysis and hybrid tools, it is

observed that most tools offer some functionality to

correct data quality. Regarding analysis tools, some

offer limited functionality to interactively remove du-

plicates or ﬁll in missing values, however, here the

tools differ greatly in their powerfulness. In hybrid

tools, this interaction has to be manually conﬁgured

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

278

by adding an additional step to the data analysis pro-

cess which requires deeper knowledge.

3.2 Foundations

As our paper aims to measure the underlying data

quality, we need to specify metrics that allow assess-

ing whether data quality can be considered as low or

high. In the following, assembled from a literature

review, metrics are introduced to measure established

data quality dimensions.

3.2.1 Accuracy/Correctness

Calculating how syntactically or semantically cor-

rect data is, is a very difﬁcult task. We either re-

quire correct reference data or plausibility rules for

this (Loshin, 2010), e.g., a birth date is not in the fu-

ture. Hence, the following approaches assume that

incorrect data can be identiﬁed:

Ratio of incorrect values The easiest way to calcu-

late correctness is by comparing the amount of correct

values to the total amount of values (Azeroual et al.,

2018; Serhani et al., 2016; Juddoo, 2015).

Distance function A more sophisticated way to cal-

culate correctness is using distance functions, which

take into consideration the degree how correct or in-

correct data is by calculating the similarity of data.

To do so, a distance function is used depending

on the type of data. For string values, the Leven-

shtein distance (Levenshtein, 1966) or Hamming dis-

tance (Hamming, 1950) could be used.

3.2.2 Consistency/Integrity

This dimension speciﬁes how consistent the set of

data is, i.e., whether a data set has contradictions

within itself. So-called consistency rules are used to

deﬁne what is considered as consistent. An example

of a consistency rule is that a zip code has to match

the respective city (Azeroual et al., 2018; Batini and

Scannapieco, 2016). We can use the following met-

rics to measure consistency:

Ratio of consistent and inconsistent data: The easi-

est way to calculate this metric is once again compar-

ing the amount of inconsistent data entries to the total

amount of data entries checked for consistency (Lee

et al., 2006).

Weighted sum: A more accurate approach is using

a weighted sum, which considers the consequences

of violating or fulﬁlling consistency rules. This ap-

proach is described in detail by (Alpar and Winkel-

str

ater, 2014; Hipp et al., 2007).

3.2.3 Completeness

The dimension completeness calculates how com-

plete the data is or, in other words, if data is miss-

ing. Hereby, one must decide if only values within

the dataset (closed-world assumption) or missing but

correct values which are not contained in the dataset

should be considered (open-world assumption) (Ba-

tini and Scannapieco, 2016). For instance, based on

this decision, a dataset with 30 states of the U.S. could

be seen either as complete or incomplete if we con-

sider the remaining 20 states as well. We can calcu-

late this dimension by the following metrics:

Missing value ratio Here, the ratio of present data to

the missing data is calculated to measure complete-

ness (Batini and Scannapieco, 2016; Scannapieco

et al., 2005).

NULL tuple ratio Instead of focusing on features, an-

other approach is to compare the ratio of tuples con-

taining at least one NULL value with tuples contain-

ing no NULL values (Blake and Mangiameli, 2009).

Note that multiple NULL values within a single tuple

are not considered.

3.2.4 Timeliness

The dimension timeliness measures the probability

that data being processed at a certain point in time

still reﬂects the reality and, hence, is not outdated.

This dimension may change over time as it is strongly

dependent to the use case (Wang and Strong, 1996).

Probabilistic approach In the probabilistic approach,

we assumed that timeliness decreases exponentially.

To calculate this dimension, it is necessary to deﬁne

how fast the timeliness of data decreases in a spec-

iﬁed amount of time, e.g., the timeliness dimension

decreases by 10% each month (Heinrich and Klier,

2011).

Time-limited approach Another means to measure

this dimension is the time-limited approach. Here,

it is assumed that data can be considered invalid at

a ﬁxed point in time. This approach then calculates

the decay of data quality in regard to timeliness from

data creation to the point in time of invalidity (Ballou

et al., 1998).

Hybrid Approach: Finally, in the hybrid approach,

timeliness is calculated using the time-limited or

probabilistic approach depending on the current use

case. The probabilistic approach is used if it is not

possible to deﬁne a ﬁxed time limit, i.e., it cannot be

foreseen when data become invalid. Otherwise, the

time-limited approach is used (Even and Shankara-

narayanan, 2005).

Unobtrusive Integration of Data Quality in Interactive Explorative Data Analysis

279

4 UNOBTRUSIVE INTEGRATION

OF DATA QUALITY IN

INTERACTIVE DATA

ANALYSIS

In this section, we present our novel approach to over-

come the aforementioned limitations of existing ap-

proaches and, thus, to enable domain experts to eas-

ily maintain an overview of data quality without re-

quiring additional effort while conducting an explo-

rative analysis. To achieve this, (a) different user

roles need to collaborate and contribute their respec-

tive strengths, (b) generic data quality metrics need to

be deﬁned, (c) depending on the analysis context, the

data quality metrics have to be adapted, and (d) issues

in data quality need to be identiﬁed and solutions have

to be recommended.

Figure 3 shows an overview of our approach. As

described above, a popular approach to involve do-

main experts in exploratory analysis is the use of

graphical data analysis tools. These provide a range

of data sources and allow domain experts without

deeper technical knowledge to combine these data

sources and specify transformations in a graphical

manner. To keep the necessary effort for domain ex-

perts at a minimum, these data sources should be pro-

vided in advance by an IT expert. In our approach,

this is done in the preparation phase (Figure 3, 1).

Here, an IT expert ﬁrst adds a new data source to the

repository (Figure 3, 1a), e.g., by specifying a connec-

tion to the respective database. In a subsequent step,

the IT expert has to create a domain-agnostic ground

truth for this data source (Figure 3, 1b). This ground

truth comprises various data quality metrics that must

hold in order to consider the data as qualitative. For

instance, if we consider the data quality dimension

completeness under the closed-world assumption for

one data feature, the IT expert has to specify which

values have to be included in the data, e.g., the 50

states of the USA are deﬁned and each differing value

or missing value would decrease the quality. When

this ground truth is speciﬁed for each data feature, the

task of the IT expert is done and the ground truth is

stored as a data quality artifact in the repository. At

this point, the IT expert’s task ends for the time being.

This phase is decoupled from the explorative anal-

ysis of domain experts to facilitate ﬂexibility without

the need to reach out to an IT expert. To support do-

main experts in their exploratory analysis with regard

to data quality, we describe our approach based on

graphical data analysis tools. For a domain expert,

the ﬁrst phase is the speciﬁcation phase (Figure 3, 2),

in which the analysis workﬂow is created.

First, the required data sources are selected (Fig-

ure 3, 2a). Since a data quality artifact is provided

in the repository for all available data sources in this

phase, the data quality can be determined by means

of this artifact (Figure 3, 2b). These calculated data

quality metrics are then displayed to the domain ex-

pert and allow for a direct assessment (Figure 3, 3a)

whether this data source is qualitatively sufﬁcient or,

if it is not, where possible problems may be located,

e.g., if there are data completeness concerns. Subse-

quently, it can be decided to either change the data

source(s) (Figure 3, 2a) or to proceed with the speci-

ﬁcation of the analysis workﬂow (Figure 3, 2c), e.g.,

preprocessing or data mining transformations.

If the latter is chosen, the workﬂow is being exe-

cuted (Figure 3, 2d), which is necessary because these

transformations affect the data and, thus, inﬂuence the

data quality. In both cases, the data quality is calcu-

lated again based on the data quality artifacts (Fig-

ure 3, 2b) and displayed for review (Figure 3, 3a). Up

to this point, all data quality metrics are calculated

based on the data quality artifact deﬁned by the IT

expert in the preparation phase and are, therefore, in-

dependent of the context of the domain expert’s anal-

ysis. Although this is a signiﬁcant advance over state-

of-the-art approaches without automatic data quality

monitoring, it is still insufﬁcient in many cases. For

instance, it is possible that the data timeliness dimen-

sion has been deﬁned in advance in such a way that

the data must be up-to-date on a daily basis, but for

the current analysis, historical data is required.

In this scenario, the deﬁnition of data quality by

means of the data quality artifact is no longer suitable

and has to be adapted to the intended analysis. This

is supported by our approach in the adaptation phase

(Figure 3, 4), where the domain-speciﬁc quality met-

rics are deﬁned (Figure 3, 4a). This can be a wide va-

riety of different measures, which is why our architec-

ture is generic and extensible in this respect. Possible

adjustments include, for instance, adding or remov-

ing quality dimensions, adjusting thresholds, or even

weighting the different dimensions according to their

importance. With each adjustment, the now domain-

speciﬁc data quality is immediately calculated (Fig-

ure 3, 4b) and again visualized for assessment by the

domain expert (Figure 3, 3b).

By this stage of the process, data quality metrics

have been predeﬁned by an IT expert and/or adapted

to the context by a domain expert. However, if the

calculated data quality is still insufﬁcient, or if a more

reliable analysis should be performed, the domain ex-

pert is required to focus on the data itself and use the

improvement phase (Figure 3, 5) to enrich or clean the

data until a sufﬁcient data quality is achieved.

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

280

Frontend

Backend

Adaptation Phase

Deﬁne

context-sensitive

quality metrics

Determine

possible

enhancements

possible

enhancements

Assess

domain-agnostic

quality metrics

Assess

data quality

Add/Remove

data quality metrics

Reﬁne/Remove

Thresholds

Reﬁne/Remove

Weights

…

Speciﬁcation Phase Improvement Phase

Specify Rules

Remove Tuples

Impute Tuples

…

Select

Data Source(s)

Specify

Workﬂow

Execute Workﬂow

Calculate

domain-agnostic

quality metrics

Calculate

domain-speciﬁc

quality metrics

Add

Data Source

Annotate

Data Source

Assess

domain-speciﬁc

quality metrics

Calculate

domain-speciﬁc

quality metrics

Store

Feedback

Request

additional

data sources

Data Sources Data Quality Artifacts Domain-driven Backlog

2 4 5

Feedback Loop

Preparation Phase

Integration Phase

Control Flow

User Interaction (steer process / perceive feedback)

Phase

Task

Repository

Repository Request / Contribution

3a 3b 3c

2a 2c 4a

2b 4b 5c

1b1a

Figure 3: Different phases to unobtrusively integrate data quality in explorative analysis.

Therefore, our approach uses data quality artifacts

to automatically identify common data quality prob-

lems (Figure 3, 5a) and indicate these problems to the

domain expert. At the same time, possible solutions

are suggested, such as how to deal with null values

or duplicates. A domain expert can now accept these

suggestions (Figure 3, 5b) or take more in-depth ac-

tions, e.g., specify further rules for data imputation.

Next, there is a further loop consisting of the

(re)calculation of data quality metrics (Figure 3, 5c)

and the follow-up assessment of whether the data

quality is now sufﬁcient (Figure 3, 3c). However, this

process is not straightforward. Instead, it can require

several iterations depending on the complexity and

suitability of the data. Consequently, a domain expert

must be able to switch between the phases, e.g., if it

turns out during the improvement phase (Figure 3, 5)

that the calculation of the domain-speciﬁc data qual-

ity metrics is still insufﬁcient and the metric calcu-

lation needs to be reﬁned. For this reason, our ap-

proach includes a feedback loop (Figure 3, 3) to en-

able the domain expert to leverage knowledge from

completed phases and use it to make adjustments in

previous phases.

Once these phases are completed, the domain ex-

pert’s analysis is ﬁnished and new insights have been

gained. Due to the prior integration of an IT expert, it

is expected that the effort for domain experts remains

manageable. Nevertheless, it is also feasible that the

data quality artifacts are no longer up-to-date, e.g., be-

cause the focus of the necessary analyses has shifted

since the data quality artifact has been created or the

data itself has changed over time. For this reason,

our approach comprises one more phase, the Integra-

tion Phase (Figure 3, 6), in which feedback is pro-

vided back to the IT department (Figure 3, 6a), e.g.,

domain-speciﬁc adjustments made in the adaptation

phase (Figure 3, 4) or data cleaning transformations

performed in the improvement phase (Figure 3, 5).

An IT expert may later adapt and reﬁne the initial data

quality artifacts or perform additional data cleaning in

advance, provided that this makes sense for the ma-

jority of the analyses. Furthermore, additional data

sources can be requested if no suitable data sources

are available in the repository (Figure 3, 6b). Either

way, the domain expert’s tasks end at this point, and

the possible additional activity of an IT expert is again

decoupled in terms of time.

Unobtrusive Integration of Data Quality in Interactive Explorative Data Analysis

281

Detailed Report:

Overall Quality

80%

Timeliness: 61%

Completeness: 79%

Consistency: 86%

Accuracy: 93%

170 idle26710 PremLaser 6721

AMK0F

2021-11-27 12:25 720 30.94 unknown22451 O2-Laser 500S

AKH9N

2020-10-11 12:21 720 error (71144)7108 O2-Laser 500S

AKH9N

2018-06-08 04:55 9840 working74.233098 PremLaser 6710

AGAKR

2017-07-12 22:03 30.45C idle29340 PremLaser 6710

AGAKR

2017-05-13 17:51 working45.91650 O2-Laser 500S

AKH9N

...

ID machine_id type date RPM temp status

Summary:

Overview of Production Data (ID: df71c38a-84fa-4044-b510-5a0f32cd1f0f)

machine_id

Accuracy: 93

Completeness: 97

Timeliness: 62

Average

88%

Consistency: 100

Scores:

# Null Values: 191520

# Unique Values: 340780

Feature Overview:

ZY5FR

AMK0E

A0D36

AGAKR

AKH9N

Distribution:

type

Accuracy: 81

Completeness: 80

Timeliness: 100

Average

90,25%

Consistency: 100

Scores:

# Null Values: 110971

# Unique Values: 6

Feature Overview:

P12

O2-X1

L2750

PremLaser 6710

O2-Laser 500S

Distribution:

Recommendations:

Issue:

A large portion (19%) of the feature values

are missing.

Solution(s):

Apply

Ignore empty rows

Apply

Remove affected rows

date

Accuracy: 100

Completeness: 64

Timeliness: 35

Average

67,75%

Consistency: 72

Scores:

# Null Values: 191520

# Unique Values: 340780

Feature Overview:

2022

2018

2019

2020

2021

Distribution:

Recommendations:

Issue:

Multiple values not conform to selected

format "DateTime"

Solution(s):

Apply

Convert to DateTime

Apply

Remove affected rows

Issue:

A large quantity of outdated data with low

relevance.

Solution(s):

Apply

Adjust age requirements

Apply

Remove outdated rows

RPM

Accuracy: 87

Completeness: 80

Timeliness: 100

Average

88,5%

Consistency: 87

Scores:

Median: 4356

Mean: 5278

Maximum: 17213

Minimum: 0

# Null Values: 353398

# Unique Values: 384031

Feature Overview:

16752

9467

6885

3443

Distribution:

Recommendations:

Issue:

Multiple values contradict the dened rule:

machine_type=""L2750" -> RPM=null

Solution(s):

Apply

Remove contradicting rows

Apply

Adjust the dened rule

temp

Accuracy: 100

Completeness: 73

Timeliness: 100

Average

83,75%

Consistency: 62

Scores:

# Null Values: 143721

# Mixed Datatypes: Text and Numerical

Feature Overview:

null

Text

Numerical

Distribution:

status

Accuracy: 100

Completeness: 92

Timeliness: 100

Average

98%

Consistency: 100

Scores:

# Null Values: 191520

# Unique Values: 340780

Feature Overview:

error (73265)

error (71144)

null

working

idle

Distribution:

Recommendations:

Issue:

A small portion (11%) of the column values

are missing.

Solution(s):

Apply

Ignore empty rows

Apply

Remove affected rows

a b c

Figure 4: Detail view in the user interface of our approach.

5 PROTOTYPE

We have implemented the presented concepts in a

fully functional prototype to provide more insight into

what such a system would look like for domain ex-

perts This was accomplished by ﬁrst extending a data

mashup tool to provide an overview of the overall data

quality during the workﬂow speciﬁcation (cf. Fig-

ure 2). By clicking on one of the quality indicators for

the overall data quality, a small overlay is displayed,

which lists the respective data quality dimensions sep-

arately and offers the possibility to display more de-

tails to the domain expert.

If the domain expert decides to proceed with more

details, the user interface depicted in Figure 4 will be

opened. In the upper part of this user interface, the

overall data quality achieved is shown ﬁrst of all and

is color-coded (Figure 4, a). Thus, it can always be

seen at ﬁrst glance whether the desired data quality

has already been achieved. Next to this, this overall

data quality is divided into its dimensions, for the data

source in this ﬁgure, thus, the accuracy, the complete-

ness, the consistency, and the timeliness (Figure 4, b).

Now, a domain expert can quickly identify which di-

mensions are critical at the moment. According to the

presented approach, these dimensions are either the

domain-agnostic data quality metrics deﬁned by the

IT expert or the domain-speciﬁc, reﬁned settings of

the domain expert. Furthermore, within the top sec-

tion of the dashboard, a small sample of the dataset is

visualized in order to give the domain expert a better

feeling about the data characteristics (Figure 4, c). In

many situations, however, an assessment at the data

source level is not sufﬁcient. For instance, it is not

yet possible to tell whether the value of the timeliness

dimension is caused by individual features with very

low data quality or whether all features have medium

data quality instead.

For this reason, the data quality is also calculated

at feature level (Figure 4, d) and visualized in color-

coded form. Here, once again, it is possible to see at

a glance which features exhibit critical data quality,

both overall and again subdivided into the individual

data quality dimensions (Figure 4, e).

Below this overview, the most common values are

indicated for each feature as value distribution (Fig-

ure 4, f), as well as additional statistical key indicators

(Figure 4, g), e.g., the number of null or unique val-

ues, as well as minimum, maximum, or the average.

The respective indicators depend on the data type of

the feature, i.e., the average is not calculated for cate-

gorical data.

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

282

In order to increase the data quality, identiﬁed is-

sues and potential countermeasures are suggested be-

low these metrics (Figure 4, h). For instance, for

the feature date, one ﬁnding is that many records are

older than the speciﬁed policy. In this case, our proto-

type suggests, among other things, that this outdated

data should be removed from the dataset resulting in

a higher quality dataset according to the requirement.

Here, it is up to the domain expert to judge whether

this would corrupt the analysis results. An alternative

would be the second suggested action that the policy

for the timeliness dimension is adjusted, i.e., older

data is also considered as qualitative. For the same

feature, an additional problem with the data in this

ﬁgure is that the date format does not match the spec-

iﬁed format. For this, a transformation to the correct

date format is suggested, which the domain expert can

trigger with just a single click.

A further example can be seen when considering

the feature RPM (rotations per minute). Here, ei-

ther the IT expert or the domain expert has deﬁned

a rule according to which the RPMs for a speciﬁc

machine type must be within a given range. If indi-

vidual records violate this rule, this is either an in-

dication of underlying issues with the machines or

anomalies in the data, which should, therefore, not

be considered for the current analysis. As a possible

solution in this scenario, our prototype suggests that

either these entries are discarded or the corresponding

rule is adapted. Any of these adjustments made is im-

mediately applied to the data in the background. The

user interface is updated after the calculation has been

completed.

Alternatively, the detail view can be closed at any

time to proceed with the analysis. Here, the system re-

turns to the workﬂow speciﬁcation overview (cf. Fig-

ure 2) and the displayed overall data quality value is

temporarily replaced with an unobtrusive warning un-

til the calculation of the adjustments has been com-

pleted. To enable a wide range of possible adjust-

ments, our prototype is built based on a generic ar-

chitecture and allows to be extended with additional

functionalities to increase data quality in an explo-

rative analysis.

6 DISCUSSION

In order to evaluate the introduced approach to inte-

grate data quality assessment unobtrusively into the

data analysis process, we use the ﬁve requirements

deﬁned earlier in Sect. 2.

(R1) Integration into the Entire Data Analysis Pro-

cess. The ﬁrst requirement speciﬁed that it is essen-

tial to integrate data quality assessment into the en-

tire data analysis process, i.e., a domain expert must

be informed about the data quality at all times. By

adding a data quality assessment through calculation

of domain-agnostic metrics into each step of the data

analysis process and by allowing to adjust process

steps through feedback loops, we can fulﬁll the ﬁrst

requirement R1. In our approach, domain experts are

always informed about possible data quality issues.

(R2) Feedback at Different Levels of Detail. The

second requirement was to avoid information over-

load of domain experts. Thus, showing less de-

tailed information considering the entire analysis pro-

cess and more details on individual analysis steps or

by speciﬁc user request. In our approach, we real-

ized this by providing direct user feedback ﬁrst in

an aggregated overview and then allowing domain

experts to request more details on demand in each

step. Thus, we follow the popular Information Seek-

ing Mantra (Shneiderman, 1996).

(R3) Involvement of Several User Roles. The third

requirement was the consideration of multiple user

roles. While domain experts have knowledge about

the speciﬁc goals of the analysis process and know

the meaning of the data very well, they usually lack

technical knowledge and IT expertise, e.g., to inte-

grate new data sources into the process. Thus, we

clearly extinguish the roles of domain experts and IT

experts and we tried to keep the communication over-

head between these roles as minimal as possible since

the steps of these user roles can be clearly separated,

i.e., a clear separation of concerns.

(R4) Automated Background Monitoring. The

fourth requirement was that the data quality assess-

ment should be done in the background without being

obtrusive for domain experts in workﬂow speciﬁca-

tion. As can be seen in Figure 3, our approach fulﬁlls

this requirement by calculating data quality metrics

fully in the background without blocking workﬂow

speciﬁcation for domain experts. Once the assess-

ment is ﬁnished, the results are shown to the domain

experts in the frontend.

(R5) Assisted Solving of Identiﬁed Issues. Finally,

requirement ﬁve was that users should be assisted

in improving data quality by applying different mea-

sures. This is realized in our approach in two ways:

(a) by identifying typical data quality issues when

Unobtrusive Integration of Data Quality in Interactive Explorative Data Analysis

283

they arise and proposing possible solutions that can

be accepted with one mouse click by a domain expert,

and (b) by showing the impact of operations, e.g., data

preprocessing tasks, on the resulting data quality for

more complex data quality issues.

In conclusion, our approach fulﬁlls the speciﬁed

requirements. In contrast to existing approaches, the

resilience of the data analyses can be predicted better

since the domain expert is informed about the data

quality in the process at all times.

7 RELATED WORK

Many different process models have been published

to describe the methodology for data analysis. The

two most popular representatives are the KDD pro-

cess (Fayyad et al., 1996) and CRISP-DM (Shearer,

2000), which schematically represent the various

phases required to transform raw data into knowl-

edge in a structured way. The implementation of

these processes is generally performed by technical

experts and based on domain-speciﬁc knowledge of

domain experts. As a consequence, however, these

predeﬁned analyses tend to become a black-box and

changes can rarely be made independently by a do-

main expert (Behringer et al., 2017). Therefore, spe-

ciﬁc domain-knowledge cannot be considered during

the analysis process and it is the way to go for well-

understood analyses.

Approaches to integrate domain experts are: (i)

Visual Analytics (Thomas and Cook, 2005), which

extends common visualization through repetitive

analysis steps, and (ii) Self-Service Business Intelli-

gence (Alpar and Schulz, 2016), which is designed to

enable self-directed analysis. Thereby, Visual Ana-

lytics is mainly focused on solving a speciﬁc prob-

lem (Keim et al., 2010), while Self-Service Busi-

ness Intelligence offers a generic analysis approach,

but follows predeﬁned analysis paths (Stodder, 2015),

e.g., choosing the features to use, and lack of ad-

vanced data mining tasks.

Another approach is based on graphical data anal-

ysis tools (Daniel and Matera, 2014), e.g., KNIME

or RapidMiner

. Here, a domain expert speciﬁes an

analysis workﬂow step-by-step starting with the se-

lection of data sources, continuing with data prepara-

tion, and ﬁnally data mining and reporting.

An important issue when it comes to data pro-

cessing and analysis is data quality. Data quality

is oftentimes reduced to accuracy of data, i.e., ty-

KNIME: https://knime.com

RapidMiner: https://rapidminer.com

pos or incorrect data (Firmani et al., 2016). How-

ever, data quality should be considered in numer-

ous dimensions. Yet, this understanding is not uni-

form as a literature review shows, e.g., the recom-

mended dimensions differ between standards (e.g.,

DIN EN ISO 9001:2015), practice (Askham et al.,

2013) and scientiﬁc literature (Azeroual et al., 2018;

Batini and Scannapieco, 2016; Firmani et al., 2016;

Wang and Strong, 1996). Nevertheless, there is

a certain consensus with regard to dimensions that

occur more frequently. These include, in particu-

lar, accuracy/correctness, completeness and consis-

tency (Askham et al., 2013; Azeroual et al., 2018;

Batini and Scannapieco, 2016; Firmani et al., 2016;

Wang and Strong, 1996), timeliness (Askham et al.,

2013; Azeroual et al., 2018; Wang and Strong, 1996)

as well as trustworthiness/credibility/reputation (Ba-

tini and Scannapieco, 2016; Firmani et al., 2016;

Wang and Strong, 1996). In many cases these dimen-

sions are summarized under generic terms.

According to Wang and Strong (Wang and Strong,

1996) these are: (i) intrinsic quality, which by deﬁni-

tion is present in the data, e.g., accuracy, objectivity

or trustworthiness of the origin, and (ii) data quality

which depends on the task at hand and is therefore

contextual, e.g., timeliness, relevancy or complete-

ness. Furthermore, data quality can also be evaluated

with regard to availability and security (accessibility

data quality) or in terms of interpretability (represen-

tational data quality). In particular, for interactive

data analysis, the ﬁrst two categories, i.e., intrinsic

and contextual data quality, are to be considered.

8 SUMMARY AND CONCLUSION

In this paper, we present a new approach to assess

data quality during explorative analysis in an unobtru-

sive interactive manner. In a ﬁrst step, we conducted

a comprehensive literature review to identify short-

comings of existing tools in regard to detecting data

quality issues in the analysis process. By doing so,

we compared different data wrangling, data analysis,

and hybrid tools based on a set of requirements. Af-

ter identifying the shortcoming of the existing tools,

we introduced an approach that shows how data qual-

ity can be assessed during the entire life cycle of a

data analysis process while keeping the domain ex-

pert in the loop. The goal is to integrate domain-

agnostic and optional, domain-speciﬁc data quality

metrics into each step of the data analysis process.

This approach helps by considering data qual-

ity early, i.e., during the speciﬁcation of the analy-

sis workﬂow, which enables data quality by design.

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

284

Hence, a domain expert recognizes timely if the qual-

ity of data sources is high enough or if the data

sources need to be extended or replaced. Detecting

such issues in an early phase of process creation re-

duces costs and efforts of the entire process. Further-

more, the process contains several feedback loops in

each step, which allows returning to an earlier step in

the life cycle in case a data quality issue is detected.

This could, e.g., lead to an adaptation or enrichment

of the used data sources.

Overall, our introduced approach offers a continu-

ous and unobtrusive integration of data quality assess-

ment in the entire life cycle of a data analysis process,

easily understandable by domain experts. We eval-

uated our results through the identiﬁed requirements

and a prototypical implementation showing its appli-

cability. In future work, we plan to conduct extensive

user studies to evaluate our concept and prototype.

REFERENCES

Alpar, P. and Schulz, M. (2016). Self-Service Business In-

telligence. Business & Information Systems Engineer-

ing, 58(2):151–155.

Alpar, P. and Winkelstr

ater, S. (2014). Assessment of data

quality in accounting data with association rules. Ex-

pert Systems with Applications, 41(5):2259–2268.

Askham, N. et al. (2013). The Six Primary Dimensions

for Data Quality Assessment – Deﬁning Data Quality

Dimensions. Technical report.

Azeroual, O. et al. (2018). Data measurement in research in-

formation systems: metrics for the evaluation of data

quality. Scientometrics, 115(3):1271–1290.

Ballou, D. et al. (1998). Modeling Information Manufactur-

ing Systems to Determine Information Product Qual-

ity. Management Science, 44(4):462–484.

Batini, C. and Scannapieco, M. (2016). Data and Informa-

tion Quality. Dimensions, Principles and Techniques.

Springer, Cham.

Behringer, M. et al. (2017). Towards Interactive Data Pro-

cessing and Analytics - Putting the Human in the

Center of the Loop. In Proceedings of ICEIS 2017.

SCITEPRESS - Science and Technology Publications.

Blake, R. and Mangiameli, P. (2009). Evaluating the Se-

mantic and Representational Consistency of Intercon-

nected Structured and Unstructured Data. In Proceed-

ings of AMCIS 2009.

Daniel, F. and Matera, M. (2014). Mashups – Concepts,

Models and Architectures. Springer.

Even, A. and Shankaranarayanan, G. (2005). Value-Driven

Data Quality Assessment. In Proc. of ICIQ 2005.

Fayyad, U. et al. (1996). The KDD Process for Extracting

Useful Knowledge from Volumes of Data. Communi-

cations of the ACM, 39(11):27–34.

Firmani, D. et al. (2016). On the Meaningfulness of ”Big

Data Quality” (Invited Paper). Data Sci. Eng.

Gartner Inc. (2021a). Magic Quadrant for Analytics and

Business Intelligence Platforms. Technical report.

Gartner Inc. (2021b). Magic Quadrant for Data Quality So-

lutions. Technical report.

Grover, P. and Kar, A. K. (2017). Big Data Analytics: A Re-

view on Theoretical Contributions and Tools Used in

Literature. Global Journal of Flexible Systems Man-

agement, 18(3):203–229.

Hamming, R. W. (1950). Error detecting and error cor-

recting codes. The Bell System Technical Journal,

29(2):147–160.

Heinrich, B. and Klier, M. (2011). Assessing data cur-

rency—a probabilistic approach. Journal of Informa-

tion Science, 37(1):86–100.

Hipp, J. et al. (2007). Rule-based measurement of data qual-

ity in nominal data. In ICIQ, pages 364–378.

Juddoo, S. (2015). Overview of data quality challenges in

the context of big data. In Proceedings of the ICCCS

2015, pages 1–9.

Keim, D. A. et al. (2010). Visual Analytics. In Mastering

The Information Age, pages 7–18. Eurographics As-

sociation, Goslar.

Lee, Y. W. et al. (2006). Journey to data quality. The MIT

Press.

Levenshtein, V. I. (1966). Binary Codes Capable of Cor-

recting Deletions, Insertions and Reversals. Soviet

Physics Doklady, 10:707.

Loshin, D. (2010). The Practitioner’s Guide to Data Qual-

ity Improvement. Elsevier.

Polyzotis, N. et al. (2018). Data Lifecycle Challenges

in Production Machine Learning. ACM SIGMOD

Record, 47(2):17–28.

Reinsel, D., Gantz, J., and Rydning, J. (2018). Data Age

2025: The Digitization of the World. Technical report.

Scannapieco, M. et al. (2005). Data quality at a glance.

Datenbank-Spektrum, 14:6–14.

Serhani, M. A. et al. (2016). An Hybrid Approach to Qual-

ity Evaluation across Big Data Value Chain. In Proc.

of Big Data Congress, pages 418–425.

Shearer, C. (2000). The CRISP-DM model: the new

blueprint for data mining. Journal of Data Warehous-

ing, 5(4):13–22.

Shneiderman, B. (1996). The Eyes Have It: A Task by

Data Type Taxonomy for Information Visualizations.

In Symposium on Visual Languages, pages 336–343.

IEEE Comput. Soc. Press.

Stodder, D. (2015). Visual Analytics for Making Smarter

Decisions Faster. Technical report.

Thomas, J. J. and Cook, K. A. (2005). Illuminating the

Path. The Research and Development Agenda for Vi-

sual Analytics. National Visualization and Analytics

Center.

Wang, R. Y. and Strong, D. M. (1996). Beyond Accuracy:

What Data Quality Means to Data Consumers. Jour-

nal of Management Information Systems, 12(4):5–33.

Unobtrusive Integration of Data Quality in Interactive Explorative Data Analysis

285