Data quality issues appear frequently in the stage
of the data integration when ETL tools extract data
from resources, migrate and populate data into data
repositories. Hence, data quality is an important
aspect in the data integration process (Kimball and
Caserta, 2011). Data quality has become a critical
concern to the success of organisations. Numerous
business initiatives have been delayed or even
cancelled, citing poor-quality data as the main
reason. Previous research has indicated that
understanding the effects of data quality is critical to
the success of organisations (Ge et al. 2011). A high
quality of data provides the foundation for the data
integration.
Most initial data quality frameworks have
considered all the data quality dimensions are
equally important (Knight and Burn, 2005). More
recently, as Fehrenbacher and Helfert (2012) stated,
it is necessary to prioritise certain data quality
dimensions for data management. However, as far as
we know, there is not yet work to prioritise data
quality diemensions in ETL. Furthermore, there is
limited research in guiding the data quality
management in the data integration process.
Therefore, in this paper we intend to find out
which data quality dimensions are crucial to data
integration and also attempt to derive the guidelines
for proactive data quality management in data
integration. The contribution of this paper are two
folds, first, we found that some typical data quality
problems exist in data integration process. We have
specified those data quality problems and related
them to different data quality dimensions. It can be
seen that certain data quality diemsnions need to be
further refined, and more dimenions towards
operational squence and data uniqueness should be
used in the data quality management in ETL. On the
other hand, in order to proactively manage data
quality in data integration, we have derived a set of
data quality guidelines that can be used to avoid data
quality pitfalls and problems when integrating data
and using the TPC-DI Benchmark.
The remainder of the paper is organised as
follows. Section 2 reviewes the related work of data
quality and data integration. Section 3 describes the
research methodology used to conduct our research.
Then in Section 4 we list the data quality problems
in data integration process and classify those data
quality problems into different data quality
dimensions in section 5. Section 6 describes the
guidelines we divided for data quality management
in data integration. Finally Section 7 concludes the
paper and outlines the future research.
2 RELATED WORK
In order to manage data quality, Wang (1998)
proposed the Total Data Quality Management
(TDQM) model to deliver high quality information
products. This model consists of four continuous
phases: define, measure, analyse and improve, in
which the measurement phase is critical, because
one cannot manage information quality without
having measured it effectively and meaningfully
(Batini and Scannapieco 2016). In order to measure
data quality, data quality dimensions must be
determined. To this end, Wang and Strong (1996)
used an exploratory factor analysis to derive 15 data
quality dimensions, which are widely accepted in the
following data quality research. Based on the 15
proposed dimensions, data quality assessment has
been applied in different domains such as Healthcare
(Warwicka et al., 2015), Supply Chain Management
(Ge and Helfert, 2013), and Smart City Applications
(Helfert and Ge, 2016).
Among the application domains, DI or ETL
systems have been emerging as an important field
that requires data quality management. The goal of
the data integration system denoted by Doan et al.
(2012) is decreasing the effort of users to acquire
high-quality answers from a data integration system.
They also defined a data warehouse in two tasks: (1)
implementing the centralised database schema and
physical design, (2) defining a batch of ETL
operations. Hence, the DI or ETL system is the
groundwork of the data warehousing in order to
provide synthesized, consistent and accurate data.
The ETL system manages some procedures
specifically in (1) revising or removing mistakes and
missing data, (2) offering confident documented
measures in data, (3) safekeeping the captured data
flow of transactions, (4) calibrating and integrating
multiple sources data to be leveraged
collaboratively, (5) structuring data to be usable by
end-user tools (Kimball and Caserta, 2011). It is not
only just extracting data from source systems, but
also as a combination of traffic policemen and
garages for the motorway of data flows in the data
warehousing architecture.
Due to the importance of data quality
management in ETL systems, previous research has
been conducted to study the data quality problems in
ETL systems. Singh and Singh (2010) attempted to
tabulate all possible data quality issues appearing in
the process of the data warehousing (the data source,
data integration and data profiling, data staging, ETL
and database schema). In this research, there were
totally 117 data quality problems demonstrated in