IMPROVING DATA QUALITY IN DATA WAREHOUSING
APPLICATIONS
Lin Li, Taoxin Peng and Jessie Kennedy
Edinburgh Napier University, 10 Colinton Road, Edinburgh, EH10 5DT, U.K.
Keywords: Data Quality, Data Quality Dimension, Data Quality Rules, Data Warehouses.
Abstract: There is a growing awareness that high quality of data is a key to today’s business success and dirty data
that exits within data sources is one of the reasons that cause poor data quality. To ensure high quality,
enterprises need to have a process, methodologies and resources to monitor and analyze the quality of data,
methodologies for preventing and/or detecting and repairing dirty data. However in practice, detecting and
cleaning all the dirty data that exists in all data sources is quite expensive and unrealistic. The cost of
cleaning dirty data needs to be considered for most of enterprises. Therefore conflicts may arise if an
organization intends to clean their data warehouses in that how do they select the most important data to
clean based on their business requirements. In this paper, business rules are used to classify dirty data types
based on data quality dimensions. The proposed method will be able to help to solve this problem by
allowing users to select the appropriate group of dirty data types based on the priority of their business
requirements. It also provides guidelines for measuring the data quality with respect to different data quality
dimensions and also will be helpful for the development of data cleaning tools.
1 INTRODUCTION
A great number of data warehousing applications
have been developed in order to derive useful
information from these large quantities of data.
However, investigations show that many of such
applications fail to work successfully and one of the
reasons is due to the dirty data. Due to the ‘garbage
in, garbage out’ principle, dirty data will distort
information obtained from it (Mong, 2000).
Nevertheless, research shows that many enterprises
do not pay adequate attention to the existence of
dirty data and have not applied useful methodologies
to ensure high quality data for their applications.
One of the reasons is a lack of appreciation of the
types and extent of dirty data (Kim, 2002).
Therefore, in order to improve the data quality, it is
necessary to understand the wide variety of dirty
data that may exist within the data source as well as
how to deal with them. This has already been
realized by some research works already (Rahm and
Do, 2000, Müller and Freytag, 2003, Kim, Choi,
Hong, Kim and Lee, 2003, Oliveira, Rogriques,
Henriques and Galhardas, 2005). However, in
practice, cleaning all data is unrealistic and simply
not cost-effective when taking into account the
needs of a business enterprise. The problem then
becomes how to make such a selection. In this paper,
this problem is referred to as the Dirty Data
Selection (DDS) problem. This paper presents a
novel method of classifying dirty data types from a
data quality dimension angle, embedded with
business rules, which has not previously been
considered in the literature. The proposed method
will help to solve this problem by allowing users to
select the appropriate group of dirty data types to
deal with based on the priority of their business
requirements.
The rest of the paper is structured as follows: in
section 2, data quality, data quality dimensions and
data quality rules that are used for the proposed
method are discussed. Dirty data types which is used
for the classification is presented in section 3. The
proposed method is given in section 4. An example
of using the method to deal with the DDS problem is
demonstrated in section 5. Finally, the paper is
concluded and future work is discussed in section 6.
379
Li L., Peng T. and Kennedy J. (2010).
IMPROVING DATA QUALITY IN DATA WAREHOUSING APPLICATIONS.
In Proceedings of the 12th International Conference on Enterprise Information Systems - Databases and Information Systems Integration, pages
379-382
DOI: 10.5220/0002903903790382
Copyright
c
SciTePress
2 DATA QUALITY, DATA
QUALITY DIMENSIONS AND
DATA QUALITY RULES
2.1 Data Quality
From the literature, data quality can be defined as
“fitness for use”, i.e., the ability of data to meet the
user’s requirement. The nature of this definition
directly implies that the concept of data quality is
relative. For example, an analysis of the financial
position of a company may require data in units of
thousands of pounds while an auditor requires
precision to the pence, i.e., it is the business policy
or business rules that determine whether or not the
data is of quality.
2.2 Data Quality Dimensions
According to Wang and Strong (Wang and Strong,
1996), the data quality dimension is a set of data
quality attributes, which represents a single aspect or
construct of data quality. These dimensions
represent the measurement of data quality from
different angles and classify the measurement of
data quality into different categories. Amongst the
data quality dimensions considered by researchers,
the following four dimensions accuracy,
completeness, consistency and currentness have
been considered to be the dimensions of data quality
involving data values (Fox, Levitin, Redman, 1994).
In this paper, these four dimensions will be used for
the proposed classification of dirty data.
2.3 Data Quality Rules
According to Adelman et al, data quality rules can
be categorized into four groups namely business
entity rules, business attribute rules, data
dependency rules, and data validity rules (Adelman,
Moss and Abai, 2005). Among the four categories,
data validity rules (R1.1~R6.2) govern the quality of
data values. Since the quality dimensions considered
in this paper are all data value related, only rules in
the data validity category will be considered for the
proposed method. It is noticed that data uniqueness
rules are associated with the data validity category.
Rules R5.1 and R5.2 evaluate a special data quality
problem which is caused by duplicate records.
Because of the popularity, complexity and difficulty
of this problem, it has attracted a large number of
researchers (Elmagarmid, Ipeirotis and VeryKios,
2007). Therefore, apart from the four data quality
dimensions, an extra data quality dimension
“Uniqueness” is introduced for dealing with
duplicate records exclusively in the proposed
method.
According to David Loshin, it is the assertion
embedded within the business polices that
determines the quality of data (Loshin, 2006).
Business policies can be transferred into a set of data
quality rules, each of which can be categorized
within the proposed data quality dimensions. In the
mean time, these rules can be used to measure the
occurrence of data flaws. In this paper, dirty data is
defined as these data flaws that break any of the data
quality rules. Since these rules are embedded within
each of the data quality dimensions, a relationship
between data quality dimensions and dirty data is
built. The proposed method is formed based on this
idea.
3 DIRTY DATA TYPES
A taxonomy of dirty data provides a better
understanding of data quality problems. There are
several taxonomies/classifications of dirty data
existing in the literature (Rahm and Do, 2000,
Müller and Freytag, 2003, Kim et al, 2003, Oliveira
et al, 2005). Within these works, Oliveira et al
produced a very complete taxonomy which has
identified 35 distinct dirty data types (DT.1~DT.35).
Since Oliveira et al’s taxonomy is the most complete
one existing in the literature, in next section, the
proposed method will use the 35 data quality
problems collected in their work for the mapping.
4 THE PROPOSED METHOD
Table 1: Data quality dimensions and data quality rules.
Data quality dimensions Rule No.
Accuracy R2.1~ R2.5, R3.1,
R4.1~R4.5
Completeness R1.2, R1.4
Currentness R3.2
Consistency R5.5, R6.1, R6.2
Uniqueness R5.1, R5.2
Having discussed data quality, data quality
dimensions and data quality rules in section 2
together with dirty data set generated based on
Oliveira et al’s work, a new classification of the
dirty data types is introduced beginning with a
mapping of data quality rules with data quality
ICEIS 2010 - 12th International Conference on Enterprise Information Systems
380
dimensions. Table 1 shows the result of the
mapping.
In order to classify dirty data types into data
quality dimensions, after mapping data quality rules
into data quality dimensions, a mapping from dirty
data types to data quality rules is required. The result
of this mapping is presented in table 2.
Table 2: Data quality rules and dirty data types.
Rule No. Dirty data type No.
R1.1 N/A
R1.2 DT.21,
R1.3 N/A
R1.4 DT.1, DT.15
R2.1 DT.4
R2.2 DT.5
R2.3 DT.11, DT.14, DT.17, DT.20,
DT.26, DT.35
R2.4 N/A
R2.5 DT.19, DT.34
R3.1 DT.16, DT.24, DT.25
R3.2 DT.3, DT.22
R4.1 DT.8
R4.2 DT.2
R4.3 DT.9
R4.4 DT.7
R4.5 DT.6
R5.1 DT.18, DT.33
R5.2 DT.12
R5.3 N/A
R5.4 N/A
R5.5 DT.10, DT.13,
R6.1 DT.23, DT.27, DT.31,
R6.2 DT.28, DT.29,DT.30, DT.32
The result of Table 2 provides immediate help for
the proposed classification of dirty data. Combining
the result from table 1 and 2, the classification of
dirty data based on data quality dimensions is
achieved in table 3.
Table 3: The classification of dirty data.
Data quality dimension Dirty data type
Accuracy
DT.2, DT.4~DT.9,
DT.11, DT.14, DT.16,
DT.17, DT.19, DT.20,
DT.23~DT.26, DT.34,
DT.35
Completeness DT.1, DT.15, DT.21
Currentness DT.3, DT.22
Consistency
DT.10, DT.13, DT.23,
DT.27~DT.32
Uniqueness DT.12, DT.18, DT.33
A method for dealing with dirty data based on
the classification in table 3 is described as follows.
Create an order of the five dimensions according
to the business priority policy;
Identify data quality problems;
Map the data types identified in 2) into the
dimensions against the classification table;
Decide dimensions to be selected based on the
budget;
Select appropriate algorithms, which can be used
to detect dirty data types associated with dimensions
identified in 3).
Execute the algorithms.
5 AN EXAMPLE
As an example, let’s consider an online banking
system used by a bank. Customers from the bank
could obtain all related banking information via this
system. Since the data in the system is
comprehensive, it is very likely that dirty data may
exist, such as misspelt data (DT.6), Wrong data
value range (DT.5), duplicate records (DT.18,
DT.33), data entered into a wrong field (DT.7),
different formats/patterns for the same attribute
(DT.23, DT.27), missing data within a record
(DT.1), late updated data (DT.3, DT.22) etc. In this
example, suppose cleaning all of the dirty data for
this bank is unrealistic. The problem that the bank
has to face is how to select a group of types of dirty
data to deal with, based on their business priority
policy, which is actually a DDS problem. According
to the bank’s priority policy, firstly the bank needs
to make sure that data maintained in the system is
accurate enough and up to date to provide correct
information. Therefore, the currentness dimension
and accuracy dimension are much more urgent than
others. The proposed method provides a systematic
approach to cope with the problem.
According to table 3, dirty data existing in the
system has been found within all of the five data
quality dimensions. It is easy to select which of
these dirty data types cause accuracy and currentness
related problems: DT.3, DT.5, DT.6, DT.7 and
DT.22, which need to be dealt with first. Therefore,
the data cleaning algorithms or methods designed for
these dirty data types should be firstly applied.
6 CONCLUSIONS AND THE
FUTURE WORK
In this paper, a novel method for dealing with
IMPROVING DATA QUALITY IN DATA WAREHOUSING APPLICATIONS
381
dirty data based on the five data quality dimensions
is presented. We have shown how the new method
builds on and improves existing work on dirty data
types and applies them to five data quality
dimensions. The resulting method can be used by
business to help to solve data quality problems,
especially the Dirty Data Selection problem and
prioritise the expensive process of data cleaning to
maximally benefit their organisations.
Future work will involve the development of a
taxonomy from a dimension angle, further more a
data cleaning tool to deal with dirty data types based
on the proposed method. However, the challenge
remains that how to organize the sequence to deal
with the dirty data types that are identified as well as
selecting suitable methods/algorithms according to
different problem domains.
REFERENCES
Adelman, S., Moss, L., Abai, M. (2005). Data Strategy.
Addison-Wesley Professional.
Elmagarmid, A. K., Ipeirotis, P. G., VeryKios, V. S.
(2007). Duplicate Record Detection: A Survey. .
IEEE Trans. on Knowl. and Data Eng. 19, 1-16.
Fox, C., Levitin, A., Redman, T. (1994). The notion of
data and its quality of dimensions. Information
Processing & Management., vol. 30, no. 1. pp. 9-19
Kim, W., Choi, B., Hong, E. Y., Kim, S. K., Lee, D.
(2003). A taxonomy of dirty data. Data Mining and
Knowledge Discovery, 7,81-99.
Kim, W. (2002). On three major holes in Data
Warehousing Today. Journal of Object Technology,
Vol.1, No.4.
Loshin, D. (2006). Monitoring Data Quality Performance
Using Data Quality Metrics. Retrived January 10,
2010, from http://www.it.ojp.gov/documents/Informa
tica_Whitepaper_Monitoring_DQ_Using_Metrics.pdf
Mong, L. (2000). IntelliClean: A knowledge-based
intelligent data cleaner. Proceedings of the ACM
SIGKDD, Boston, USA.
Müller, H., Freytag, J. C. (2003). Problems, Methods, and
Challenges in Comprehensive Data Cleansing. Tech.
Rep. HUB-1B-164
Oliveira, P., Rodrigues, F. T., Henriques, P., Galhardas, H.
(2005). A Taxonomy of Data Quality Problems.
Second International Workshop on Data and
Information Quality (in conjunction with CAISE'05),
Porto, Portugal.
Rahm, E., Do, H. (2000). Data Cleaning: Problems and
Current Approaches. IEEE Bulletin of the Technical
Committee on Data Engineering. vol.23, 41, No.2.
Wang, R. Y., Strong, D. M. (1996). Beyond Accuracy:
What Data Quality Means to Data Consumers. Journal
of Management Information Systems, 12, 4.
ICEIS 2010 - 12th International Conference on Enterprise Information Systems
382