2 DATA QUALITY, DATA
QUALITY DIMENSIONS AND
DATA QUALITY RULES
2.1 Data Quality
From the literature, data quality can be defined as
“fitness for use”, i.e., the ability of data to meet the
user’s requirement. The nature of this definition
directly implies that the concept of data quality is
relative. For example, an analysis of the financial
position of a company may require data in units of
thousands of pounds while an auditor requires
precision to the pence, i.e., it is the business policy
or business rules that determine whether or not the
data is of quality.
2.2 Data Quality Dimensions
According to Wang and Strong (Wang and Strong,
1996), the data quality dimension is a set of data
quality attributes, which represents a single aspect or
construct of data quality. These dimensions
represent the measurement of data quality from
different angles and classify the measurement of
data quality into different categories. Amongst the
data quality dimensions considered by researchers,
the following four dimensions accuracy,
completeness, consistency and currentness have
been considered to be the dimensions of data quality
involving data values (Fox, Levitin, Redman, 1994).
In this paper, these four dimensions will be used for
the proposed classification of dirty data.
2.3 Data Quality Rules
According to Adelman et al, data quality rules can
be categorized into four groups namely business
entity rules, business attribute rules, data
dependency rules, and data validity rules (Adelman,
Moss and Abai, 2005). Among the four categories,
data validity rules (R1.1~R6.2) govern the quality of
data values. Since the quality dimensions considered
in this paper are all data value related, only rules in
the data validity category will be considered for the
proposed method. It is noticed that data uniqueness
rules are associated with the data validity category.
Rules R5.1 and R5.2 evaluate a special data quality
problem which is caused by duplicate records.
Because of the popularity, complexity and difficulty
of this problem, it has attracted a large number of
researchers (Elmagarmid, Ipeirotis and VeryKios,
2007). Therefore, apart from the four data quality
dimensions, an extra data quality dimension
“Uniqueness” is introduced for dealing with
duplicate records exclusively in the proposed
method.
According to David Loshin, it is the assertion
embedded within the business polices that
determines the quality of data (Loshin, 2006).
Business policies can be transferred into a set of data
quality rules, each of which can be categorized
within the proposed data quality dimensions. In the
mean time, these rules can be used to measure the
occurrence of data flaws. In this paper, dirty data is
defined as these data flaws that break any of the data
quality rules. Since these rules are embedded within
each of the data quality dimensions, a relationship
between data quality dimensions and dirty data is
built. The proposed method is formed based on this
idea.
3 DIRTY DATA TYPES
A taxonomy of dirty data provides a better
understanding of data quality problems. There are
several taxonomies/classifications of dirty data
existing in the literature (Rahm and Do, 2000,
Müller and Freytag, 2003, Kim et al, 2003, Oliveira
et al, 2005). Within these works, Oliveira et al
produced a very complete taxonomy which has
identified 35 distinct dirty data types (DT.1~DT.35).
Since Oliveira et al’s taxonomy is the most complete
one existing in the literature, in next section, the
proposed method will use the 35 data quality
problems collected in their work for the mapping.
4 THE PROPOSED METHOD
Table 1: Data quality dimensions and data quality rules.
Data quality dimensions Rule No.
Accuracy R2.1~ R2.5, R3.1,
R4.1~R4.5
Completeness R1.2, R1.4
Currentness R3.2
Consistency R5.5, R6.1, R6.2
Uniqueness R5.1, R5.2
Having discussed data quality, data quality
dimensions and data quality rules in section 2
together with dirty data set generated based on
Oliveira et al’s work, a new classification of the
dirty data types is introduced beginning with a
mapping of data quality rules with data quality
ICEIS 2010 - 12th International Conference on Enterprise Information Systems
380