2.2 Proposed Improvements to Data
Quality Problem Taxonomy
While performing validation of the existing DQ
problem taxonomy (Oliveira P., et. al., 2005), it is
noticed that some improvements to the taxonomy are
necessary.
2.2.1 Identification and Correction of Data
Quality Problems
First of all, for future development, it is important to
understand, instances of which DQ problem classes
can be identified and corrected automatically and
which will require a partial or full manual assistance
from data operators.
Table 3 summarizes our initial estimates of how
many DQ problem classes can be processed
automatically and how many would require partial or
complete manual assistance.
Table 3: Detection and correction of DQ problem class
instances.
Detection and
correction
method
Number of DQ
problem classes
that can be
detected
Number of DQ
problem classes
that can be
corrected
Completely
automatically
19 7
Partially
manually (with
data operator
assistance)
4 8
Only manually 12 20
Total 35 35
As expected detecting a DQ problem in general is
much easier than correcting that error. For example,
it is easy to check whether all mandatory data fields
have values, but if a particular one does not, in most
cases without data operator assistance, it is
impossible to guess the missing value.
Only one DQ problem class is identified where
correcting error is considerably easier than
identifying it. It might be very hard just by looking at
data to detect that two data sources use different
measurement units. However, once established that
data source X uses one measurement unit and data
source Y another one, values can be easily
transformed using simple arithmetic. For example, a
comparatively simple correspondence exists between
metric and imperial measurement systems, where
conversion between those systems is done by simple
multiplication.
2.2.2 Modification of the DQ Problem Class
Structure
By analysing how well data operators could separate
between different DQ problem classes from the
original taxonomy (Oliveira P., et. al., 2005), we
identified following DQ problem classes where no
meaningful data error example could be provided to
separate the two: Inadequate value to the attribute
context and Value items beyond the attribute context.
So merging of these two DQ problem classes can be
suggested.
The first error describes cases where a value is
input into a wrong data field, while the second error
describes cases where data field contains a complex
value where parts of it would most appropriately have
been input in other data fields. These two errors
represent just a slightly more general case of
Redundancy errors.
A new DQ problem class can be proposed that
does not appear in the original taxonomy – Factual
errors. Such errors may appear in data fields that
contain natural language data values and
consequently may contain factual information. An
example of this DQ problem would be a data field
containing value: “The painting is located in the
capital of France – London,” where the statement that
London is capital of France is clearly a factual error.
Original DQ problem taxonomy considered
situations where only individual errors occur.
However, in real life scenarios a combination of two
or several different DQ problems might be
simultaneously present even in a single data field.
In fact, Misspelling error can cause almost every
other kind of DQ problem as well. For example, if a
person’s birth year is misspelled as “19743” instead
of “1974”, this will be both a Misspelling error and
an Interval violation error. Other DQ problem
combinations may exist, like: Syntax error/Set
violation, Set violation/Outdated value, etc.
The fact that DQ problem combinations may exist
requires establishing a certain order in which DQ
problems are identified in order to minimize the
number of suspected data errors. For example,
typically correcting misspelling errors first will also
automatically correct other suspected DQ problems
as well.
Following order can be proposed in which DQ
problems from a category “an attribute value of a
single tuple” should be processed:
Missing value;
Misspelling error;
Syntax error;
Interval violation;
ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems
448