the DW design quality. Section 5 presents an
experimental evaluation demonstrating the
effectiveness of proposed indexes in supporting the
DW design process. Finally, Section 6 concludes the
paper by discussing ongoing and future work.
2 RELATED WORK
In the literature, different researchers have been
focused on data quality in operational systems and a
number of different definitions and methodologies
have been proposed, each one characterized by
different quality metrics. Although Wang (1996a)
and Redman (1996) proposed a wide number of
metrics that have become the reference models for
data quality in operational systems, in the literature
most works refers only to a limited subset of metrics
(e.g., accuracy, completeness, consistency and
timeliness).
Literature reviews e.g., (Wang et al., 1995)
highlighted that there is not a general agreement on
data quality metrics; for example, timeliness has
been defined by some researchers in terms of
whether the data are out of date (Ballou and Pazer,
1985), while other researchers use the same term for
identifying the availability of output on time
(Kriebel, 1978) (Scannapieco et al., 2004) (Karr et
al., 2006). Moreover, some of the proposed metrics,
called subjective metrics (Wang and Strong, 1996a)
e.g., interpretability and easy of understanding,
require a final user evaluation made by
questionnaires and/or interviews and then result
more suitable for qualitative evaluations rather than
quantitative ones.
Different researchers have been focused on
proposing automatic methods for conceptual schema
development and evaluation. Moreover, some of the
proposed approaches e.g., (Phipps and Davis, 2002)
include the possibility of using the user input to
refine the obtained result.
An alternative category of approaches employs
statistical techniques for assessing data quality. For
example, the analysis of data distributions can
provide useful information on data quality. In this
context, an interesting work has been presented in
(Karr et al., 2006), where a statistical approach has
been experimented on two real DBs.
A different category of techniques for assessing
data quality concerns Cooperative Information
Systems (CISs). In this context, the DaQuinCIS
(Scannapieco et al., 2004) project proposed a
methodology for quality measurement and
improvement for CISs. The proposed methodology
is primarily based on the premise that CISs are
characterized by high data replication, i.e. different
copies of the same data are stored by different
organizations. From data quality perspective, this
feature offers the opportunity of evaluating and
improving data quality on the basis of comparisons
among different copies.
With respect to above solutions, we aim at
proposing a semantics independent methodology
measuring objective features of data to derive
information useful both for supporting the selection
of DW measures and dimensions, and evaluating the
final quality of taken DW design choices. For such
purpose, we have defined a set of metrics measuring
different statistical and syntactical characteristics of
data. It important to highlight that our goal is not to
propose an alternative technique for the DW design
process, but present a methodology that, coupled
with other types of solutions e.g., (Golfarelli et al.,
1998), is able to effectively drive the DW design
choices. For example, it can be used for guiding the
attribute selection in the case of alternative choices
(i.e., redundant information).
3 PROPOSED INDEXES
Considering the whole set of definitions and metrics
that have been proposed in the literature for
assessing data quality of an operational DB, we
identified relevance and value added proposed by
Wang (1996a) as the most appropriate concepts for
our analysis. Indeed, we are interested in identifying
the set of attributes of a given DB storing relevant
information and that could add value in decision
processes. For example, an attribute characterized by
null values does not provide value added from the
data analysis point of view. In this case, the attribute
does not enhance the informative content of the DW
and the quality of derived decisions.
Although the selection of DB tables and
attributes is primarily guided by semantic
considerations, the designer can greatly benefit by
the availability of syntactical and statistical
information. For example, in the presence of
alternative choices, the designer can select the
attribute characterized by the most desirable
features. On the other hand, the designer can decide
to change his design choice if he discovers that the
selected attribute is characterized by undesirable
features.
For evaluation purposes, we identified a set of
indexes referring to the following types of DB
elements:
FROM DATABASE TO DATAWAREHOUSE: A Design Quality Evaluation
179