if data provided by D meets the data requirements.
However, this approach is not suitable when dealing
with large amounts of data, i.e., large number of
datasets and datasets with large amounts of data.
In this paper, we propose an approach to
completeness assessment of linked datasets, which is
suitable for large amounts of data. The proposed
approach uses information extracted from Q to
evaluate both schema and data completeness, and it
doesn't require the execution of time consuming
queries. To provide a more detailed evaluation, we
propose two distinct types of data completeness:
literal and instance completeness.
In order to evaluate our approach, a tool was
implemented and some experiments were
performed. The accomplished evaluation shows that
our approach is able to produce similar results to the
ones produced when considering a conventional
approach. As a conventional approach we mean
when the user has to submit queries over each
dataset (by means of its endpoint) individually and
analyze if that dataset meets his/her data
requirements. This is usually a hard and time-
consuming task.
The remainder of this paper is organized as
follows: Section 2 introduces information quality
concepts; Section 3 presents our approach for linked
datasets completeness assessment; Section 4
describes some experiments performed to evaluate
our proposal; Section 5 discusses related work, and
Section 6 points out some conclusions and indicates
future work.
2 IQ AND THE WEB OF DATA
There has been an exponential growth in the
availability of linked datasets on the Web and in the
development of applications for querying and
consuming these data. Due to that, the concept of
Information Quality (IQ) is becoming more and
more a necessity, instead of an optional requirement.
The notion of IQ has emerged during the past
years and shows a steadily increasing interest. IQ is
based on a set of dimensions or criteria. The role of
each one is to assess and measure a specific quality
aspect (Wang and Strong, 1996). In general, IQ
researchers assume that there are some shared norms
of quality, or quality expectations, and ways of
measuring the extent of meeting those norms and
expectations. For our purposes, we use the general
definition of IQ – ‘fitness for use’ – that
encompasses different aspects of quality (Wang and
Strong, 1996; Zaveri et al., 2012).
It is important to distinguish the two concepts of
Data Quality and Information Quality. IQ is a term
to describe the quality of any element or content of
information systems (Wang and Strong, 1996), not
only the data. IQ assurance is the certainty that
particular information meets some quality
requirements. This leads us to think in a service-
based perspective of quality, which focuses on the
information consumer’s response to his/her task-
based interactions with the information system. The
use of the term information rather than data implies
that the use and delivery of the data must be
considered in any quality judgments, i.e., the quality
of delivered data represents its value to information
consumers (Price and Shanksa, 2005). Thus, we use
the definition of Information Quality as a set of
criteria to indicate the overall quality degree
associated with the information in the system
(Pipino et al., 2002).
One of the most known quality dimensions
classification is presented by Wang and Strong
(1996). They empirically identified fifteen IQ
criteria under the perspective of a set of users. An
empirical approach analyzed the information
collected from the users and determined the
characteristics of useful data for their tasks. The
aspects were grouped into four broad information
quality classes: intrinsic, contextual,
representational, and accessibility. Intrinsic data
quality denotes the quality of data itself. Contextual
data quality enforces that data quality must be
considered within the context of a task at hand, i.e.,
data must be relevant, timely, complete and
appropriate in terms of amount. The
Representational data quality category is related to
the format and the meaning of data. Accessibility
defines if data are available or obtainable for the
user.
Regarding linked datasets, Zaveri et al. (2012)
compiled a list of data quality criteria applicable to
Linked Data quality assessment. To this end, they
gathered and compared some existing approaches
and grouped them under a common classification
scheme. As a result, they identified a core set of 26
different data quality, which have been grouped into
six dimension classes: contextual, intrinsic,
accessibility, representation, trust and dataset
dynamicity. They argue that these groups are not
strictly disjoint but can partially overlap. Also, the
dimensions are not independent from each other but
correlations exist among dimensions in the same
group or between groups. The contextual, intrinsic,
representational and accessibility dimensions are
similarly defined as in the work of Wang and Strong
CanYouFindAlltheDataYouExpectinaLinkedDataset?
649