A COMPLETENESS-AWARE DATA QUALITY PROCESSING

APPROACH FOR WEB QUERIES

Sandra de F. Mendes Sampaio

School of Computer Science, University of Manchester, Oxford Road, Manchester, U.K.

Pedro R. Falcone Sampaio

Manchester Business School, University of Manchester, Manchester, U.K.

Keywords: Data Quality, Internet Query Systems, Completeness, Query Processing.

Abstract: Internet Query Systems (IQS) are information systems used to query the World Wide Web by finding data

sources relevant to a given query and retrieving data from the identified data sources. They differ from

traditional database management systems in that data to be processed need to be found by a search engine,

fetched from remote data sources and processed taking into account issues such as the unpredictability of

access and transfer rates, infinite streams of data, and the ability to produce partial results. Despite the

powerful query functionality provided by internet query systems when compared to traditional search

engines, their uptake has been slow partly due to the difficulty of assessing and filtering low quality data

resulting from internet queries. In this paper we investigate how an internet query system can be extended to

support data quality aware query processing. In particular, we illustrate the metadata support, XML-based

data quality measurement method, algebraic query processing operators, and query plan structures of a

query processing framework aimed at helping users to identify, assess, and filter out data regarded as of low

completeness data quality for the intended use.

1 INTRODUCTION

There has been an exponential growth in the

availability of data on the web and in the usage of

systems and tools for querying and retrieving web

data. Despite the considerable advances in search

engines and other internet technologies for

dynamically combining, integrating and collating

web data, supporting a DBMS-like data management

approach across multiple web data sources is still an

elusive goal. To buck this trend, internet query

systems − IQS (Naughton, J., DeWitt, D., Maier, D.,

et al, 2001) are being developed to enable DBMS-

like query processing and data management over

multiple web data sources, shielding the user from

complexities such as information heterogeneity,

unpredictability of data source response rates, and

distributed query execution.

The comprehensive query processing approach

supported by IQS allows users to query a global

information system without being aware of the sites

structure, query languages, and semantics of the data

repositories that store the relevant data for a given

query (Naughton, J., DeWitt, D., Maier, D., et al,

2001). Despite the significant amount of work in the

development of the data integration and distributed

query processing capabilities, internet query systems

still suffer from inadequate data quality control

mechanisms to address the management of quality of

the data retrieved and processed by the IQS. Typical

examples of data quality issues (Olson, J., 2003) that

need to be addressed when supporting quality aware

query processing over multiple web data sources are:

• Accuracy of data: Data can have errors or

inconsistencies in its representation. For example,

the data values “St. Louis” and “Saint Louis” may

not be matched in a join operation between data

sources despite referring to the same address

instance, due to the different representation formats.

• Completeness of data: a data source is regarded

as complete if all information requirements are

modelled and stored in the database. For instance,

data sources fed by online forms with poor data

quality checks and optional fields often give rise to

234

de F. Mendes Sampaio S. and R. Falcone Sampaio P. (2008).

A COMPLETENESS-AWARE DATA QUALITY PROCESSING APPROACH FOR WEB QUERIES.

In Proceedings of the Third International Conference on Software and Data Technologies - ISDM/ABF, pages 234-239

DOI: 10.5220/0001894802340239

 SciTePress

data with low completeness due to several attributes

with null values.

• Timeliness of data: Data can be of poor quality

when it is not timely for the intended use.

Internet query systems support the mediator-

based approach to quality management (Wiederhold,

G., 1992). The mediator-based approach is

applicable in situations where users need to

formulate complex queries encompassing multiple

web data sources for which there is no control over

the data available in a data source and the

infrastructure supporting data source site query

processing (e.g., querying e-Science data sources

[http://www.rcuk.ac.uk/escience]). In the mediator-

based approach, the speed of the internet limits

transmission of relevant data, and users cannot

reconcile and cleanse all necessary data items prior

to query formulation as in data warehouse-based

quality-based integration approaches (Helfert, M.,

and E. von, Maur, 2001), therefore needing dynamic

strategies for managing accuracy, completeness, and

timeliness data quality issues.

In this paper we investigate how an internet

query system can be extended to support a dynamic

data quality aware query processing framework. In

particular, we illustrate the completeness assessment

method, metadata support, algebraic query

processing operators, and query plan structures of a

query processing framework aimed at helping users

to identify, assess, and filter out data regarded as of

low completeness quality for the intended use. The

remainder of the paper is structured as follows.

Section 2 discusses key contributions of the

proposed approach. Section 3 presents how

completeness measures can be associated with XML

data. Section 4 presents the method used to annotate

XML data with completeness information. Section 5

describes the quality aware algebraic query

processing framework. Section 6 provides an

example illustrating how completeness is assessed

during query processing. Section 7 describes related

work. Section 8 summarizes the work.

2 DISCUSSION

The main contribution of this paper is to

demonstrate how existing ideas in the arena of query

processing and quality of information can be

combined and applied in the context of Internet data

processing, to analyse, evaluate and possibly filter

data of unacceptable Completeness quality. The

ideas being reused are: query engine extension, and

data annotation with quality information. The

approach taken in this work suggests Internet data to

be annotated with quality information prior to query

processing, to allow the quality information to be

taken into consideration during query optimisation

and execution. The data annotation can be done

automatically by the mediator system, or by data

providers, if these are cooperative data sources.

In addition to annotating data with quality

information, the query engine of the mediator

system requires extension to enable quality

information to be taken as input, processed along its

corresponding data, and have an impact on the

produced results, which should reflect what the user

requested in his/her query. The proposed framework

is tested over a simple, extensible, and robust

Internet Query System (Naughton, J., DeWitt, D.,

Maier, D., et al, 2001), which receives, as input,

Internet data in XML format and generates, as

output, XML data representing results from

submitted user queries. The query engine is

implemented in a Database Management System

fashion, i.e., as a set of algebraic operators designed

to work together to process XML data. More

specifically, this work shows how the query engine

algebra can be extended with an operator

encapsulating capabilities to deal with information

about the Completeness quality of the data being

processed. The approach of query engine extension

has a number of shortcomings, such as the ones

described as follows: It may increase the complexity

of design and implementation of the engine.

Therefore, modularity and encapsulation are

important features to be taken into consideration

when designing a quality-aware query engine for

any system, always enforcing the idea that data

quality related functionalities must be all

encapsulated within “data quality” operators; it may

be intrusive in the point of view of traditional query

processing, as the addition of new

operators/functions into a pre-existing engine can

demand the creation of new optimisation rules and

heuristics, as well as modification of pre-existing

operators. We believe that a careful query engine

design and implementation can avoid problems of

intrusiveness, as it has been shown in previous work

describing extension of engines to deal with

parallelism in query execution (Graefe, G., 1996); it

may be inflexible or difficult in allowing the user to

build and incorporate into the system his own

definition of quality based on the task he/she has at

hand. Flexibility in allowing user input can be

achieved by providing user interfaces to take user

input or feedback on quality of data.

A COMPLETENESS-AWARE DATA QUALITY PROCESSING APPROACH FOR WEB QUERIES

235

3 MEASURING COMPLETENESS

Completeness is a context-dependent data quality

dimension that refers to “the extent to which data are

of sufficient breadth, depth and scope for the task at

hand” (Wang, R., and S. E., Madnick, 1989). In the

context of a database model, two types of

completeness dimensions are considered: model

completeness and data completeness. Model

completeness refers to the measure of how

appropriate the schema of the database is for a

particular application. Data completeness refers to

the measurable errors of omission observed between

the database and its schema, checking, for example,

if a database contains all entities/attributes specified

in the schema.

Completeness issues arising in database

applications may have several causes, for example,

discrepancies between the intent for information

querying and the collected data, partial capture of

data semantics during data modelling, and the loss of

data resulting from data exchange. Potential

approaches to address completeness issues include

removing entities with missing values from the

database; replacing missing values with default

values, and completing missing values with data

from other sources. Irrespective of the approach

taken to deal with poor data completeness, it is

crucial that database users formulating queries

across multiple data sources are able to judge if a

particular query result is “fit” for its purpose, by

measuring the level of completeness of the result.

4 ANNOTATING XML DATA

FOR COMPLETENESS

To enable quality aware query processing, data

should be annotated with quality information (Wang,

R. Y., Reddy, M. P., and Kon, H. B., 1995).

Annotations describing simple data completeness

information can be done automatically by the

mediator system, as streams of data from remote

data sources are input. The information should

specify the number of tag elements and element

values missing from an XML document, relative to

the expected numbers for the document to be

considered complete. This information can be

obtained by simply counting the numbers, while

parsing and analysing the structure of the document

against its schema description or DTD.

<name>Audi Dealers</name>

<car><model>A6 Avant</model>

</dealer>

<name>Fiat Dealers</name>

<car><model>Cinquecento</model>

<car><model>Multipla</model>

</dealer>

<name>Renauld Dealers</name>

</dealer>

<elem><name>price</name>

</missingElements>

<elem><name>price</name>

</missingValues>

</completeness></dataQuality>

</carDealerInformation>

Figure 4.1: Example of XML document with annotations.

The annotated numbers for missing tag elements and

element values, as well as expected numbers,

represent quality factors that will be taken into

account during query execution. These quality

factors are added into an XML document as (sub)

elements and (sub) element values associated with

other elements specified in the schema or DTD of

the original XML document. An example is

illustrated in Figure 4.1, showing an XML document

describing information about car dealers. Note that

information to be used during query processing to

calculate the completeness of

carDealerInformation

is attached to the original document, following an

initial parsing of the document and its schema. In

this example, both the price element and its value for

car

Idea are missing. It is also missing the price

value for car

Multipla.

5 QUALITY AWARE QUERY

PROCESSING

Figure 5.1 illustrates the quality aware query

processing framework proposed in this work, which

can be implemented as an extension to Internet

Query Systems.

ICSOFT 2008 - International Conference on Software and Data Technologies

236

QUERY

QUERY PROCESSING

QUERY

OPTIMISATION

QUERY

EXECUTION

Query Engine

Query

Search Engine

Query

DATA SOURCE

SELECTION

DATA

FETCHING AND ANNOTATION

Query

esults

ata

Figure 5.1: Query Processing and Data Search in an IQS.

In the case of Niagara, an input XML-based query

expression is initially transformed into two sub-

queries, a search engine query, and a query engine

query. The first is used within the search engine to

select data sources that are relevant to the query.

Once data sources are selected, the process of

fetching data takes place, and streams of data start

flowing from the data sources to the site of the

Internet Query System for data annotation and query

execution. The second sub-query is optimised and

ultimately mapped into a quality aware query

execution plan that contains a special-purpose

operator addressing the annotated completeness

information.

A description of the query engine algebra used to

execute the query plan is detailed in Table 5.1. The

Completeness Algebra is an XML algebra extended

with an operator that encapsulates the capability of

measuring completeness quality of XML data based

on completeness factors annotated on the data. This

operator is called Completeness and it encapsulates

functions for measuring, inserting, and propagating

completeness information in XML data, provided

the data has completeness factors associated with it.

Table 5.1: Completeness algebra.

Logical Operators Description

Scan(inputData) Builds a data structure for each data unit

and passes each structure to the next

operator.

Select(input,pred) Applies a predicate (pred) over the input

and either discards or retains the input

depending on whether pred evaluates to

false or true.

Project(input,listElem) Discards from the input all the elements

that are not specified in listElem.

Join(inputLeft,

inputRight,pred)

Concatenates both inputs, retaining all

their elements, applies a predicate over

the result.

Completeness(input) Updates completeness info at every step

of execution, measures the final

completeness score, attaches the

measure to query results, and displays

the results to the user.

6 QUERY EXAMPLE

Consider the XML document described in Figure

4.1, and the example XML-QL query described in

Figure 6.1, which retrieves the

model and price of

each car offered by Fiat Dealers. Following the input

to the Niagara System, the Query Optimiser

generates the query plan from the query expression

in Figure 6.1, illustrated in Figure 6.2. Note that,

following each operator, there is a Completeness

operator updating the completeness information at

each step of query execution. The query results are

shown in Figure 6.3. Note that only the model and

price for each

Fiat Dealers’ cars appear in the

results, as specified in the query. Therefore, the

measures for model completeness and data

completeness, performed by the Completeness

operator at the root of the plan, are calculated

considering only these two elements. The formulas

used to calculate MC and DC are illustrated in

Figure 6.4. They were derived from the ideas

discussed in (Pipino, L.L., Lee, Y.W. and Wang,

R.Y., 2002), which suggest that a metric to calculate

the completeness score for a relational database can

be formulated using simple ratio. In the simple ratio

method, if the number of relations and attributes that

are missing from the database is divided by the total

number of relations and attributes defined in the

database schema, and the result of that is subtracted

by 1, then what is obtained is a number in a

continuous scale between 0 and 1, that represents the

model completeness score for the database relative

to its schema. To measure data completeness of a

relational database the same method applies, but the

ration in this case should be between the number of

missing attribute instances and the expected number

of attribute instances. Within the continuous scale, 1

represents the highest completeness measure and is

appropriate for data complying with the most strict

completeness threshold, and 0 represents the lowest

model completeness measure, appropriate for data

that are unacceptable from the model completeness

perspective.In Figure 6.2, there is a sequence of 2

pairs (Scan ,Completeness) operators, omitted for

space limitations. Each of the Scan operators in the

sequence unnests a level of nested elements, by

attaching a copy of each unnested element (and its

sub-elements) to the input tuple. For example, the

first Scan unnests the

<dealer> elements, which are

sub-elements to

<carDealerInformation>. The

second Scan unnests the

<name> and <car> elements,

which are sub-elements to

<dealer>. The Construct

operator is the physical counterpart to the Project

A COMPLETENESS-AWARE DATA QUALITY PROCESSING APPROACH FOR WEB QUERIES

237

operator described in Table 5.1. It projects elements

and builds a structure to hold query results.

Table 6.1 illustrates the functionality of the

Completeness operator at each execution step of the

example query. The first Completeness operator (the

one following the first Scan operator) receives the

original Completeness information from the data

sources. Then it creates a copy of the information,

attaches it to the extended layer of elements

unnested by Scan, and updates the information. The

updated information relates to the unnested

elements.

WHERE <carDealerInformation>

</></></>

IN "*" conform_to "file: completeness.dtd",$v14 =

"Fiat Dealers"

CONSTRUCT

Figure 6.1: Example query.

The second and third Completeness operators

behave in a similar way, copying and updating the

input Completeness information according to the

changes made by the previous operator. The last

Completeness operator follows Construct. It updates

the number of elements and values projected by

Construct, and, also, calculates the measures of MC

and DC for the input document relative to the

example query.

Completeness

Construct

Completeness

Select

…

Completeness

Scan

Figure 6.2: Query plan for example query.

<model>Cinquecento</>

</>

<model>Multipla</>

</>

Figure 6.3: Example query results.

MC = 1–[(num of missing elements) / (num of

elements)]

DC = 1–[(num of missing values) / (num of values)]

Figure 6.4: Formulas used within the Completeness

operator, to calculate model completeness (MC) and data

completeness (DC).

Table 6.1: Behaviour of Completeness Operators.

Information updated by

Completeness Operator

numberElements 21

numberMissingElements 1

nameMissingElem <price>

numberTimes 1

numberValues 14

numberMissingValues 2

nameElemMissingValue <price>

numberTimes 2

Information updated by

Last Completeness Operator

numberElements 6

numberMissingElem 2

nameMissingElem <price>

numberTimes 1

nameMissingElem <model>

numberTimes 1

numberValues 6

numberMissingValues 3

nameElemMissingValue <price>

numberTimes 2

nameElemMissingValue <model>

numberTimes 1

MC 0.66

DC 0.50

7 RELATED WORK

In (Mecella, M., Scannapieco, M., Virgillito, A.,

Baldoni, R., Catarci, T., and Batini, C., 2003) an

approach for data quality management in

Cooperative Information Systems is described. The

architecture has as its main component a Data

Quality Broker, which performs data requests on all

cooperating systems on behalf of a requesting

system. The system, however, does not adopt an

algebraic query processing framework and is not

built on top of a mainstream IQS. In (Naumann, F.,

Lesser, U., and Freytag, J., 1999), data quality is

incorporated into schema integration by answering a

global query using only queries that are classified as

high quality and executable by a subset of the data

sources. This is done by assigning quality scores to

ICSOFT 2008 - International Conference on Software and Data Technologies

238

queries based on previous knowledge about the data

to be queried, considering quality dimensions such

as completeness, timeliness and accuracy. The

described approach, however, does not use XML as

the canonical data model and does not address

physical algebraic query plan implementation issues.

8 CONCLUSIONS AND FUTURE

WORK

With the ubiquitous growth, availability, and usage

of data on the web, addressing data quality

requirements in connection with web queries is

emerging as a key priority for database research

(Gertz, M., Ozsu, T., Saake, G., and Sattler, K.,

2003). There are two established approaches for

addressing data quality issues relating to web data:

data warehouse-based, where relevant data is

reconciled, cleansed and warehoused prior to

querying; and mediator-based where quality metrics

and thresholds relating to cooperative web data

sources are evaluated “on the fly” at query

processing and execution time. In this paper we

illustrate the query processing extensions being

engineered into the Niagara internet query system to

support mediator-based quality aware query

processing for the completeness data quality

dimension. We are also addressing the timeliness

dimension (Sampaio, S. F. M., Dong, C., and

Sampaio, P. R. F, 2005) and extending SQL with

data quality constructs to express data quality

requirements (Dong, C., Sampaio, S. F. M., and

Sampaio, P. R. F., 2006). The data quality aware

query processing extensions encompass metadata

support, an XML-based data quality measurement

method, algebraic query processing operators, and

query plan structures of a query processing

framework aimed at helping users to identify, assess,

and filter out data regarded as of low completeness

data quality for the intended use. As future plans we

intend to incorporate accuracy data quality support

into the framework and benchmark the quality/cost

query optimiser in connection with a health care

application (Dong, C., Sampaio, S. F. M., and

Sampaio, P. R. F., 2005).

REFERENCES

Naughton, J., DeWitt, D., Maier, D., et al, 2001. The

Niagara Internet Query System. IEEE Data Eng. Bull.

24(2), 27-33.

Olson, J., 2003. Data Quality: the Accuracy Dimension,

Morgan Kauffmann. 1st edition.

http://www.rcuk.ac.uk/escience. The UK e-Science

Programme.

Wiederhold, G., 1992. Mediators in the Architecture of

Future Information Systems. IEEE Computer 25(3).

Helfert, M., and E. von, Maur, 2001. A Strategy for

Managing Data Quality in Data Warehouse Systems.

In Proc. of Information Quality, 62-76.

Wang, R., and S. E., Madnick, 1989. The Inter-Database

Instance Identification Problem in Integrating

Autonomous Systems. Proc. of ICDE, 46-55.

Wang, R. Y., Reddy, M. P., and Kon, H. B., 1995. Toward

Quality Data: An Attribute-Based Approach. Decision

Support Systems, 13(3-4), 349-372.

Sampaio, S. F. M., Dong, C., and Sampaio, P. R. F, 2005.

Incorporating the Timeliness Quality Dimension in

Internet Query Systems. WISE 2005 Workshops,

LNCS 3807, 53-62.

Dong, C., Sampaio, S. F. M., and Sampaio, P. R. F., 2006.

Expressing and Processing Timeliness Quality Aware

Queries: The DQ2L Approach. International

Workshop on Quality of Information Systems, ER

2006 Workshops, LNCS 4231, 382-392.

Naumann, F., Lesser, U., and Freytag, J., 1999. Quality-

driven Integration of Heterogeneous Information

Systems. In Proc. of the 25th VLDB, 447-458.

Mecella, M., Scannapieco, Et. Al.. The DaQuinCIS

Broker: Querying Data and Their Quality in

Cooperative Information Systems. LNCS 2800.

Dong, C., Sampaio, S. F. M., and Sampaio, P. R. F., 2005.

Building a Data Quality Aware Internet Query System

for Health Care Applications. In Proceedings of IRMA

Conference - Databases Track, San Diego, USA.

Graefe, G., 1996. Iterators, Schedulers, and Distributed-

memory Parallelism. In Software, Practice and

Experience, 26(4), 427-452.

Gertz, M., Ozsu, T., Saake, G., and Sattler, K., 2003. Data

Quality on the Web. Germany, Dagstuhl Seminar.

Pipino, L.L., Lee, Y.W. and Wang, R.Y., 2002. Data

Quality Assessment. CACM(45),4 (virtual extension).

A COMPLETENESS-AWARE DATA QUALITY PROCESSING APPROACH FOR WEB QUERIES

239