QTrail-DB: A Query Processing Engine for Imperfect Databases with

Evolving Qualities

Maha Asiri and Mohamed Y. Eltabakh

Computer Science Department, Worcester Polytechnic Institute (WPI), MA, U.S.A.

Keywords:

Imperfect Database, Data’s Quality, Quality Propagation, Query Optimization.

Abstract:

Imperfect databases are very common in many applications due to various reasons ranging from data-entry

errors, transmission errors, and wrong instruments’ readings, to faulty experimental setups leading to incorrect

results. The management and query processing of imperfect databases is a very challenging problem requires

incorporating the data’s qualities within the database engine. Even more challenging, the qualities are not

static and may evolve over time. Unfortunately, most of the state-of-art techniques deal with the data quality

problem as an ofﬂine task. In this paper, we propose the “QTrail-DB” system that introduces a new quality

model based on the new concept of “Quality Trails”, which captures the evolution of the data’s qualities

over time. QTrail-DB extends the relational data model to incorporate the quality trails within the database

system. We propose a new query algebra, called “QTrail Algebra”, that enables transparent propagation and

derivations of the data’s qualities within a query pipeline. QTrail-DB is developed within PostgreSQL and

experimentally evaluated using real-world datasets to demonstrate its efﬁciency and practicality.

1 INTRODUCTION

In most modern applications it is almost a fact that

the working databases may not be perfect and may

contain low-quality data records (Batini and Scanna-

pieco, 2006; Rahm and Do, 2016). The presence of

such low-quality data is due to many reasons includ-

ing missing or wrong values, redundant information,

human errors, or network transmission errors. A sci-

ence survey has revealed that 80.3% of the partici-

pant research and scientiﬁc groups have admitted that

their working databases contain records of low qual-

ity, which puts their analysis and explorations at risk

(Twombly, 2011). Moreover, a recent IBM report

found that the cost of Poor Data Quality for the US

Economy around $3 trillion per year. This includes

direct costs as well as indirect costs (IBM, 2021).

Even more challenging, the qualities of the data

tuples are typically not static, they may change over

time (evolve) depending on various events taking

place in the database. The emerging scientiﬁc ap-

plications are excellent examples in which tracking

and maintaining the data’s qualities is of utmost im-

portance. For example, Figure 1 illustrates a possible

sequence of operations that may take place in biolog-

ical databases. First, a data tuple r (e.g., a gene tuple)

can be imported from an external source to the local

ID Name Seq

JW0335 lacZ … ATGAGG…

…

JW4778 cyaA … TTGTAC…

(b) Annotation-driven quality trails capturing the quality history of each record.

Time

E2: New comment

record indicating a

wrong value

E3: Failed comparison

with external Repository

E4: Update event

correcting the

wrong value

E5: New article is

added supporting the

tuple’s content

E1: Insertion

event

Quality Level

Auxiliary information attached to each transition, e.g.,

triggering event and other statistics.

…"

To tuple r

in DB

1- Importing tuple r from Source S.

2- A scientist inserting a comment in DB

indicating a possible wrong value in r.

3- Performing a comparison with other

repositories to validate the data. Tuple r

did not match the repository.

4- A scientists updates tuple r and fixes

the error.

5- A scientific articles related to and

supporting r’s content is added to DB.

…"

r’s initial quality depends

on S’s credentials

r’s quality is decreased

further

r’s quality increases

Time Dimension

Figure 1: Database tuples with Evolving Qualities over

Time.

database. At that time, r would be assigned an ini-

tial quality score depending on the source’s credibil-

ity. Then, a scientist may insert a comment highlight-

ing a possible error in the tuple (e.g., the gene’s start

position does not seem correct), based on which r’s

Asiri, M. and Eltabakh, M.

QTrail-DB: A Quer y Processing Engine for Imperfect Databases with Evolving Qualities.

DOI: 10.5220/0012081200003541

In Proceedings of the 12th International Conference on Data Science, Technology and Applications (DATA 2023), pages 295-302

ISBN: 978-989-758-664-4; ISSN: 2184-285X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

295

quality should be decreased. After a while, a veriﬁca-

tion step that compares the local data with an external

repository may conﬁrm that r contains an incorrect

value, which will further decrease r’s quality. Subse-

quent actions in the database may either increase or

decrease r’s quality over time, e.g., Steps 4 and 5 in

Figure 1, which are an update operation on r (e.g.,

correcting the gene’s start position), and the addition

of a scientiﬁc article matching r’s new content, re-

spectively, should both enhance r’s quality. In gen-

eral, each tuple in the database may have its quality

changing over time based on different operations tak-

ing place in the database.

In such imperfect databases with dynamic and

evolving qualities over time, the standard query pro-

cessing that treats all tuples the same while ignoring

their qualities is indeed a very limited approach. For

example, several interesting and challenging ques-

tions may arise beyond the standard data querying,

which include:

1. What was the quality of tuple r before the last re-

vision?

2. Why r’s quality has drastically dropped at time t,

and what did we do to ﬁx that?

3. Given my complex query, e.g., involving selec-

tion, joins, grouping and aggregation, and set op-

erators, what is quality of each output tuple? Can I

trust the results and build further analysis on them

or not?

Certainly, supporting these types of questions is of

critical importance to end-users and high-level appli-

cations. It warrants the need for fundamental changes

in the underlying DBMS. In this paper, we propose

the “QTrail-DB” system, an advanced query pro-

cessing engine for imperfect databases with evolv-

ing qualities. We identify two major tasks to be ad-

dressed, which are:

Task 1−Systematic Modeling of Evolving Quali-

ties: With the large scale of modern databases, even a

very small percentage of low-quality data may trans-

late to a very large number of low-quality records.

This makes it very challenging and time-consuming

process. Therefore, the underlying database engine

must be able to capture and model the data qualities

in a systematic way, and also keep track of their evo-

lution over time (Refer to Questions 1 & 2).

Task 2−Quality Propagation and Assessment of

Query Results: It is a continuous process of col-

lecting and generating data of various degrees of

qualities—with possible interleaving of ofﬂine efforts

to verify and ﬁx the imperfect tuples. Therefore, it is

unavoidable to query the data while having tuples of

different qualities. Each tuple r in the output results

should have an inferred and derived quality based on

input tuples contributed to r’s computation (Refer to

Question 3).

QTrail-DB proposes a full integration of the data’s

qualities into all layers of a DBMS. This integra-

tion includes introducing a new quality model that

captures the evolving qualities of each data tuple

over time, called a “Quality Trail” and proposing

a new relational algebra, called “QTrail Algebra”,

that enables seamless and transparent propagation

and derivations of the data’s qualities within a query

pipeline.

The key contributions of this paper are summa-

rized as follows:

• Proposing the “QTrail-DB” system that treats

data’s qualities as an integral component within

relational databases. In contrast to existing re-

lated work, QTrail-DB is the ﬁrst to quantify

and model the data’s qualities, and fully integrate

them within the data processing cycle. (Section 2)

• Introducing a new quality model based on the new

concept of “quality trails” that captures the evolv-

ing quality history of each data tuple over time

and a new query algebra, called “QTrail Algebra”

that extends the semantics of the standard query

operators to manipulate and propagate the quality

trails.(Section 3 and 4 )

• Developing the QTrail-DB prototype system

within the PostgreSQL engine, and evaluating its

performance using real-world biological datasets.

(Sections 5 and 6)

2 RELATED WORK

Due to its critical importance, data quality has been

extensively studied in literature. The most related to

our work are the following.

Cleaning and Repairing Technique: A main thread

of research is on data cleaning, repairing, and cleans-

ing, where potential low-quality data records are iden-

tiﬁed, and then ﬁxed. The underlying techniques in

these system vary signiﬁcantly from fully-automated

heuristics-based techniques, comparison-based with

external sources and repositories, and rule-driven

techniques, to human-in-the-loop mechanisms. With

the variety of algorithms and techniques for data

cleaning, several extensible and generic frameworks

have been proposed to integrate these algorithms,

e.g., (Dallachiesa et al., 2013). The common theme

in all of these systems is that they all work in total

isolation from query processing.

DATA 2023 - 12th International Conference on Data Science, Technology and Applications

296

Quality Assessment Techniques: On the other hand,

very little attention is given to quality assessment

at query time. It has been addressed in the con-

text of mining operations, sensor data, and relational

databases. The core of these techniques is based

on statistical assumptions about the underlying data.

And then, each technique studies its domain-speciﬁc

operations and how they affect the statistical mea-

sures.

A major limitation in these systems is the assumed

statistics may not be available in many applications.

For example, the work in (Ballou et al., 2006)—which

is the most related to QTrail-DB—assume that the

probability of error in each column in the database

is known in advance, which is not the case in many

applications. And even if this knowledge is available,

it a coarse-grained knowledge over an entire column

and not tied to speciﬁc tuples.

Uncertain and Probabilistic Databases: Another

big area of research is focusing on uncertain and

probabilistic databases (Galindo et al., 2006; Widom,

2005a). In these systems, a given data value can be

uncertain, and hence it is represented by a possible

set of values, a probability distribution function over

a given range, or a probability of actual presence. In

uncertain databases, the query engine is extended to

operate on these uncertain values and tuples, and en-

force correct semantics (called “possible worlds”).

Although uncertainty is related to data qualities in

some sense, these systems are fundamentally differ-

ent from QTrail-DB since the notion of “quality” is

not part of these systems.

3 QTrail-DB DATA & QUALITY

MODELS

QTrail-DB has an extended data model, where each

data tuple carries a “quality trail” encoding the evolv-

ing quality of this tuple. More formally, for a given

relation R having n data attributes, each data tuple

r ∈R has the schema of: r = ⟨v

,...,v

⟩, where

,...,v

are the data values of r, and Q

is r’s

quality trail. Q

is a vector in the form of Q

⟨q

,...,q

⟩, where each point q

is a quality tran-

sition deﬁned as follows.

Deﬁnition 3.1 (Quality Transition). A quality tran-

sition represents a change in a tuple’s quality and

it consists of a 4-ary vector ⟨score, timestamp, trig-

geringEvent, statistics⟩, where “score” is a quality

score ranging between 1 (the lowest quality) and

MaxQuality (the highest quality), “timestamp” is the

time at which the score becomes applicable, “trig-

geringEvent” is a text ﬁeld describing the event that

ID Name Seq

JW0335 lacZ … ATGAGG…

…

JW4778 cyaA … TTGTAC…

(b) Annotation-driven quality trails capturing the quality history of each record.

Time

E2: New comment

record indicating a

wrong value

E3: Failed comparison

with external Repository

E4: Update event

correcting the

wrong value

E5: New article is

added supporting the

tuple’s content

E1: Insertion

event

Quality Level

Auxiliary information attached to each transition, e.g.,

triggering event and other statistics.

…"

To tuple r

in DB

1- Importing tuple r from Source S.

2- A scientist inserting a comment in DB

indicating a possible wrong value in r.

3- Performing a comparison with other

repositories to validate the data. Tuple r

did not match the repository.

4- A scientists updates tuple r and fixes

the error.

5- A scientific articles related to and

supporting r’s content is added to DB.

…"

r’s initial quality depends

on S’s credentials

r’s quality is decreased

further

r’s quality increases

Time Dimension

Figure 2: Example of r’s Quality Trail Corresponding To

Operations in Figure 1.

triggered this quality transition, and “statistics” ﬁeld

contains various statistics that will be maintained

and updated during query processing.Only “score”,

and “timestamp” are mandatory ﬁelds, while “trig-

geringEvent”, and “statistics” are optional ﬁelds.

Since r’s quality is evolving over time, the length

of Q

’s vector is also increasing over time by the addi-

tion of new transitions (Refer to Figure 2). The qual-

ity trail is formally deﬁned as follows.

Deﬁnition 3.2 (Quality Trail). A quality trail of a

given tuple r ∈ R is denoted as Q

and is repre-

sented as a vector of quality transitions. The tran-

sitions in Q

are chronologically ordered, i.e., for all

i, Q

[i].timestamp < Q

[i + 1].timestamp. Moreover,

the quality transitins have a stepwise changing pat-

tern, i.e., Q

[i] is the valid transition over the time pe-

riod [Q

[i].timestamp,Q

[i+ 1].timestamp).

Referring to the data tuple r from Figure 1, its cor-

responding quality trail is depicted in Figure 2. With

each of the actions highlighted in Figure 1, r’s quality

trail will change (evolve) from the L.H.S (the inser-

tion time) to the R.H.S (the current time). Each point

in the quality trail is a quality transition. For example,

at time t

, a new quality transition is added to the trail

consisting of: ⟨4,t

,“updating a wrong value”, {...}⟩.

This transition remains valid (the most recent one) un-

til time t

when a new transition is added. The statis-

tics ﬁeld and its usage will be discussed in more detail

in Section 4.

4 QUALITY PROPAGATION AND

ASSESSMENT OF QUERY

RESULTS

In this section, we present the extended query pro-

cessing engine of QTrail-DB for propagating the qual-

ity trails within a query plan. We propose a new SQL

QTrail-DB: A Query Processing Engine for Imperfect Databases with Evolving Qualities

297

Merge Operator Ω(Q

,…,Q

)

Output:

- Quality trail Q

! Initially has no transitions

1.  Position a sweep line L to the left-most transition in Q

,…,Q

2.  While (more transitions are available) Do

3.  Move L right to the next transition at time t

4.  S={s

, s

, …, s

} ! Set of active transitions at t from Q

,…,Q

5.  s

out

! A new output transition

6.  s

out

.score = Min( S.s

.score ), 1 ≤ i ≤ n

7.  s

out

.timestamp = t

8.  s

out

.statistics = StatsCombine(S.s

.statistics), 1 ≤ i ≤ n

9.  s

out

.triggeringEvent = Null

10.  Q

.addTransition(s

out

)

11.  End While

12.  Return Q

Order Operator χ(Q

,…,Q

)

Output:

sortedList: Sorted list of quality trails (Initially empty)

- Call function SortFn(Q

,…,Q

)

- Return sortedList; // highest quality inserted first.

Function SortFn( S: Set of quality trails )

1.  Sort S descending based on the most-recent transition’s score values

2.  Divide S into groups having the same score value (still descending).

3.  For (each group g in S (in the sorted order)) Loop

4.  If (size of g = 1) Then

5.  - Output the quality trail in g to sortedList

6.  Else

7.  For (each quality trail Q

in g having no more transitions) Loop

8.  - Output Q

to sortedList & remove from g

9.  End For

10.  If (more quality trails exist in g) Then

11.  - Trim the current active transition from each quality trail

12.  - Recursively call SortFn(g’s quality trails)

13.  End If

14.  End For

Figure 3: Pseudocode of the Merge Ω Operator.

algebra, called “QTrail Algebra”, in which the stan-

dard query operators have been extended to seam-

lessly manipulate the quality trails associated with

each tuple. In this section, we assume the quality

trails have been created and maintained (The focus of

Section 5), and thus we will focus now on the query-

time propagation.

Fortunately, in the provenance literature, the prop-

agation semantics of the tuples’ lineage is a well stud-

ied problem under the different operators. In spe-

ciﬁc, we use the same semantics as in the Trio sys-

tem (Widom, 2005b). Therefore, after each algebraic

transformation, we can track the input tuples con-

tributing to a speciﬁc output tuple without the need for

re-inventing the wheel. Yet, the unsolved challenge is

how to translate this knowledge to derivations over

the quality trails?. In the following, we study the se-

mantics of deriving the quality trails of each output

record from its contributing input records.

Selection Operator (σ

(R)): The operator applies

data-based selection predicates p over relation R, and

reports the qualifying tuples. Predicates p reference

only the data values v

,...,v

within the tuples.

The extension to the selection operator is straightfor-

ward since the content of the qualifying tuples do not

change, and thus the output quality trails remain un-

changed. The algebraic expression is: σ

(R) = {r =

⟨v

,...,v

⟩ ∈ R | p(r) = True}

Projection Operator π

,...,a

(R): In QTrail-DB,

the quality trails are at the tuple level, and not tied

to speciﬁc attribute(s) within the tuple. Therefore,

the projection operator will not change the quality

of its input tuples. That is: π

,...,a

(R) = {r

′

⟨a

,...,a

⟩} ∀ r ∈ R.

Merge Operator (Ω(Q

,...)): Several of the re-

lational operators, e.g., join, grouping, aggregation,

among others, involve merging multiple input tuples

together to form one output tuple. Therefore, the

corresponding input quality trails may also need to

be merged and combined together. To perform this

merge operation over quality trails, we introduce the

new Merge operator Ω(Q

,...). This operator is

not a physical operator, instead it is a logical opera-

tor that executes within other physical operators, e.g.,

join, grouping, and duplicate elimination.

The Merge operator’s logic is presented in Fig-

ure 3, and its functionality is illustrated using the ex-

ample in Figure 4. Assume combining three tuples

, r

, and r

having quality trails Q

, and Q

respectively. All quality trails are typically aligned

from the R.H.S (which is the query time Q

), i.e., each

quality trial must have a valid transition at time Q

However, the trails are not necessarily aligned from

the L.H.S since the data tuples may be inserted into

the database at different times (See Figure 4). The

quality trail of the output tuple Q

is derived using

a sweep line algorithm over the input quality trails

starting from left to right and jumping over the tran-

sition points as illustrated in Figure 4 (Lines 1-3 in

Figure 3). The basic idea behind the algorithm is that

the quality of the output tuple at any given point in

time t should be the lowest among the qualities of the

contributing tuples at time t.

Referring to the example in Figure 4, the sweep

line starts at Position 1, where only Q

exists and

has a quality level 4-star, which will be produced in

the output. The line then jumps to Position 2, where

starts participating with a quality level 3-star, and

hence a 3-star transition will be added to Q

. The

sweep line keeps moving to the subsequent positions,

and at each position, it calculates the lowest qual-

ity score among the input participants to be the out-

put’s quality score at this position (Lines 5-7 in Fig-

ure 3). For example, referring to the example in Fig-

ure 4(a), at time t

, the contributing input qualities

from Q

, and Q

, are 2-star, 3-star, and 5-star,

and thus the corresponding quality transition on Q

will have a 2-star score.

Although Q

’s quality scores reﬂect only the low-

est score among the input values, the statistics ﬁeld

associated with each quality transition is intended to

provide deeper insights on the other values contribut-

ing to the score. Initially, the statistics associated with

each quality transition, e.g., Min, Max, Avg, are set to

the transition’s score value as illustrated in Figure 3.

And then, as the transitions get merged, new statis-

tics are computed and get attached to the new quality

transition. For example, the sweep line at Position 6

encounters scores 4-star, 2-star, and 1-star transitions

along with their initial statistics. Notice that Q

’s ac-

tive transition at Position 6 is still the 2-star transition

DATA 2023 - 12th International Conference on Data Science, Technology and Applications

298

Time

Query Time

Direction of the sweep line

in the Merge operator

1 2 3 4 5 6 7 8

2 2

1 1 1

ToDo: statistics...distributive and algebraic. Also Q-trails are at the tuple-level, but the include sell-

level annotations...

2.1 Research Task I: Quality Propagation and Assessment of Query Results

In this research task, we address the challenges of assessing the quality of query results under complex

processing and transformations. For example, referring to Figure ??, assume that we have relations R,

S, and T stored in the Quality-Annotated Data repository, which means that each tuple in these relations

already has its quality trail attached to it. It is very common that a single query or workﬂow on these

relations may involve several of the standard query operators, e.g., selection, projection, join, grouping and

aggregations, and duplicate eliminations, to produce the desired output relation O. The key question that we

address in this task is: What is the quality of each output tuple in O? Notice that our objective is not to just

infer the qualities at the last stage of processing, but to incrementally derive them after each transformation.

Otherwise, other quality-based processing would not be feasible, e.g., applying predicates and functions

on the qualities at any processing stage (Research Task II), and enabling constraints-based processing for

quality maximization (Research Task III). We propose extending the semantics and algebra of the query

operators to seamlessly manipulate the quality trails. All operators—as well as the manipulation functions

over quality trails introduced in Section ??— will consume and produce quality trails conforming to the

data model presented in Section ??. And thus, they can be seamlessly pipelined during processing. In the

following, we highlight the proposed extensions.

•Selection Operator (

(R)): The extension to the selection operator is straightforward since this operator

does not modify the quality trails of its input tuples. Therefore, if tuple r =<a

,....,a

, Q > satisﬁes

the deﬁned predicates p, then r will be produced in the output along with its quality trail Q.

• Merge Operator ((Q

, Q

)): Several of the relational operators, e.g., join, grouping, and aggregation

involve merging multiple tuples together to form one output tuple, and thus the input quality trails will also

need to be merged/combined together. We introduce the logical merge operator  over the quality trails

that works as follows. Assume that tuples r

and r

in Figure ?? will be merged together, e.g., in a join

or aggregation, then we use a sweep line algorithm over Q

and Q

from left to right that jumps over their

transition points.

, Q

2.2 Research Task II:

2.3 Research Task III:

2.4 Research Task IV:

2.5 Research Task V:

2.6 Research Task VI: