Algorithms for Computing Inequality Joins

Brahma Dathan

and Stefan Trausan Matu

Information and Computer Sciences, Metropolitan State University, St. Paul, MN 55106, U.S.A.

Computer Science and Engineering, Bucharest Polytechnic University, Bucharest, Romania

Keywords: Inequality Joins, Query Processing, Algorithms, Lower Bounds.

Abstract: Although the problem of joins has been known ever since the concept of relational databases was introduced,

much of the research in that area has addressed the question of equijoins. In this paper, we look at the problem

of inequality joins, which compares attributes using operators other than equality. We develop an algorithm

for computing inequality joins on two relations with comparisons on two pairs of attributes and then extend

the work to queries involving more than two comparisons. Our work also derives a lower bound for inequality

joins on two relations and show that the two-comparisons algorithm is optimal.

1 INTRODUCTION

Database joins is an old problem (Codd, 1970,

Graefe, 1993, Mishra, 1992) with much of the

research on the topic addressing the problem of

equijoins. Inequality joins (again an old problem,

(DeWitt, 1991, Klug, 1988)), where attributes are

compared for inequality are less frequently used than

equijoins in traditional database systems, but find

utility in temporal databases and other applications

such as database cleaning (Wang, 2013, Cao, 2012,

Enderle, 2004, Khayyat, 2015), and sophisticated

problems even in conventional applications.

For example, consider a company that sells a large

number of products. By one estimate, a well-known

online retailer offers close to half a billion different

products in the United States alone. Among the

various factors that play a role in the profits of a

company is the amount of storage space required by

a product. Often, the size of a specific product is not

directly proportional to the profit on that item. For

example, jewelry items are often small, but they

typically generate more profit per unit volume than

some bulkier items such as plastic chairs.

Assume that products that vary widely in their

features are implemented as separate relations.

(Jewelry and chairs might be two such products.) As

an example, suppose

and

are two separate

relations representing two categories of products.

Among the numerous attributes, assume that the

following are identically named in the two relations:

key, the primary key of the relation; vol, the amount

of space occupied by one unit of the product for

storage; profit, the average profit for one unit of the

product; and unitsSold, the number of units of the

product sold per year.

An example of relation

is given in Table 1.

Table 2 is an instance of

The company wants to minimize the cost of

storage while maximizing profits. In connection with

this, suppose it is considering increasing the

inventory of some products in

relative to the

inventory of some products in

. For this, it may wish

to check for all pairs of tuples

and

whether

d.vol

is more than

c.vol

and

c.profit

is more

than

d.profit

; if the condition is true, and if space is

an issue, the company might decide to stock a larger

quantity of c at the expense of storage for

. It is thus

worthwhile to execute the following query (call it

STORAGE).

select c.key, d.key from C c, D d where

d.vol > c.vol and c.profit > d.profit;

Executing the query on the sample relations yields

the following pairs of keys: (

c1, d2

), (

c1, d5

), (

c1,

), (

c2, d1

), (

c2, d2

), (

c2, d6

), (

c2, d7

), (

c3, d1

(

c3, d2

), (

c3, d3

), (

c3, d4

), (

c3, d5

), (

c3, d6

), (

c3,

), (

c4, d2

), (

c5, d2

), (

c7, d2

The company may want to do more analysis: it

might decide that comparing the unitsSold field

should also be a factor in making a decision. It may

then decide to execute a query such as the following.

select c.key, d.key from C c, D d where

d.vol > c.vol and c.profit > d.profit

and c.unitsSold > d.unitsSold

Dathan, B. and Matu, S.

Algorithms for Computing Inequality Joins.

DOI: 10.5220/0006826803570364

In Proceedings of the 7th International Conference on Data Science, Technology and Applications (DATA 2018), pages 357-364

ISBN: 978-989-758-318-6

357

The result of this query is (

c1, d7

), (

c2, d7

), (

c3,

), (

c3, d3

), (

c3, d4

), (

c3, d7

Table 1: Relation C.

key

vol

profit

unitsSold

Table 2: Relation D.

key

vol

profit

unitsSold

Despite such applications, the problem of

inequality joins remains an insufficiently researched

area. A recent paper (Khayyat, 2017) makes quite a

compelling case for more efficient inequality join

algorithms. This inefficiency is not surprising

because surprising because commercial database

systems solve the problem using a nested loop

that examines every pair of tuples.

Assume that

and

are relations with both

relations containing fields named k,

and

. The

inequality join query, which we denote by

involving these two fields

and

is the following,

where 



and 



are relational operators (<, >, <=, and

>=).

select s.k, t.k from S s, T t where

s.A 



 t.A AND s.B 



t.B

In this work, we present an algorithm to solve

inequality join queries. Our paper makes the

following contributions: 1) It first develops an

algorithm to solve the inequality join problem

involving two comparisons. 2) It provides an

extension to the algorithm in (1) to an arbitrary

number of comparisons. 3) It derives a lower bound

for the problem of inequality joins of the form given

in Q1. The result applies to algorithms that employ

comparisons to determine which tuple pairs belong to

the result. 4) It shows that the algorithm in (1) above

is optimal.

The rest of the paper is organized as follows. In

the next section, we illustrate our algorithm with an

example. For exposition purposes, we describe a good

part of the algorithm using query STORAGE. In Sect.

3, we formally describe the algorithm and extend it to

multiple comparisons. In Sect. 4, we derive a lower

bound for the problem for two comparisons. An

examination of related work is done in Sect. 5.

Section 6 concludes the paper.

2 ALGORITHM CONCEPTS

In this section, we introduce the algorithm using the

query STORAGE.

2.1 Page Setup

We rename the smaller relation

and the larger

relation

. If the two relations are equal in size, we

may name either one of them

and the other

. The

attributes are named

(for key),

(the attribute used

in the first comparison), and

(the attribute in the

second comparison). The where clause is

then changed so that all inequality comparisons are

of the form s.X <rel_op> t.X.

For the rest of this section, we illustrate the

algorithm using the query STORAGE.

is the

smaller relation, so it is renamed

and

becomes

The attributes

key

vol

, and

profit

are renamed

k, A

and

, respectively. For uniformity, we refer to the

key values of

s1, s2

, etc. instead of

c1, c2

, and so

on. Similarly, the key values of

are referred to as

t1,

, etc. The query becomes

select s.k, t.k from S s, T t where

s.A < t.A and s.B > t.B

We will use the letter

to denote an arbitrary

tuple and

to denote an arbitrary

tuple.

2.2 Regions

We now introduce the concept of regions. Clearly,

the query result depends on the relative order of the

and

tuples based on attribute

(as well as

), that

is, which

tuples have an

attribute value less than

the

attribute value of the

tuples. Consider sorting

the union of

and

on attribute

in the ascending

order. We get the sequence

t8, s3, t3, s2, t4, s7, s5,

t1, t6, s1, s4, t5, t7, t2, s6.

Tuple

has an

attribute value less than the

attribute values of all tuples other than

, so it could

be paired with all

tuples except

as far as attribute

DATA 2018 - 7th International Conference on Data Science, Technology and Applications

358

is concerned. It is this idea that we formalize into

the notion of regions.

Since no

tuple has an

value less than

t8.A

and

tuple has an

value greater than

s6.A

, tuples

and

cannot be part of the query result and are

ignored. We can divide tuples other than

and

into sub-sequences with a non-empty sequence of

tuples followed by a non-empty sequence of

tuples.

The sub-sequences are

























 and









Each of these four sub-sequences is called a

region and are numbered 0 through 3. We call the

numbers region numbers. Every tuple also has a

region number, which is that of the region the tuple is

in. For example,

and

have the

region number

of 0. Similarly,

and

have the region number 1,

, and

have the region number 2, and

, and

have the region number 3.

For attribute

, the comparison is

s.B > t.B

, so we

sort the union of

and

in the descending order on

. We get 5 sub-sequences and the tuples are assigned

region numbers 0 through 4:































and







(Tuples

and

are omitted and in the case of a tie,

we place the

tuple before the

tuple because the

comparison is a less than.)

Every

and

tuple thus has two region numbers,

one for each of the two sorted sequences. We call

them

region number and

region number. Note

that

and

do not get region numbers and are no

longer considered in further stages of the algorithm.

Although the concepts of regions and region

numbers were described using the two sorted

sequences of   for the sake of efficiency, we

compute the region numbers using a different

approach, which we describe next.

2.2.1 Computing the Region Numbers

To compute the

region numbers, we first sort

attribute

to get the table

as shown in Fig. 1. Each

cell contains, or will contain, the key, the

attribute

value, a flag, and the region number. We create a

second list

SUA

with one entry for each unique value

of attribute

. The entries in

SUA

point to entries in

as shown in the figure.

For each

tuple, we determine its insertion point

SUA

. For example,

t4.A

is 16, so its insertion point

is between

and

(because we have

s2.A <

t4.A

< s7.A

); we set the flag in

true

to note that there

is a

tuple

with

t.A

between

s2.A

and

s7.A

. Tuple

stores a reference to

. Similarly, we find

insertion points for all

tuples. Note that

’s flag is

false

because there is no

tuple

such that

s7.A

t.A <

s5.A

. Also,

does not have an insertion point

because of its key value. There is no

tuple

with

t.A

> s6.A

. Both

and

are therefore ignored.

We next compute the

region number of the

tuples. The region number is incremented whenever

we encounter an

tuple s with the flag set to true. The

region number of a

tuple is the

region number

of the

tuple it points to. For example,

points to

, so

’s

region number is the same as

’s

region number.

true

false

true

false

true

false

SUA

T tuples

Figure 1: Region Number Computation.

For attribute B, since the relational operator is

for the second comparison, we sort

in the

descending order of the

attribute values. The

and

region numbers are given in the Fig. 2.

2.2.2 Computing the Query Result

If an attribute has exactly one region, that attribute

need not be considered in computing the query result,

so we assume the more general and non-trivial case

where both attributes have at least two regions.

Let 



denote the region number of tuple

relation

. Similar notations apply for the

region

numbers and

tuples. A pair (

s, t

) is in the query

result if and only if 







and 







To compute the result, we need to access the

tuples efficiently. We can access the

tuples in the

region order using

and

, but the

tuples are a

problem because they are not sorted. So we employ

two separate collections called region ordered tables

to access the

tuples in the region order, but not

necessarily in the sorted order of

Figure 2: A and B region numbers.

Algorithms for Computing Inequality Joins

359

The table is implemented as a balanced tree, with

one leaf node per region. The leaves are stored in a

linked list, and each leaf points to a linked list of

tuples that have the region number of that leaf node.

The first step in computing the query result is to

create an

region ordered table for the

tuples. This

table shown in Fig. 3. The leaves hold region numbers

and are shown as circles. We also create an empty

region ordered table. This will contain the

tuples

and will be ordered on the

region numbers.

Figure 3: Data Structure Holding All the T tuples.

We then enter a loop starting with the highest

numbered

region and work our way down to 0. For

each region number

, we add to the

region ordered

table those

tuples with 



 

The highest

region is 3. We pick from the

region ordered table all the

tuples with

region

number of 3. These are

, and

. They are stored

in the

region ordered table. The result is shown in

Fig. 4. The

tuples with

region equal to 3 are

and

. Since 



  and

, and

all have





  they are all paired with

. Since 



 ,

is paired with only

because 



 .

We proceed to

region 2. T tuples

and

have

2 for their

region number and are added to the table.

The

tuples with 



  are

and

. 



 

is paired with

. 



  so

is also paired

with

Figure 4: Table After Adding T Tuples with 



= 3.

Moving down to

region 1, we add

to the

table. We pair

(



 ) with

, and

because they all have a

region number that is greater

than or equal to 



. Finally, for

region 0, we add

. Since 



 

is paired with every tuple in

the table.

3 FORMAL DESCRIPTIONS

In this section, we state the region computation and

the final query result computation algorithms more

formally and describe our extension to multiple

comparisons.

3.1 The Two-Comparisons Algorithm

The algorithm for computing the

region numbers is

shown in Fig. 5. The

region numbers are computed

similarly. The complexities for the steps are indicated

as we progress through the steps.

Lines 1-4 sort

on attribute

in the ascending or

descending order (complexity:   and

lines 5-6 initialize the flag (complexity: . Steps

9-14 create the table

SUA

(complexity: .

Lines 15-41 compute the insertion points for the

tuples. The loop 23-37 is the standard binary search

algorithm augmented to determine the insertion point.

With the loop starting in line 19, the complexity of

lines 15-41 is  .

1) if rel_op is < or rel_op is <=

2) SA = S sorted ascending on S.A

3) else

4) SA = S sorted descending on S.A

5) for s in SA

6) s.fA = false

7) SUA = new array of indices to SA

8) uniques = 0

9) for 0 <= i < m - 1

10) if SA[i].A != SA[i + 1].A

11) SUA[uniques].index = i

12) uniques = uniques + 1

13) SUA[uniques] = SA[m -- 1]

14) uniques = uniques + 1

15) if rel_op is < or rel_op is >

16) offset = -1

17) else

18) offset = 0

19) for t in T

20) low = 0

21) high = uniques - 1

22) found = false

23) while low <= high and not found

24) mid = (low + high)/2

25) if SA[SUA[mid]].A == t.A

26) found = true

27) else if rel_op is < or <=

28) if SA[SUA[mid]].A < value

29) low = mid + 1

Figure 5: Region Computation.

Internal nodes

DATA 2018 - 7th International Conference on Data Science, Technology and Applications

360

30) else

31) high = mid - 1

32) else

33) if SA[SUA[mid]].A > value

33) low = mid + 1

34) else

35) high = mid - 1

36) if found

37) ins_pt = mid - offset

36) else

37) ins_pt = high

37) if ins_pt == -1

38) remove t from T

39) else

40) SA[SUA[ins_pt]].fA = true

41) t.spos = SUA[ins_pt]

42) r = 0

43) for s in sorted order SA

44) s.rA = r

45) if s.fA

46) r++

47) for t in T

48) t.rA = t.spos.rA

Figure 5: Region Computation (Cont.).

The region numbers are computed in lines 42-48.

The complexity is 







The final result is computed

by the algorithm in Fig. 6. Since there are at most

regions, TA can be created in time  .The

iterator can be constructed in  time. The

complexity of lines 5-14 is   , where e

is the number of tuples in the result. Clearly, we have

Theorem 1. The algorithm complexity is

  .

Theorem 2. The algorithm computes the inequality

join correctly.

Proof. Omitted due to lack of space.

1 



: region ordered table





: region ordered table

3 for each t in T





.add(r, t)

5 for r from 



down to 0

6 itA = 



.iterator(r)

7 while itA.hasNext()

8 t = itA.next()

9 



.add(t.



, t)

10 for each s with s.



== r

11 itB = 



.iterator(s.



)

12 while itB.hasNext()

13 t = itB.next()

14 output the pair (s, t)

Figure 6: Computing the Final Result.

3.2 Multiple Comparisons Algorithm

We start off as in the two-comparisons algorithm by

renaming the relations and fields in the obvious

manner. Suppose there are comparisons. Let us

denote the attributes involved in these comparisons

by 











 The algorithm maintains  copies of

the  table, 











 with 



sorted on field





    It computes the region numbers for

the  regions, one for each field 



     It

constructs  region ordered tables each storing all the

tuples of the  table. Let us denote these by













 



is ordered on region     

It constructs  region ordered tables, each initially

empty. Let us denote these by 











 



is ordered on region     

The algorithm in Fig. 5 in its entirety and lines 1-

4 of the algorithm in Fig. 6 are employed to

implement all of the above functionalities. The

difference between this extension and the two-

comparisons algorithm is in the way the final query

result is computed. The extension employs a greedy

approach. For this, we make the observation that in

the two-comparisons algorithm, we perform far fewer

comparisons at the beginning when we deal with the

higher region numbers. With a uniform distribution

of the tuples over all regions, we would expect the

number of

tuples in region 



to be 





and the

number of

tuples to be 





, where 





is the

number of regions for field 



. Therefore, the number

of pairs of

and

tuples we need to consider for the

highest region number for 



is at most 















tuplesLeft = m, the number of S tuples

while tuplesLeft > 0

select A, a field (see discussion)

r = largest-numbered unprocessed

region number in A;

execute processField(A, r)

Figure 7: Handling Multiple Fields.

processField(A, r):





: region ordered table for field X





: region ordered table for field X

for each t in 







add(r, t)

B = any field A with 



 

itA = 



.iterator(r)

while itA.hasNext()

t = itA.next()





.add(t.



, t)

for each s with s.



== r and

s.processed == false

itB = 



.iterator(s.



)

while itB.hasNext()

t = itB.next()

for every field g A or B

if 







exit the outer for loop

output (s, t);

s.processed = true

tuplesLeft--

Figure 8: Computing the Final Result.

Algorithms for Computing Inequality Joins

361

The above observation is exploited in the

algorithm shown in Fig. 7. We select a field  for

which we judge the number of comparisons to be the

smallest. For each field 



, assume the following: the

largest unprocessed region number is 



 the number

tuples in region 



is 



; the number of tuples in

the region ordered table 



is 



; and the number of

tuples in region 



is 



. Then we select field  for

which the expression













is the smallest.

The tuple pairs are computed by the algorithm in

Fig. 8. After we retrieve a

tuple

from the region

ordered table 



that matches an

tuple

for region

number based on , we check if 



 



for all

other fields

as well. As the region numbers

decrease, more and more

tuples enter the region

ordered tables, but we also flag more

tuples at

higher numbered regions when we encounter fewer

tuples. This way, the total number of pairs to be

considered does not increase sharply.

4 A LOWER BOUND

In this section, we derive a lower bound for the

number of operations needed to enumerate all pairs of

tuples that satisfy an inequality join query of the form

given in

(Sect. 1). We consider only algorithms

that employ comparisons to determine the query

result. We will assume that the number of tuples in 

and  are  and , respectively, and that   . For

simplicity, the two relations  and  are assumed to

have identically named attributes.

Any deterministic algorithm to compute the query

result needs to determine the tuple pairs in the result

and enumerate them. If there are

tuple pairs in the

result, any algorithm must take  steps to

enumerate them.

Consider two relations

and

such that for any

pair of tuples 



and 



in   , either

























is true or 























true; that is, all pairs of tuples are totally ordered on

the pair of comparisons. Clearly, this is a weaker

problem than the one we addressed in Sec. 2 and 3.

With a total order, to determine the query result, we

need to determine which of the possible total orders

of the

and

tuples is present in the input with one

restriction: the relative order of the

tuples

themselves or the relative order of the

tuples

themselves do not play a role in the output.

4.1 Number of Input Sequences

The tuples of

could be distributed into

non-empty

sets, where     ; these subsets are denoted by









 The number of ways in which a set of

elements can be divided into

non-empty subsets is

given by Stirling number of the second kind (Graham,

1988) and is denoted by 







 The

tuples of

could be distributed in-between (and/or before and

after) these sets in four different ways.

1. Distribute the n tuples between the pairs of S

tuples. There are   such locations. The

tuples

are distributed in 







ways. Although their

relative order within a set should not be counted, the

relative order of each subset 



with respect to the

subsets of

that occupy the   slots is important

and should be counted. Thus, there are 









possible sequences of the tuples. The T tuples

can be distributed in    ways into the  

slots and the permutation of these   subsets

must also be counted. This gives us 



  



 





 sequences. Thus, the total number of sequences

is











  



  



.

2. In addition to using up all    locations

between the

tuples, we could attach an additional

non-empty set of T tuples before 



, (which is the first

subset of

). This gives rise to

non-empty sets of

tuples. The number of sequences is

















.

3. This is similar to item 2 above. We could have

non-empty sets of

tuples by putting a non-empty set

tuples after 



. The number of sequences is again



















4. This case applies only if there are enough

tuples to go around. That happens if    or  

. We form   subsets by putting a non-empty

set of

tuples before 



and after 



. The number of

sequences is 











  



  



.

Therefore, unless     , the number of

sequences with k non-empty subsets of S tuples is

obtained by adding up the four terms given above to

get the following expression.













  



  





 

















 











  



  





If     , the number of sequences is

DATA 2018 - 7th International Conference on Data Science, Technology and Applications

362













  



  





 

















To get a lower bound we can simply use the fact

that



















 





















But







 





 because we can divide the m

tuples into

subsets by having   tuples in one

subset and putting the remaining

tuples into   

non-empty subsets, giving at least 





 distributions.

The number of input sequences is thus no less than



















(Griffith, 2010).

The algorithm must pick the correct input

sequence from the 



possible inputs. We can easily

employ an oracle argument as in the case of the

derivation for the lower bound for sorting and see that

in the worst case, a comparison may reduce the

number of possible sequences by half. Therefore, the

number of comparisons needed is at least.

In addition, if the result contains

tuple pairs, the

algorithm must spend  time to enumerate

them. This gives us the following result.

Theorem 3. Let be any inequality join query

processing algorithm that uses comparisons to

determine the tuple pairs in the result and enumerate

them. The minimum number of steps for to

complete its execution when processing a query on

two relations  and  with cardinalities m and n

respectively, where    is   

where e is the number of tuple pairs in the result.

From Theorem 2 and Theorem 3, we have

Theorem 4. The two-comparisons algorithm is

optimal.

5 RELATED WORK

Classes of joins other than equijoins that have

received less attention but have their own applications

include inequality joins (Klug, 1988, Chandra, 1977,

DeWitt, 1991) and similarity joins (Silva, 2012). The

MapReduce framework has been used to compute

joins. A work using this framework to compute

inequality joins is by Okcan (Okcan, 2011). There are

other important implementations of equijoins and

similarity joins using the MapReduce framework

including works by Blanas et al, (Blanas, 2010) Silva

and Reed (Silva, 2012), Vernica, et al (Vernica,

2010), and Afrati and Ullman (Afrati, 2010).

A significant work by Khayyat et al (Khayyat,

2017) essentially addresses the same problem we took

up here. Their resort to sorting both  and , whereas

we sort only the smaller relation. A difference

between the works is that their algorithm takes

   time, while our approach takes

  time. Our algorithm is optimal for a

pair of relations on two pairs of fields.

Inequality joins have found applications in areas

such as XML query processing to perform

containment joins (Wang 2003). A containment join

between a set of ancestor nodes (denoted as

) and a

set of descendant nodes (denoted as

) is to find all

pairs of







    , such that a is an

ancestor of d. A solution to inequality joins can be

applied to help process these queries.

Inequality joins can be applied to address

questions in temporal databases (Cao, 2012, Enderle,

2004) and have a role in database cleaning

(Chaudhuri, 2006, Khayyat, 2015).

6 CONCLUSIONS

In this paper, we looked at the problem of inequality

joins, an important class of joins that has received less

attention than equijoins. We derived a lower bound

for the problem of inequality joins of two relations

and came up with an optimal algorithm that solved

the problem for two comparisons. We showed how to

extend the approach to more than two comparisons.

We plan to investigate how the multiple

comparisons algorithm could be parallelized. The

approach seems to support a high degree of

concurrency because it processes tuples by region.

REFERENCES

Afrati, F.N., Ullman, J.D., 2010. Optimizing Joins in a

Map-reduce Environment, 13th International

Conference on Extending Database Technology.

Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J.,

Tian, Y., 2010. A Comparison of Join Algorithms for

Log Processing in MapReduce, ACM SIGMOD

International Conference on Management of Data.

Cao, Y., Zhou, Y., Chan, C., Tan, K., 2012. On Optimizing

Relational Self-joins, 15th International Conference on

Extending Database Technology.

Chandra, A.K., Merlin, P.M., 1977. Optimal

Implementation of Conjunctive Queries in Relational

Data Bases, Proceedings of the Ninth Annual ACM

Symposium on Theory of Computing STOC '77.

Algorithms for Computing Inequality Joins

363

Chaudhuri, S., Ganti, V., Kaushik, R., 2006. A Primitive

Operator for Similarity Joins in Data Cleaning,

Proceedings of the Seventh International Conference

on Data Engineering.

Codd, E. F., 1970. A Relational Model of Data for Large

Shared Data Banks, CACM, Vol. 13, No. 6, pp 377-387.

DeWitt, D.J., Naughton, J.F., Schneider, D.A., 1991. An

Evaluation of Non-Equijoin Algorithms, 17th

International Conference on Very Large Data Bases.

Enderle, J., Hampel, M., Seidl, T., 2004. Joining Interval

Data in Relational Databases, ACM SIGMOD

International Conference on Management of Data.

Graefe, G., 1993, Query Evaluation Techniques for Large

Databases, ACM Comput. Surv., Vol. 25, No. 2.

Graham, R.L., Knuth, D., Patashnik, O., 1988. Computer

Mathematics, Addison-Wesley.

Griffith, M., Mezo, I., 2010. A Generalization of Stirling

Numbers of the Second Kind via a Special Multiset,

Journal of Integer Sequences, Vol. 13.

Khayyat, Z., Ilyas, I.F., Jindal, A., Madden, S., Ouzzani,

M., Papotti, P., Quiane-Ruiz, J., Tang, N., Yin, S., 2015.

BigDansing: A System for Big Data Cleansing, ACM

SIGMOD International Conference on Management of

Data.

Khayyat, Z., Lucia, W., Singh, M., Ouzzani, M., Papotti,

Paolo, Quiane-Ruiz, J., Tang, N., Kalnis, P., 2017. Fast

and Scalable Inequality Joins, The VLDB Journal.

Klug, A., 1988. On Conjunctive Queries Containing

Inequalities, J. ACM, Vol. 35, No. 1.

Mishra, P., Eich, M. H., 1992. Join Processing in

Relational Databases, ACM Comput. Surv, Vol. 24, No.

Okcan, A., Riedewald, M., 2011. Processing Theta-joins

Using MapReduce, ACM SIGMOD International

Conference on Management of Data.

Silva, Y.N., Reed, J.M., 2012. Exploiting MapReduce-

based Similarity Joins, ACM SIGMOD International

Conference on Management of Data.

Vernica, R., Carey, M.J., Li, C., 2010. Efficient Parallel

Set-similarity Joins Using MapReduce, ACM SIGMOD

International Conference on Management of Data.

Wang, Y., Metwally, A., Parthasarathy, S, 2013. Scalable

All-pairs Similarity Search in Metric Spaces, 19th ACM

SIGKDD International Conference on Knowledge

Discovery and Data Mining.

Wang, W., Jiang, H., Lu, H., Yu, J., 2003. Containment Join

Size Estimation: Models and Methods, ACM SIGMOD

International Conference on Management of Data.

DATA 2018 - 7th International Conference on Data Science, Technology and Applications

364