To get a lower bound we can simply use the fact
that
But
because we can divide the m
tuples into
k
subsets by having tuples in one
subset and putting the remaining
k
tuples into
non-empty subsets, giving at least
distributions.
The number of input sequences is thus no less than
(Griffith, 2010).
The algorithm must pick the correct input
sequence from the
possible inputs. We can easily
employ an oracle argument as in the case of the
derivation for the lower bound for sorting and see that
in the worst case, a comparison may reduce the
number of possible sequences by half. Therefore, the
number of comparisons needed is at least.
In addition, if the result contains
e
tuple pairs, the
algorithm must spend time to enumerate
them. This gives us the following result.
Theorem 3. Let be any inequality join query
processing algorithm that uses comparisons to
determine the tuple pairs in the result and enumerate
them. The minimum number of steps for to
complete its execution when processing a query on
two relations and with cardinalities m and n
respectively, where is
where e is the number of tuple pairs in the result.
From Theorem 2 and Theorem 3, we have
Theorem 4. The two-comparisons algorithm is
optimal.
5 RELATED WORK
Classes of joins other than equijoins that have
received less attention but have their own applications
include inequality joins (Klug, 1988, Chandra, 1977,
DeWitt, 1991) and similarity joins (Silva, 2012). The
MapReduce framework has been used to compute
joins. A work using this framework to compute
inequality joins is by Okcan (Okcan, 2011). There are
other important implementations of equijoins and
similarity joins using the MapReduce framework
including works by Blanas et al, (Blanas, 2010) Silva
and Reed (Silva, 2012), Vernica, et al (Vernica,
2010), and Afrati and Ullman (Afrati, 2010).
A significant work by Khayyat et al (Khayyat,
2017) essentially addresses the same problem we took
up here. Their resort to sorting both and , whereas
we sort only the smaller relation. A difference
between the works is that their algorithm takes
time, while our approach takes
time. Our algorithm is optimal for a
pair of relations on two pairs of fields.
Inequality joins have found applications in areas
such as XML query processing to perform
containment joins (Wang 2003). A containment join
between a set of ancestor nodes (denoted as
A
) and a
set of descendant nodes (denoted as
D
) is to find all
pairs of
, such that a is an
ancestor of d. A solution to inequality joins can be
applied to help process these queries.
Inequality joins can be applied to address
questions in temporal databases (Cao, 2012, Enderle,
2004) and have a role in database cleaning
(Chaudhuri, 2006, Khayyat, 2015).
6 CONCLUSIONS
In this paper, we looked at the problem of inequality
joins, an important class of joins that has received less
attention than equijoins. We derived a lower bound
for the problem of inequality joins of two relations
and came up with an optimal algorithm that solved
the problem for two comparisons. We showed how to
extend the approach to more than two comparisons.
We plan to investigate how the multiple
comparisons algorithm could be parallelized. The
approach seems to support a high degree of
concurrency because it processes tuples by region.
REFERENCES
Afrati, F.N., Ullman, J.D., 2010. Optimizing Joins in a
Map-reduce Environment, 13th International
Conference on Extending Database Technology.
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J.,
Tian, Y., 2010. A Comparison of Join Algorithms for
Log Processing in MapReduce, ACM SIGMOD
International Conference on Management of Data.
Cao, Y., Zhou, Y., Chan, C., Tan, K., 2012. On Optimizing
Relational Self-joins, 15th International Conference on
Extending Database Technology.
Chandra, A.K., Merlin, P.M., 1977. Optimal
Implementation of Conjunctive Queries in Relational
Data Bases, Proceedings of the Ninth Annual ACM
Symposium on Theory of Computing STOC '77.
Algorithms for Computing Inequality Joins
363