Keyword Query:
Part ‘United States’
Conceptual Query:
select Part.Name
where Nation.Name = ‘United States’
SQL Query:
select P.name
from part P, nation N, lineitem LI,
orders O, customer C
where N.name = ‘United States’
and P.partkey = LI.partkey
and O.custkey = C.custkey
and C.nationkey = N.nationkey
and LI.orderkey = O.orderkey
Figure 1: Equivalent Keyword, Conceptual, and Inferred
SQL Queries on TPC-H
relational schema), approaches to recognize and deal
with ambiguity are necessary to make query inference
valuable. Two specific forms of ambiguity that may
lead to multiple query interpretations are addressed.
In addition, schemas may have multiple sets of joins
that are equivalent in their semantic meaning. By
identifying and reducing these duplicate join paths to
a single core path, ambiguity is reduced.
The AutoJoin inference engine efficiently ex-
tends the capabilities of previous inference ap-
proaches, while maintaining independence from par-
ticular query languages or interfaces. The individual
contributions of AutoJoin are:
• An algorithm called EMO that efficiently con-
structs all maximal sets of lossless joins in a
schema. EMO significantly outperforms previous
approaches that fail on large schemas.
• A method for reducing the number of ambiguous
queries by detecting and removing semantically
equivalent interpretations.
• Efficient algorithms for generating query interpre-
tations at query execution time.
• An extension of the lossless join approach to gen-
erate queries with a lossy join.
• A performance study that demonstrates the ap-
proach is scalable.
The rest of this paper is organized as follows. Sec-
tion 2 provides background on query inference strate-
gies. Section 3 presents the AutoJoin inference en-
gine along with the join graph structure. Section 4 de-
scribes the overall strategy for precomputing the loss-
less join trees by the EMO algorithm. Efficient join
determination algorithms are presented in Section 5.
A performance study in Section 6 shows that query
inference can be performed with minimal overhead
even for large schemas. The paper then closes with
future work and conclusions.
2 BACKGROUND
State of the art database interfaces require query in-
ference, as users should not be required to know the
schema and structure of the database queried. Key-
word searches along with natural language query-
ing require an efficient, scalable, and general strat-
egy to discover joins for query execution . An ideal
query inference engine would automatically apply to
existing relational schemas without administrator in-
tervention, quickly pre-compute the necessary data
structures to minimize overhead during query execu-
tion, and return a ranked list of query interpretations
based on specifications from the query interface.
The Universal Relation provided the first interface
which required query inference. The lossless join
property related to functional dependencies provided
the mechanism to determine the joins required to
complete the query (Maier and Ullman, 1983). If
more than one lossless interpretation exists, the in-
ferred query results in the union of all the unique loss-
less queries. In another approach, the query with the
lowest cost (Wald and Sorenson, 1984) is selected as
the inferred query. The cost function first identifies
a lossless interpretation. If lossless joins do not ex-
ist for the query, lossy joins are permitted. These two
approaches both infer a single query, while new inter-
faces require a ranked list of interpretations.
The pursuit of simpler interfaces for relational data-
bases has led to varying methods of join determina-
tion. Each keyword search interface has developed
their own algorithm to address the challenges of find-
ing the k-lowest cost query interpretations for a set of
keywords. Discover (Hristidis and Papakonstantinou,
2002) grows all ways from a relation containing one
of the keywords with a limit on the number of joins
permitted between relations. This inefficiently gener-
ates extra graphs that do not contain all the keywords.
DBXplorer (Agrawal et al., 2002) identifies the lo-
cation of keywords through an efficient symbol table
and then infers the joins required for query interpreta-
tions by generating spanning trees. Neither approach
maintains the lossless property in their cost functions
and all computations occur at query time.
Conceptual query languages aim to hide the com-
plexity of the schema from users by mapping con-
cepts familiar to the user to the relational model.
Again, these languages require the identification of
joins to complete the query. CQL (Owei and Navathe,
2001) uses a shortest path algorithm to find the mini-
mum join paths between specified concepts resulting
ICEIS 2005 - HUMAN-COMPUTER INTERACTION
32