Automata Theory based Approach to the Join Ordering

Problem in Relational Database Systems

Miguel Rodríguez

, Daladier Jabba

, Elias Niño

, Carlos Ardila

and Yi-Cheng Tu

Department of Systems Engineering, Universidad del Norte, KM5 via Puerto Colombia, Barranquilla, Colombia

Department of Computer Science and Engineering, University of South Florida, Tampa, U.S.A.

Keywords: Automata Theory, Query Optimization, Join Ordering Problem.

Abstract: The join query optimization problem has been widely addressed in relational database management systems

(RDBMS). The problem consists of finding a join order that minimizes the time required to execute a query.

Many strategies have been implemented to solve this problem including deterministic algorithms,

randomized algorithms, meta-heuristic algorithms and hybrid approaches. Such methodologies deeply

depend on the correct configuration of various input parameters. In this paper, a meta-heuristic approach

based on the automata theory will be adapted to solve the join-ordering problem. The proposed method

requires a single input parameter that facilitates its usage respect to those previously described in the

literature. The algorithm was embedded into PostgreSQL and compared with the genetic competitor using

the most resent TPC-DS benchmark. The proposed method is supported by experimental results achieving

up to 30% faster response time than GEQO in different queries.

1 INTRODUCTION

Since the early days of RDBMS the problem of

finding a join order to minimize the execution time

of a query has been approached. (Chaudhuri, 1998)

defined large join queries as relational algebra

queries with N join operations involving N+1

relations when N is greater or equals to 10.

Consecutively the Large Join Query Optimization

Problem (LJQOP) was formally addressed as finding

a Query Execution Plan (QEP) with a minimum cost

for a large join query.

The LJQOP have been widely addressed and

many methods have been developed to solve it.

Randomized algorithms such as iterative

improvement and simulated annealing, evolutionary

algorithms such as genetic algorithms and Meta

heuristics such as ant colony optimization are some

common strategies used in the solution of the

problem.

The solution space of a LJQOP consists of all

query trees that answers the query. There are three

types of query trees that can result from the solution

space: left deep, bushy and right deep. An extended

discussion about types of query trees is given in

(Ioannidis and Kang, 1991).

The construction of a LJQOP solution space is

theoretically possible for a small number of

relations. When N increases substantially, finding

the optimal join order is considered an NP-hard

problem and thus deterministic algorithms cannot

find a solution easily.

Systems holding workloads from applications

such as decision support systems and business

intelligence require the ability of joining more than

10 relations easily. In this paper, a Meta heuristic

approach based on the automata theory that has been

effectively used in the solution of the Traveling

Salesman Problem (TSP) will be presented and its

application in the solution of the join ordering

problem will be discussed. Finally the automata

based query optimizer proposed in this work will be

tested using the most recent decision support

benchmark TPC-DS.

The remaining parts of this work will be

distributed as follows. Previous work on solving the

LJQOP is discussed in section two. The proposed

methodology will be explained in section three. The

experimental design and setup used to test the

algorithm is going to be exposed in section four. A

discussion about the results obtained by the

algorithm and a comparison analysis between the

proposed method and the PostgreSQL genetic

optimizer module is showed in section five. Finally

257

Rodríguez M., Jabba D., Niño E., Ardila C. and Tu Y..

Automata Theory based Approach to the Join Ordering Problem in Relational Database Systems.

DOI: 10.5220/0004433802570265

In Proceedings of the 2nd International Conference on Data Technologies and Applications (DATA-2013), pages 257-265

ISBN: 978-989-8565-67-9

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

conclusions and future work are presented.

2 RELATED WORK

The join ordering problem has been approached in

different ways among the years. A literature review

presented in (Steinbrunn et al., 1997) provides

detailed information on different approaches to the

solution of the problem and classifies them in four

groups. The first one corresponds to deterministic

algorithms such as dynamic programming and

minimum selectivity algorithm. The second group,

randomized algorithms, includes simulated

annealing, iterative improvement, two-phase

optimization and random sampling. The third group

consists of genetic algorithms, which encode the

solutions and then uses selection, crossover and

mutation algorithms. Finally the fourth group is

compound of hybrid methods.

Three of the most popular approximate solutions

to the join-ordering problem are simulated

annealing, genetic algorithms and ant colony

optimization.

2.1 Simulated Annealing

The annealing process in physics consists of

obtaining low energy states of a solid element being

heated. Simulated Annealing takes advantage of the

Metropolis algorithm used to study equilibrium

properties in the microscopically analysis of solids.

Specifically the Metropolis algorithm generates a

sequence of states for a solid object. Given an

element in state i with energy E



a new element in

state j is produced, if the difference between

energies is below cero, the new state is automatically

accepted; otherwise its acceptance will depend on

certain probability based on the temperature the

system is exposed to and a physic constant known as

Boltzmann constant k



. Similarly, the simulated

annealing algorithm constructs solutions to

combinatorial problems linking solution-generation

alternatives and an acceptance criterion. The states

of the system can be matched to solutions of the

combinatorial problem, and in the same way the cost

function of the optimization problem can be seen as

the energy cost of the annealing system. Therefore

the simulated annealing algorithm starts exposing

the system to high temperatures and thus accepting

solutions that do not improve previous solutions. By

terms of a cooling factor, the temperature starts

lowering until it reaches zero where solutions that do

not improve its parents are not accepted.

The calculation of the acceptance probability of the

simulated annealing algorithm is adopted from the

Metropolis algorithm and corresponds to the

following equation.







1, 

































,













(2.1)

Different implementations of the simulated

annealing algorithm have been used to solve the join

ordering problem using different cooling schemas,

initial solutions, and solution generation

mechanisms.

The implementation in (Ioannidis and Wong,

1987) proposed the use of simulated annealing to

solve the recursive query optimization problem. The

initial state 



was chosen using semi-naïve

evaluation methods and the initial temperature 



was chosen as twice the cost of the initial state. The

termination criterion of the algorithm is composed of

two parts: the temperature must be below 1 and the

final state must remain the same for four consecutive

stages. The generation mechanism is based on a

transition probability matrix :







→0,1

where each neighbor of the current state has the

same probability to be chosen as the next state.





,′













|





∈









0



(2.2)

Finally authors suggest the use of two different

cooling schedules in their implementation. They

propose the use of the following equation to control

the temperature of the system.













(2.3)

The function  returns values between 0 and 1.

The first strategy proposed consists of keeping  a

constant value of 0.95 and the second one consists of

modifying the value of  according to Table 1.

Table 1: Factor to reduce temperature.





/ 

2 0.80

4 0.85

8 0.90

∞

0.95

A second approach to query optimization by

simulated annealing is proposed in (Swami and

Gupta, 1988) where two implementations of

simulated annealing are compared to several other

algorithms including perturbation walk, Quasi-

random sampling, local optimization and iterative

improvement. The proposed simulated annealing

DATA2013-2ndInternationalConferenceonDataManagementTechnologiesandApplications

258

implementation uses an interesting generation

mechanism that combines two different strategies.

The first strategy is swapping which consists of

selecting two positions in the vector and

interchanges its values and the second strategy is 3-

cycle, which consists of randomly selecting 3

elements of the actual state and shift them one

position to the right in a circle. In order to select

which strategy is used to generate the new solution

at a given iteration, variable ∈0,1 that

represents the frequency of swap selection and thus

1 that represents 3-cycle selection is used.

2.2 Genetic Algorithms

The author of (Holland, 1992) explains that “Most

organisms evolve by means of two primary process:

natural selection and sexual reproduction. The first

determines which members of the population survive

to reproduce, and the second ensures mixing and

recombination among the genes of their offspring”.

Genetic algorithms use the same procedure to seek

an optimal solution to an optimization problem by

selecting the most fitted solutions of the problem

and combining them to create new generations.

A genetic algorithm heavily depends on the

performance of its three basic operations: selection,

recombination and mutation.

In general the selection scheme describes how to

extract individuals from the current generation to

create new elements to be evaluated in the next

generation by creating a mating pool. Consequently

the elements selected in the present generation must

be “good” enough to be the parents of the new

generation.

The crossover operation is what makes genetic

algorithms different from other randomized

methods. Based on the natural reproduction process

where parts of the genes of both parents are

combined to form new individuals, the crossover

function uses two individuals selected from the

mating pool and combines them to create new

individuals. Several methods to crossover have been

designed and some of the most popular are one-point

crossover, two-point crossover, cycle crossover and

uniform crossover.

The mutation operation adds an additional

change to the new individuals of the population to

prevent the generation of uniform populations and

getting trapped in local optima. The mutation

operation in binary encoded genetic algorithms can

be easily implemented by selecting a random bit

from the encoded string and change its value by

using the negation operation. Even though the

mutation operation is essential for the genetic

algorithm to work properly, it must be used

carefully.

Genetic algorithms have also been used to solve

the query optimization problem as alternative to

randomized algorithms. The genetic algorithm

implemented by the authors of (Bennett et al., 1991)

is the first known genetic algorithm used to

approach the query optimization problem. The

authors adapted a genetic algorithm used to solve the

assembly line balancing problem focusing on

finding an appropriate encoding schema and

crossover operation to solve the query optimization

problem. The author of (Muntes-Mulero et al., 2006)

proposed the Carquinyoli Genetic Optimizer (CGO)

which uses a tree fashioned codification for the

algorithm to represent solutions, the crossover

operation randomly selects two members of the

current population and examines each tree’s

operations and stores them in a list, then a sub tree

from each parent is selected and an offspring is

generated by combining a sub tree from one parent

and the ordered list of operations from the other, the

same procedure is applied to the other sub tree and

operation list. Five different mutation strategies were

used, swap, change scan, change join, join

reordering and random sub tree. The selection

strategy used by GCO is a simple elitist algorithm.

Finally the commercial database system

PostgreSQL is equipped with GEQO, a genetic

optimizer that activates when the number of tables

involved in a query exceeds 10. GEQO is based on

the steady state genetic algorithm GENITOR that

presents two main differences compared to

traditional genetic algorithms, the explicit use of

ranking and the genotype reproduction in an

individual basis.

2.3 Ant Colony Optimization

The optimization method based on ant colonies

consists of three procedures: ConstructAntsSolution,

UpdatePheromones and DeamonActions. The first

method manages the construction of solutions by

single ants using pheromone trails and heuristic

information. The second method uses the solution

constructed by the ant and update pheromone trail

accordingly increasing the amount of pheromones or

reducing the amount of pheromones due

evaporation. The third method includes actions that

cannot be performed by single ants like local

optimization procedures or additional pheromone

increases.

Lately new algorithms to solve the query

AutomataTheorybasedApproachtotheJoinOrderingProbleminRelationalDatabaseSystems

259

optimization problem have been proposed based on

the ACO theory. Different approaches to query

optimization using ant algorithms have been

developed and also combined with genetic

algorithms to increase the accuracy of the solutions

found. In this section different approaches to query

optimization assembled on ant colony optimization

algorithms are studied. Specifically three different

approaches will me mentioned: the first known ACO

algorithm to be applied to the query optimization

problem (LI et al., 2008), a best-worst ACO

combined with a genetic algorithm to solve the

query optimization problem (Zhou et al., 2009) and

finally another type of combination between ACO

and GA to approach the join ordering problem is

presented in (Kadkhodaei and Mahmoudi, 2011).

3 DSQO: DETERMINISTIC

SWAPPING QUERY

OPTIMIZATION

The method proposed as a novel algorithm to solve

the traveling sales man problem in (Niño et al.,

2010) takes advantage of the automata theory to

construct a path to find a global optimal solution to

the problem. Specifically, a special type of

deterministic finite automaton is constructed to

model the solution space of the combinatorial

problem and a transition function is designed to

allow the navigation around neighbor answers. The

exchange deterministic algorithm (EDA) was used

to browse the structure to rapidly converge to an

optimal solution.

3.1 Query Optimization based

on the Automata Theory

A deterministic finite automaton of swapping,

DFAS, is a kind of DFA that allows the modeling of

the set of feasible solutions of combinatorial

problems where the order of the elements is relevant

and no repetitions are permitted. A DFAS is

formally defined in (NIÑO and ARDILA, 2009) as a

7-tuple.

,Σ,,



,,







(3.1)

Where  represents the set of all feasible

solutions to the problem, Σ is the input alphabet and

represents the set of all possible exchanges between

two elements of the answer. The author proved that

the number of elements of Σ is given by the

following equation



∗1

(3.2)

:Σ→Q, is the transition function and

takes the node 



∈ and swap the elements in the

positions indicated by the element of the alphabet,





is the initial state and is given by an initial

solution to the problem,  is the set of final states,





is the input vector containing the initial order of

elements corresponding to the state 



,  is the

objective function of the combinatorial problem that

evaluates the given order 



in the node 



For instance the following example is given to

understand the construction of a DFAS. Given the

objective function of a combinatorial optimization

problem and an input vector.











0.3



0.2



0.1



(3.3)







1,2,3

(3.4)

The alphabet contains six elements,

consecutively by the application of the definition.

Σ



1,2





1,3



,2,3



(3.5)

The transition function is constructed by labeling

the state 



after the input vector 





; the swap

operation is applied to 



for every element of the

alphabet and every new vector 





constitutes a new

state 



that is included in the DFAS; the process is

repeated until every node 



has been evaluated.

Table 2 shows the transition function for the given

example

Table 2: DFAS transition function example.







1,2









2,1,3

















1,3









3,2,1

















2,3









1,3,2

















1,2





1,2,3















1,3









3,1,2

















2,3









2,3,1

















1,2





2,3,1















1,3









1,2,3

















2,3









3,1,2

















1,2





3,1,2















1,3









2,3,1

















2,3









1,2,3

















1,2





1,2,3















1,3









2,1,3

















2,3









3,1,2

















1,2





1,2,3















1,3









2,1,3

















2,3









3,1,2











DATA2013-2ndInternationalConferenceonDataManagementTechnologiesandApplications

260

3.2 The Exchange Deterministic

Algorithm (EDA)

EDA is a simple algorithm proposed in (Niño et al.,

2010) that describes a strategy to browse a DFAS

structure to find global optimal solutions to

combinatorial problems that allows finding the state





that contains the global optimal solution of the

problem in polynomial time by only exploring the

necessary states minimizing the use of computer

memory.

Taking into consideration the characteristics

mentioned above the following algorithm was

proposed. Where  is the current state, 





is the

vector associated to the current state and 





 is

the value of evaluating the current state´s vector in

the objective function.

Step 1. 



Step 2. 





 and θ

Step 3. ∀



∈Σ, evaluate 







,





 and if





, let 



,





 and make

θ



Step 4. If θ,  is a global optimum.

Otherwise 



,



and loop back to

step 2.

It is easy to observe that the proposed search

strategy does not require the construction of the

complete DFAS structure at once. Instead, it only

constructs the required areas of the solution space as

the neighbors are chosen following the objective

function improvement. This strategy is expected to

save computer memory because it compares states

one by one and rapidly discards portions of the

solution space that do not improve the objective

function.

3.3 DSQO: Deterministic Swapping

Query Optimization

A DFAS structure can be constructed to represent

the solution space of the join ordering problem

because in fact, it is a combinatorial problem where

the order of the elements is relevant and no

repetitions are allowed.

Following the definition, a DFAS to solve the

query optimization problem is the following 7-tuple.

,Σ,,



,,











(3.6)

Where  represents the set of all possible join

orders, Σ represents all possible exchanges between

two tables in the left deep join query three,  is the

transition function from one query plan to another

with a symbol of Σ, 



, is a random element of 

selected as the initial state of the automaton, , is the

same set as  because every plan in  represents a

solution, 



, is the vector that contains the order in





and 



, is the objective function that estimates

the cost of executing a given 



plan.

The solution space that a DFAS can represent is

reduced to all possible left deep trees and thus no

bushy tree strategy can be directly explored by this

method. To illustrate how to construct a DFAS

modeling the join-ordering problem, Figure 1 shows

the transition diagram of the DFAS corresponding to

the following query with three tables to join.

SELECT *

FROM tab1, tab2, tab3

WHERE tab1.fkt2 = tab2.pk AND

tab2.fkt3 = tab3.pk

Figure 1: DFAS transition diagram from the example

query.

The transition diagram shows how the solution

space of the join ordering problem is represented.

Each node of the graph contains a vector with a join

ordering in the left deep strategy of the following

form:



1,2,3



→1⋈2⋈3

(3.7)

Where the first two elements from left to right

are joined first, and then the intermediate table is

joined with the next element in the vector, creating

another intermediate table. The process is repeated

until there are no more elements to join and the last

intermediate table contains the expected result.

EDA was proposed as a method to navigate the

DFAS structure without building the complete

solution space, by the exploration of the

neighborhood of a given state and an objective

function improvement rule. Even though EDA is

capable of finding global optimal solutions to

combinatorial problems effectively, it was mainly

designed to find optimal solutions to the traveling

salesman problem. Despite the similarities between

the TSP and the join-ordering problem, it is

necessary to adjust the algorithm to perform as well

AutomataTheorybasedApproachtotheJoinOrderingProbleminRelationalDatabaseSystems

261

in the solution of the query optimization problem,

which is the targeted problem of this work.

The main reason EDA is not efficient in the

solution of the join-ordering problem is that the

objective function used in the optimization

procedure is yet an estimate of the real cost of using

a specific join order. Therefore, minimal

improvements towards a better solution may not be

worth the effort of a new iteration of the algorithm.

Another important reason EDA is not effective when

applied to the query optimization problem is the lack

of representation of the wider solution space, which

includes bushy query trees. A slower convergence of

the algorithm is caused because it encounters a

reasonable number of solutions with Cartesian

products in the optimization process of the left-deep

only solution space.

In order to improve the convergence speed of the

query optimization based on automata theory

module, the DSQO algorithm was designed. There

were two main improvements made to the original

EDA algorithm: an objective function improvement

criterion was added in order to avoid unnecessary

optimization efforts and a heuristic was included to

transform Cartesian product left-deep plans into

feasible bushy tree query plans. The heuristic used

was taken from the genetic implementation in

PostgreSQL.

The objective function improvement criterion

added to the algorithm is used to stop the

optimization process when the significance of the

new optimum found is not relevant in the solution of

the problem. An input parameter ∈0,1, which

can be seen as a threshold value, is required by

DSQO to evaluate if the new solution found in the

iteration improves the current solution by a certain

percentage given by  using the following equation.









(3.8)

Where 



is the new solution and  is the current

best solution found. The  parameter can be

considered as an attempt to provide the DBA with a

tool to adjust the accuracy of the optimization

process based on how deep to look into the search

space.

The second strategy included in DSQO is the use

of a heuristic to avoid Cartesian products in the left-

deep solution space and thus expanding it. The

heuristic is taken from the PostgreSQL genetic

algorithm implementation and works as follows. The

given order 



is evaluated from left to right. At first

every relation is seen as a new branch in the query

tree and at the beginning, the first relation

constitutes the only branch of the tree. Then, every

new branch is tried to merge with an existing one or

otherwise it is added as a single relation branch to

the query tree. Finally, any existing branch in the

tree is forced merge to construct a complete query

tree.

Comparing DSQO with other meta-heuristic

methods applied in query optimization such as

simulated annealing, genetic algorithms and ant

colony optimization, the proposed method presents a

notorious advantage: it only requires a single input

parameter. As mentioned before, even though well

known meta-heuristics have been successfully

applied to the query optimization problem,

configuring the right input parameters is a delicate

task. With only one parameter to configure the

DSQO takes an enormous advantage against its

competitors.

4 EXPERIMENTAL SETUP

AND DESIGN

The implementation of the DSQO module was built

and tested on a Mac mini server running Mac OS X

10.6 operating system. It was equipped with a 2.66

GHz Intel 2 Duo processor, 4GB of DDR3 memory

@1067 MHz and a 500GB − 7200 SATA hard disk.

During the execution of tests the machine was

not connected to a network to avoid sharing

resources with network related tasks, also all queries

were run locally in the machine to avoid network-

related delay times while measuring response times.

Finally, the machine was setup to stop any other

resource demanding service with the purpose of

allowing the higher amount of resources to the

database system.

The query optimization module based on the

automata theory proposed in this work was

implemented using the C programming language

and compiled using GCC for Mac OSX. It was

integrated into the code of the PostgreSQL 9.1

database system within its query optimization

module along with GEQO.

The test scenarios were planned based on TPC-

DS (POESS, NAMBIAR and WALRATH, 2007), a

decision support benchmark that supports a retail

product supplier system. The benchmark consists of

a database schema with 25 tables, a data population

tool, test queries, and a data maintenance model.

The data population tool provided was used to

produce a database instance using a 1GB scale factor

and 100 test queries were produced as the

DATA2013-2ndInternationalConferenceonDataManagementTechnologiesandApplications

262

specification of the benchmark indicates. However,

the proposed queries included in the benchmark

usually evaluate less than 10 joins with few

exceptions.

With the intention of testing the DSQO module

with large join queries two scenarios were designed.

The first scenario consists of four queries, one for

each fact table in the schema, that joins each

referenced dimension resulting in queries with 8, 9,

13 and 15 joins, which result still small to test the

module but are particularly interesting because they

retrieve high amounts of data. The following 8-join

query was constructed using the Store Sales fact

table.

SELECT

date_dim.d_date,

time_dim.t_time,

customer.c_first_name,

customer.c_last_name,

item.i_product_name,

cd.cd_gender,

store.s_store_name,

customer_address.ca_county,

hd.hd_dep_count

FROM

public.store_sales,

public.date_dim,

public.time_dim,

public.item,

public.customer,

public.customer_demographics cd,

public.household_demographics hd,

public.customer_address,

public.store

WHERE

store_sales.ss_sold_date_sk =

date_dim.d_date_sk AND

store_sales.ss_sold_time_sk =

time_dim.t_time_sk AND

store_sales.ss_item_sk =

item.i_item_sk AND

store_sales.ss_customer_sk =

customer.c_customer_sk AND

store_sales.ss_cdemo_sk =

cd.cd_demo_sk AND

store_sales.ss_hdemo_sk =

hd.hd_demo_sk AND

store_sales.ss_addr_sk =

customer_address.ca_address_sk AND

store_sales.ss_store_sk =

store.s_store_sk;

To overcome the absence of large join queries, 5

more queries were designed taking advantage of the

snowflake design of the database schema. The Web

Returns fact table was used and its different

dimension tables where joined to obtain queries with

15, 20, 25 and 30 join operations.

Finally, each query was run 10 times to obtain an

average of the execution time and optimization

time. The PostgreSQL GEQO module, the only

commercial generic query optimizer, was setup with

its standard parameters and the DSQO module was

tested using different values for  specifically 0.02,

0.05, 0.10 and 0.15.

5 COMPARATIVE ANALYSIS

This section will be divided into two subsections

comparing the results obtained for each scenario

independently. The first subsection shows the results

of the star schema experiments and the second

subsection displays the results of the snowflake

schema experiments. Figures present the average

response time and optimization time of each query

using the GEQO implementation, included in the

PostgreSQL 9.1.2 distribution, and the optimization

strategy proposed in this work. The quality of the

solution found by each algorithm was measured in

terms of plan cost and is also showed using figures

that facilitate the analysis.

5.1 Snowflake Scenario

Query execution time results of the tests under the

snowflake scenario are shown and also compared to

the results obtained by GEQO in figure 2. The graph

shows that with small number of join operations

both optimizers perform similarly, but when the

number of join operations increases DSQO tend to

perform better.

A maximum improvement of 13.72% was

achieved by DSQO over GEQO solving the 30-join

query. Also, the DSQO presented almost a 20%

improvement in the plan cost for the same query

while, smaller queries obtained similar plan costs

with GEQO presenting minimal differences not

exceeding 0.5%.

Figure 2: Snowflake query execution time comparison.

AutomataTheorybasedApproachtotheJoinOrderingProbleminRelationalDatabaseSystems

263

In general, optimization time was considerably

smaller for DSQO for up to 25 joins. On the other

hand, in the 30 join case the optimization time was

larger, taking almost 50% more time. The overall

query execution time for the 30-join query was

smaller due to the quality of the plan that was found,

specifically 17.45% better. Figure 3 shows a

comparison graph between DSQO with different

values of  and GEQO.

Figure 3: Snowflake query optimization time comparison.

5.2 Star Scenario

The star scenario tests results are shown in figure 4.

Query execution times of the proposed algorithm

and GEQO are shown for each query in this

workload. As the results obtained by DSQO with

values of  of 0.05, 0.10 and 0.15 presented minimal

differences, only results with  values of 0.02 and

0.05 are shown.

The graph indicates that the Store_sales fact

table presented the least improvement, which is

explained by the fact that it is the smallest table. On

the other hand, the highest improvement was

obtained optimizing the Web_returns fact table

where a 30% improvement was achieved. The

improvement obtained by DSQO in the quality of

the plan was not substantial and the highest

improvement obtained was 15%.

Similarly as in the query execution time test,

results with minimal differences were found

between values of  of 0.10 and 0.15, thus the results

presented will omit tests with 0.15.

In general, optimization time spent by DSQO

was smaller than GEQO, which is explained by the

fact that the maximum amount of joins included in

the workload was 15. Figure 5 shows the

convergence speed of both algorithms. Even though

the proposed algorithm is faster, the quality of the

answer found by GEQO is slightly better.

Figure 4: Star query execution time comparison.

Figure 5: Star query optimization time comparison.

6 CONCLUSIONS AND FUTURE

WORK

A new algorithm based on the automata theory was

introduced in this work to find global optimal

solutions to the join ordering problem in relational

database systems. Internal analyses were performed

to the EDA algorithm to understand its performance

in the solution of the query optimization problem,

which derived in the design of the DSQO algorithm.

Different types of metrics such as query execution

time, optimization time and plan cost were measured

to evaluate the efficiency of the proposed query

optimizer. DSQO was tested and empirically

compared to the genetic algorithm implementation

GEQO in PostgreSQL, which is the only

commercial Meta heuristic optimizer available.

Results show that the utilization of the proposed

methodology to solve queries in star and snowflake

environments decreases up to 30% the response time

of the database system.

Even thought the proposed methodology

outperformed the only existing commercial Meta

DATA2013-2ndInternationalConferenceonDataManagementTechnologiesandApplications

264

heuristic optimizer, the proposed methodology can

be further improved by combining it with other

heuristics to obtain better initial solutions. Another

future implementation should experiment with

different operations other than swapping that may

offer the reduction of the exploration of the solution

space. Finally, the optimization methodology should

be tested under different databases and types of

queries, such as those supporting scientific data.

ACKNOWLEDGEMENTS

The research described in this paper was funded by

Colciencias and Universidad del Norte, Barranquilla,

through research grant approved in the Joven

Investigador Proposal 2011.

Yi-Cheng Tu is supported by Grant IIS-1117699

from the US National Science Foundation (NSF).

REFERENCES

Bennett, K., Ferris, M. C. and Ioannidis, Y. E., 1991. A

genetic algorithm for database query optimization.

Computer Sciences Department, University of

Wisconsin, Center for Parallel Optimization.

Chaudhuri, S., 1998. An overview of query optimization

in relational systems, Proceedings of the seventeenth

ACM SIGACT-SIGMOD-SIGART symposium on

Principles of database systems 1998, ACM, pp. 34-43.

Holland, J. H., 1992. Genetic algorithms. Scientific

American, 267(1), pp. 66-72.

Ioannidis, Y. E. and Wong, E., 1987. Query optimization

by simulated annealing. ACM.

Ioannidis, Y. E. and Kang, Y. C., 1991. Left-deep vs.

bushy trees: an analysis of strategy spaces and its

implications for query optimization, Proceedings of

the 1991 ACM SIGMOD international conference on

Management of data 1991, ACM, pp. 168-177.

Kadkhodaei, H. and Mahmoudi, F., 2011. A combination

method for join ordering problem in relational

databases using genetic algorithm and ant colony,

Granular Computing (GrC), 2011 IEEE International

Conference on 2011, IEEE, pp. 312-317.

Li, N., Liu, Y., Dong, Y. and Gu, J., 2008. Application of

Ant Colony Optimization Algorithm to Multi-Join

Query Optimization. Advances in Computation and

Intelligence, pp. 189-197.

Muntes-Mulero, V., Zuzarte, C. and Markl, V., 2006. An

inside analysis of a genetic-programming based

optimizer, Database Engineering and Applications

Symposium, 2006. IDEAS'06. 10th International 2006,

IEEE, pp. 249-255.

Niño, E. D. and Ardila, C. J., 2009. Algorithm based on

finite automata for obtaining global optimum

combinatorial problems. Ingeniería y Desarrollo, (25),

pp. 99-114.

Niño, E. D., Ardila, C. J., Jabba, D. and Donoso, Y., 2010.

A novel Algorithm based on Deterministic Finite

Automaton for solving the mono-objective Symmetric

Traveling Salesman Problem. International Journal of

Artificial Intelligence, 5(A10), pp. 101-108.

Poess, M., Nambiar, R. O. and Walrath, D., 2007. Why

you should run TPC-DS: a workload analysis,

Proceedings of the 33rd international conference on

Very large data bases 2007, VLDB Endowment, pp.

1138-1149.

Steinbrunn, M., Moerkotte, G. and Kemper, A., 1997.

Heuristic and randomized optimization for the join

ordering problem. the VLDB Journal, 6(3), pp. 191-

208.

Swami, A. and Gupta, A., 1988. Optimization of large join

queries. ACM.

Zhou, Y., Wan, W. and Liu, J., 2009. Multi-joint query

optimization of database based on the integration of

best-worst Ant Algorithm and Genetic Algorithm,

Wireless Mobile and Computing (CCWMC 2009), IET

International Communication Conference on 2009,

IET, pp. 543-546.

AutomataTheorybasedApproachtotheJoinOrderingProbleminRelationalDatabaseSystems

265