Text-to-SQL Meets the Real-World

Eduardo R. Nascimento

1,4 a

, Grettel M. Garc

ıa

1 b

, Lucas Feij

1 c

, Wendy Z. Victorio

1,4 d

Yenier T. Izquierdo

1 e

, Aiko R. de Oliveira

4 f

, Gustavo M. C. Coelho

1 g

, Melissa Lemos

1,4 h

Robinson L. S. Garcia

2 i

, Luiz A. P. Paes Leme

3 j

and Marco A. Casanova

1,4 k

Instituto Tecgraf, PUC-Rio, Rio de Janeiro, 22451-900, RJ, Brazil

Petrobras, Rio de Janeiro, 20031-912, RJ, Brazil

Instituto de Computac¸

ao, UFF, Niter

oi, 24210-310, RJ, Brazil

Departamento de Inform

atica, PUC-Rio, Rio de Janeiro, 22451-900, RJ, Brazil

Keywords:

Text-to-SQL, GPT, Large Language Models, Industrial Databases.

Abstract:

Text-to-SQL refers to the task deﬁned as “given a relational database D and a natural language sentence S that

describes a question on D, generate an SQL query Q over D that expresses S”. Numerous tools have addressed

this task with relative success over well-known benchmarks. Recently, several LLM-based text-to-SQL tools,

that is, text-to-SQL tools that explore Large Language Models (LLMs), emerged that outperformed previous

approaches. When adopted for industrial-size databases, with a large number of tables, columns, and foreign

keys, the performance of LLM-based text-to-SQL tools is, however, signiﬁcantly less than that reported for

the benchmarks. This paper then investigates how a selected set of LLM-based text-to-SQL tools perform over

two challenging databases, an openly available database, Mondial, and a proprietary industrial database. The

paper also proposes a new LLM-based text-to-SQL tool that combines features from tools that performed well

over the Spider and BIRD benchmarks. Then, the paper describes how the selected tools and the proposed

tool, running under GPT-3.5 and GPT-4, perform over the Mondial and the industrial databases over a suite

of 100 carefully deﬁned natural language questions that are closely related to those observed in practice. It

concludes with a discussion of the results obtained.

1 INTRODUCTION

Text-to-SQL (Katsogiannis-Meimarakis and

Koutrika, 2023), also referred to as NL2SQL

(Kim et al., 2020), is the task deﬁned as “given a

relational database D and a natural language (NL)

sentence S that describes a question on D, generate

an SQL query Q over D that expresses S”. Numerous

tools have addressed this task with relative success

(Affolter et al., 2019)(Katsogiannis-Meimarakis and

https://orcid.org/0009-0005-3391-7813

https://orcid.org/0000-0001-9713-300X

https://orcid.org/0009-0006-4763-8564

https://orcid.org/0009-0003-0545-2612

https://orcid.org/0000-0003-0971-8572

https://orcid.org/0009-0001-8970-0454

https://orcid.org/0000-0003-2951-4972

https://orcid.org/0000-0003-1723-9897

https://orcid.org/0000-0002-0528-5151

https://orcid.org/0000-0001-6014-7256

https://orcid.org/0000-0003-0765-9636

Koutrika, 2023)(Kim et al., 2020) over well-known

benchmarks, such as Spider – Yale Semantic Pars-

ing and Text-to-SQL Challenge (Yu et al., 2018)

and BIRD – BIg Bench for LaRge-scale Database

Grounded Text-to-SQL Evaluation (Li et al., 2023).

Recently, several LLM-based text-to-SQL tools, that

is, text-to-SQL tools that explore Large Language

Models (LLMs), emerged that outperformed previous

approaches on both benchmarks.

However, when run over industrial-size, real-

world databases, the performance of some LLM-

based text-to-SQL tools is signiﬁcantly less than

that reported in the Spider and BIRD Leaderboards.

One of the reasons is that such databases have large

schemas, whereas these benchmarks feature a large

number of databases whose schemas are quite small.

Indeed, using just the number of tables, of the nearly

200 databases used for training in Spider, the largest

one has 25 tables (and it is the only one with more

than 20 tables) and, of the 20 used for testing, the

largest one has 11 tables. The BIRD benchmark tries

Nascimento, E., García, G., Feijó, L., Victorio, W., Izquierdo, Y., R. de Oliveira, A., Coelho, G., Lemos, M., Garcia, R., Leme, L. and Casanova, M.

Text-to-SQL Meets the Real-World.

DOI: 10.5220/0012555200003690

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 26th Inter national Conference on Enterprise Information Systems (ICEIS 2024) - Volume 1, pages 61-72

ISBN: 978-989-758-692-7; ISSN: 2184-4992

to remedy this situation but of the 73 used for train-

ing, BIRD has only two databases with more than 25

tables and, of the 11 used for development, the largest

one has 13 tables.

This paper, therefore, addresses the problem of as-

sessing if LLM-based text-to-SQL tools can be used

over relational databases with large schemas, with

several hundred objects, counting tables, columns,

and foreign keys.

To investigate this problem, the paper abandons

the strategy of using many databases with small-

to-medium schemas to test LLM-based text-to-SQL

tools and concentrates on two challenging databases:

a proprietary industrial database, IndDB (the name

was anonymized), and an openly available database,

Mondial.

IndDB is an oil and gas industrial asset integrity

management database, in production at an industrial

company. It features a relational schema with 27

tables, 585 columns, and 30 foreign keys (some of

which are multi-column); the largest table has 81

columns. Thus, IndDb has nearly 640 hundred ob-

jects.

IndDB features a database keyword search tool,

DANKE (Izquierdo et al., 2021), which compiles

SQL queries from keyword queries, enhanced with at-

tribute ﬁltering and limited forms of aggregation.

However, since IndDB is proprietary, the exper-

iments cannot be replicated. Thus, the paper also

considers Mondial, an openly-available database stor-

ing geographic data, with a total of 47.699 instances;

the relational schema has 46 tables, with a total of

184 columns and 49 foreign keys (again, some of

which are multi-column) – a total of nearly 280 ob-

jects. Mondial was chosen precisely because it has a

large schema, which is challenging for text-to-SQL,

and is openly available, which permits replicating the

results. A version of Mondial, with 34 tables, is also

part of the BIRD benchmark.

The IndDB and the Mondial databases are each

accompanied by a suite of 100 NL questions that are

closely related to those observed in practice.

The paper proceeds to outline nine text-to-SQL

strategies, listed in Table 2. The strategies include

DANKE, a strategy that combines DANKE with key-

word extraction from an NL question (Nascimento

et al., 2023), SQLCoder

based on an LLM ﬁne-

tuned to text-to-SQL, three LangChain-based text-to-

SQL strategies

, “C3 + ChatGPT + Zero-Shot” (Dong

et al., 2023), “DIN-SQL + GPT-4” (Pourreza and

Raﬁei, 2023), and “C3+DIN”, a new strategy that

https://huggingface.co/defog/sqlcoder-34b-alpha

https://python.langchain.com/docs/use cases/qa struc

tured/sql

combines components of C3 and DIN-SQL.

Then, the paper analyses how the text-to-SQL

strategies perform over the Mondial and the IndDB

databases, using the suite of 100 NL questions. Ex-

cept for DANKE and SQLCoder, the paper tested the

strategies with GPT-3.5 Turbo and GPT-4.

The paper concludes with a discussion of the re-

sults obtained. Brieﬂy, using Mondial, the top-5

strategies with respect to overall accuracy used GPT-

4. C3 had the best overall accuracy of 0.78. SQL-

Coder had a limited overall accuracy of 0.35. Using

IndDb, DANKE had the best overall accuracy, closely

followed by “Manual prompt chain + DANKE” with

GPT-4. All other text-to-SQL strategies had a very

low overall accuracy. In general, the limited perfor-

mance of text-to-SQL strategies for IndDB and Mon-

dial, with large schemas, typical of real-world in-

dustrial databases, calls for different text-to-SQL ap-

proaches, as discussed in the conclusions.

The paper is organized as follows. Section 2 cov-

ers related work. Section 3 describes the IndDB and

the Mondial databases. Section 4 summarizes the

strategies tested. Section 5 details the experiments.

Section 6 contains the conclusions.

2 RELATED WORK

2.1 Text-to-SQL Datasets

The Spider – Yale Semantic Parsing and Text-to-SQL

Challenge (Yu et al., 2018) offers datasets for training

and testing text-to-SQL tools. Spider features nearly

200 databases covering 138 different domains from

three resources: 70 complex databases from different

college database courses, SQL tutorial Web sites, on-

line CSV ﬁles, and textbook examples; 40 databases

from DatabaseAnswers

; and 90 databases based on

WikiSQL, with about 500 tables in about 90 domains.

For each database, Spider lists 20-50 hand-written NL

questions and their SQL translations; the SQL queries

cover all the major SQL components.

Spider proposes three evaluation metrics: compo-

nent matching checks whether the components of the

prediction and the ground truth SQL queries match

exactly; exact matching measures whether the pre-

dicted SQL query as a whole is equivalent to the

ground truth SQL query; execution accuracy requires

that the predicted SQL query select a list of gold val-

ues and ﬁll them into the right slots.

One may criticize Spider for having many

databases with very small schemas. In the number of

http://www.databaseanswers.org/

ICEIS 2024 - 26th International Conference on Enterprise Information Systems

tables, the largest ﬁve are: baseball 1, with 25 ta-

bles, cre Drama Workshop Groups, with 18 tables,

and cre Theme park, imdb, and sakila 1, with 16

tables. About half of the databases have schemas

with ﬁve tables or less. Therefore, the results re-

ported in the leaderboard are highly biased towards

databases with small schemas and do not reﬂect real-

world databases.

BIRD – BIg Bench for LaRge-scale Database

Grounded Text-to-SQL Evaluation (Li et al., 2023)

is a large-scale cross-domain text-to-SQL benchmark

in English. The dataset contains 12,751 text-to-SQL

data pairs and 95 databases with a total size of 33.4

GB across 37 domains. BIRD tries to bridge the gap

between text-to-SQL research and real-world applica-

tions by exploring three additional challenges: deal-

ing with large and messy database values, external

knowledge inference, and optimizing SQL execution

efﬁciency. However, as mentioned in the introduc-

tion, BIRD still has only two databases out of 73 in

the training dataset, with more than 30 tables.

WikiSQL (Zhong et al., 2017) has 80,654 NL sen-

tences and SQL annotations of 24,241 tables. Each

query in WikiSQL is limited to the same table and

does not contain complex operations such as sorting

and grouping.

SQL-Eval (Ping, 2023) is a framework

that eval-

uates the correctness of text-to-SQL strategies, cre-

ated during the development of SQLCoder.

Finally, the sql-create-context

dataset was

built for the text-to-SQL task. It contains 78,577

examples of NL queries, SQL CREATE TABLE state-

ments, and SQL Queries answering the questions.

The CREATE TABLE statement provides context for

the LLMs, without having to provide actual rows of

data.

2.2 Text-to-SQL Tools

LangChain

is a generic framework that offers several

predeﬁned strategies to build and run SQL queries

based on NL prompts. Section 4.3 reviews in more

detail LangChain.

The Spider Web site

publishes a leaderboard with

the best-performing text-to-SQL tools. At the time of

this writing, in terms of Execution with Values, the

top 6 tools were (in descending order): “MiniSeek”

(no reference available at the time of writing); “DAIL-

SQL + GPT-4 + Self-Consistency” and “DAIL-SQL

Available at https://github.com/defog-ai/sql-eval

https://huggingface.co/datasets/b-mc2/sql-create-

context

https://python.langchain.com

https://yale-lily.github.io/spider

+ GPT-4”, both reported in (Gao et al., 2023); “DPG-

SQL + GPT-4 + Self-Correction” (no reference avail-

able at the time of writing); “DIN-SQL + GPT-4”

(Pourreza and Raﬁei, 2023); “Hindsight Chain of

Thought with GPT-4” (no reference available at the

time of writing); and “C3 + ChatGPT + Zero-Shot”

(Dong et al., 2023).

The BIRD Web site

also publishes a leaderboard

with the best-performing tools. At the time of this

writing, in terms of Execution Accuracy, the top 5

tools were: “SFT CodeS-15B” and “SFT CodeS-7B”

(no reference available at the time of writing); “DAIL-

SQL + GPT-4”; “DIN-SQL + GPT-4”; and GPT-4.

Section 4 presents the details of “DIN-SQL +

GPT-4” and “C3 + ChatGPT + Zero-Shot”. These

tools were selected since their code was available

when the experiments reported in this paper were de-

signed.

Other text-to-SQL tools include the Defog SQL-

Coder, outlined in Section 4.2. The “60 Top AI Text

To SQL Bot Tools” Web site

lists other AI tools for

text-to-SQL.

Finally, the DB-GPT-Hub

is a project exploring

how to use LLMs for text-to-SQL. The project con-

tains data collection, data preprocessing, model selec-

tion and building, and ﬁne-tuning of weights, includ-

ing LLaMA-2, and the evaluation of several LLMs

ﬁne-tuned for text-to-SQL.

3 BENCHMARK DATASETS

ADOPTED

A benchmark dataset for the text-to-SQL task is a pair

B = (D,{(L

)/i = 1,...,n}), where D is a database

and, for i = 1,..., n, L

is an NL question over D and G

is the ground truth SQL query over D that translates

. In the context of this paper, a benchmark dataset

is meant exclusively for testing text-to-SQL tools; it

is not designed for training such tools. Section 5.2

describes the procedure adopted to evaluate text-to-

SQL tools.

The primary benchmark dataset adopted in this

paper is IndDB, with a set of 100 NL questions and

their translations to SQL.

IndDB is an oil and gas industrial asset integrity

management database in production at an industrial

company. It features a relational schema with 27 ta-

bles, 585 columns, and 30 foreign keys (some multi-

column); the largest table has 81 columns.

https://bird-bench.github.io

https://topai.tools/s/Text-to-SQL-bot

https://github.com/eosphoros-ai/DB-GPT-Hub

Text-to-SQL Meets the Real-World

The questions are classiﬁed into simple, medium,

and complex, that correspond to the easy, medium,

and hard classes used in the Spider benchmark (extra-

hard questions were not considered). As in the Spi-

der benchmark, the difﬁculty is based on the num-

ber of SQL constructs, so that queries that contain

more SQL constructs (GROUP BY, ORDER BY, IN-

TERSECT, nested subqueries, column selections, and

aggregators) are considered to be harder. The list

of questions contains 33 simple, 34 medium, and 33

complex questions.

However, since IndDB is proprietary and is not

openly available, the experiments cannot be repli-

cated. Thus, the paper also considers a version

of Mondial

, with a set of 100 NL questions and

their translations to SQL, divided into 34 simple, 33

medium, and 33 complex questions. Mondial stores

geographic data and is openly available. It has a total

of 47.699 instances; the relational schema

has 46 ta-

bles, with a total of 184 columns and 49 foreign keys,

some of which are multi-column.

Finally, Table 1 shows basic statistics of the sets

of queries, where “#cols” refers to the number of

columns of the target clause and the other columns

refer to the number of joins, ﬁlters, and aggrega-

tions that occur anywhere in the query, including any

nested query. Note that, on average, the medium and

complex queries for IndDB have roughly twice the

number of joins and ﬁlters as the medium and com-

plex queries for Mondial.

4 STRATEGIES TESTED

This section outlines the strategies tested, summa-

rized in Table 2.

4.1 DANKE-Based Strategies

DANKE (Izquierdo et al., 2021) is an automatic,

schema-based tool that supports keyword query pro-

cessing for both the relational and RDF environments

(Izquierdo et al., 2018). DANKE is in production at

an oil company as an interface for several databases,

including IndDB.

DANKE supports expressions that contain key-

words along with ﬁlters and certain forms of aggre-

gation, embedded along the keywords, and some syn-

tactical sugar to enhance readability. An example of

https://www.dbis.informatik.uni-

goettingen.de/Mondial/

The Mondial referential dependencies diagram

can be found at https://www.dbis.informatik.uni-

goettingen.de/Mondial/mondial-abh.pdf

an expression that DANKE accepts is “What is the

total number of cities in Botswana?”, which corre-

sponds to the keyword query “total cities Botswana”.

Note that DANKE recognizes and processes the ag-

gregation “total cities” and the restriction “Coun-

try name=Botswana”, and ignores the phrase “What

is”, the article “the”, and the preposition “of”, treated

as syntactical sugar.

The core engine of DANKE ﬁrst constructs a

Steiner tree that covers a set of nodes (relation

schemes or RDF classes) whose instances match the

largest set of keywords (Garc

ıa et al., 2017). It then

compiles the keyword-based query into an SQL (or

SPARQL) query that includes conditions that repre-

sent keyword matches and joins. Without such joins,

an answer would be a disconnected set of tuples (or

nodes of the RDF graph), which hardly makes sense.

The “DANKE” strategy listed in Table 2 passes

the NL query to DANKE.

The “Manual prompt chain + DANKE” strategy

(Nascimento et al., 2023) uses an LLM to extract key-

words from an NL question through a sequence of 4

steps, each with a speciﬁc prompt: (1) identify the rel-

evant entities; (2) identify the relevant properties; (3)

generate keyword query candidates; and (4) select the

best query Q. Then, Q is passed to DANKE, which

processes Q as explained above. The experiments

tested this strategy with GPT-3.5-turbo and GPT-4.

4.2 SQLCoder

SQLCoder

is a specialized text-to-SQL model,

open-sourced under the Apache-2 license. The latest

model, sqlcoder-34b-alpha, features 34B parameters

and was ﬁne-tuned on a base CodeLlama model, on

more than 20,000 human-curated questions, classiﬁed

as in Spider, based on ten different schemas. The term

SQL Coder will refer to the model and the tool based

on the model.

The training dataset consisted of prompt-

completion pairs, encompassing several schemas

with varying difﬁculty levels, whereas the evaluation

dataset featured questions from novel schemas. The

use of complex schemas, with 4-20 tables, challenged

the model. The ﬁne-tuning process occurred in two

stages: the base model was ﬁrst reﬁned using easy

and medium questions and then further ﬁne-tuned on

hard and extra-hard questions to yield SQLCoder.

The accuracy of SQLCoder on the Defog’s dataset

is promising: defog-sqlcoder-34b achieved 84.0%,

whereas gpt4-turbo-2023-11-09 achieved 82.5%.

https://huggingface.co/defog/sqlcoder-34b-alpha

ICEIS 2024 - 26th International Conference on Enterprise Information Systems

Table 1: Basic statistics of the sets of queries.

Table 2: Summary of the strategies tested.

1) “DANKE” –

The database keyword search tool that is the current interface for IndDB.

2) “Manual prompt chain + DANKE” –

The LLM extracts keywords from the NL question, and passes the keywords to KwS Tool;

tested with GPT-3.5-turbo and GPT-4.

3) “SQLCoder” –

Defog’s tool for text-to-SQL, using a special-purpose LLM.

4) “SQLQueryChain (LangChain)” –

The LangChain SQLQueryChain processes NL questions into SQL;

tested with GPT-3.5-turbo-16k and GPT-4.

5) “SQLDatabaseSequentialChain (LangChain)” –

The LangChain SQLDatabaseSequentialChain processes NL questions into SQL;

tested with GPT-3.5-turbo-16k and GPT-4.

6) “SQL Database Agent (LangChain)” –

The LangChain SQL Database Agent processes NL questions into SQL;

tested with GPT-3.5-turbo-16k and GPT-4.

7) “DIN-SQL” –

An implementation of “DIN-SQL” that allows the use of different LLMs;

tested with GPT-3.5-turbo-16k and GPT-4.

8) “C3” –

An implementation of “C3” that allows the use of different LLMs;

tested with GPT-3.5-turbo and GPT-4.

9) “C3+DIN” –

The C3+DIN tool introduced in Section 4.4;

tested with GPT-3.5-turbo-16k and GPT-4.

4.3 LangChain-Based Strategies

LangChain

is a framework that helps develop

LLM applications. LangChain offers three prede-

ﬁned chains

for text-to-SQL, and is compatible

with MySQL, PostgreSQL, Oracle SQL, Databricks,

SQLite, and other DBMSs.

Very brieﬂy, SQLQueryChain extracts meta-

data from the database automatically, creates a

prompt, and passes this metadata to the LLM. This

chain greatly simpliﬁes creating prompts to access

databases. SQLDatabaseSequentialChain, based on

the query, determines which tables to use and, based

on those tables, calls the SQLQueryChain. This helps

https://docs.langchain.com

https://python.langchain.com

when the number of tables in the database is large.

Finally, SQLAgent provides a more ﬂexible way of

interacting with databases. It answers NL questions

based on the database schema, as well as on the

database content, and recovers from errors by running

a generated SQL query, catching the traceback and re-

generating it correctly.

In addition to passing the schema in the prompt,

these chains for SQL make it possible to provide sam-

ple data that can help an LLM build correct queries

when the data format is not apparent. Sample rows

are added to the prompt after the column information

for each corresponding table.

The experiments tested all LangChain-based

strategies with GPT-3.5-turbo-16k and GPT-4.

Text-to-SQL Meets the Real-World

4.4 C3 and DIN-Based Strategies

4.4.1 C3

“C3 + ChatGPT + Zero-Shot” (Dong et al., 2023)

(or brieﬂy C3) is a prompt-based strategy, originally

deﬁned for ChatGPT, that uses only approximately

1,000 tokens per query and achieves a better perfor-

mance than ﬁne-tuning-based methods. C3 has three

key components: Clear Prompting (CP); Calibration

with Hints (CH); Consistent Output (CO).

Clear Prompting addresses two problems: (1) the

size of the database schema may exceed the prompt

limit; (2) a prompt with too many tables and columns

may confuse ChatGPT. Clear prompting then recalls

relevant tables and columns (to the NL question) and

adds them to the prompt, along with the schema.

Calibration with Hints avoids errors caused by

certain biases inherent in ChatGPT: to select columns

that are relevant to the question but not required; to

use LEFT JOIN, OR, and IN incorrectly. Calibration

with Hints then instructs ChatGPT to follow two “de-

bias” hints: (1) select only the necessary column; (2)

avoid misusing SQL constructs.

Consistent Output tries to avoid the problem that

the output of ChatGPT is unstable due to the inherent

randomness of LLMs. The proposed solution sam-

ples multiple reasoning paths to generate several SQL

queries, executes the SQL queries on the database and

collects the execution outcomes, and uses a voting

mechanism on the results to identify the most con-

sistent SQL.

At the time of writing, C3 was the sixth strategy

listed in the Spider Leaderboard, achieving 82.3% in

terms of execution accuracy on the test set. It outper-

formed state-of-the-art ﬁne-tuning-based approaches

in execution accuracy on the test set, while using only

approximately 1,000 tokens per query.

The experiments tested C3 with GPT-3.5-turbo

and GPT-4.

4.4.2 DIN

“DIN-SQL + GPT-4” (Pourreza and Raﬁei, 2023)

(or brieﬂy DIN) uses only prompting techniques and

decomposes the text-to-SQL task into four steps:

schema linking; query classiﬁcation and decomposi-

tion; SQL generation; and self-correction.

Schema Linking includes ten randomly selected

samples from the training set of the Spider dataset and

follows the chain-of-thought template. The prompt

begins with “Let’s think step by step,”...; for the col-

umn names mentioned in the question, the corre-

sponding columns and their tables are selected from

the schema; possible entities and cell values are also

extracted from the question.

Classiﬁcation and Decomposition classiﬁes each

query into: easy – single-table queries that can be an-

swered without joins or nesting; non-nested – queries

that require joins but no sub-queries; nested – queries

that require joins, sub-queries, and set operations.

SQL Generation depends on the query classiﬁca-

tion. A simple few-shot prompting with no intermedi-

ate steps is adequate for easy queries. For non-nested

complex queries, it uses NatSQL as an intermediate

representation, removes operators JOIN ON, FROM,

GROUP BY, and set operators, and merges the HAV-

ING and WHERE clauses. Brieﬂy, for nested com-

plex queries, it breaks down the problem into multi-

ple steps; the prompt for this class is designed in a

way that the LLM should ﬁrst solve the sub-queries

and then use them to generate the ﬁnal answer.

Finally, Self-Correction addresses the problem

that the generated SQL queries can sometimes have

missed or redundant keywords such as DESC, DIS-

TINCT, and aggregation functions. To solve this

problem, the self-correction step instructs the LLM

to correct those minor mistakes by a zero-shot set-

ting, where only the buggy code is passed to the LLM,

which is asked to ﬁx the bugs.

When released, “DIN-SQL + GPT-4” was the

top-performing tool listed in the Spider Leaderboard,

achieving 85.3% in terms of execution accuracy.

The experiments tested DIN with GPT-3.5-turbo-

16k and GPT-4.

4.4.3 C3+DIN

“C3+DIN” combines some of the C3 and DIN strate-

gies and decomposes the text-to-SQL task into six

steps: database schema description; clear prompt-

ing and schema linking; classiﬁcation and decompo-

sition; calibration with hints; SQL generation; and

self-correction.

Database schema description uses LangChain’s

SQLDatabase module to access database metadata.

This metadata is manipulated to represent the schema

in an automated way.

Clear prompting and schema linking takes advan-

tage of DIN’s chain of thought for schema linking,

but uses C3’s clear prompting to pass only the ta-

bles and columns relevant to the query. Hence, “C3

+ DIN” uses fewer tokens to represent the schema in

the prompt.

Classiﬁcation and decomposition uses the DIN’s

classiﬁcation strategy, which is important for SQL

generation, since it uses different prompts for each

class. Additionally, the prompts help detect tables that

ICEIS 2024 - 26th International Conference on Enterprise Information Systems

should be joined and nested queries by detecting sub-

queries contained in the main query.

Calibration with hints uses C3’s calibration,

which incorporates prior knowledge of the LLM. Be-

fore the SQL generation step, “C3 + DIN” provides

hints to help the model generate SQL queries that

align more closely with the desired output.

SQL generation is based on DIN’s chain of

thought (Wei et al., 2023), where some thought-

chain demonstrations are provided as stimulus exam-

ples. This signiﬁcantly improves the ability of LLMs

to perform complex reasoning. Furthermore, DIN’s

SQL Generation applies different few-shot prompts

for each query class. Thus, “C3 + DIN” adopts this

approach to generate SQL code.

Self-correction incorporates DIN’s self-correction

mechanism, which seeks to correct the SQL code in a

simple way by passing some hints to the model along

with the database schema. Indeed, LLMs have hallu-

cination problems, that is, they can generate text that

does not make sense (Yu et al., 2023).

The experiments tested “C3+DIN” with GPT-3.5-

turbo-16k and GPT-4.

5 EXPERIMENTS

5.1 Conﬁgurations

The experiments tested the strategies summarized in

Table 2 for the 100 NL questions and their translations

to SQL over the Mondial and IndDB databases, stored

in Oracle. The foreign keys of Mondial were used, but

not those of IndDB, which were not available for the

experiments.

Except for DANKE and SQLCoder, the experi-

ments ran each strategy with two LLMs – GPT-3.5-

turbo or GPT3.5-turbo-16k and GPT-4. Since both

schemas are fairly large, some experiments had to use

GPT3.5-turbo-16k, which allows 16k tokens. The ex-

periments used the (paid) OpenAI API.

SQLCoder used the sqlcoder-34b-alpha

model, with 34B parameters. For the experiments,

owing to constraints inherent in the model, the Mon-

dial DDL was transposed to the PostgreSQL syntax,

facilitated by GPT-4 under human supervision. Then,

the output comprised SQL queries formulated in

PostgreSQL syntax, subsequently transcribed into

the Oracle syntax through GPT-4, again under human

supervision.

Additional experiments, not reported here, used

other LLMs. The “Manual prompt chain + DANKE”

strategy was run with GPT-3.5-turbo-1106, an-

nounced on November 6th, 2023, whereas the strate-

gies described in Section 4.3 and 4.4 were run with

LLaMA-2-chat, with 70B parameters. However, none

of these experiments resulted in good accuracy.

5.2 Evaluation Procedure

Let B = (D, {(L

)/i = 1,...,n}) be a benchmark

dataset. Recall that L

is an NL question and G

is the

corresponding ground truth SQL query. Let P

be the

SQL query predicted by a text-to-SQL strategy for L

Let PT

and GT

be the tables that P

and G

return

when executed over D, called the predicted and the

ground truth tables.

Intuitively, P

is correct if PT

and GT

are similar.

The notion of similarity adopted neither requires that

and GT

have the same columns, nor do they have

the same rows. This allows for some mismatch be-

tween PT

and GT

. The following procedure captures

this intuition:

1. Compute GT

and PT

over D.

2. For each column of GT

, compute the most simi-

lar column of PT

, respecting a minimum column

similarity threshold of tc. This step induces a par-

tial matching M from columns of GT

to columns

of PT

3. If the fraction of the number of columns of GT

that match some column of PT

is below a given

threshold tn, P

is considered incorrect.

4. The adjusted ground truth table AGT

is con-

structed by dropping all columns of GT

that do

not match any column of PT

, and the adjusted

predicted table APT

is constructed by dropping

all columns of PT

that are not matched and per-

muting the remaining columns so that PC

is the

column of APT

iff GC

, the k

column of

AGT

, is such that M(GC

) = PC

5. Finally, AGT

and APT

are compared. If their sim-

ilarity is above a given threshold tq, then P

is cor-

rect; otherwise P

is incorrect.

In Step 1, GT

may be pre-computed to avoid re-

executing G

over D for each experiment.

In Step 2, the similarity between two table

columns was measured as their Jaccard coefﬁcient

(recall that table columns are sets). The threshold tc

was set to 0.50.

In Step 3, the threshold tn was set to 0.80, that is,

0.80 of the number of columns of GT

must match

some column of PT

. Note that setting tn = 0.80

forces all columns of GT

to match some column

of PT

, if GT

has four or fewer columns (indeed,

4 ∗ 0.80 = 3.20 is rounded up to 4, that is, GT

must

have all four columns matching some column of PT

and likewise for a smaller number of columns).

Text-to-SQL Meets the Real-World

Now, from column “#cols” of Table 1, observe

that all queries for Mondial have a result with at

most four columns. Hence, setting tn = 0.80 im-

plies that all columns of GT

must match a column

of PT

. However, the same is not valid for IndDB. In-

deed, from column “#cols” of Table 1, observe that

some queries for IndDB have a result with up to 19

columns. Hence, for such queries, some columns of

may not match any column of PT

In Step 4, the new tables AGT

and APT

will

have the same number of columns and the matched

columns will appear in the same order.

In Step 5, the similarity of AGT

and APT

was

computed as their Jaccard coefﬁcient (recall that ta-

bles are sets of tuples), and the threshold tq was set

to 0.95. Thus, AGT

and APT

need not have the same

rows but, intuitively, P

will be incorrect if APT

con-

tains only a small subset of the rows in AGT

, or APT

contains many rows not in AGT

Finally, the accuracy of a given text-to-SQL strat-

egy over the benchmark B is the number of correct

predicted SQL queries divided by the total number of

predicted queries, as usual.

5.3 Results

Tables 4 and 5, at the end of paper, show the results

for the Mondial and the IndDB databases. Columns

under “Accuracy” indicate the accuracy results for

the simple, medium and complex queries, as well as

the overall accuracy; columns “Input Tokens” and

“Output Tokens” respectively show the number of

tokens passed as input and received as output from

the model; column “Estim. Cost” indicates the esti-

mated cost in US Dollars; and column “Exec. Time”

displays the total time to compute the 100 queries,

which naturally depends on the HW and SW setup,

and should be used only to compare the strategies.

5.3.1 Results for Mondial

Accuracy. The top-5 strategies with respect to overall

accuracy used GPT-4. C3 had the best overall accu-

racy of 0.78. Then, SQLQueryChain with samples,

DIN, and C3+DIN had the same overall accuracy

of 0.70. Lastly, SQLQueryChain, without samples,

achieved 0.69. Among the Langchain-based strate-

gies, those that passed the entire schema and samples

in the prompt had superior overall accuracy. SQL-

Coder had a limited overall accuracy of 0.35.

As for the query types, C3 with GPT-4 had the best

accuracy for complex queries, 0.71; SQLQueryChain

with samples had the best accuracy for medium

queries, 0.85; and C3+DIN with GPT-4 had the best

accuracy for simple queries, 0.91.

Strategy Details. Experiments with Langchain were

divided into two groups: (1) passing the NL ques-

tion and the schema; and (2) passing the NL question,

the schema, and two sample rows from each table.

The second group resulted in larger prompts, requir-

ing GPT-3.5-turbo-16k in some cases, while GPT-4

handled large prompts seamlessly. SQLDatabaseSe-

quentialChain and SQLAgent had minimal cost due to

the smaller prompts obtained by ﬁltering the schemas

for the relevant tables. However, GPT-3.5-turbo and

GPT-4 misidentiﬁed crucial tables, leading to incor-

rect SQL queries. SQLAgent became lost or hallu-

cinated using GPT-4. In fact, SQLAgent is not fully

compatible with GPT-4. Also, SQLAgent had a poor

performance with GPT-3.5-turbo.

DIN with GPT-3.5-turbo had the largest number

of input tokens, followed closely by DIN with GPT-

4, both with over 1,4 MM input tokens. Indeed, DIN

generates large prompts since it passes the complete

database schema and uses a few examples to indicate

how the LLM should reason and generate SQL code

in each stage, except for the self-correction stage.

C3’s Consistent Output generated many output

tokens since it produces ten answers in each clear-

prompting stage (table recall and column recall). Fur-

thermore, at the time of writing, the output token price

of GPT-4 ($0.06/1K sampled tokens) is higher than

the input token price ($0.03/1K prompt tokens). Thus,

albeit C3 with GPT-4 had the best overall accuracy, it

generated 426,937 output tokens with an overall cost

of $30.23.

Table 3 shows the error analysis of C3 with GPT-

4. When compared with that of C3 for the Spider

benchmark (Dong et al., 2023), it indicates that the

errors resulting from schema linking and joins are

exacerbated in the experiments with Mondial, which

would be expected, given that the Mondial schema is

far more complex than the majority of the datasets in

the Spider benchmark.

C3+Din with GPT-4 had the same overall accu-

racy as DIN, but lower than C3 with GPT-4. However,

its cost and execution times were the highest among

all experiments. It inherited the problems of C3 and

DIN, such as a high table and column recall time and

large prompt sizes, but generated fewer input tokens

than DIN, as it did not use all DIN modules.

5.3.2 Results for IndDB

Accuracy. “DANKE” had the best overall accu-

racy, followed by “Manual prompt chain + DANKE”

with GPT-4, but both were low. All other strategies

achieved a very low overall accuracy. Of these strate-

gies, C3+DIN with GPT-4 had the best overall accu-

ICEIS 2024 - 26th International Conference on Enterprise Information Systems

Table 3: Error analysis of the C3 with GPT-4 experiment using the Mondial database.

Error Type Schema Linking Joins Nested Query Invalid Query Misc

Percentage 53.3 30.0 3.3 6.7 6.7

racy, 0.13, but at a high cost, and SQLQueryChain

with GPT-4 had the second-best overall accuracy,

0.12, but with a much lower cost.

Strategy Details. SQLQueryChain with GPT-4 and

using samples could not be run since it required more

tokens than the allowed 8,192 for GPT-4. DIN with

GPT-4 generated the largest number of input tokens

since the IndDB schema has fewer tables than Mon-

dial, but the tables have more columns. SQLCoder

was not tested over IndDB for privacy reasons.

Analysis of the Predicted SQL Queries. The low

accuracy results call for a careful comparison of the

predicted and the ground truth SQL queries. Rather

than going through each of the 22 strategies (the num-

ber of lines of Table 5), the analysis concentrated on

LangChain SQLQueryChain with GPT-4, the second-

best accuracy of the text-to-SQL strategies with a low

cost.

The analysis revealed some problems with the ex-

periments with IndDB: (1) some of the NL questions

in the ground truth were incomplete or ambiguous; (2)

certain ground truth SQL queries were very speciﬁc

translations of the NL question, either because they

had columns in the target clause that were speciﬁc in-

terpretations of terms in the NL question, or because

the where clause contained conditions that were again

speciﬁc interpretations of ﬁlters in the NL question;

(3) the relational schema had table and column names

that did not match terms the user typically adopted

for the concepts (and therefore were used to formu-

late the NL questions); (4) the relational schema is

relatively large, which led to LLM hallucination; (5)

many NL questions led to SQL queries with up to 5

joins to gather all the required data.

The ground truth was then adjusted to mini-

mize Problems (1) and (2), and the experiment with

LangChain SQLQueryChain with GPT-4 was re-

executed. The accuracies thereby obtained were in-

deed much better than those reported in Table 5 for

this strategy. For example, the accuracy of simple NL

questions improved from 0.27 to 0.48, which reﬂects

the adjustments of the ground truth.

However, an accuracy of 0.48 for simple NL ques-

tions over IndDB is still much lower than the accu-

racy of 0.82 that this strategy obtained for simple NL

questions over Mondial, reported in Table 4. Since

the Mondial relational schema uses familiar table and

column names, this observation suggests that problem

(3) may be blamed for the lower accuracy of IndDB.

The traditional approach to avoid problem (3) is

to create views over the relational schema and care-

fully select view and column names that better match

user terms. However, this approach faced restrictions

to be implemented on IndDB, which is a production

database, and it could not be tested in the experiments.

Problems (4) and (5) were expected since the ex-

periments aimed at stressing the text-to-SQL strate-

gies with a large schema and complex questions.

6 CONCLUSIONS

The experiments with Mondial indicated that C3 with

GPT-4 had the best overall accuracy. However, C3 in-

curred higher costs, longer runtime, and more output

tokens. SQLQueryChain with GPT-4 proved more

effective, concerning token cost and execution time.

In general, passing the entire schema to the LLM

achieved better results than ﬁltering the schema and

passing just a few tables, but this approach is limited

by the number of tokens the LLM allows, especially

for complex schemas, such as that of Mondial.

Regarding IndDB, DANKE had the best overall

accuracy, closely followed by “Manual prompt chain

+ DANKE” with GPT-4, albeit both results were low.

This suggests that the NL questions deﬁned for In-

dDB were quite challenging. All other strategies had

an overall accuracy of less than 15%, which is very

low. They correctly processed only those NL ques-

tions over IndDB that mapped to a single table SQL

queries.

C3+DIN did not have good performance. It in-

herited the problems of C3 and DIN, such as a high

table and column recall time and large prompt sizes,

but generated fewer input tokens than DIN, as it did

not use all DIN modules.

The speciﬁc goal of this paper, to use an LLM-

based text-to-SQL tool to construct a Natural Lan-

guage Interface for IndDB as an alternative to

DANKE, therefore, failed for the list of text-to-SQL

strategies tested, even though this list includes some

strategies that ranked high in familiar leaderboards,

and a model ﬁne-tuned for the text-to-SQL task, with

a reported very good performance. In general, the

poor performance of text-to-SQL strategies for the

large, complex databases typical of real-world indus-

trial applications calls for different alternatives and

this is the take-home lesson of this paper.

Text-to-SQL Meets the Real-World

Future work will consider three alternatives for

text-to-SQL, which are not mutually exclusive.

The ﬁrst alternative will explore a combination of

a question-and-answer interface and database views.

As hinted at the end of Section 5, the view deﬁni-

tions will: (1) create a vocabulary that better matches

user terms; (2) predeﬁne frequently required joins. A

user session will start with a step-by-step NL ques-

tion speciﬁcation to cope with any mismatch between

the user terms and the view vocabulary, and to disam-

biguate term usage. After this step, the interface will

select a few views to apply a text-to-SQL strategy.

The second alternative will consider ﬁne-tuning a

locally stored LLM speciﬁcally for a given database

with a large schema. The training dataset can be quite

laborious to create, but GPT-4 may come in hand to

augment the training set from a seed set of NL ques-

tions and their SQL translations.

Suppose that a reasonable set P of pairs (S

′

consisting of an NL question S

′

and its SQL transla-

tion Q

′

, can indeed be generated for a given database

D. A third alternative would be to sample P to con-

struct a set E of pairs (S

′

) such that S and S

′

are

similar, for a given NL question S. Then, a text-

to-SQL strategy would pass E to the LLM to help

translate S, as already used in a limited form in the

LangChain strategies with samples.

ACKNOWLEDGEMENTS

This work was partly funded by FAPERJ un-

der grant E-26/202.818/2017; by CAPES under

grants 88881.310592-2018/01, 88881.134081/2016-

01, and 88882.164913/2010-01; by CNPq under grant

302303/2017-0; and by Petrobras.

REFERENCES

Affolter, K., Stockinger, K., and Bernstein, A. (2019). A

comparative survey of recent natural language inter-

faces for databases. The VLDB Journal, 28:793–819.

Dong, X., Zhang, C., Ge, Y., Mao, Y., Gao, Y., lu Chen, Lin,

J., and Lou, D. (2023). C3: Zero-shot text-to-sql with

chatgpt. arXiv preprint arXiv:2307.07306.

Gao, D., Wang, H., Li, Y., Sun, X., Qian, Y., Ding, B.,

and Zhou, J. (2023). Text-to-sql empowered by large

language models: A benchmark evaluation. arXiv

preprint arXiv:2308.15363.

Garc

ıa, G. M., Izquierdo, Y. T., Menendez, E., Dartayre,

F., and Casanova, M. A. (2017). Rdf keyword-based

query technology meets a real-world dataset. In Pro-

ceedings of the International Conference on Extend-

ing Database Technology, pages 656–667. OpenPro-

ceedings.org.

Izquierdo, Y. T., Garcia, G. M., Lemos, M., Novello,

A., Novelli, B., Damasceno, C., Leme, L. A. P. P.,

and Casanova, M. A. (2021). A platform for key-

word search and its application for covid-19 pandemic

data. Journal of Information and Data Management,

12(5):656–667.

Izquierdo, Y. T., Garc

ıa, G. M., Menendez, E. S., Casanova,

M. A., Dartayre, F., and Levy, C. H. (2018). Quiow: a

keyword-based query processing tool for rdf datasets

and relational databases. In International Confer-

ence on Database and Expert Systems Applications

(DEXA), pages 259–269. Springer.

Katsogiannis-Meimarakis, G. and Koutrika, G. (2023). A

survey on deep learning approaches for text-to-sql.

The VLDB Journal, 32(4):905–936.

Kim, H., So, B.-H., Han, W.-S., and Lee, H. (2020). Natural

language to sql: Where are we today? Proc. VLDB

Endow., 13(10):1737–1750.

Li, J. et al. (2023). Can llm already serve as a database in-

terface? a big bench for large-scale database grounded

text-to-sqls. arXiv preprint arXiv:2305.03111.

Nascimento, E. R., Garcia, G. M., Victorio, W. Z., Lemos,

M., Izquierdo, Y. T., Garcia, R. L., Leme, L. A. P.,

and Casanova, M. A. (2023). A family of natural lan-

guage interfaces for databases based on chatgpt and

langchain (short paper). In Companion Proceedings

of the 42nd International Conference on Conceptual

Modeling: Posters and Demos co-located with ER

2023, Lisbon, Portugal, November 06-09, 2023, vol-

ume 3618 of CEUR Workshop Proceedings.

Ping, W. J. (2023). Open-sourcing sqleval: our framework

for evaluating llm-generated sql.

Pourreza, M. and Raﬁei, D. (2023). Din-sql: Decomposed

in-context learning of text-to-sql with self-correction.

arXiv preprint arXiv:2304.11015.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B.,

Xia, F., Chi, E., Le, Q., and Zhou, D. (2023). Chain-

of-thought prompting elicits reasoning in large lan-

guage models. arXiv preprint arXiv:2201.11903.

Yu, T. et al. (2018). Spider: A large-scale human-labeled

dataset for complex and cross-domain semantic pars-

ing and text-to-sql task. In Proceedings of the 2018

Conference on Empirical Methods in Natural Lan-

guage Processing, pages 3911–3921, Brussels, Bel-

gium.

Yu, X., Cheng, H., Liu, X., Roth, D., and Gao, J. (2023).

Automatic hallucination assessment for aligned large

language models via transferable adversarial attacks.

arXiv preprint arXiv:2310.12516.

Zhong, V., Xiong, C., and Socher, R. (2017). Seq2sql: Gen-

erating structured queries from natural language using

reinforcement learning. CoRR, abs/1709.00103.

ICEIS 2024 - 26th International Conference on Enterprise Information Systems

APPENDIX

Table 4: Results for Mondial.

Text-to-SQL Meets the Real-World

Table 5: Results for IndDB.

ICEIS 2024 - 26th International Conference on Enterprise Information Systems