Authors:
Eduardo R. Nascimento
1
;
Yenier T. Izquierdo
1
;
Grettel García
1
;
Gustavo Coelho
1
;
Lucas Feijó
1
;
Melissa Lemos
1
;
Luiz Leme
2
and
Marco Casanova
3
;
1
Affiliations:
1
Instituto Tecgraf, PUC-Rio, Rio de Janeiro, 22451-900, RJ, Brazil
;
2
Instituto de Computação, UFF, Niterói, 24210-310, RJ, Brazil
;
3
Departamento de Informática, PUC-Rio, Rio de Janeiro, 22451-900, RJ, Brazil
Keyword(s):
Text-to-SQL, GPT, Large Language Models, Relational Databases.
Abstract:
The leaderboards of familiar benchmarks indicate that the best text-to-SQL tools are based on Large Language Models (LLMs). However, when applied to real-world databases, the performance of LLM-based text-to-SQL tools is significantly less than that reported for these benchmarks. A closer analysis reveals that one of the problems lies in that the relational schema is an inappropriate specification of the database from the point of view of the LLM. In other words, the target user of the database specification is the LLM rather than a database programmer. This paper then argues that the text-to-SQL task can be significantly facilitated by providing a database specification based on the use of LLM-friendly views that are close to the language of the users’ questions and that eliminate frequently used joins, and LLM-friendly data descriptions of the database values. The paper first introduces a proof-of-concept implementation of three sets of LLM-friendly views over a relational database
, whose design is inspired by a proprietary relational database, and a set of 100 Natural Language (NL) questions that mimic users’ questions. The paper then tests a text-to-SQL prompt strategy implemented with LangChain, using GPT-3.5 and GPT-4, over the sets of LLM-friendly views and data samples, as the LLM-friendly data descriptions. The results suggest that the specification of LLM-friendly views and the use of data samples, albeit not too difficult to implement over a real-world relational database, are sufficient to improve the accuracy of the prompt strategy considerably. The paper concludes by discussing the results obtained and suggesting further approaches to simplify the text-to-SQL task.
(More)