
5 CONCLUSIONS
This paper argued that the target user of a database
specification should be viewed as the LLM, when im-
proving the performance of a text-to-SQL strategy.
From the point of view of metadata, this posi-
tion quite simply asks to create a database specifi-
cation that defines a vocabulary close to that of the
NL questions to be submitted for translation to SQL.
This specification can be easily implemented with fa-
miliar views. As for the data, this position requires
creating a set of constructs that try to capture the data
semantics. This can be far more complex and would
require knowledge of the LLM API capabilities if one
wants to go beyond providing data samples. Fortu-
nately, LLMs are “few-show learners”, that is, they
can learn to perform a new language task from only a
few examples (Brown et al., 2020). Thus, providing a
few data samples per table helps.
To help convince the reader of the soundness of
the position, the paper introduced a test dataset, with
three sets of LLM-friendly views of increasing com-
plexity, and 100 NL questions and their translation to
SQL. Using the benchmark dataset, the experiments
suggested that there is a dramatic increase in accuracy
when one moves from prompting the LLM with the
relational schema to prompting the LLM with LLM-
friendly views and data samples, as argued in the pa-
per.
Views also help reduce the SQL query complex-
ity by including additional columns with pre-defined
joins. However, the larger the view, the more tokens
its definition would consume, and LLMs typically
limit the number of tokens passed. Also, the LLM
may get lost when the views have many columns.
Finally, there is room for further improvement.
For example, the LLM-friendly views used in the
experiments were created by inspecting the database
documentation and by mining a log of user questions.
Albeit this process was tedious but not too difficult,
further work will focus on a tool that automatically
creates views on the fly, depending on the NL ques-
tion submitted, along the lines of the tool described in
(Nascimento et al., 2023).
ACKNOWLEDGEMENTS
This work was partly funded by FAPERJ un-
der grant E-26/202.818/2017; by CAPES under
grants 88881.310592-2018/01, 88881.134081/2016-
01, and 88882.164913/2010-01; by CNPq under grant
302303/2017-0; and by Petrobras.
REFERENCES
Affolter, K., Stockinger, K., and Bernstein, A. (2019). A
comparative survey of recent natural language inter-
faces for databases. The VLDB Journal, 28.
Brown, T. B. et al. (2020). Language models
are few-shot learners. In Proc. Advances in
Neural Information Processing Systems 33.
doi:10.48550/arXiv.2005.14165.
Gan, Y., Chen, X., Huang, Q., Purver, M., Woodward, J. R.,
Xie, J., and Huang, P. (2021a). Towards robustness
of text-to-sql models against synonym substitution.
CoRR, abs/2106.01065.
Gan, Y., Chen, X., and Purver, M. (2021b). Exploring
underexplored limitations of cross-domain text-to-sql
generalization. In Conference on Empirical Methods
in Natural Language Processing.
Katsogiannis-Meimarakis, G. and Koutrika, G. (2023). A
survey on deep learning approaches for text-to-sql.
The VLDB Journal, 32(4):905–936.
Kim, H., So, B.-H., Han, W.-S., and Lee, H. (2020). Natural
language to sql: Where are we today? Proc. VLDB
Endow., 13(10):1737–1750.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin,
V., Goyal, N., K
¨
uttler, H., Lewis, M., Yih,
W.-t., Rockt
¨
aschel, T., Riedel, S., and Kiela,
D. (2020). Retrieval-augmented generation for
knowledge-intensive nlp tasks. In Advances in Neu-
ral Information Processing Systems, volume 33, pages
9459–9474.
Li, J. et al. (2023). Can llm already serve as a database in-
terface? a big bench for large-scale database grounded
text-to-sqls. arXiv preprint arXiv:2305.03111.
Nascimento, E. R., Garcia, G. M., Feij
´
o, L., Victorio, W. Z.,
Lemos, M., Izquierdo, Y. T., Garcia, R. L., Leme, L.
A. P., and Casanova, M. A. (2024). Text-to-sql meets
the real-world. In Proc. 26th Int. Conf. on Enterprise
Info. Sys.
Nascimento, E. R., Garcia, G. M., Victorio, W. Z., Lemos,
M., Izquierdo, Y. T., Garcia, R. L., Leme, L. A. P.,
and Casanova, M. A. (2023). A family of natural lan-
guage interfaces for databases based on chatgpt and
langchain. In Proc. 42nd Int. Conf. on Conceptual
Modeling – Posters&Demos, Lisbon, Portugal.
Yu, T. et al. (2018). Spider: A large-scale human-labeled
dataset for complex and cross-domain semantic pars-
ing and text-to-sql task. In Proc. 2018 Conference on
Empirical Methods in Natural Language Processing,
pages 3911–3921.
ICEIS 2024 - 26th International Conference on Enterprise Information Systems
806