
ments. LoRA was working best in the zero-shot set-
ting, as pointed out before. We assume that a model
like LoRA, which has previously been fine-tuned with
in-context examples, might not benefit from them fur-
ther; in fact, it could be negatively impacted by bi-
asing the generation process. Apart from few-shot
prompting, employing rule-based approaches could
further minimize prediction errors. These strategies
might involve syntax checking, utilizing entity dictio-
naries, checking for unwanted language filters, and re-
moving natural language output that is not SPARQL.
Finally, it is important to acknowledge that our
study has certain limitations. We have concentrated
on semantic parsing of queries from dialogues, al-
though we recognize the importance of exploring
other tasks, such as extracting triples or construct-
ing subgraphs in different graph languages. We also
suggest further extending our foundational evalua-
tion by additional human assessments and including a
wider array of recently published models, especially
those trained on program code or structured data doc-
uments. Moreover, the SPICE dataset is limited to
English. Since pre-training corpora of LLMs primar-
ily consist of English text data, they likely work better
where entities and relations correspond to meaning-
ful English words. Consequently, it is to be expected
that LLMs exhibit worse performance on benchmarks
with more morphologically rich languages.
5 CONCLUSION
We compared LLMs in conversational semantic pars-
ing. Our findings indicate that even smaller, fine-
tuned 7B-LLMs exhibit reasonable performance in
generating SPARQL queries from dialogues, although
they might not always be syntactically valid or yield
the correct result. We also discussed model-specific
differences and common errors that can be mitigated
through few-shot prompting and fine-tuning. In fu-
ture work, we intend to delve into the applicability of
our findings to different query languages. Further, we
plan to conduct user evaluations of deployed LLM-
based CQA systems for practical search scenarios.
3
REFERENCES
Aliannejadi, M., Azzopardi, L., Zamani, H., Kanoulas,
E., Thomas, P., and Craswell, N. (2021). Analysing
mixed initiatives and search strategies during conver-
3
This work has been supported by the German Federal
Ministry of Education and Research grant 01IS17049.
sational search. In Proc. of the 30th CIKM, page
16–26, New York, NY, USA. ACM.
Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang,
H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez,
J. E., Stoica, I., and Xing, E. P. (2023). Vicuna: An
open-source chatbot impressing gpt-4 with 90% chat-
gpt quality. LMSYS Org Blog.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2019). BERT: Pre-training of deep bidirectional
transformers for language understanding. In Proc.
of the 2019 NAACL: Human Language Technologies,
pages 4171–4186, Minneapolis, Minnesota. ACL.
Gu, Y., Kase, S., Vanni, M., Sadler, B., Liang, P., Yan,
X., and Su, Y. (2021). Beyond i.i.d.: Three levels of
generalization for question answering on knowledge
bases. In Proc. of the Web Conference 2021, WWW
’21, page 3477–3488, New York, NY, USA. ACM.
Hu, E. J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y.,
Wang, S., Wang, L., and Chen, W. (2022). LoRA:
Low-rank adaptation of large language models. In In-
ternational Conference on Learning Representations.
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E.,
Bang, Y. J., Madotto, A., and Fung, P. (2023). Survey
of hallucination in natural language generation. ACM
Comput. Surv., 55(12).
Kacupaj, E., Plepi, J., Singh, K., Thakkar, H., Lehmann, J.,
and Maleshkova, M. (2021). Conversational question
answering over knowledge graphs with transformer
and graph attention networks. In Proc. of the 16th
EACL, pages 850–862, Online. ACL.
Li, Z., Qu, L., and Haffari, G. (2020). Context dependent
semantic parsing: A survey. In Proc. of the 28th In-
ternational Conference on Computational Linguistics,
pages 2509–2521, Barcelona, Spain (Online). Interna-
tional Committee on Computational Linguistics.
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig,
G. (2023). Pre-train, prompt, and predict: A system-
atic survey of prompting methods in natural language
processing. ACM Computing Surveys, 55(9):1–35.
OpenAI (2022). Chatgpt: Optimizing language models for
dialogue. OpenAI.
Perez-Beltrachini, L., Jain, P., Monti, E., and Lapata, M.
(2023). Semantic parsing for conversational question
answering over knowledge graphs. In Proc. of the 17th
EACL, pages 2507–2522, Dubrovnik, Croatia. ACL.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and
Sutskever, I. (2019). Language models are unsuper-
vised multitask learners. OpenAI.
Saha, A., Pahuja, V., Khapra, M. M., Sankaranarayanan, K.,
and Chandar, S. (2018). Complex sequential question
answering: Towards learning to converse over linked
question answer pairs with a knowledge graph. In
Proc. of the Thirty-Second AAAI Conference on Ar-
tificial Intelligence. AAAI Press.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,
M.-A., Lacroix, T., Rozi
`
ere, B., Goyal, N., Hambro,
E., Azhar, F., et al. (2023). Llama: Open and efficient
foundation language models. arXiv:2302.13971.
Wang, B., Shin, R., Liu, X., Polozov, O., and Richardson,
M. (2020). RAT-SQL: Relation-aware schema encod-
ing and linking for text-to-SQL parsers. In Proc. of
the 58th Ann. Meeting of the ACL, pages 7567–7578.
ACL.
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
814