study.
The main limitation of this study is that the ana-
lyzed records have a standardized format. Thus, the
observed results may differ significantly in different
scenarios where the input text is less rigidly struc-
tured. However, the overall architecture described in
1 remains valid, although more research should be
done on the instructions to be provided as input to the
model.
5 CONCLUSIONS AND FUTURE
WORK
This paper explored the application of LLMs, specifi-
cally GPT 3.5 Turbo and GPT 4, for extracting named
entities from repetitive texts. This investigation aimed
to study the effectiveness of these models in handling
such structured texts by defining different types of in-
structions with an increasing level of detail. This pa-
per has demonstrated that all the tested LLMs reach a
total ratio greater than 0.75. In all cases, costs should
also be considered while choosing the best model.
This paper has investigated two specific models:
GPT 3.5 Turbo and GPT 4. As new models are re-
leased continuously, future work could include com-
paring them and the costs of models released by dif-
ferent providers, such as Google and Meta.
Future work could also use the best scenario at
scale and implement an LLM-based app that receives
input from repetitive text and an output example and
returns the formatted CSV text as output. In addition,
instruction optimization could be investigated, with a
more detailed analysis of the model to use based on
the task requirements.
REFERENCES
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I.,
Aleman, F. L., Almeida, D., Altenschmidt, J., Altman,
S., Anadkat, S., et al. (2023). Gpt-4 technical report.
arXiv preprint arXiv:2303.08774.
Ashok, D. and Lipton, Z. C. (2023). Promptner: Prompt-
ing for named entity recognition. arXiv preprint
arXiv:2305.15444.
Bang, J., Lee, B.-T., and Park, P. (2023). Examination of
ethical principles for llm-based recommendations in
conversational ai. In 2023 International Conference
on Platform Technology and Service (PlatCon), pages
109–113. IEEE.
Ben Abacha, A. and Zweigenbaum, P. (2011). Automatic
extraction of semantic relations between medical en-
tities: a rule based approach. Journal of biomedical
semantics, 2:1–11.
Braverman, M., Chen, X., Kakade, S., Narasimhan, K.,
Zhang, C., and Zhang, Y. (2020). Calibration, entropy
rates, and memory in language models. In Interna-
tional Conference on Machine Learning, pages 1089–
1099. PMLR.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., et al. (2020). Language models are few-
shot learners. Advances in neural information pro-
cessing systems, 33:1877–1901.
Eftimov, T., Korou
ˇ
si
´
c Seljak, B., and Koro
ˇ
sec, P. (2017).
A rule-based named-entity recognition method for
knowledge extraction of evidence-based dietary rec-
ommendations. PloS one, 12(6):e0179488.
Fani Sani, M., Sroka, M., and Burattin, A. (2023). Llms and
process mining: Challenges in rpa: Task grouping, la-
belling and connector recommendation. In Interna-
tional Conference on Process Mining, pages 379–391.
Springer.
Gebreab, S. A., Salah, K., Jayaraman, R., ur Rehman,
M. H., and Ellaham, S. (2024). Llm-based framework
for administrative task automation in healthcare. In
2024 12th International Symposium on Digital Foren-
sics and Security (ISDFS), pages 1–7. IEEE.
Goyal, A., Gupta, V., and Kumar, M. (2018). Recent named
entity recognition and classification techniques: a sys-
tematic review. Computer Science Review, 29:21–43.
Humbel, M., Nyhan, J., Vlachidis, A., Sloan, K., and
Ortolja-Baird, A. (2021). Named-entity recognition
for early modern textual documents: a review of ca-
pabilities and challenges with strategies for the future.
Journal of Documentation, 77(6):1223–1247.
Ji, B., Liu, R., Li, S., Yu, J., Wu, Q., Tan, Y., and Wu, J.
(2019). A hybrid approach for named entity recogni-
tion in chinese electronic medical record. BMC medi-
cal informatics and decision making, 19:149–158.
Lo Duca, A., Abrate, M., Marchetti, A., and Moretti, M.
(2024). Genealogical data-driven visits of historical
cemeteries. Informatics, 11(1).
Lo Duca, A., Marchetti, A., Moretti, M., Diana, F., Toni-
azzi, M., and D’Errico, A. (2023). Genealogical data
mining from historical archives: The case of the jew-
ish community in pisa. Informatics, 10(2).
Luo, Y., Zhao, H., and Zhan, J. (2019). Named entity recog-
nition only from word embeddings. arXiv preprint
arXiv:1909.00164.
Ma, H., Zhang, C., Bian, Y., Liu, L., Zhang, Z., Zhao,
P., Zhang, S., Fu, H., Hu, Q., and Wu, B. (2023).
Fairness-guided few-shot prompting for large lan-
guage models. Advances in Neural Information Pro-
cessing Systems, 36:43136–43155.
Malmasi, S., Fang, A., Fetahu, B., Kar, S., and Rokhlenko,
O. (2022). Multiconer: A large-scale multilingual
dataset for complex named entity recognition. arXiv
preprint arXiv:2208.14536.
Olaoye, G. and Jonathan, H. (EasyChair, 2024). The evolv-
ing role of large language models (llms) in banking.
EasyChair Preprint no. 13367.
Pakhale, K. (2023). Comprehensive overview of named
entity recognition: Models, domain-specific
An Empirical Study to Use Large Language Models to Extract Named Entities from Repetitive Texts
423