
Lengyel, G., Lample, G., Saulnier, L., et al. (2023).
Mistral 7b. Technical report, Mistral AI.
Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A.,
Savary, B., Bamford, C., Chaplot, D. S., de las Casas,
D., Hanna, E. B., Bressand, F., et al. (2024). Mixtral
of experts. Technical report, Mistral AI.
Kasneci, E., Seßler, K., K
¨
uchemann, S., Bannert, M.,
Dementieva, D., Fischer, F., Gasser, U., Groh, G.,
G
¨
unnemann, S., and H
¨
ullermeier, E. (2024). Can ai
grade your essays? Educational Assessment and Arti-
ficial Intelligence Review. Forthcoming.
Ke, Z. and Ng, V. (2019). Automated essay scoring: A
survey of the state of the art. In Proceedings of the
Twenty-Eighth International Joint Conference on Arti-
ficial Intelligence, IJCAI-19, pages 6300–6308. Inter-
national Joint Conferences on Artificial Intelligence
Organization.
Koo, T. K. and Li, M. Y. (2016). A guideline of selecting
and reporting intraclass correlation coefficients for re-
liability research. Journal of Chiropractic Medicine,
15(2):155–163.
Koutn
´
ık, J., Greff, K., Gomez, F., and Schmidhuber, J.
(2014). Handwriting recognition with large multidi-
mensional long short-term memory recurrent neural
networks. pages 2193–2201. Curran Associates, Inc.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998).
Gradient-based learning applied to document recogni-
tion. Proceedings of the IEEE, 86(11):2278–2324.
Mizumoto, A. and Eguchi, M. (2023). Exploring the po-
tential of using an ai language model for automated
essay scoring. Research Methods in Applied Linguis-
tics, 2(2):100050.
Mu
˜
noz, S. A. S., Gayoso, G. G., Huambo, A. C., Tapia, R.
D. C., Incaluque, J. L., Aguila, O. E. P., Cajamarca,
J. C. R., Acevedo, J. E. R., Rivera, H. V. H., and
Arias-Gonz
´
ales, J. L. (2023). Examining the impacts
of chatgpt on student motivation and engagement. So-
cial Space, 23(1):1–27.
Naismith, B., Mulcaire, P., and Burstein, J. (2023). Auto-
mated evaluation of written discourse coherence us-
ing gpt-4. In Kochmar, E., Burstein, J., Horbach, A.,
Laarmann-Quante, R., Madnani, N., Tack, A., Yaneva,
V., Yuan, Z., and Zesch, T., editors, Proceedings of the
18th Workshop on Innovative Use of NLP for Build-
ing Educational Applications (BEA 2023), pages 394–
403, Toronto, Canada. Association for Computational
Linguistics.
Neto, A. F. S., Bezerra, B. L. D., Ara
´
ujo, S. S., Souza,
W. M. A. S., Alves, K. F., Oliveira, M. F., Lins, S.
V. S., Hazin, H. J. F., Rocha, P. H. V., and Toselli,
A. H. (2024a). Bressay: A brazilian portuguese
dataset for offline handwritten text recognition. In
Document Analysis and Recognition - ICDAR 2024:
18th International Conference, Athens, Greece, Au-
gust 30–September 4, 2024, Proceedings, Part II, page
315–333, Berlin, Heidelberg. Springer-Verlag.
Neto, A. F. S., Bezerra, B. L. D., Araujo, S. S., Souza, W.
M. A. S., Alves, K. F., Oliveira, M. F., Lins, S. V. S.,
Hazin, H. J. F., Rocha, P. H. V., and Toselli, A. H.
(2024b). Bressay: A brazilian portuguese dataset for
offline handwritten text recognition. In 18th Interna-
tional Conference on Document Analysis and Recog-
nition (ICDAR), Athens, Greece. Springer.
Nguyen, H. A., Stec, H., Hou, X., Di, S., and McLaren,
B. M. (2023). Evaluating chatgpt’s decimal skills and
feedback generation in a digital learning game. In
Viberg, O., Jivet, I., Mu
˜
noz-Merino, P. J., Perifanou,
M., and Papathoma, T., editors, European Conference
on Technology Enhanced Learning, pages 278–293,
Cham. Springer Nature Switzerland.
OpenAI (2024a). Gpt-4 technical report.
OpenAI (2024b). Openai o1 system card. Technical report,
OpenAI.
Ramesh, D. and Sanampudi, S. K. (2022). An automated
essay scoring systems: A systematic literature review.
Artificial Intelligence Review, 55(3):2495–2527.
Rice, S. V. (1999). Optical Character Recognition: An
Illustrated Guide to the Frontier. Springer, Boston,
MA.
Sawatzki, J., Schlippe, T., and Benner-Wickner, M. (2021).
Deep learning techniques for automatic short answer
grading: Predicting scores for english and german an-
swers. In Cheng, E. C. K., Koul, R. B., Wang, T.,
and Yu, X., editors, International Conference on Arti-
ficial Intelligence in Education Technology, pages 65–
75, Singapore. Springer Nature Singapore.
Seßler, K., Xiang, T., Bogenrieder, L., and Kasneci, E.
(2023). Peer: Empowering writing with large lan-
guage models. In Viberg, O., Jivet, I., Mu
˜
noz-Merino,
P. J., Perifanou, M., and Papathoma, T., editors, Re-
sponsive and Sustainable Educational Futures, pages
755–761, Cham. Springer Nature Switzerland.
Stahl, M., Biermann, L., Nehring, A., and Wachsmuth,
H. (2024a). Exploring llm prompting strategies for
joint essay scoring and feedback generation. arXiv
preprint, arXiv:2404.15845. [cs.CL].
Stahl, M., Biermann, L., Nehring, A., and Wachsmuth,
H. (2024b). Exploring llm prompting strategies for
joint essay scoring and feedback generation. arXiv,
2404.15845.
Sung, C., Dhamecha, T. I., and Mukhi, N. (2019). Improv-
ing short answer grading using transformer-based pre-
training. In Isotani, S., Mill
´
an, E., Ogan, A., Hast-
ings, P., McLaren, B., and Luckin, R., editors, Artifi-
cial Intelligence in Education, pages 469–481, Cham.
Springer International Publishing.
Uto, M., Xie, Y., and Ueno, M. (2020). Neural automated
essay scoring incorporating handcrafted features. In
Scott, D., Bel, N., and Zong, C., editors, Proceed-
ings of the 28th International Conference on Com-
putational Linguistics, pages 6077–6088, Barcelona,
Spain (Online). International Committee on Compu-
tational Linguistics.
Wang, Z. and Li, Y. (2020). Handwritten text recog-
nition: Benchmarking of current state-of-the-art.
arXiv:2003.12294.
Xue, J., Tang, X., and Zheng, L. (2021). A hierarchi-
cal bert-based transfer learning approach for multi-
dimensional essay scoring. IEEE Access, 9:125403–
125415.
Analysis of the Effectiveness of LLMs in Handwritten Essay Recognition and Assessment
785