dition, MCQs that ask to complete a sentence or fill-
in a blank appear to be handled much more suc-
cessfully (87.1%) compared to other types of ques-
tions (60.1%). Therefore, GPT models’ capabili-
ties seem limited when it comes to handling MCQs
about computer code requiring reasoning beyond
mere completion (56.6%).
While our study of GPT models’ performance on
diverse types of MCQs yielded numerous valuable in-
sights, it is subject to countless limitations and leaves
much room for improvement. Hence, we suggest sev-
eral directions for future work: (i) further analyze the
effects of prompt-tuning (ii) and/or iterative prompt-
construction; (iii) examine the performance of GPT
models on other domains, e.g., competitive mathe-
matics; (iv) develop a systematic framework to com-
prehensively assess the capabilities and limitations
of GPT models; and (v) study possibilities of effec-
tive integration of GPT-based tools, e.g., ChatGPT or
Copilot, into programming education.
Ankur, D. and Atul, D. (2022). Introducing Amazon
CodeWhisperer, the ML-powered coding com-
panion. AWS Machine Learning Blog. June
24, 2022. https://aws.amazon.com/blogs/machine-
Beck, K. (2000). Extreme programming explained: em-
brace change. Addison-Wesley professional.
Becker, B. A., Denny, P., Finnie-Ansley, J., Luxton-Reilly,
A., Prather, J., and Santos, E. A. (2022). Programming
is hard–or at least it used to be: Educational oppor-
tunities and challenges of ai code generation. arXiv
preprint arXiv:abs/2212.01020.
Biderman, S. R. and Raff, E. (2022). Fooling moss detec-
tion with pretrained language models. Proceedings of
the 31st ACM International Conference on Informa-
tion & Knowledge Management.
Bommarito, J., Bommarito, M., Katz, D. M., and Katz,
J. (2023). GPT as knowledge worker: A zero-shot
evaluation of (AI) CPA capabilities. arXiv preprint
Bommarito II, M. and Katz, D. M. (2022). GPT takes the
bar exam. arXiv preprint arXiv:abs/2212.14402.
Bowman, E. (2023). A college student created
an app that can tell whether ai wrote an es-
say. NPR Technology. January 9, 2023.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., et al. (2020). Language models are few-
shot learners. Advances in neural information pro-
cessing systems, 33:1877–1901.
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O.,
Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brock-
man, G., Ray, A., Puri, R., Krueger, G., Petrov, M.,
Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S.,
Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavar-
ian, M., Winter, C., Tillet, P., Such, F. P., Cummings,
D., Plappert, M., Chantzis, F., Barnes, E., Herbert-
Voss, A., Guss, W. H., Nichol, A., Paino, A., Tezak,
N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saun-
ders, W., Hesse, C., Carr, A. N., Leike, J., Achiam,
J., Misra, V., Morikawa, E., Radford, A., Knight,
M., Brundage, M., Murati, M., Mayer, K., Welin-
der, P., McGrew, B., Amodei, D., McCandlish, S.,
Sutskever, I., and Zaremba, W. (2021). Evaluating
large language models trained on code. arXiv preprint
Denny, P., Kumar, V., and Giacaman, N. (2022). Convers-
ing with Copilot: Exploring prompt engineering for
solving cs1 problems using natural language. arXiv
preprint arXiv:abs/2210.15157.
Drori, I. and Verma, N. (2021). Solving linear algebra by
program synthesis. arXiv preprint arXiv:2111.08171.
etienne, F. and Bott, F. (2002). Software design–cognitive
aspects. Springer Verlag.
Elsen-Rooney, M. (2023). NYC education department
blocks ChatGPT on school devices, networks. Chalk-
beat New York. January 3, 2023.
Finnie-Ansley, J., Denny, P., Becker, B. A., Luxton-Reilly,
A., and Prather, J. (2022). The robots are coming:
Exploring the implications of OpenAI Codex on in-
troductory programming. In Australasian Computing
Education Conference, ACE ’22, page 10–19, New
York, NY, USA. Association for Computing Machin-
Gilson, A., Safranek, C. W., Huang, T., Socrates,
V., Chi, L. S., Taylor, R. A., and Chartash, D.
(2022). How well does chatgpt do when tak-
ing the medical licensing exams? the implica-
tions of large language models for medical edu-
cation and knowledge assessment. In medRxiv.
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M.,
Song, D., and Steinhardt, J. (2020). Measuring mas-
sive multitask language understanding. arXiv preprint
Huang, K. (2023). Alarmed by A.I. chatbots, universities
start revamping how they teach. New York Times. Jan-
uary 16, 2023.
Karmakar, A., Prenner, J. A., D’Ambros, M., and Robbes,
R. (2022). Codex hacks HackerRank: Memorization
issues and a framework for code synthesis evaluation.
ArXiv, abs/2212.02684.
Knuth, D. E. (1984). Literate programming. The computer
journal, 27(2):97–111.
Kung, T. H., Cheatham, M., Medinilla, A., Sil-
los, C., De Leon, L., Elepano, C., Madriaga,
M., Aggabao, R., Diaz-Candido, G., Maningo,
J., et al. (2022). Performance of ChatGPT on
USMLE: Potential for ai-assisted medical education
using large language models. medRxiv preprint.
Large Language Models (GPT) Struggle to Answer Multiple-Choice Questions About Code