dition, MCQs that ask to complete a sentence or fill-
in a blank appear to be handled much more suc-
cessfully (87.1%) compared to other types of ques-
tions (60.1%). Therefore, GPT models’ capabili-
ties seem limited when it comes to handling MCQs
about computer code requiring reasoning beyond
mere completion (56.6%).
While our study of GPT models’ performance on
diverse types of MCQs yielded numerous valuable in-
sights, it is subject to countless limitations and leaves
much room for improvement. Hence, we suggest sev-
eral directions for future work: (i) further analyze the
effects of prompt-tuning (ii) and/or iterative prompt-
construction; (iii) examine the performance of GPT
models on other domains, e.g., competitive mathe-
matics; (iv) develop a systematic framework to com-
prehensively assess the capabilities and limitations
of GPT models; and (v) study possibilities of effec-
tive integration of GPT-based tools, e.g., ChatGPT or
Copilot, into programming education.
REFERENCES
Ankur, D. and Atul, D. (2022). Introducing Amazon
CodeWhisperer, the ML-powered coding com-
panion. AWS Machine Learning Blog. June
24, 2022. https://aws.amazon.com/blogs/machine-
learning/introducing-amazon-codewhisperer-the-ml-
powered-coding-companion/.
Beck, K. (2000). Extreme programming explained: em-
brace change. Addison-Wesley professional.
Becker, B. A., Denny, P., Finnie-Ansley, J., Luxton-Reilly,
A., Prather, J., and Santos, E. A. (2022). Programming
is hard–or at least it used to be: Educational oppor-
tunities and challenges of ai code generation. arXiv
preprint arXiv:abs/2212.01020.
Biderman, S. R. and Raff, E. (2022). Fooling moss detec-
tion with pretrained language models. Proceedings of
the 31st ACM International Conference on Informa-
tion & Knowledge Management.
Bommarito, J., Bommarito, M., Katz, D. M., and Katz,
J. (2023). GPT as knowledge worker: A zero-shot
evaluation of (AI) CPA capabilities. arXiv preprint
arXiv:abs/2301.04408.
Bommarito II, M. and Katz, D. M. (2022). GPT takes the
bar exam. arXiv preprint arXiv:abs/2212.14402.
Bowman, E. (2023). A college student created
an app that can tell whether ai wrote an es-
say. NPR Technology. January 9, 2023.
https://www.npr.org/2023/01/09/1147549845.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., et al. (2020). Language models are few-
shot learners. Advances in neural information pro-
cessing systems, 33:1877–1901.
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O.,
Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brock-
man, G., Ray, A., Puri, R., Krueger, G., Petrov, M.,
Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S.,
Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavar-
ian, M., Winter, C., Tillet, P., Such, F. P., Cummings,
D., Plappert, M., Chantzis, F., Barnes, E., Herbert-
Voss, A., Guss, W. H., Nichol, A., Paino, A., Tezak,
N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saun-
ders, W., Hesse, C., Carr, A. N., Leike, J., Achiam,
J., Misra, V., Morikawa, E., Radford, A., Knight,
M., Brundage, M., Murati, M., Mayer, K., Welin-
der, P., McGrew, B., Amodei, D., McCandlish, S.,
Sutskever, I., and Zaremba, W. (2021). Evaluating
large language models trained on code. arXiv preprint
arXiv:abs/2107.03374.
Denny, P., Kumar, V., and Giacaman, N. (2022). Convers-
ing with Copilot: Exploring prompt engineering for
solving cs1 problems using natural language. arXiv
preprint arXiv:abs/2210.15157.
Drori, I. and Verma, N. (2021). Solving linear algebra by
program synthesis. arXiv preprint arXiv:2111.08171.
D
´
etienne, F. and Bott, F. (2002). Software design–cognitive
aspects. Springer Verlag.
Elsen-Rooney, M. (2023). NYC education department
blocks ChatGPT on school devices, networks. Chalk-
beat New York. January 3, 2023.
Finnie-Ansley, J., Denny, P., Becker, B. A., Luxton-Reilly,
A., and Prather, J. (2022). The robots are coming:
Exploring the implications of OpenAI Codex on in-
troductory programming. In Australasian Computing
Education Conference, ACE ’22, page 10–19, New
York, NY, USA. Association for Computing Machin-
ery.
Gilson, A., Safranek, C. W., Huang, T., Socrates,
V., Chi, L. S., Taylor, R. A., and Chartash, D.
(2022). How well does chatgpt do when tak-
ing the medical licensing exams? the implica-
tions of large language models for medical edu-
cation and knowledge assessment. In medRxiv.
https://doi.org/10.1101/2022.12.23.22283901.
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M.,
Song, D., and Steinhardt, J. (2020). Measuring mas-
sive multitask language understanding. arXiv preprint
arXiv:abs/2009.03300.
Huang, K. (2023). Alarmed by A.I. chatbots, universities
start revamping how they teach. New York Times. Jan-
uary 16, 2023.
Karmakar, A., Prenner, J. A., D’Ambros, M., and Robbes,
R. (2022). Codex hacks HackerRank: Memorization
issues and a framework for code synthesis evaluation.
ArXiv, abs/2212.02684.
Knuth, D. E. (1984). Literate programming. The computer
journal, 27(2):97–111.
Kung, T. H., Cheatham, M., Medinilla, A., Sil-
los, C., De Leon, L., Elepano, C., Madriaga,
M., Aggabao, R., Diaz-Candido, G., Maningo,
J., et al. (2022). Performance of ChatGPT on
USMLE: Potential for ai-assisted medical education
using large language models. medRxiv preprint.
https://doi.org/10.1101/2022.12.19.22283643.
Large Language Models (GPT) Struggle to Answer Multiple-Choice Questions About Code
57