Large Language Models (GPT) Struggle to Answer Multiple-Choice

Questions About Code

Jaromir Savelka

, Arav Agarwal

, Christopher Bogart

and Majd Sakr

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, U.S.A.

Keywords:

Multiple-Choice Question Answering, MCQ, Introductory and Intermediate Programming, Code Analysis,

Generative Pre-Trained Transformers, GPT, Python Course, Programming Knowledge Assessment, ChatGPT,

Codex, GitHub Copilot, AlphaCode.

Abstract:

We analyzed effectiveness of three generative pre-trained transformer (GPT) models in answering multiple-

choice question (MCQ) assessments, often involving short snippets of code, from introductory and interme-

diate programming courses at the postsecondary level. This emerging technology stirs countless discussions

of its potential uses (e.g., exercise generation, code explanation) as well as misuses in programming educa-

tion (e.g., cheating). However, the capabilities of GPT models and their limitations to reason about and/or

analyze code in educational settings have been under-explored. We evaluated several OpenAI’s GPT models

on formative and summative MCQ assessments from three Python courses (530 questions). We found that

MCQs containing code snippets are not answered as successfully as those that only contain natural language.

While questions requiring to ﬁll-in a blank in the code or completing a natural language statement about the

snippet are handled rather successfully, MCQs that require analysis and/or reasoning about the code (e.g., what

is true/false about the snippet, or what is its output) appear to be the most challenging. These ﬁndings can be

leveraged by educators to adapt their instructional practices and assessments in programming courses, so that

GPT becomes a valuable assistant for a learner as opposed to a source of confusion and/or potential hindrance

in the learning process.

1 INTRODUCTION

This paper analyzes the effectiveness of gener-

ative pre-trained transformers (GPT), speciﬁcally

text-davinci-* models, to handle multiple-choice

question (MCQ) assessments, often involving small

snippets of code, from introductory and intermedi-

ate programming courses. We manually collected

a sizeable data set of 530 MCQs from three exist-

ing Python courses. Using a combination of sim-

ple pattern matching and manual curation, we orga-

nized the questions into meaningful categories ac-

cording to their type (e.g., true/false questions, or

questions asking about an output of the provided code

snippet). We analyzed the performance of the GPT

models across the categories to determine if ques-

tions of a certain type are handled more successfully

than questions of other types. We also benchmark the

older InstructGPT text-davinci-001 model against

the more recent GPT-3.5 text-davinci-002 and

https://orcid.org/0000-0002-3674-5456

https://orcid.org/0000-0001-9848-1663

https://orcid.org/0000-0001-8581-115X

https://orcid.org/0000-0001-5150-8259

text-davinci-003 models to gauge the rate of im-

provement that has been achieved over the past sev-

eral years.

There has been a burst of public attention to GPT

models’ potential impact on education as the result of

the recent release of OpenAI’s ChatGPT

. For exam-

ple, the tool has been blocked by New York City pub-

lic schools (Elsen-Rooney, 2023) because it may en-

able student plagiarism and provide inappropriate or

incorrect content. Universities have also been react-

ing, adjusting assignments (Huang, 2023) and seek-

ing out tools like GPTZero that detect text generated

by AI tools (Bowman, 2023). OpenAI have released a

similar tool themselves. However, reliability of these

tools has not been thoroughly tested.

Programming instructors as well as CS educators

in general have been sensitized to this development

even earlier. Large language models, such as GPT,

can generate computer program code (i.e., perform

computer program synthesis) with a high degree of

success. They can also explain computer program

code in natural language terms. Recently, a number

ChatGPT. https://chat.openai.com/ [Accessed 2023-

01-26]

Savelka, J., Agarwal, A., Bogart, C. and Sakr, M.

Large Language Models (GPT) Struggle to Answer Multiple-Choice Questions About Code.

DOI: 10.5220/0011996900003470

In Proceedings of the 15th International Conference on Computer Supported Education (CSEDU 2023) - Volume 2, pages 47-58

ISBN: 978-989-758-641-5; ISSN: 2184-5026

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

of computer program code generation tools have been

released. Among these, the most prominent ones are

OpenAI’s Codex (Chen et al., 2021), DeepMind’s Al-

phaCode (Li et al., 2022), and Amazon’s CodeWhis-

perer (Ankur and Atul, 2022). GitHub’s Copilot

version of Codex) conveniently integrates with IDEs,

such as Visual Studio Code, and hence has attracted

much attention. Microsoft dubs Copilot as “Your

AI pair programmer” (a reference to pair program-

ming (Beck, 2000; McDowell et al., 2002)). Since

it is available for free to students and educators, it is

inevitable that learners will use it to complete their

course assignments and assessments. Similarly, there

are no technical or cost barriers to using ChatGPT

which can be, among many other things, leveraged

to generate answers to MCQ questions.

To investigate how GPT models handle the MCQ

assessments of various types in a programming ed-

ucation context, we analyzed the following research

questions:

• Is there a difference between how successfully the

GPT models handle questions that contain only

natural language and those that also contain snip-

pets of computer code?

• Are there particular types of MCQs that are more

challenging for the GPT models compared to

other types of MCQs?

By carrying out this work, we provide the follow-

ing contributions to the CS education research com-

munity. To the best of our knowledge, this is the ﬁrst

comprehensive study that:

• Evaluates the performance of GPT models on

MCQ-style assessments that involve code snip-

pets, across different types of such questions.

• Lays a systematic foundation for discussions

about suitable uses of GPT models in program-

ming classes by providing quantitative analysis of

the models capabilities and limitations in handling

of computer code.

2 MOTIVATING EXAMPLE

Consider the below Python script that asks a user to

input a value which is expected to be a number. The

entered value of type str is cast to an int and di-

vided by the length of the raw input (str). Note

that the code defends against the possibility of a

GitHub Copilot: Your AI pair programmer. Available

at: https://github.com/features/copilot [Accessed 2023-01-

20]

ZeroDivisionError which cannot really occur, as

explained below. However, this likely confuses GPT

models when answering questions about this snippet.

try:

value = input("Enter a value: ")

print(int(value) / len(value))

except ZeroDivisionError:

print("Very bad input...")

If a user enters 22, then the output of the script would

be 11.0 (i.e., 22 / 2). As shown in Figure 1, if one pro-

vides ChatGPT (one of the state-of-the-art GPT-3.5

models) with the code snippet and asks, “what would

be the output if the user enters 0,” (letting ChatGPT

choose from “A. 0.0” or “B. Very bad input...”), the

provided answer is “B. Very bad input...” Of course,

this is an incorrect answer because the length of the

string "0" is 1 and, hence, the output is 0.0 (as shown

in Figure 1).

A human learner making this error would likely

be suspected of having several crucial misconcep-

tions. Firstly, selecting the “B. Very bad input...”

option would be somewhat more understandable if

the value variable were not placed within the len()

function call. In that case, one could assume that the

learner simply failed to recognize that the output of

the input() function call is a str and assumed it

was an int instead. However, applying the len()

function to an int would result in a TypeError be-

ing raised. Hence, the only input that could theoreti-

cally raise a ZeroDivisionError would be an empty

string. However, even that input would not result in

that particular error because it would fail on an at-

tempt to cast the value variable to int (ValueError)

that would occur prior to the division. Overall, a hu-

man learner selecting the “B. Very bad input...” an-

swer over the correct “A. 0.0” would clearly demon-

strate a lack of understanding of the workings of this

simple snippet of code.

Figure 1 shows the output of ChatGPT when

asked to explain the code snippet line by line. In-

terestingly, the explanation is correct, including the

line where the division takes place. With respect to

the statement on that line, it declares that: “[it] takes

the input value and ﬁrst converts it to an integer us-

ing the int() function, then divides it by the length

of the input value using the len() function.” Fur-

thermore, Figure 1 also shows the output of ChatGPT

when asked to generate Python code with the same

functionality as the provided code snippet. From the

natural language description, ChatGPT generates cor-

rect Python code with the speciﬁed behavior.

In this example, a GPT model is capable of cor-

rectly explaining the behavior (execution) of a com-

puter program on a local level (i.e., line by line). It

CSEDU 2023 - 15th International Conference on Computer Supported Education

Figure 1: The upper-left screenshot depicts a conversation with ChatGPT when asked to explain a code snippet line by line.

It correctly explains the bahavior (1). The lower-right shows a conversation with ChatGPT when asked to generate the code

snippet with the same behavior. The generated code is correct (2). The upper-right screenshot depicts a conversation with

ChatGPT when asked a straightforward MCQ about a code it can correctly explain line by line as well as correctly generate.

The answer is wrong (3)—compare the actual output of the code snippet which is shown in the lower-left corner (4).

is equally capable of generating the computer pro-

gram from a natural language description. Yet, it fails

spectacularly in answering simple questions about the

very same program. This is quite likely in stark con-

trast with a typical human learner. A learner capable

of independently writing the program from the natu-

ral language description as well as correctly explain-

ing its execution line by line, would quite likely be in

a position to answer such questions with ease.

3 RELATED WORK

In prior work, we evaluated the capability of a

GPT model (text-davinci-003), to pass a diverse

set of assessment instruments, including MCQs, in

the realistic context of full-ﬂedged programming

courses (Savelka et al., 2023). We found that the

current GPT models are not capable of passing the

full spectrum of assessments typically involved in

a Python programming course (below 70% on even

entry-level modules); but a straightforward applica-

tion of these models could enable a learner to obtain a

non-trivial portion of the overall available score (over

55%) in introductory and intermediate courses alike.

We observed that an important limitation of the GPT

models is their apparent struggle with activities that

require chains of reasoning steps, and that there ap-

peared to be a difference in success rate between

MCQs that contain a code snippet and those that do

not (Savelka et al., 2023). In this paper, we further

explore this phenomenon, focusing on discovery of

more ﬁne-grained properties of MCQs that are chal-

lenging for the GPT models to handle.

To the best of our knowledge, there is no other

study of GPT’s performance on MCQs from the pro-

gramming domain. There is work evaluating the per-

formance on MCQ data sets from other domains; in

many cases the tool does better than random chance;

sometimes even well enough to pass a test. For ex-

ample, Robinson et al. apply InstructGPT (Ouyang

et al., 2022) and Codex to OpenBookQA (Mihaylov

et al., 2018), StoryCloze (Mostafazadeh et al., 2016),

and RACE-m (Lai et al., 2017) data sets which fo-

cus on multi-hop reasoning, recall, and reading com-

prehension, reporting 77.4-89.2% accuracy (Robin-

son et al., 2022). In some cases, GPT can gener-

ate code when applied to programming assignments

Large Language Models (GPT) Struggle to Answer Multiple-Choice Questions About Code

in higher education courses. Drori and Verma used

Codex to write Python programs to solve 60 compu-

tational linear algebra MCQs, reporting 100% accu-

racy (Drori and Verma, 2021). Others have used GPT

models to solve various MCQ-based exams, includ-

ing the United States Medical Licensing Examination

(USMLE), with accuracy around 50% (Kung et al.,

2022; Gilson et al., 2022; Li

evin et al., 2022), the

Multistate Bar Examination (MBE) (Bommarito II

and Katz, 2022), and the American Institute of Certi-

ﬁed Public Accountants’ (AICPA) Regulation (REG)

exam (Bommarito et al., 2023).

Although, programming-related MCQs have not

been studied directly, some researchers in adjacent

ﬁelds have studied reasoning about similarly formal

topics. Although, GPT can often answer questions

about systems and rules, it is especially challenged by

tasks that involve applying them and reasoning about

their implications in novel examples. Hendryks et

al. created data set that includes a wide variety of

MCQs across STEM, humanities and arts, with GPT-

3 performing at levels above 50% for subjects such as

marketing and foreign policy, but below 30% for top-

ics like formal logic (Hendrycks et al., 2020). They

found that the model performed particularly poorly

in quantitative subjects. For example, in Elementary

Mathematics they note that GPT can answer questions

about arithmetic order of operations (e.g. that mul-

tiplications are performed before additions), it can-

not correctly answer questions that require applying

this concept. They also note that GPT performance

is not necessarily correlated with how advanced the

topic is for humans, doing better at College Mathe-

matics than Elementary Mathematics. Finally, they

noted that GPT does poorly on tests of legal and moral

reasoning (Hendrycks et al., 2020).

Lu et al. studied GPT models’ performance on a

large data set consisting of 21,208 MCQs on topics

in natural science, social science, and language (Lu

et al., 2022). They prompted the models to produce

an explanation along with its answer and reported 1-

3% improvement in accuracy (74.04%). In this work,

we do not adopt the approach and, hence, leave space

for future work as it appears quite promising and

deﬁnitely applicable in the context of programming

MCQs.

There is a growing body of related work on GPT

models’ capabilities in solving programming tasks

by generating code. Finnie-Ansley et al. evaluated

Codex on 23 programming tasks used as summative

assessments in a CS1 programming course (Finnie-

Ansley et al., 2022). Denny et al. focused on the

effects of prompt engineering when applying Copi-

lot to a set of 166 exercises from the publicly avail-

able CodeCheck repository (Denny et al., 2022). Out-

side of the educational context, there have been stud-

ies exploring GPT’s capabilities on competitive and

interview programming tasks. Chen et al. released

the HumanEval data set where Codex achieved 28.8%

success rate on the ﬁrst attempt and 72.3% when al-

lowed 100 attempts (Chen et al., 2021). Li et al. report

Deepmind’s AlphaCode performance on Codeforces

competitions,

achieving a 54.3% ranking amongst

5,000 participants (Li et al., 2022). Karmakar et al.

reported 96% pass rate for Codex on a data set of

115 programming problems from HackerRank

(Kar-

makar et al., 2022). Nguyen and Nadi reported Copi-

lot’s effectiveness on LeetCode

problems, achieving

42% accuracy (Nguyen and Nadi, 2022).

. Program code does more than control com-

puter execution; it also, some argue primarily, serves

as communication among developers (Knuth, 1984).

Since GPT is a text prediction model trained on code

in the context of human discussions about it, the

model’s representation of code is likely to capture

code’s design intent more strongly than code’s formal

properties. For example, work from multiple stud-

ies suggest that models that interpret code depend

heavily on function names and input variables (Mo-

hammadkhani et al., 2022; Yang et al., 2022). Al-

though, models like GPT are not trained to simulate

code execution, they can in many cases generate code

based on natural language description of the code’s

intent. Researchers have reported varying success at

generating code in response to programming assign-

ments, ranging from Codex’s 100% success gener-

ating Python computational linear algebra programs

(Drori and Verma, 2021), to 78.3% on some CS1 pro-

gramming problems (Finnie-Ansley et al., 2022), to

79% on the CodeCheck

repository of Python pro-

gramming problems (Denny et al., 2022).

Researchers have identiﬁed distinct cognitive pro-

cesses involved in programming. Characterizing the

kinds of learning necessary to teach programming,

Robins et al. claim for example that the knowledge

of how programming constructs work is cognitively

different from the strategy or plan for how to build a

program; and that programming comprehension and

generation are distinct mental processes that must be

taught. Programming skill is a blend of related cog-

nitive processes; it is not surprising that a generative

Codeforces. Available at: https://codeforces.com/

contests [Accessed 2023-01-22]

HackerRank. Available at: https://www.hackerrank.

com/ [Accessed 2023-01-22]

LeetCode. Available at: https://leetcode.com/ [Ac-

cessed 2023-01-22]

CodeCheck: Python Exercises. Available at:

https://horstmann.com/codecheck/python-questions.html

[Accessed 2022-01-22]

CSEDU 2023 - 15th International Conference on Computer Supported Education

model would not mimic all these processes equally

well (Robins et al., 2003).

GPT’s ability to answer questions intended as ed-

ucational assessments naturally raises the question

of its use for cheating. Biderman and Raff noted

that GPT solutions can evade plagiarism detection

by code similarity tools such as MOSS (Biderman

and Raff, 2022). On the other hand, Wermelinger

notes that while Copilot-generated solutions can typ-

ically pass some tests, they do not pass enough to

get a passing grade on a typical assignment; he con-

cludes that Copilot can be a useful springboard to-

wards solving CS1 problems, but outside of very

common stereotyped beginners’ exercises, learners’

substantial contribution is still required (Wermelinger,

2023). Becker et al. include a broader discussion of

the opportunities and challenges posed by code gen-

erating tools (Becker et al., 2022).

4 DATA SET

We manually collected MCQ assessment exercises

from three Python programming courses. Python Es-

sentials - Part 1 (Basics)

(PE1) aims to guide a

learner from a state of complete programming illit-

eracy to a level of programming knowledge which al-

lows them to design, write, debug, and run programs

encoded in the Python language. The course consists

of four content units and one completion (summary)

test. The units include (i) introduction to Python and

computer programming, (ii) data types variables, ba-

sic I/O, operations and basic operators, (iii) boolean

values, conditional loops, lists, logical and bitwise op-

erators, and (iv) functions, tuples, dictionaries, data

processing and exceptions.

Python Essentials - Part 2 (Intermediate) (PE2)

is focused on more advanced aspects of Python pro-

gramming, including modules, packages, exceptions,

ﬁle processing, object-oriented programming. Simi-

larly to PE1, the course is organized into four content

units and one completion (summary) test. The course

units are (i) modules, packages, and pip, (ii) strings,

string and list methods, and exceptions, (iii) object-

oriented programming, and (iv) miscellaneous.

Finally, Practical Programming with Python

OpenEDG: Python Essentials - Part 1 (Basics). Avail-

able at: https://edube.org/study/pe1 [Accessed 2023-01-15]

OpenEDG: Python Essentials - Part 2 (Intermediate).

Available at: https://edube.org/study/pe2 [Accessed 2023-

01-15]

Sail(): Social and Interactive Learning Platform.

Available at: https://sailplatform.org/courses. [Accessed

2023-03-03]

Table 1: Descriptive statistics of the created dataset. Each

row provides information about the MCQs each of the

courses employ. Each column reports on the distribution

of the code content of each MCQ set in each course.

Course Units MCQ MCQ Course

(topics) (plain) (+code) Overall

PE1 4 53 96 149

PE2 4 65 83 148

PPP 8 89 144 233

Type Overall 16 207 323 530

(PPP) emphasizes hands-on experience with funda-

mental Python constructs and exposure to software

development tools, practices, and real-world appli-

cations. The course consists of eight units which

include (i) Python basics and introduction to func-

tions, (ii) control ﬂow, strings, input and output, (iii)

Python data structures, (iv) object-oriented program-

ming, (v) software development, (vi) data manipula-

tion, (vii) web scraping and ofﬁce document process-

ing, and (viii) data analysis.

In PE1 and PE2, formative assessments are called

quizzes while summative assessments are called tests.

The tests determine if learners pass the courses

whereas quizzes are meant as practice. The MCQs

often include small snippets of code for learners to

reason about. From the two courses, we collected

297 questions (179 have code snippets). PPP uses

MCQ-style inline activities as formative assessment

and tests as summative assessment. From this course,

we collected 233 MCQs (144 with code snippets). Ta-

ble 1 has additional details.

We used simple pattern matching combined with

manual curation as the second step to organize the

MCQs into several categories. The ﬁrst distinction

was made between MCQs with code and MCQs with

no code. For an MCQ, to be considered as with code

one of the following two had to be true:

• Within the body of the question there had to be at

least one line fully dedicated to computer code.

• The choices were computer code expressions.

Inline mentions of names of functions or variables

were not considered as sufﬁcient for an MCQ to be

considered with code.

The second distinction was made along the fol-

lowing lines, focusing on the overall syntax of what

the question writer asks the student to do:

• True/False

The learner is asked to assess the truthfulness of a

single statement. For example:

Developers that write code individually are

not expected to apply code standards.

A. True

B. False

Large Language Models (GPT) Struggle to Answer Multiple-Choice Questions About Code

Figure 2: Distribution of MCQs into categories. Note that

the MCQs asking about the output of a code snippet as well

as MCQs focused on ﬁlling-in the blanks in a snippet are not

present in the MCQs with no code. This is to be expected

given the nature of those questions. The MCQs with code

are quite dominated by questions that ask about the output

of a code snippet as well as with questions of other type.

Otherwise, the distribution is relatively uniform.

Evaluate the following expression and deter-

mine whether it is True or False.

2 + 2 != 2 * 2

A. True

B. False

• Identify True/False Statement

The learner is asked to pick one or more answer

choices that are either true or false. Note that this

is different from the True/False questions (previ-

ous category). For example:

Which of the following statements is false?

A. The pandas module provides some CSV-

related methods.

B. Python has a built-in XML package with

several modules for XML parsing.

C. JSON data format has syntax to represent

all Python data structure types.

D. Python has a built-in csv module con-

taining methods for reading and writing into

CSV ﬁles.

Take a look at the snippet and choose one of

the following statements which is true:

nums = []

vals = nums[:]

vals.append(1)

A. nums is longer than ‘vals‘

B. vals is longer than nums

C. nums and vals are of the same length

• Finish Statement.

The learner is asked to complete a statement. For

example:

The ‘**‘ operator:

A. performs duplicated multiplication

B. does not exist

C. performs exponentiation

Right-sided binding means that the following

expression:

1 ** 2 ** 3

will be evaluated:

A. from right to left

B. in random order

C. from left to right

• Output

The learner is asked to identify the choice that cor-

responds to the output of a given snippet of code.

This category is applicable only to questions with

code. For example:

What is the output of the following snippet if

the user enters two lines containing 2 and 4

respectively?

x = int(input())

y = int(input())

print(x + y)

A. 2

B. 24

C. 6

What is the output of the following snippet?

my_list_1 = [1, 2, 3]

my_list_2 = []

for v in my_list_1:

my_list_2.insert(0, v)

print(my_list_2)

A. [1, 2, 3]

B. [1, 1, 1]

C. [3, 3, 3]

D. [3, 2, 1]

• Fill-in Blanks

The learner is asked to ﬁll in a code snippet by se-

lecting the appropriate choice as an answer. This

category is applicable only to questions with code.

For example:

Fill in the blank of the is_negative func-

tion deﬁnition shown below, so that the func-

tion returns True when the argument pro-

vided to num is a negative number and returns

False otherwise.

def is_negative(num):

return _________________

A. not (num > 0)

B. num > 0

C. num <= 0

D. num < 0

CSEDU 2023 - 15th International Conference on Computer Supported Education

The following code snippet should open

the myfile ﬁle and assign the lines to the

all_lines variable. Which of the options

below should be used to ﬁll in the blanks?

with __________________________

all_lines = file.readlines()

A. open("myfile",’r’) as file:

B. "myfile" in open as file:

C. with open "myfile" as file:

• Other

Any MCQ that does not fall into any of the above

categories. For example:

How many times will the code snippet below

print ‘X‘.?

for i in range(1, 7):

for j in range(2, 6):

print(’X’)

A. 24

B. 28

C. 35

Notice that the above example is closely related

to the questions asking for the output of the snip-

pet. However, there is a subtle difference since

this questions does not ask what is the output di-

rectly.

Given the piece of code presented in the

code snippet below, what is the value of

palindromes[1]?

palindromes = [’pop’, ’noon’, ’madam’]

A. ’pop’

B. ’noon’

C. ’p’

D. ’madam’

E. ’o’

Figure 2 shows the distribution of the MCQs into the

individual categories. The MCQs asking about the

output of a code snippet as well as MCQs focused on

ﬁlling-in the blanks in a snippet are not present in the

MCQs with no code. This is to be expected given the

nature of those questions. The MCQs with code are

quite dominated by questions that ask about the out-

put of a code snippet as well as with questions of other

type. Otherwise, the distribution is relatively uniform.

The ﬁll-in questions are rare. The distribution of the

no code questions is close to uniform.

5 MODELS

The original GPT model (Radford et al., 2018) is a 12-

layer decoder-only transformer (Vaswani et al., 2017)

with masked self-attention heads. Its core capabil-

ity is ﬁne-tuning on a downstream task. The GPT-2

model (Radford et al., 2019) largely follows the de-

tails of the original GPT model with a few modiﬁca-

tions, such as layer normalization moved to the input

of each sub-block, additional layer-normalization af-

ter the ﬁrst self-attention block, and a modiﬁed initial-

ization. Compared to the original model it displays

remarkable multi-task learning capabilities (Radford

et al., 2019). The next generation of GPT mod-

els (Brown et al., 2020) uses almost the same archi-

tecture as GPT-2. The only difference is that it uses

alternating dense and locally banded sparse attention

patterns in the layers of the transformer. The main fo-

cus of Brown et al. was to study the dependence of

performance and model size where eight differently

sized models were trained (from 125 million to 175

billion parameters). The largest of the models is com-

monly referred to as GPT-3. The interesting property

of these models is that they appear to be very strong

zero- and few-shot learners. This ability appears to

improve with the increasing size of the model (Brown

et al., 2020).

We are primarily interested in the performance of

text-davinci-003, one of the most advanced GPT

models offered by OpenAI. The text-davinci-003

model builds on top of previous text-davinci-002,

which in turn is based on code-davinci-002 (fo-

cused on code-completion tasks). To gauge the rate

of improvement over the several recent years, we

compare the performance of text-davinci-003 to

text-davinci-002 as well as to the previous gen-

eration’s InstructGPT model (text-davinci-001).

Recently, OpenAI has also released gpt-3.5-turbo

which reportedly matches the performance of the

text-davinci-003 for tenth of the cost.

We set the temperature to 0.0, which cor-

responds to no randomness. The higher the

temperature the more creative the output but it can

also be less factual. We set max_tokens to 500 (a

token roughly corresponds to a word). This parame-

ter controls the maximum length of the output. We set

top_p to 1, as is recommended when temperature is

set to 0.0. This parameter is related to temperature

and also inﬂuences creativeness of the output. We set

frequency_penalty to 0, which allows repetition by

ensuring no penalty is applied to repetitions. Finally,

we set presence_penalty to 0, ensuring no penalty

is applied to tokens appearing multiple times in the

output.

OpenAI: Model index for researchers. Available at:

https://beta.openai.com/docs/model-index-for-researchers/

instructgpt-models [Accessed 2023-01-15]

Large Language Models (GPT) Struggle to Answer Multiple-Choice Questions About Code

6 EXPERIMENTAL DESIGN

To test the performance of the three text-davinci-*

models, we submit MCQs one by one using the

openai Python library

which is a wrapper for the

OpenAI’s REST API. We embed each question in the

prompt template shown in Figure 3. The text of the

prompt’s preamble is inspired by OpenAI’s QA ex-

ample.

The {{question}} token is replaced with the

question text. The {{choices}} token is replaced with

the candidate answers where each one is placed on

a single line preceded by a capital letter. Each model

returns one or more of the choices as the prompt com-

pletion, which is then compared to the reference an-

swer. For PE1 and PE2, we let partially correct an-

swers be incorrect, following the course creators’ as-

sessment guidelines. In PPP, there is always exactly

one correct answer.

As the baseline, we use a simple model that se-

lects the answer with the highest Jaccard similarity to

the question. In case of a tie the longest answer is se-

lected. Jaccard similarity is one of the simplest mea-

sures of text similarity. Hence, it is an ideal candidate

for a baseline as it allows to detect what ratios of the

questions within their respective categories could be

solved employing this simple, yet sensible, heuristic.

Such MCQs likely pose very little challenge for GPT

models.

We report the proportions of the correct an-

swers (i.e., the accuracy) for each model per MCQ

category. We speciﬁcally focus on the differences

in performance of the text-davinci-003 model on

MCQs that contain code snippets (with code) com-

pared to MCQs that do not (no code). We are also

interested in the difference between the performance

on completion-based MCQs (Finish Statement and

Fill-in Blanks) compared to the rest. This is because

these question types are not too far off from the pre-

training objective and, hence, the expectation is that

the models’ performance should be higher on these

types. To test statistical signiﬁcance we use a simple

two-independent proportions test which is a statistical

hypothesis test used to determine whether two propor-

tions are different from each other.

7 RESULTS

Table 2 reports the results of our experiments. Firstly,

as one would expect all three GPT models clearly

GitHub: OpenAI Python Library. Available at: https:

//github.com/openai/openai-python [Accessed 2023-01-16]

OpenAI: Q&A. Available at: https://platform.openai.

com/examples/default-qa [Accessed 2023-03-04]

I am a highly intelligent bot that can easily

handle answering multiple-choice questions on

introductory Python topics. Given a question

and choices I can always pick the right ones.

Question: {{question}}

Choices:

The correct answer:

Figure 3: MCQ Prompt Template. The text of the

preamble (1) is inspired by OpenAI’s QA example. The

{{question}} token (2) is replaced with the question text.

The {{choices}} token (3) is replaced with the candidate

answers where each one is placed on a single line preceded

by a capital letter.

outperform the simple Jaccard similarity baseline.

The text-davinci-003 model appears to perform

the best (65.5% overall) with a small margin over

the text-davinci-002 (64.5% overall). The perfor-

mance of the text-davinci-001 appears to be much

lower compared to the other two models. This is to

be expected. While the text-davinci-002 is a di-

rect predecessor of the text-davinci-003 (hence,

the small difference) the text-davinci-001 is quite

removed from the two. The major breakthrough

in OpenAI GPT-3’s capabilities in handling com-

puter code was Codex (code-davinci-002) (Chen

et al., 2021) which is the direct predecessor of

text-davinci-002.

There appears to be a clear difference between the

performance of the most capable text-davinci-003

on the MCQs that contain code snippets (59.5% over-

all) compared to those that do not (77.9% over-

all). This difference is statistically signiﬁcant (p <

0.0001). This is to be expected as the combination of

code and natural language likely constitutes (on aver-

age) more complex input than natural language alone.

Additionally, it is quite possible that in our particular

context the questions with code are (on average) more

difﬁcult than questions with no code.

There also appears to be clear difference be-

tween the performance of text-davinci-003 on

the completion-oriented MCQs (87.1%) and the rest

(60.1%). This difference is statistically signiﬁcant

(p < 0.0001). Since GPT models are primarily fo-

cused on prompt completion, be it text or computer

code, this ﬁnding is also as expected.

OpenAI: Model index for researchers. Available at:

https://beta.openai.com/docs/model-index-for-researchers/

instructgpt-models [Accessed 2023-01-15]

CSEDU 2023 - 15th International Conference on Computer Supported Education

Table 2: Results of the experiments. The Jaccard column reports the performance of the baseline. The text-davinci-001,

text-davinci-002, and text-davinci-003 columns report the performance of the different GPT3 models. Results of the No Code

and With Code sections are summarized in the Total rows. The Overall row at the bottom reports the average performance of

the models across all the types of MCQs.

Question Type Jaccard text-davinci-001 text-davinci-002 text-davinci-003

No Code

True/False 11/25 13/25 19/25 20/25

(44.0%) (52.0%) (76.0%) (80.0%)

Identify True/False Statement 8/44 12/44 22/44 27/44

(18.2%) (27.3%) (50.0%) (61.4%)

Finish Statement 12/53 40/53 46/53 48/53

(22.6%) (75.5%) (86.8%) (90.6%)

Other 9/47 27/50 43/50 39/50

(19.1%) (53.2%) (86.0%) (74.0%)

Total 40/172 92/172 130/172 134/172

(23.2%) (53.5%) (75.6%) (77.9%)

With Code

True/False 9/23 12/23 10/23 10/23

(39.1%) (52.2%) (43.5%) (43.5%)

Identify True/False Statement 4/26 10/26 15/26 11/26

(15.4%) (38.5%) (57.7%) (42.3%)

Output 22/110 28/110 58/110 53/110

(20.0%) (25.4%) (52.7%) (48.2%)

Fill-in 2/13 5/13 10/13 11/13

(15.4%) (38.5%) (76.9%) (84.6%)

Finish Statement 8/27 10/27 22/27 22/27

(29.6%) (37.0%) (81.5%) (81.5%)

Other 39/159 42/159 97/159 106/159

(24.5%) (26.4%) (61.1%) (66.7%)

Total 84/358 107/358 212/358 213/358

(23.4%) (29.9%) (59.2%) (59.5%)

Overall 124/530 199/530 342/530 347/530

(23.4%) (37.5%) (64.5%) (65.5%)

8 DISCUSSION

Our experimental results suggest that there, indeed,

is a difference between how successfully the GPT

models handle questions that contain only natural lan-

guage and those that also contain snippets of com-

puter code (RQ1). Tentatively, we can conclude that

inclusion of a code snippet within an MCQ makes the

question more challenging for GPT models to handle.

This conclusion is supported by universally lower per-

formance on MCQs with code across all the subtypes,

i.e., True/False, Identify True/False Statement, Finish

Statement, and Other. The root cause for this discrep-

ancy is likely one or more of the following: (i) GPT

models are somewhat more limited with respect to

handling computer programs compared to natural lan-

guage; (ii) GPT models struggle with the combina-

tion of different types of expressions (i.e., natural lan-

guage and code); and/or (iii) the questions with code

snippets are inherently more difﬁcult.

While the greater difﬁculty of the questions with

code might certainly be a factor it appears that the

GPT models sometimes struggle to answer questions

with code that one might judge as simple. For exam-

ple, consider the following MCQ:

The following statement:

assert var == 0

A. is erroneous

B. will stop the program when var != 0

C. has no effect

D. will stop the program when var == 0

The answer of text-davinci-003 to this question

was “D. will stop the program when var == 0”. Hence,

it appears there are certain limitations in the capabil-

ities of the GPT models to answer questions about

code. This is somewhat surprising if one considers the

Large Language Models (GPT) Struggle to Answer Multiple-Choice Questions About Code

well documented capabilities of the models when it

comes to generation or explanation of computer pro-

grams.

The results also show that certain types of MCQs

are more challenging than others for the GPT mod-

els (RQ2). The questions that involve generation of

natural language and/or code appear to be handled

with much more success than other types of ques-

tions. This is to be expected as GPT models are pri-

marily focused on prompt completion. On the other

hand, it leads to somewhat paradoxical situations such

as the one illustrated in the motivating example (Sec-

tion 2). The models are capable of generating code

based on a natural language description, as well as

generating natural language explaining execution of

the code line-by-line. Yet, somehow these capabili-

ties do not seem to extend to the realm of answering

pointed speciﬁc questions about the code (often quite

simple ones).

We hypothesize that the above described para-

dox might be related to the phenomenon described

by (D

etienne and Bott, 2002). They point out that

program code serves two simultaneous purposes: it

is both a narrative description of a programmer’s in-

tent, and an artifact that controls computer execution.

Accordingly, human programmers maintain, and syn-

chronize, at least two kinds of mental models of code,

a functional model that captures the purpose the pro-

gram is supposed to play in the world, and a structural

model that allows mental simulation of data and con-

trol ﬂow.

Since GPT models are trained on large corpora

that include texts in natural language as well as pro-

gram code with comments and documentation, they

may acquire robust representations of the functional

relationship between code and the intent it expresses.

The training corpora likely do not contain code with

outputs or trace logs of its execution. Thus, models

may lack the required data to build a representation

of a structural model of code’s function. This is not

to say that including the mentioned resources into the

training corpora would necessarily result in the acqui-

sition of such a model. This is because an effective

use of the model may require the ability to simulate

execution of the code, or its part. The current large

language models, including GPT, do not have this ca-

pability. Note that there is an active area of research

in augmenting large language models with reasoning

skills and providing them with the ability to use tools,

such as the Python interpreter (Mialon et al., 2023).

Arguably, building up these connected mental

models of code’s purpose and operation should be a

key part of what CS education teaches. The particular

limitations of GPT models provide a useful lens into

what kind of mental model we are evaluating in typ-

ical higher education programming assessments. It

may be that True/False and Identify True/False State-

ments MCQs more likely require mental simulation of

the code execution. An experiment to validate our hy-

pothesis might be to classify MCQs according to their

focus on (a) predicting actual behavior, or (b) infer-

ring intent, and measure if and how the GPT models’

performance correlates with this classiﬁcation.

There are ongoing debates as to the changes the

emergence of GPT-based tools such as ChatGPT or

GitHub Copilot will inﬂict on the software develop-

ment profession as well as programming education.

Firstly, it appears inevitable that the tools will become

an integral and accepted part in the software devel-

opment process. Therefore, future programmers will

likely need to write less code. On the other hand, they

will need to be able to validate the auto-generated

code, spot deﬁciencies, and correct them efﬁciently.

Hence, programming education might need to de-

prioritize teaching learners how to write code and start

emphasizing skills such as requirements formulation,

debugging, trade-off analysis, and critical thinking.

Finally, the GPT-based tools present numerous op-

portunities to improve current instructional and as-

sessment practices in programming classes. Our ex-

periments suggest that GPT models are capable of

explaining code in plain and easily understandable

terms. Similarly, they are capable of generating and

completing program code. A judicious use of these

capabilities might result in numerous novel tools and

instructional approaches for novices and advanced

learners alike. However, there are also potential

threats. An improper or misinformed use of the tools

may result in an academic integrity violation (AIV)

incident (i.e., cheating). Similarly, over-reliance on

GPT-based tools may rather hinder than improve the

learning process.

9 CONCLUSIONS AND FUTURE

WORK

We evaluated text-davinci-* GPT models on a

sizeable set of 530 MCQs, many of which con-

tained code snippets, from three Python program-

ming courses. The overall accuracy of the most ca-

pable text-davinci-003 model was measured at

65.5% (compared to the 23.4% Jaccard similarity

baseline). While such performance is impressive

there appear to be some noticeable limitations. First

of all, it appears that the MCQs containing code snip-

pets were somewhat more challenging (59.5%) for

the model than those with no code (77.9%). In ad-

CSEDU 2023 - 15th International Conference on Computer Supported Education

dition, MCQs that ask to complete a sentence or ﬁll-

in a blank appear to be handled much more suc-

cessfully (87.1%) compared to other types of ques-

tions (60.1%). Therefore, GPT models’ capabili-

ties seem limited when it comes to handling MCQs

about computer code requiring reasoning beyond

mere completion (56.6%).

While our study of GPT models’ performance on

diverse types of MCQs yielded numerous valuable in-

sights, it is subject to countless limitations and leaves

much room for improvement. Hence, we suggest sev-

eral directions for future work: (i) further analyze the

effects of prompt-tuning (ii) and/or iterative prompt-

construction; (iii) examine the performance of GPT

models on other domains, e.g., competitive mathe-

matics; (iv) develop a systematic framework to com-

prehensively assess the capabilities and limitations

of GPT models; and (v) study possibilities of effec-

tive integration of GPT-based tools, e.g., ChatGPT or

Copilot, into programming education.

REFERENCES

Ankur, D. and Atul, D. (2022). Introducing Amazon

CodeWhisperer, the ML-powered coding com-

panion. AWS Machine Learning Blog. June

24, 2022. https://aws.amazon.com/blogs/machine-

learning/introducing-amazon-codewhisperer-the-ml-

powered-coding-companion/.

Beck, K. (2000). Extreme programming explained: em-

brace change. Addison-Wesley professional.

Becker, B. A., Denny, P., Finnie-Ansley, J., Luxton-Reilly,

A., Prather, J., and Santos, E. A. (2022). Programming

is hard–or at least it used to be: Educational oppor-

tunities and challenges of ai code generation. arXiv

preprint arXiv:abs/2212.01020.

Biderman, S. R. and Raff, E. (2022). Fooling moss detec-

tion with pretrained language models. Proceedings of

the 31st ACM International Conference on Informa-

tion & Knowledge Management.

Bommarito, J., Bommarito, M., Katz, D. M., and Katz,

J. (2023). GPT as knowledge worker: A zero-shot

evaluation of (AI) CPA capabilities. arXiv preprint

arXiv:abs/2301.04408.

Bommarito II, M. and Katz, D. M. (2022). GPT takes the

bar exam. arXiv preprint arXiv:abs/2212.14402.

Bowman, E. (2023). A college student created

an app that can tell whether ai wrote an es-

say. NPR Technology. January 9, 2023.

https://www.npr.org/2023/01/09/1147549845.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,

Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,

Askell, A., et al. (2020). Language models are few-

shot learners. Advances in neural information pro-

cessing systems, 33:1877–1901.

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O.,

Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brock-

man, G., Ray, A., Puri, R., Krueger, G., Petrov, M.,

Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S.,

Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavar-

ian, M., Winter, C., Tillet, P., Such, F. P., Cummings,

D., Plappert, M., Chantzis, F., Barnes, E., Herbert-

Voss, A., Guss, W. H., Nichol, A., Paino, A., Tezak,

N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saun-

ders, W., Hesse, C., Carr, A. N., Leike, J., Achiam,

J., Misra, V., Morikawa, E., Radford, A., Knight,

M., Brundage, M., Murati, M., Mayer, K., Welin-

der, P., McGrew, B., Amodei, D., McCandlish, S.,

Sutskever, I., and Zaremba, W. (2021). Evaluating

large language models trained on code. arXiv preprint

arXiv:abs/2107.03374.

Denny, P., Kumar, V., and Giacaman, N. (2022). Convers-

ing with Copilot: Exploring prompt engineering for

solving cs1 problems using natural language. arXiv

preprint arXiv:abs/2210.15157.

Drori, I. and Verma, N. (2021). Solving linear algebra by

program synthesis. arXiv preprint arXiv:2111.08171.

etienne, F. and Bott, F. (2002). Software design–cognitive

aspects. Springer Verlag.

Elsen-Rooney, M. (2023). NYC education department

blocks ChatGPT on school devices, networks. Chalk-

beat New York. January 3, 2023.

Finnie-Ansley, J., Denny, P., Becker, B. A., Luxton-Reilly,

A., and Prather, J. (2022). The robots are coming:

Exploring the implications of OpenAI Codex on in-

troductory programming. In Australasian Computing

Education Conference, ACE ’22, page 10–19, New

York, NY, USA. Association for Computing Machin-

ery.

Gilson, A., Safranek, C. W., Huang, T., Socrates,

V., Chi, L. S., Taylor, R. A., and Chartash, D.

(2022). How well does chatgpt do when tak-

ing the medical licensing exams? the implica-

tions of large language models for medical edu-

cation and knowledge assessment. In medRxiv.

https://doi.org/10.1101/2022.12.23.22283901.

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M.,

Song, D., and Steinhardt, J. (2020). Measuring mas-

sive multitask language understanding. arXiv preprint

arXiv:abs/2009.03300.

Huang, K. (2023). Alarmed by A.I. chatbots, universities

start revamping how they teach. New York Times. Jan-

uary 16, 2023.

Karmakar, A., Prenner, J. A., D’Ambros, M., and Robbes,

R. (2022). Codex hacks HackerRank: Memorization

issues and a framework for code synthesis evaluation.

ArXiv, abs/2212.02684.

Knuth, D. E. (1984). Literate programming. The computer

journal, 27(2):97–111.

Kung, T. H., Cheatham, M., Medinilla, A., Sil-

los, C., De Leon, L., Elepano, C., Madriaga,

M., Aggabao, R., Diaz-Candido, G., Maningo,

J., et al. (2022). Performance of ChatGPT on

USMLE: Potential for ai-assisted medical education

using large language models. medRxiv preprint.

https://doi.org/10.1101/2022.12.19.22283643.

Large Language Models (GPT) Struggle to Answer Multiple-Choice Questions About Code

Lai, G., Xie, Q., Liu, H., Yang, Y., and Hovy, E.

(2017). Race: Large-scale reading comprehen-

sion dataset from examinations. arXiv preprint

arXiv:abs/1704.04683.

Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser,

J., Leblond, R., Eccles, T., Keeling, J., Gimeno,

F., Lago, A. D., Hubert, T., Choy, P., de Mas-

son d’Autume, C., Babuschkin, I., Chen, X., Huang,

P.-S., Welbl, J., Gowal, S., Cherepanov, A., Molloy,

J., Mankowitz, D. J., Robson, E. S., Kohli, P., de Fre-

itas, N., Kavukcuoglu, K., and Vinyals, O. (2022).

Competition-level code generation with AlphaCode.

Science, 378(6624):1092–1097.

evin, V., Hother, C. E., and Winther, O. (2022). Can

large language models reason about medical ques-

tions? ArXiv preprint arXiv:abs/2207.08143.

Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.-W., Zhu,

S.-C., Tafjord, O., Clark, P., and Kalyan, A. (2022).

Learn to explain: Multimodal reasoning via thought

chains for science question answering.

McDowell, C., Werner, L., Bullock, H., and Fernald, J.

(2002). The effects of pair-programming on perfor-

mance in an introductory programming course. In

Proceedings of the 33rd SIGCSE technical symposium

on Computer science education, pages 38–42.

Mialon, G., Dess

ı, R., Lomeli, M., Nalmpantis, C., Pa-

sunuru, R., Raileanu, R., Rozi

ere, B., Schick, T.,

Dwivedi-Yu, J., Celikyilmaz, A., et al. (2023). Aug-

mented language models: a survey. arXiv preprint

arXiv:2302.07842.

Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. (2018).

Can a suit of armor conduct electricity? A new dataset

for open book question answering. arXiv preprint

arXiv:abs/1809.02789.

Mohammadkhani, A. H., Tantithamthavorn, C. K., and

Hemmati, H. (2022). Explainable AI for pre-trained

code models: What do they learn? When they do not

work? ArXiv preprint, arXiv:abs/2211.12821.

Mostafazadeh, N., Chambers, N., He, X., Parikh, D., Batra,

D., Vanderwende, L., Kohli, P., and Allen, J. (2016).

A corpus and cloze evaluation for deeper understand-

ing of commonsense stories. In Proceedings of the

2016 Conference of the North American Chapter of

the Association for Computational Linguistics: Hu-

man Language Technologies, pages 839–849.

Nguyen, N. and Nadi, S. (2022). An empirical evalua-

tion of GitHub Copilot’s code suggestions. In 2022

IEEE/ACM 19th International Conference on Mining

Software Repositories (MSR), pages 1–5.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright,

C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama,

K., Ray, A., et al. (2022). Training language mod-

els to follow instructions with human feedback. arXiv

preprint arXiv:abs/2203.02155.

Radford, A., Narasimhan, K., Salimans, T., and Sutskever,

I. (2018). Improving language understanding by gen-

erative pre-training.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and

Sutskever, I. (2019). Language models are unsuper-

vised multitask learners.

Robins, A., Rountree, J., and Rountree, N. (2003). Learning

and Teaching Programming: A Review and Discus-

sion. Computer Science Education, 13(2):137–172.

Robinson, J., Rytting, C. M., and Wingate, D. (2022).

Leveraging large language models for multi-

ple choice question answering. arXiv preprint

arXiv:abs/2210.12353.

Savelka, J., Agarwal, A., Bogart, C., Song, Y., and Sakr,

M. (2023). Can generative pre-trained transformers

(gpt) pass assessments in higher education program-

ming courses? In Proceedings of the 28th Annual

ACM Conference on Innovation and Technology in

Computer Science Education.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.

(2017). Attention is all you need. Advances in neural

information processing systems, 30.

Wermelinger, M. (2023). Using GitHub Copilot to solve

simple programming problems. Proceedings of the

54th ACM Technical Symposium on Computing Sci-

ence Education (SIGCSE).

Yang, G., Zhou, Y., Yang, W., Yue, T., Chen, X., and Chen,

T. (2022). How important are good method names in

neural code generation? A model robustness perspec-

tive. ArXiv, abs/2211.15844.

CSEDU 2023 - 15th International Conference on Computer Supported Education