How Proficient Is Generative AI in an Introductory Object-Oriented

Programming Course?

Marina Lepp

and Joosep Kaimre

Institute of Computer Science, University of Tartu, Narva mnt 18, Tartu, Estonia

Keywords: Artificial Intelligence (AI), Object-Oriented Programming, Programming Education, ChatGPT, Copilot,

Novice Programmers.

Abstract: In 2022, the release of ChatGPT marked a significant breakthrough in Artificial Intelligence (AI) chatbot

usage, particularly impacting computer science education. AI chatbots can now generate code snippets, but

their proficiency in solving various tasks remains debated. This study examines how well AI-based chatbots,

including ChatGPT and Microsoft Copilot, perform in solving tasks in the "Object-Oriented Programming"

course. Both tools were tested on multiple programming tasks and exam questions, with their results compared

to those of students. Currently, ChatGPT-3.5 performs below the average student, while Copilot is on par.

The chatbots performed better on introductory topics, though their performance varied as task difficulty

increased. They also fared better on longer programming test tasks than on shorter exam tasks. A common

error was failing to provide all possible solutions and misinterpreting implied requirements. Despite these

challenges, both AI tools are capable of passing the course. These findings offer valuable insights for

programming instructors by highlighting the strengths and weaknesses of AI chatbots, helping guide potential

improvements in computer science education.

1 INTRODUCTION

The year 2022 marked a breakthrough for Artificial

Intelligence, especially with the public release of

ChatGPT. This event brought widespread attention to

AI assistants and chatbots, sparking concerns about

AI replacing humans and rendering certain jobs

obsolete. One field that is notably affected is

computer science and its education (Denny et al.,

2024).

The rise of AI has led to a surge of research on its

proficiency in university courses (Denny et al., 2024;

Finnie-Ansley et al., 2022; Finnie-Ansley et al., 2023;

Richards et al., 2024), its use as a tool (Denny et al.,

2023; Jukiewicz, 2024; Kiesler et al., 2023), and its

impact on student learning (Sun et al., 2024; Yilmaz

& Yilmaz, 2023). There is currently no consensus on

whether AI outperforms the average student, with

some studies suggesting it surpasses students (Finnie-

Ansley et al., 2022; Finnie-Ansley et al., 2023;

Richards et al., 2024), while others indicate that AI

can pass courses but falls short compared to students

(Bordt & Luxburg, 2023; Shoufan, 2023).

https://orcid.org/0000-0003-3303-5245

Specifically, in courses on object-oriented

programming, there is limited research on how AI

chatbots compare to students and whether there are

particular areas where AI struggles. Additionally,

there is little research on how well AI assistants

handle input in languages other than English (for

example, Estonian) and whether this affects their

performance.

The main goal of this study is to analyze the

proficiency of various AI chatbots in an introductory

object-oriented programming course and to compare

their performance to that of students enrolled in the

course. In this paper, the terms "AI assistants",

"chatbots" and "AI chatbots" are used

interchangeably to refer to AI-based models that

students can interact with. To achieve this goal, the

paper will address the following research questions:

 How do different chatbots perform in the

"Object-Oriented Programming" course

compared to students?

 What are the common mistakes made by AI

chatbots while solving the tasks from the

"Object-Oriented Programming" course?

216

Lepp, M. and Kaimre, J.

How Proﬁcient Is Generative AI in an Introductory Object-Oriented Programming Course?.

DOI: 10.5220/0013199200003932

In Proceedings of the 17th International Conference on Computer Supported Education (CSEDU 2025) - Volume 2, pages 216-227

ISBN: 978-989-758-746-7; ISSN: 2184-5026

The paper opens with a summary of existing

research on the proficiency of AI chatbots in

computer science courses. Section 3 provides an

overview of the "Object-Oriented Programming"

course and outlines the evaluation methods used for

the AI chatbots in this course. Section 4 presents the

results, while Section 5 discusses those results in

detail.

2 BACKGROUND

Several studies have evaluated the effectiveness of

various AI approaches for solving tasks across

different fields, including economics (Geerling et al.,

2023), law (Bommarito & Katz, 2022), and medical

studies (Lee, 2023). However, this study primarily

focuses on computer science and programming

courses. The majority of previous studies about

computer science have concentrated on ChatGPT

(e.g. Richards et al., 2024; Sun et al., 2024; Yilmaz &

Yilmaz, 2023), although some have explored other

tools like GitHub Copilot (e.g. Finnie-Ansley et al.,

2022).

Several studies have shown that AI tools often

perform within the top quartile of the respective class

in introductory programming courses (e.g. Finnie-

Ansley et al., 2022; Richards et al., 2024). AI excels

in tasks covering basic topics but shows greater

variability in more complex areas like data structures

and algorithms (Finnie-Ansley et al., 2023). Bordt

and Luxburg (2023) confirmed this variability,

finding that while ChatGPT-3.5 could pass a data

structures course, it performed below the average

student, whereas ChatGPT-4 performed comparably

to students. Shoufan (2023) reported similar

outcomes, with ChatGPT achieving a passable grade

but still trailing behind students. Similarly, it was

found that ChatGPT could pass undergraduate

courses but struggled at the postgraduate level

(Richards et al., 2024). It outperformed students on

introductory topics but was outperformed on more

advanced tasks. In contrast, in another study, no

significant difference was discovered between AI's

performance in introductory and intermediate tasks,

with AI sometimes excelling in intermediate-level

assessments (Savelka et al., 2023). Overall, AI

assistants tend to outperform students at the

introductory level but struggle more with expert-level

tasks. The line between introductory and intermediate

tasks is not always clear, and while AI may surpass

students in beginner courses, it lags behind on more

challenging tasks, though still capable of passing.

In terms of topic-specific performance, AI

performed best in algorithms and data structures,

followed by operating systems, machine learning, and

database management systems (Joshi et al., 2024).

Beyond university courses, AI chatbots have also

been tested in competitive programming challenges

like Leetcode (Kadir et al., 2024). ChatGPT

surpassed the average human acceptance rate across

nearly all categories and difficulty levels (easy,

medium, hard), with the exception of tasks involving

bit manipulation. However, these challenges are not

directly comparable to university courses, as

participants may be less motivated to perform at their

best since the results do not impact their GPA.

ChatGPT’s ability to generate solutions from non-

English (Czech) input has been evaluated in

information security courses (Malinka et al., 2023).

ChatGPT successfully passed all of the four courses

where it was assessed. AI outperformed the students'

average in one course, while students outperformed

AI in the other three. ChatGPT often outperformed

students in full-text exams requiring written answers

or solutions. However, students generally performed

better in tasks involving projects, essay writing, and

coding small snippets.

Research on AI performance in object-oriented

programming courses taught in Java, particularly

when comparing AI assistants to student results, is

limited. It was found that while ChatGPT can handle

simpler tasks, it struggles with more complex ones

(Bucaioni et al., 2024; Ouh et al., 2023; Wang et al.,

2024). However, it often provides partial solutions

that give students a useful starting point. Another

issue noted is that when tasks involve data presented

as UML diagrams or API documentation, chatbots

have difficulty parsing the full input. This challenge

is not unique to Java or object-oriented programming

but highlights the broader limitations of AI in

handling non-text inputs. Camara et al. (2023)

support this, reporting that ChatGPT struggles to

reliably generate UML diagrams and often

encounters syntax issues. Cipriano and Alves (2024)

further confirmed AI's difficulties with object-

oriented tasks, finding that AI-generated code

frequently contained compilation errors, required

multiple prompts to complete all necessary classes

and functions, and failed some unit tests used for task

evaluation. In addition, code generated by ChatGPT

is prone to several quality issues, such as runtime

errors, incorrect outputs, and challenges with

maintainability (Liu et al., 2024). The limited

research comparing AI assistants to students makes it

difficult to determine whether AI faces similar

How Proﬁcient Is Generative AI in an Introductory Object-Oriented Programming Course?

217

challenges or if it outperforms students in some areas

while underperforming in others.

Overall, AI has advanced to a level where it can

successfully pass university courses, but it has yet to

consistently outperform the average student. This

may be due to the significant variation in AI's

proficiency across different tasks. While AI can

easily handle small code snippets and exam

questions, it struggles with larger, more complex

projects, programs, and essays. For essays, AI often

encounters issues with proper citations and may

generate false references (Malinka et al., 2023).

Whether AI exceeds the students’ average in a course

often depends on the grading scheme and the weight

of different tasks in the final grade. Another factor

affecting AI performance is the presentation of task

descriptions. AI can also pass non-English courses,

but more research is needed to determine if this

capability extends beyond Czech to other languages,

such as Estonian. Research on AI in Java-based

object-oriented programming courses is limited, but

the general trend aligns with other fields: AI excels at

simpler, smaller tasks but struggles as complexity

increases. The lack of studies comparing AI to

students in Java courses makes it difficult to pinpoint

where AI's performance differs from that of students

or whether they make similar mistakes.

3 RESEARCH METHOD

3.1 Course Details

Object-oriented programming (OOP) is a widely used

paradigm in contemporary programming. It revolves

around the concept of objects, which are instances of

classes that encapsulate both data and the behaviors

associated with that data (Gabbrielli & Martini,

2023). At the University of Tartu, this paradigm is

taught to first-year students in the course “Object-

Oriented Programming”, using the Java programming

language. The data, tasks, and results from this course

were utilized to compare the proficiency of AI

chatbots in object-oriented programming tasks

against that of novice programmers.

The “Object-Oriented Programming” course is

primarily taught to first-year Computer Science

students as a mandatory part of their curriculum.

However, it is also selected as an elective by many

other students, making it one of the largest courses at

the University of Tartu, with annual enrollment

ranging from 270 to 330 students. To enroll in the

"Object-Oriented Programming" course, students

must first complete an introductory course in Python

(Leping et al., 2009). Since students are not required

to have prior experience with object-oriented

programming or Java, most enter the course as

complete beginners in both the language and the

subject matter.

The course lasts for 16 weeks and consists of

weekly lectures, homework assignments, and practice

seminars, employing a flipped-classroom approach

(Lepp & Tõnisson, 2015). Each week, students are

required to watch lecture videos, complete a short

quiz, tackle homework related to the weekly topic,

and attend practice seminars to enhance their

understanding. Throughout the course, they must take

two major tests and complete two group assignments,

culminating in a final exam. The final grade primarily

depends on the exam (33 points out of 102) and the

tests (worth 16 points each). Points earned in practice

seminars are based on attendance alone. Most

homework assignments feature automated tests that

provide immediate feedback, allowing students to

resubmit their work until they achieve the highest

score. Group projects are open-ended tasks with

specific requirements, granting students the freedom

to choose their topics and implementations, which

can lead to less uniform assessments since graders

also take the students' skill levels into account. As a

result, our analysis will focus on comparing AI and

student performance on the exam and tests, as these

assessments have clear and consistent grading

criteria, enabling more accurate comparisons. Since

these two components significantly influence the

final grade (almost 64% in total), this analysis will

offer insights into AI's proficiency relative to

students, including the potential for academic

dishonesty and the extent to which AI could assist

students.

The two tests are scheduled for weeks 7 and 13,

each covering the topics taught in the preceding

weeks. Students are provided with a program outline

listing the required classes, the methods each class

should include, any interfaces the classes must

implement or properties to inherit from a superclass,

and the expected main workflow of the program.

Students have 105 minutes to complete the program,

during which they are allowed to use any materials

except for communicating with others or seeking help

from an AI chatbot. The beginning of the sample test

task (translated into English) can be seen in Figure 1.

During the test, the students can use automatic

tests to verify that all required classes and methods

are present. However, these tests do not assess the

internal logic of the program. It is the student's

responsibility to test and debug their code to ensure it

aligns with the provided task outline.

CSEDU 2025 - 17th International Conference on Computer Supported Education

218

Figure 1: The beginning of the sample task for test 1.

Figure 2: Examples of the short exam tasks.

The exam is taken at the end of the course through

a computer-based test on Moodle. Students are

allowed to use course materials, review code

examples and documentation, and search on Google,

but they are prohibited from communicating with

others, using AI assistants, opening IDEs, compiling

code, or running it. The exam lasts 60 minutes and

covers all topics taught in the course. The first

question is a declaration of honesty, worth 1 point,

and the final question is a longer, open-ended task

worth 6 points. Between these are 13 other questions,

worth 2 points each, presented in a random order.

These questions include multiple-choice, single-

choice, fill-in-the-blank, and matching types, with the

specific set of options varying depending on the

question. Students cannot revisit previous questions

once they have moved on to the next. Examples of the

short question (translated into English) can be seen in

Figure 2.

The longer, open-ended question requires

students to explain and justify their answers. There

are two types of this question: one involves filling in

the gaps of a code snippet and providing all possible

answers along with reasoning, while the other

presents a code snippet with errors, asking students to

evaluate statements about what is wrong with the

code and justify their decisions. Examples of both

types of longer tasks (translated into English) are

shown in Figure 3.

Figure 3: Examples of the two long tasks.

How Proﬁcient Is Generative AI in an Introductory Object-Oriented Programming Course?

219

3.2 Procedure and Data Analysis

ChatGPT-3.5 and Microsoft Copilot were selected for

testing. ChatGPT-3.5 was chosen due to its free

availability, unlike ChatGPT-4, which requires a paid

subscription. Since students are more likely to use the

free version, it was assumed they would opt for

ChatGPT-3.5. Microsoft Copilot was included

because students at the University of Tartu have free

access to its paid version through their university

email, making it a likely choice for students.

To evaluate how well ChatGPT and Copilot could

handle tasks in the introductory object-oriented

programming course, they were provided with the full

text of the tests and tasks, and their outputs were

graded accordingly. While ChatGPT processed the

full text with no issues, Copilot had a 4000-character

limit, requiring some tasks to be split into multiple

queries. No additional adjustments were made, which

also meant that the AI chatbots received the tasks in

Estonian.

To assess how effectively the AI tools completed

the tasks, the standard grading scheme of the course

was used. Common errors were documented to

identify recurring issues and potential problem areas.

Each AI chatbot was given three different versions of

the tests to gather more data points. For the exam,

which consists of a range of questions from a Moodle

question bank, the AI assistants were tested on ten

questions from each set to establish a reliable average

performance for each topic. The same set of questions

was presented to both AI assistants.

4 RESULTS

4.1 Programming Test 1

Test 1 covers key topics such as Java classes, objects,

Strings, files, lists, polymorphism, interfaces, and

abstract, super- and subclasses. The test is worth 16

points, with a passing score set at 12 out of 16.

ChatGPT and Copilot outputs were evaluated. On

average, ChatGPT achieved a score of 14.65, while

Copilot scored 15.85 out of 16, both exceeding the

required minimum score. A total of 285 students took

this test, with the average score being 14.61 points. In

comparison, both ChatGPT and Copilot performed

better, with their averages being higher. However, the

difference between ChatGPT and the students was

relatively small, while Copilot showed a more

significant advantage. When looking at individual

tests, ChatGPT outperformed the students’ average

on two of them. However, the students’ average was

lowered by non-compiling solutions, which were

automatically scored as 0. Copilot, on the other hand,

outperformed the students on all three tests.

Figure 4 shows the boxplot comparing the results

of the students and the AI chatbots for test 1. To

improve clarity, outliers were excluded from the

visualization since non-compiling solutions

automatically received a score of zero. The AI chatbot

results are marked with logo pictures, and the

students’ average is represented as a cross on the

boxplot. When examining the quartile distribution,

ChatGPT's scores were below the lower quartile for

T1.2 and T1.3 and only slightly above for T1.1,

suggesting that ChatGPT underperforms compared to

the average student on more lengthy and complex

tasks. In contrast, Copilot performed better, scoring

above the upper quartile for T1.2 and between the

median and upper quartile for T1.1 and T1.3,

indicating that it outperforms the average student.

Figure 4: Students’ and AI chatbots’ results in test 1.

ChatGPT consistently made two mistakes (see

Table 1): it failed to specify the required encoding for

files and defined logic in abstract methods within the

superclass without overriding it in the subclasses.

Copilot, however, had no recurring errors. It only

missed specifying file encoding once and added null

values to a toString method in just one of the three

tests.

4.2 Programming Test 2

Test 2 covers streams, exception handling, and data

structures, in addition to the topics covered in test 1.

The test was worth 16 points and unlike test 1, there

was no minimum score required to pass. ChatGPT

and Copilot were given the problem descriptions of

three test variants, and their results were graded.

CSEDU 2025 - 17th International Conference on Computer Supported Education

220

Table 1: ChatGPT and Copilot mistakes in test 1.

Test

version

ChatGPT mistakes Copilot

mistakes

T1.1 1. Did not use the required

encoding for files.

2. A method that needed to

e abstract was define

Displayed null

values in

toString

method .

T1.2 1. Did not use the required

encoding for files.

2. A method that needed to

be abstract was defined.

3. Used wrong method

names.

4. A mistake in application

ic.

Made no

mistakes.

T1.3 1. Did not use the required

encoding for files.

2. Did not read data from a

file.

Did not use

the required

encoding for

files.

On average, ChatGPT achieved a score of 12.13,

while Copilot scored 14.87 points out of 16. The

students’ average for these tests was 13.39 points.

ChatGPT scored lower than the students’ average,

whereas Copilot scored higher. However, when

analyzing the individual tests, ChatGPT

outperformed students in T2.3 but scored lower in

T2.1 and T2.2. Copilot, on the other hand,

outperformed students in all test variants.

Figure 5 illustrates the performance of the

students and the AI chatbots in test 2. Outliers, such

as non-compiling solutions that received a zero score,

were excluded for clarity. In terms of quartile

distribution, ChatGPT's scores were below the lower

quartile for T2.1 and T2.2 and fell between the lower

quartile and the median for T2.3. In contrast,

Copilot’s scores were above the median for T2.1 and

T2.3 and just below the median for T2.2. These

findings confirm that ChatGPT underperformed

compared to the average student, while Copilot

consistently outperformed the students’ average

across the tests.

ChatGPT exhibited several recurring mistakes

(Table 2), with the most common being that methods

were marked as public instead of private. This

requirement was outlined in the test, as methods were

not supposed to be accessible outside the class, which

likely impacted the outcome. However, a similar

problem with private methods was also mentioned

previously (Wang et al., 2024). The test variants

differed in that two required the use of queues, while

one required the use of maps. ChatGPT made more

mistakes in the tests involving queues (T2.1, T2.2),

particularly with reading user input, while it

performed better in the test that involved maps (T2.3).

Another recurring issue was sorting: the tests required

a non-decreasing order, but ChatGPT sorted in the

opposite direction.

Figure 5: Students’ and AI chatbots’ results in test 2.

Table 2: ChatGPT and Copilot mistakes in test 2.

Test

version

ChatGPT mistakes Copilot

mistakes

T2.1 1. Methods were not

private.

2. The logic for asking

user input did not work

correctly.

3. Did not use user input.

4. Multiple mistakes in

application logic.

1. Did not

generate get and

set methods.

T2.2 1. Methods were not

private.

2. The logic for asking

user input did not work

correctly.

3. Did not use user input.

4. Sorted in the wrong

direction.

5. Did not generate some

functionalities.

1. Did not

generate some

get methods.

2. Sorted in the

wrong direction.

T2.3 1. Methods were not

private.

2. Sorted in the wrong

direction.

3. Small mistake in

readin

ut.

1. Did not

generate some

get methods.

2. A method

was not private.

Copilot had a recurring issue in all three test

variants where it failed to generate some required

getter methods and found workarounds instead. Other

mistakes were more unique to individual test variants.

Like ChatGPT, Copilot struggled with the sorting

direction and setting a method as private, but these

issues occurred only in one test variant, unlike the

recurring mistakes seen with ChatGPT. Additionally,

Copilot did not show any noticeable performance

How Proﬁcient Is Generative AI in an Introductory Object-Oriented Programming Course?

221

differences between the queue-based tests (T2.1,

T2.2) and the map-based test (T2.3).

4.3 Exam

The AI assistants performed similarly on questions

related to objects and classes (see Appendix).

However, a noticeable difference emerged in

questions, which focused on strings, files, and lists

(Q3). Both struggled with various string methods, but

ChatGPT also encountered issues with

Collections.sort and list indexing. The most

significant discrepancy, which contributed to the

larger gap in points, was that ChatGPT struggled to

accurately handle string values stored in variables,

often confusing values between variables or

introducing new ones. This issue did not occur in the

Copilot’s responses.

Copilot consistently outperformed ChatGPT on

topics related to interfaces, sub-, super- and abstract

classes (see Appendix Q4-Q9)—even though they

made similar mistakes, Copilot did so less frequently.

Both assistants struggled with class hierarchy,

specifically confusing the order of method

declaration searches and how a superclass’s

constructor is called during instance creation.

Another shared mistake was assuming that an abstract

class must implement an interface's methods. A

unique error for Copilot was attempting to define an

abstract method in an abstract class without using the

keyword "abstract". ChatGPT, on the other hand, had

more distinct errors, such as confusing when to use

"extends" versus "implements," misunderstanding

when access modifiers are required, and mixing up

what is permissible in classes versus abstract classes.

The graphics questions could not be analyzed

since they included a picture of a JavaFX program,

which could not be input into the AI assistants. For

the events topic (Q11), all issues were due to logical

and string comparison errors, rather than a

misunderstanding of how events and changes

function. Overall, Copilot and ChatGPT made similar

mistakes, but the main difference in their scores came

from the fact that ChatGPT made these errors more

frequently than Copilot.

The only topic where ChatGPT outperformed

Copilot was exception handling (Q13), which is

notable since Copilot outscored ChatGPT in all other

areas. Both AI assistants struggled with print

statements, often ignoring them, and incorrectly

assumed that if an exception is thrown in a catch

block, it would be caught automatically, which is not

the case. Another shared error was their

misunderstanding that data streams cannot

interchangeably use readUTF/writeUTF for

readInt/writeInt (Q12). Additionally, ChatGPT had

more issues understanding file lengths. Both AIs

struggled with the stack data structure (Q14),

consistently failing on this topic. As there was no time

when they answered correctly about stacks, this likely

indicates a lack of sufficient training data. ChatGPT

also showed difficulties with queues and sets, a

recurring issue that also appeared in the results of test

2. This suggests that Copilot has a better grasp of less

common data structures compared to ChatGPT.

The long tasks (Q15) required a comprehensive

understanding of the topics from the entire course,

leading to a variety of mistakes, many of which were

similar to errors made in earlier questions, such as

issues with superclass constructor calls and string-to-

number conversions. A recurring problem was that

when the task required providing all possible

solutions, the AI assistants almost always gave only

one correct answer instead of multiple possibilities.

For example, they would provide only "public" as the

correct keyword, even though other keywords were

also valid, or they would use only an abstract class or

the superclass for instance creation. The tasks also

required explanations for the proposed solutions,

where the AI assistants performed well if their initial

answer was correct. As with other tasks, Copilot

outperformed ChatGPT in this area.

Figure 6: Students’ and AI chatbots’ results in the exam.

In terms of average performance, Copilot

achieved a higher average score (27.13 points) than

the students (26.90, SD=3.62), while ChatGPT had a

lower average (23.82). The exam results for the

students and the AI chatbots are shown as a boxplot

in Figure 6. Although Copilot had a higher average

score than the students, it fell below the median,

suggesting that more than half of the students

CSEDU 2025 - 17th International Conference on Computer Supported Education

222

performed better. ChatGPT fared even worse, with an

average score below the bottom quartile, indicating

that 75% of the students outperformed it. Overall, this

suggests that half of the students have a better

understanding of the topics than the AI chatbots.

5 DISCUSSION

5.1 Comparison of AI vs. Students

One of the goals of the study was to compare chatbot

performance with that of students in the "Object-

Oriented Programming" course. In test 1, both AI

chatbots outperformed the students’ average, but this

was mainly because non-compiling student solutions

received automatic zeros, lowering the average.

Quartile analysis showed ChatGPT being

consistently below the bottom quartile, meaning 75%

of students outperformed it. Copilot, however,

performed better, scoring above the median in all

three tests and reaching the upper quartile in one.

These results align with Bordt and Luxburg's (2023)

findings, where ChatGPT-3.5 passed a course but

underperformed compared to students, while

ChatGPT-4 performed similarly to them. As Copilot

is based on GPT-4, the results are consistent with this

observation. However, other studies (Finnie-Ansley

et al., 2022; Finnie-Ansley et al., 2023; Richards et

al., 2024) have found that AI tools like ChatGPT and

Codex ranked above the upper quartile in

introductory CS courses, possibly due to the tasks

being in English, whereas this study used Estonian

tasks. Similar language-related performance variance

was observed in Czech information security courses

(Malinka et al., 2023), where students often

outperformed AI.

In test 2, the AI results differed from the students’

average: ChatGPT performed worse, while Copilot

did better. ChatGPT was in the bottom 25% for two

test variants and in the bottom half for the third.

Copilot scored above the median for two variants but

below it for one. This continues the trend from test 1

where ChatGPT-3.5 passed but underperformed

compared to students, while Copilot performed closer

to the students’ median. However, both chatbots

performed worse overall in test 2 compared to test 1.

These results align with studies (Finnie-Ansley et al.,

2023; Richards et al., 2024), showing that AI tools do

better on simpler tasks but struggle with more

complex ones, even though other research (Savelka et

al., 2023) has found no such difference. The exam

results confirmed this pattern: ChatGPT scored lower

than the students’ average while Copilot

outperformed it, but both AIs still fell below the

median student. A key issue was the AIs' inability to

parse one question presented as an image, costing

them 2 points. This reflects previous challenges with

interpreting UML diagrams (Cámara et al., 2023; Ouh

et al., 2023).

An interesting observation is that the AI

performed better on the tests than on the exam. The

tests involved creating programs with multiple

classes and functions from longer textual

descriptions, while the exam focused on shorter

questions requiring decisions about smaller code

snippets. One might expect more errors in longer

texts, yet the AI struggled more with the shorter

questions. This aligns with previous research

(Malinka et al., 2023), which found that students

excelled at solving small code snippets compared to

AI, but the AI’s ability to handle more complex tasks

from detailed descriptions stands out as noteworthy.

5.2 Common AI Mistakes

The second research question examined common

mistakes made by AI chatbots in solving course tasks.

The mistakes by ChatGPT and Copilot were

documented to identify patterns. In test 1, the most

frequent error, present in all ChatGPT solutions and

once in Copilot’s, was failing to specify file encoding.

This was a requirement stated in the task, as the files

are in a particular encoding, with the assumption that

data would be read using that encoding. Copilot also

displayed null values in a toString method, and

ChatGPT had issues reading files and used incorrect

method names. Another recurring error for ChatGPT

was failing to make a superclass method abstract.

Instead, it defined logic in it, which worked but did

not follow the task's specification. Test 2 had more

errors. Both chatbots failed to make certain methods

private, likely due to the requirement being written in

Estonian as the methods are not meant to be callable

outside the class, and this may have caused confusion

for the AI assistants, as they did not make similar

mistakes in test 1. Both also sorted data incorrectly,

possibly due to confusion over the term "non-

decreasing," which was in Estonian. Copilot

struggled with generating get and set methods, while

ChatGPT had difficulties with Queue data structures,

misunderstanding how the task should be handled.

Interestingly, while ChatGPT and Copilot made

similar mistakes on the same tests, no repeated errors

were observed between tests 1 and 2. This may be due

to the different topics each test covered, which

reduced the likelihood of similar mistakes. Although

ChatGPT and Copilot failed to generate some

How Proﬁcient Is Generative AI in an Introductory Object-Oriented Programming Course?

223

required get and set methods, they did not encounter

compilation issues, which have been highlighted by

previous research (Cipriano & Alves, 2024).

The exam covered various course topics,

requiring the AI assistants to tackle a wide range of

tasks. A frequent issue, which was not related to the

content but to the question format, was that the AI

often failed to choose answers from the multiple-

choice options provided, usually selecting just one

correct answer instead of listing all. Both ChatGPT

and Copilot struggled with String comparison,

making case-sensitive errors and using incorrect

methods. ChatGPT also had trouble understanding

variable values and made mistakes related to lists,

including issues with indexes, reference-based

handling, and sorting. Complex topics like interfaces,

abstract classes, and class hierarchies posed

challenges. ChatGPT consistently made errors with

abstract methods, and both AI tools struggled with

method implementations and understanding how

superclass constructors are called. They also had

difficulty using the keywords "abstract," "extends,"

"implements," and understanding access modifiers in

interfaces and abstract classes. These issues suggest

that when faced with less common or edge-case

problems, AI assistants are more prone to mistakes.

The graphics questions revealed that AI assistants

could not parse image data, similar to issues noted in

prior research on UML diagrams (Cámara et al.,

2023; Ouh et al., 2023). For events, the AIs generally

understood event logic but made recurring mistakes

in String and logical comparisons, often confusing

variable values. In data streams, both ChatGPT and

Copilot struggled with how methods like readInt and

writeUTF interact, as well as with input file size. Both

AIs also failed to account for some print statements

in exception handling and incorrectly assumed that

exceptions thrown in a catch block would be caught.

Data structures were another weak area, with both

always making errors on stack-related tasks, and

ChatGPT having recurring problems with Queues,

similar to the issues in test 2. The final exam question,

requiring knowledge across all course topics, showed

familiar errors—trouble with interfaces, abstract

classes, and class hierarchies, along with incorrect use

of access modifiers. Additionally, the AIs tended to

propose only one solution when multiple were

required. However, when their answers were correct,

their explanations were solid, consistent with

previous research where AI outperformed students in

explaining answers (Malinka et al., 2023).

6 CONCLUSIONS

This paper provided an overview of ChatGPT and

Copilot's proficiency in an introductory object-

oriented programming course, comparing their

performance to that of students. The findings offer

valuable insights for instructors, highlighting how

these AI tools stack up against students and

identifying common mistakes made by the chatbots.

This information can be informative for

improvements to computer science courses and

education.

A limitation of this study is its focus on just one

course in a single year, due to the rise of AI chatbots

being only a recent development. Future research

across more courses and years would be beneficial.

While this research primarily explored AI

proficiency, another promising area of study is how

AI tools can assist lecturers with teaching and grading

automation. Additionally, as AI assistants evolve and

improve through updates, research into how their

proficiency changes over time would be an intriguing

direction for further exploration.

ACKNOWLEDGEMENTS

This work was supported by the Estonian Research

Council grant "Developing human-centric digital

solutions" (TEM-TA120).

REFERENCES

Bommarito, M., & Katz, D. M. (2022). GPT takes the bar

exam. https://doi.org/10.48550/ARXIV.2212.14402

Bordt, S., & Luxburg, U. (2023). ChatGPT participates in

a computer science exam. https://doi.org/10.48550/

arXiv.2303.09461

Bucaioni, A., Ekedahl, H., Helander, V., & Nguyen, P. T.

(2024). Programming with ChatGPT: How far can we

go? Machine Learning with Applications, 15, 100526.

https://doi.org/10.1016/j.mlwa.2024.100526

Cámara, J., Troya, J., Burgueño, L., & Vallecillo, A.

(2023). On the assessment of generative AI in modeling

tasks: An experience report with ChatGPT and UML.

Software and Systems Modeling (SoSyM), 22(3), 781–

793. https://doi.org/10.1007/s10270-023-01105-5

Cipriano, B. P., & Alves, P. (2024). LLMs still can't avoid

instanceof: An investigation into GPT-3.5, GPT-4 and

Bard's capacity to handle object-oriented programming

assignments. In Proceedings of the 46th International

Conference on Software Engineering: Software

Engineering Education and Training (pp. 162–169).

ACM. https://doi.org/10.1145/3639474.3640052

CSEDU 2025 - 17th International Conference on Computer Supported Education

224

Denny, P., Khosravi, H., Hellas, A., Leinonen, J., & Sarsa,

S. (2023). Can we trust AI-generated educational

content? Comparative analysis of human and AI-

generated learning resources. https://doi.org/10.48550/

arXiv.2306.10509

Denny, P., Prather, J., Becker, B. A., Finnie-Ansley, J.,

Hellas, A., Leinonen, J., Luxton-Reilly, A., Reeves, B.

N., Santos, E. A., & Sarsa, S. (2024). Computing

education in the era of generative AI. Communications

of the ACM, 67(2), 56–67. https://doi.org/10.

1145/3624720

Finnie-Ansley, J., Denny, P., Becker, B., Luxton-Reilly, A.,

& Prather, J. (2022). The robots are coming: Exploring

the implications of OpenAI Codex on introductory

programming. In Proceedings of the 24th Australasian

Computing Education Conference (pp. 10–19).

https://doi.org/10.1145/3511861.3511863

Finnie-Ansley, J., Denny, P., Luxton-Reilly, A., Santos, E.,

Prather, J., & Becker, B. (2023). My AI wants to know

if this will be on the exam: Testing OpenAI’s Codex on

CS2 programming exercises. In Proceedings of the 25th

Australasian Computing Education Conference (pp.

97–104). https://doi.org/10.1145/3576123.3576134

Gabbrielli, M., & Martini, S. (2023). Object-oriented

paradigm. In Programming languages: Principles and

paradigms (2nd ed.). Springer Nature.

Geerling, W., Mateer, G. D., Wooten, J., & Damodaran, N.

(2023). ChatGPT has mastered the principles of

economics: Now what? SSRN Electronic Journal.

https://doi.org/10.2139/ssrn.4356034

Jukiewicz, M. (2024). The future of grading programming

assignments in education: The role of ChatGPT in

automating the assessment and feedback process.

Thinking Skills and Creativity, 52. https://doi.org/10.

1016/j.tsc.2024.101522

Joshi, I., Budhiraja, R., Dev, H., Kadia, J., Ataullah, M.,

Mitra, S., Akolekar, H., & Kumar, D. (2024). ChatGPT

in the classroom: An analysis of its strengths and

weaknesses for solving undergraduate computer

science questions. In Proceedings of the Technical

Symposium on Computer Science Education (pp. 625–

631). https://doi.org/10.1145/3626252.3630803

Kadir, M., Rahman, T., Barman, S., & Al-Amin, M. (2024).

Exploring the competency of ChatGPT in solving

competitive programming challenges. International

Journal of Advanced Trends in Computer Science and

Engineering, 13(1), 13–23. https://doi.org/10.30534/

ijatcse/2024/031312024

Kiesler, N., Lohr, D., & Keuning, H. (2023). Exploring the

potential of large language models to generate

formative programming feedback. In 2023 IEEE

Frontiers in Education Conference (FIE) (pp. 1–5).

https://doi.org/10.1109/FIE58773.2023.10343457

Lee, H. (2023). The rise of ChatGPT: Exploring its

potential in medical education. Anatomical Sciences

Education. https://doi.org/10.1002/ase.2270

Leping, V., Lepp, M., Niitsoo, M., Tõnisson, E., Vene, V.,

& Villems, A. (2009). Python prevails. In Proceedings

of the International Conference on Computer Systems

and Technologies (pp. IV.17-1 - IV.17-5).

Lepp, M., & Tõnisson, E. (2015). Integrating flipped

classroom approach and work in pairs into workshops

in programming course. In International Conference on

Frontiers in Education: Computer Science and

Computer Engineering (pp. 220-226).

Liu, Y., Le-Cong, T., Widyasari, R., Tantithamthavorn, C.,

Li, L., Le, X.-B. D., & Lo, D. (2024). Refining

ChatGPT-generated code: Characterizing and

mitigating code quality issues. ACM Transactions on

Software Engineering and Methodology, 33(5), Article

116. https://doi.org/10.1145/3643674

Malinka, K., Peresíni, M., Firc, A., Hujnák, O., & Janus, F.

(2023). On the educational impact of ChatGPT: Is

artificial intelligence ready to obtain a university

degree? In Proceedings of the 2023 Conference on

Innovation and Technology in Computer Science

Education (pp. 47–53). https://doi.org/10.1145/

3587102.3588827

Ouh, E. L., Gan, B. K. S., Jin Shim, K., & Wlodkowski, S.

(2023). ChatGPT, can you generate solutions for my

coding exercises? An evaluation on its effectiveness in

an undergraduate Java programming course. In

Proceedings of the 2023 Conference on Innovation and

Technology in Computer Science Education (pp. 54–

60). https://doi.org/10.1145/3587102.3588794

Richards, M., Waugh, K., Slaymaker, M., Petre, M.,

Woodthorpe, J., & Gooch, D. (2024). Bob or Bot:

Exploring ChatGPT's answers to university computer

science assessment. ACM Transactions on Computing

Education, 24(5), 1–32. https://doi.org/10.

1145/3633287

Savelka, J., Agarwal, A., Bogart, C., Song, Y., & Sakr, M.

(2023). Can generative pre-trained transformers (GPT)

pass assessments in higher education programming

courses? In Proceedings of the 2023 Conference on

Innovation and Technology in Computer Science

Education (pp. 117–123). https://doi.org/10.1145/

3587102.3588792

Shoufan, A. (2023). Can students without prior knowledge

use ChatGPT to answer test questions? An empirical

study. ACM Transactions on Computing Education,

23(45), 1–29. https://doi.org/10.1145/3628162

Sun, D., Boudouaia, A., Zhu, C., & Li, Y. (2024). Would

ChatGPT-facilitated programming mode impact

college students’ programming behaviors,

performances, and perceptions? An empirical study.

International Journal of Educational Technology in

Higher Education, 21(14). https://doi.org/10.1186/

s41239-024-00446-5

Wang, S., Ding, L., Shen, L., Luo, Y., Du, B., & Tao, D.

(2024). OOP: Object-oriented programming evaluation

benchmark for large language models. In Findings of

the Association for Computational Linguistics: ACL

2024 (pp. 13619–13639). Bangkok, Thailand.

Yilmaz, R., & Yilmaz, F. G. K. (2023). The effect of

generative artificial intelligence (AI)-based tool use on

students' computational thinking skills, programming

self-efficacy and motivation. Computers and

Education: Artificial Intelligence, 4. https://doi.org/10.

1016/j.caeai.2023.100147.

How Proﬁcient Is Generative AI in an Introductory Object-Oriented Programming Course?

225

APPENDIX

Table: AI assistants' results in exam questions.

ic AI Points Mistakes

Objects,

Classes

Chat-

GPT

Avg=1.9 SD=0.32

min=0 max=2

Did not choose answers from the possible answer list given

Copilot

Avg=1.95 SD=0.16

min=1 max=2

Did not follow the order of arguments of a method

Strings,

Files, Lists

Chat-

GPT

Avg=0.995 SD=0.74

min=0 max=2

Comparing substrings indexes

Problems with uppercase and lowercase comparison

Assumed that Collections.sort returned, not changed the

existing list

Made mistakes with lists related to them being reference-based

Did not comprehend which String value was saved in the

variable

Copilot

Avg=1.7 SD=0.63

min=0 max=2

Problems with uppercase and lowercase comparison

Problems with the Strin

contains metho

Interfaces

Chat-

GPT

Avg=1.563 SD=0.37

min=1 max=2

Stated that abstract class needs to implement interface methods

An empty method body is still an implementation of a method

Used extends for interfaces

Stated that interface methods need access keywords

Used abstract methods in non-abstract classes

Copilot

Avg=1.847 SD=0.63

min=1.6 max=2

Stated that abstract class needs to implement interface methods

Chat-

GPT

Avg=1.4 SD=0.78

min=0 max=2

Used keyword class when methods are not implemented

Stated that abstract cannot be used in interfaces

Stated that the access modifier has to be specified in an

interface

Sorted in the wrong direction with comparable

Copilot

Avg=1.82 SD=0.38

min=1 max=2

Defined abstract methods without using the keyword abstract

in an abstract class

Class

hierarchy

Chat-

GPT

Avg=1.64 SD=0.39

min=1 max=2

Failed to realize a superclass's constructor with no arguments

is always called when creating an instance of a subclass

Failed to realize that when a subclass calls a superclass

constructor with arguments, the constructor with no arguments

is not calle

Copilot

Avg=1.904 SD=0.22

min=1.33 max=2

Chat-

GPT

Avg=1.733 SD=0.64

min=0 max=2

Method declarations are searched for starting from subclasses,

not from superclasses

Copilot

Avg=1.8 SD=0.63

min=0 max=2

Abstract

classes

Chat-

GPT

Avg=1.516 SD=0.14

min=0.66 max=2

Stated that abstract classes cannot have realized methods

Did not add the keyword abstract to abstract methods

Stated that abstract classes cannot have abstract subclasses

Used extends with interfaces

Did not implement all abstract methods in non-abstract

subclass

Copilot

Avg=1.68 SD=0.35

min=1 max=2

Stated that an abstract class needs to implement all superclass

methods

Used extends with interfaces

Stated that you can override method only when superclass is

abstract

Chat-

GPT

Avg=1.663 SD=0.26

min=1.33 max=2

Assumed that interfaces cannot contain variables

Assumed that abstract classes cannot contain only non-abstract

methods

Copilot

Avg=1.826 SD=0.29

min=1.33 max=2

Stated that abstract classes cannot have realized methods

Implemented methods in interfaces

Assume

that interfaces cannot contain variables

Q10 Gra

hics Could not be anal

zed

(

the

uestions contained

ictures

)

CSEDU 2025 - 17th International Conference on Computer Supported Education

226

ic AI Points Mistakes

Q11 Events

Chat-

GPT

Avg=1.45 SD=0.50

min=0.5 max=2

Made mistakes when String comparison and methods were used

Confused < and <=

Sometimes confused different variables

Copilot

Avg=1.75 SD=0.35

min=1 max=2

Made mistakes when String comparison and methods were used

Confused != and ==

Sometimes confused different variables

Confusion with list indexes

Q12 Streams

Chat-

GPT

Avg=1.799 SD=0.34

min=1.14 max=2

Did not understand when reading the input file had reached the

end of the file

readUTF cannot comprehend input written with the method

writeInt

Had a problem understanding how long the input file is

Copilot

Avg=1.869 SD=0.29

min=1.14 max=2

readUTF cannot comprehend input written with the method

writeInt

Q13

Exception

handling

Chat-

GPT

Avg=1.857 SD=0.12

min=1.75 max=2

Did not notice a print statement

An exception thrown in a catch block was not caught

Copilot

Avg=1.732 SD=0.62

min=0 max=2

Q14

Data

structures

Chat-

GPT

Avg=1 SD=1.05

min=0 max=2

Did not understand how Stack data structure works

Had problems with Queue element removal, and did not

understand it worked as FIFO

Had problems when a set was given the same element multiple

times

Copilot

Avg=1.6 SD=0.84

min=0 max=2

Did not understand how Stack data structure works

Q15

Question

with

explana-

tions

Chat-

GPT

Avg=4.3 SD=1.06

min=3 max=5.5

Did not add Comparable interface when necessary

Did not add access modifiers

Had a problem with String to Integer and Double conversions

Did not use interfaces

Assumed that method signatures must contain throws

NumberFormatException

Non-static methods cannot be called directly in a static context

Did not mention creatin

subclass instances

Copilot

Avg=4.65 SD=0.66

min=3.5 max=5.5

Did not use interfaces or abstract classes, only class

Did not mention creating subclass instances both with subclass

and superclass types

Stated that protected methods cannot be called from other

classes

Stated that a class’s main method cannot contain instances of

said class

Did not add access modifiers

Did not add a call to superclass constructor in subclass

How Proﬁcient Is Generative AI in an Introductory Object-Oriented Programming Course?

227