
tion to well-established good practices of program-
ming. For this evaluation, we use PyLint, a widely-
used tool in the Python programming language for
checking a module for coding standards, and certain
types of code smells.
Examining ChatGPT for Potential Memoriza-
tion of Training Data: One of the fears surrounding
LLMs is that they might memorize (potentially pri-
vate) data that shows up in the training set. We design
an experiment to provide some idea of how ChatGPT
might be affected by this problem. While our experi-
ment would not confirm 100% whether memorization
has occurred, it can still provide some perspective to
the end-user who is trying to make the decision on
whether to allow their data to be used in the training
set.
Assessing the Level of “wrongness” of Wrong
ChatGPT Solutions: When ChatGPT generates a
wrong solution to a problem, it is still insightful to
gauge the extent to which this solution is wrong. For
example, a marginally wrong solution might be fix-
able through minor tweaks. To study this phenom-
ena, we evaluated the test cases passed by ChatGPT-
generated programs which executed successfully yet
failed to solve the problem at hand.
2 RELATED RESEARCH
Table 1 shows a summary of how our paper differs
from a series of recent works that are closest to our
work. We organize our variations from these works
into five categories that we discuss in details below.
(1) Size of the experiment: Our research analyzes
ChatGPT’s coding abilities using a record 2,792 cod-
ing challenges, the largest number to date, compared
to the previous high of 264 in related studies (see
(Bubeck et al., 2023)). This scale is crucial for sta-
tistical robustness, allowing a thorough examination
of the model’s diverse strengths and weaknesses.
(2) Variety of coding tasks used in experiment:
In terms of coding tasks, our study significantly ex-
tends the scope of previous research by Bubeck et
al. and Tian et al., which focused on data struc-
tures, algorithms, and specific LeetCode problems.
While Bubeck et al. used 264 questions including
100 from LeetCode without detailing their topics, and
Tian et al. concentrated on arrays, hash tables, sort-
ing, and string-related problems, our study covers
more ground. We investigate five algorithmic areas:
dynamic programming, greedy algorithms, depth-first
search, divide and conquer, and topological sort. Ad-
ditionally, we examine data structures, including pri-
ority queues, arrays, hash tables, stacks, and binary
search trees, along with string-related problems.
(3) Coding quality evaluation: While most related
studies mainly focus on the accuracy of coding solu-
tions from language models, our approach addition-
ally assess both GPT-3 and GPT-4 using a wide range
of code quality metrics. This includes coding con-
ventions, error intricacies, warnings, and refactoring
needs. The research in (Feng et al., 2023) also exam-
ines coding errors using Flake8, but its primary aim is
different, focusing on social media analysis to under-
stand ChatGPT’s code generation and overall usage.
(4) Language models under evaluation: Our study
examines both GPT-4 and its predecessor, GPT-3.5,
comparing their performance on identical coding
challenges. Earlier works like (Noever and McKee,
2023), (Biswas, 2023), and (Tian et al., 2023) focused
on GPT-3, contrasting it with older models like Codex
and CodeGen, but did not include GPT-4, which was
not available then. Our research is similar to (Bubeck
et al., 2023), which also evaluates GPT-4’s coding
performance, but we differ in experiment size, the va-
riety of computing problems addressed, and our in-
clusion of coding quality assessments.
(5) Training set memorization and assessment of
wrong solutions: Finally, the evaluation of ChatGPT’s
memorization behavior and the assessment of the ex-
tent of “wrongness” of ChatGPT’s wrong solutions
(recall Section 1) are novelties in our work that have
not been studied by any of the previous works on
ChatGPT.
3 DATA COLLECTION
EXPERIMENTS
Tools Used in Our Experiments: Our ChatGPT
evaluations primarily utilized two tools: LeetCode
and Pylint. LeetCode is an online platform that offers
a vast array of coding challenges and interview prepa-
ration materials across various difficulty levels and
topics, supporting numerous programming languages.
Interview questions at major tech companies such as
Google, Amazon, Microsoft, and so on are mostly
directly drawn from LeetCode. LeetCode’s built-in
compiler not only assesses user-submitted code but
also benchmarks it against other submissions using a
comprehensive set of test cases. To evaluate the code
quality of the programming solutions generated by
ChatGPT, we used the Pylint Python library. PyLint
conducts static analysis, checking Python code for ad-
herence to coding standards, style guidelines, syntax,
errors, unused code, and refactoring suggestions.
Data Collection Process: We manually input
each LeetCode coding challenge into ChatGPT, al-
Unmasking the Giant: A Comprehensive Evaluation of ChatGPT’s Proficiency in Coding Algorithms and Data Structures
413