Evaluating the Performance of LLM-Generated Code for

ChatGPT-4 and AutoGen Along with Top-Rated Human Solutions

Ashraf Elnashar, Max Moundas, Douglas C. Schmidt, Jesse Spencer-Smith and Jules White

Department of Computer Science, Vanderbilt University, Nashville, Tennessee, U.S.A.

Keywords:

Large Language Models (LLMs), Automated Code Generation, ChatGPT-4 vs. AutoGen Performance,

Software Development Efﬁciency, Stack Overﬂow Solution Analysis, Computer Science Education, Prompt

Engineering in AI Code, Quality Assessment, Runtime Performance Benchmarking, Dynamic Testing

Environments.

Abstract:

In the domain of software development, making informed decisions about the utilization of large language

models (LLMs) requires a thorough examination of their advantages, disadvantages, and associated risks.

This paper provides several contributions to such analyses. It ﬁrst conducts a comparative analysis, pitting the

best-performing code solutions selected from a pool of 100 generated by ChatGPT-4 against the highest-rated

human-produced code on Stack Overﬂow. Our ﬁndings reveal that, across a spectrum of problems we exam-

ined, choosing from ChatGPT-4’s top 100 solutions proves competitive with or superior to the best human

solutions on Stack Overﬂow.

We next delve into the AutoGen framework, which harnesses multiple LLM-based agents that collaborate to

tackle tasks. We employ prompt engineering to dynamically generate test cases for 50 common computer sci-

ence problems, both evaluating the solution robustness of AutoGen vs ChatGPT-4 and showcasing AutoGen’s

effectiveness in challenging tasks and ChatGPT-4’s proﬁciency in basic scenarios. Our ﬁndings demonstrate

the suitability of generative AI in computer science education and underscore the subtleties of their problem-

solving capabilities and their potential impact on the evolution of educational technology and pedagogical

practices.

1 INTRODUCTION

Emerging Trends, Challenges, and Research Foci.

Large language models (LLMs) (Bommasani et al.,

2021), such as ChatGPT (Bang et al., 2023) and Copi-

lot (git, ), have the ability to generate complex code

to meet a set of natural language requirements (Car-

leton et al., 2022). Software developers can use LLMs

to generate human descriptions of desired functional-

ity or requirements, as well as synthesize code in a

variety of languages ranging from Python to Java to

Clojure. These tools are currently being integrated

into popular Integrated Development Environments

(IDEs), such as IntelliJ (Krochmalski, 2014) and Vi-

sual Studio.

LLMs are now easily accessible through the In-

ternet and within IDEs, and developers are increas-

ingly leveraging them to guide many programming

tasks. In many cases, the questions and code samples

to which developers apply these LLMs are the same

questions and code samples they previously would

have sought help on via discussion forums. For ex-

ample, Stack Overﬂow (stackoverﬂow.com) is a pop-

ular online forum where developers ask questions and

obtain guidance on code samples.

There has been signiﬁcant discussion and re-

search (git, ; Asare et al., 2022; Pearce et al., 2022)

on applying LLMs to generate code with respect to

the quality of the code from a security and defect per-

spective. First-generation LLM-based tools often pro-

duced poor quality code due to their ability to ”hallu-

cinate” convincing text or code that was fundamen-

tally ﬂawed, although it appeared correct. In addi-

tion, LLMs trained on human-produced code in open-

source projects often had vulnerabilities or eschewed

best practices. Much discussion on the code quality

generated by LLMs has therefore focused on func-

tional correctness and security.

Although using LLMs before fully comprehend-

ing their capabilities and limitations is risky, there are

also clear productivity beneﬁts for developers in cer-

tain areas. For example, LLMs can help to automate

258

Elnashar, A., Moundas, M., Schmidt, D., Spencer-Smith, J. and White, J.

Evaluating the Performance of LLM-Generated Code for ChatGPT-4 and AutoGen Along with Top-Rated Human Solutions.

DOI: 10.5220/0012820600003753

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Conference on Software Technologies (ICSOFT 2024), pages 258-270

ISBN: 978-989-758-706-1; ISSN: 2184-2833

repetitive, tedious, or boring coding tasks and per-

form these tasks faster—and often better—than devel-

opers (De Vito et al., 2023). This productivity boost is

particularly apparent when coding tasks involve APIs

or algorithms that developers are unfamiliar with and

thus require study to master before performing the

tasks. When these APIs and algorithms are included

in an LLM’s training set it often generates code for

them swiftly and accurately.

In addition, a key beneﬁt related to code perfor-

mance is how to employ LLMs via prompting and

prompt engineering for many different potential solu-

tions and then automatically benchmark them to iden-

tify the fastest solution(s). A prompt is the natural

language input to an LLM (Liu et al., 2023). Prompt

engineering is an emerging discipline that structures

interactions with LLM-based computational systems

to solve complex problems via natural language inter-

faces (Giray, 2023).

This paper expands upon our prior work (Elnashar

et al., ) that compared the runtime performance of

code produced by humans vs. code generated by

ChatGPT-3.5. We ﬁrst replicate our earlier experi-

ments replacing ChatGPT-3.5 with ChatGPT-4 (Es-

pejel et al., 2023), which is a more advanced ver-

sion of the GPT model. As shown below, ChatGPT-4

demonstrates a marked improvement in understand-

ing complex problem statements and generating more

efﬁcient code due to its enhanced training data and

reﬁned algorithms, which interpret prompts more ac-

curately and increase generated code efﬁciency.

We next conduct a comparative analysis of Auto-

Gen (Porsdam Mann et al., 2023) and ChatGPT-4, re-

vealing notable differences in their success rates and

error handling capabilities. In particular, our results

reveal that ChatGPT-4’s solutions present a 9.8% fail-

ure rate and a 90.2% pass rate, whereas AutoGen’s

solutions have a 15.6% failure rate and an 84.4% pass

rate. Moreover, we apply visual tools for clarity and

present insights into the potential educational appli-

cations of each approach.

Paper Organization. The remainder of this paper is

organized as follows: Section 2 summarizes the open

research questions we address and outlines our techni-

cal approach; Section 3 explains our testbed environ-

ment conﬁguration and analyzes results from exper-

iments that compare the top Stack Overﬂow coding

solutions against solutions generated by ChatGPT-4;

Section 4 examines the effectiveness of the AutoGen

approach in generating programming solutions and

compares its performance with ChatGPT-4; Section 5

compares our work with related research; and Sec-

tion 6 presents the lessons learned from our study and

outlines future work.

2 SUMMARY OF OPEN

RESEARCH QUESTIONS AND

TECHNICAL APPROACH

This section summarizes the open research questions

we address in this paper and outlines our technical

approach for each question.

Q1: How do the most efﬁcient LLM-generated

codes from GPT-3.5 Turbo and GPT-4 compare

with the top human-produced code in terms of

runtime performance? Section 3 investigates the

runtime performance of code generated by both GPT-

3.5 Turbo and GPT-4, contrasting it with human-

produced code. Our analysis includes a comparison

of human-written Stack Overﬂow solutions to those

generated by ChatGPT-4 and GPT-3.5 Turbo using di-

verse prompting strategies. We focus on the efﬁciency

of the fastest solutions from both LLMs compared

to the best human answers, representing a real-world

scenario where developers might seek the most efﬁ-

cient solution through iterative LLM querying. This

investigation provides a foundational understanding

of LLMs’ utility in practical coding applications.

Q2: What is the range and reliability of coding

solutions generated by GPT-3.5 Turbo and GPT-

4, compared to a diverse set of human-produced

code, in terms of runtime efﬁciency and practical

application? Section 3 expands the scope of our anal-

ysis beyond optimal solutions, examining the runtime

efﬁciency of the most common, as well as the best

and worst, LLM-generated codes. This study offers

a comprehensive view of the coding efﬁciency that

GPT-3.5 Turbo and GPT-4 can achieve, benchmarked

against human solutions. By analyzing a range of

LLM-generated solutions, we provide insights into

the variability and reliability of LLMs as coding as-

sistants.

Q3: Against which human-produced solutions

should LLM outputs from GPT-3.5 Turbo and

GPT-4 be benchmarked, and what represents the

average developer’s capability? Section 3 tack-

les the challenge of setting appropriate benchmarks

for LLM-generated code by selecting a representative

sample of human solutions for comparison. This anal-

ysis helps determine where GPT-3.5 Turbo and GPT-

4 stand in relation to average developer skill levels.

The chosen benchmarks range from highly optimized

to average human solutions, offering a balanced per-

spective on LLMs’ capabilities.

Q4: How does AutoGen, with its systematic

and structured LLM prompting, compare with

the more ﬂexible and generalized approach of

ChatGPT-4 in terms of efﬁciency, accuracy, and

adaptability in code generation? Section 4 expands

Evaluating the Performance of LLM-Generated Code for ChatGPT-4 and AutoGen Along with Top-Rated Human Solutions

259

upon the experiments in Section 3 to assess whether

AutoGen’s structured prompting leads to more ef-

ﬁcient and/or accurate code outputs compared to

ChatGPT-4. We apply both AutoGen and ChatGPT-4

to evaluate these LLMs’ capabilities in comprehend-

ing and producing Python code, by presenting them

with a sequence of increasingly complex problems.

Each generated solution underwent thorough testing

for both correctness and efﬁciency, thus highlighting

the LLMs’ ﬂexibility and accuracy in code genera-

tion.

When addressing these questions, we consider

various factors, such as the stochastic nature of LLMs,

that may yield different outputs for the same prompt.

We also consider the variance in human-provided

coding solutions in terms of quality and efﬁciency.

The comparison between AutoGen and ChatGPT-4

further extends this investigation by analyzing the im-

pact of different technical approaches on the quality

of the generated code.

Our prior work (White et al., 2023) shows how

prompt wording inﬂuences the quality of LLM out-

put. We therefore focus on how prompt wording in-

ﬂuences the quality of generated code. In particular,

we investigate if varying the wording causes LLMs to

generate faster code more consistently.

3 COMPARING STACK

OVERVIEW AND

ChatGPT-4-GENERATED

SOLUTIONS

This section analyzes the results from our comparison

of top human-provided Stack Overﬂow coding solu-

tions and the corresponding ChatGPT-4-generated so-

lutions.

3.1 Experiment Conﬁguration

This section explains the conﬁguration of our testbed

environment and analyzes the results from experi-

ments that compare the top Stack Overﬂow coding

solutions against solutions generated by ChatGPT-4.

3.1.1 Overview of Our Approach

Our analysis was conducted on code samples written

in Python since (1) it is relatively easy to extract and

experiment with stand-alone code samples in Python

compared to other languages, (2) ChatGPT-4 appears

to generate more correct code in Python vs. less pop-

ular languages (such as Clojure), and (3) Python is a

popular language in domains like Data Science where

developers often have more familiarity and comfort

with LLMs.

Our problem set was manually curated from Stack

Overﬂow by browsing questions related to Python.

We searched for questions pertaining to categories,

such as “array questions” since these questions are

readily tested for performance at increasing input

sizes. We then analyzed each question and its can-

didate solutions to select question/solution pairs that

could be isolated and inserted into our test harness.

We avoided questions that relied heavily on third-

party libraries to minimize complexity, such as ver-

sion discrepancies and dependency issues. These

complexities can obscure the assessment of the core

algorithmic efﬁciency of the code (a potential threat

to validity, as discussed in Section 3.3). Instead, we

focused on solutions built on core libraries and capa-

bilities within Python itself.

Wherever possible, we selected the top-voted so-

lution as the comparison. In some cases, multiple

languages were present in the solutions and we se-

lected the ﬁrst Python solution, mimicking developers

looking for the ﬁrst solution in their target language.

These decisions and related methodological consider-

ations are discussed further in Section 3.3.4.

For each selected question, we extracted the ques-

tion’s title posted on Stack Overﬂow and used it

as a prompt for ChatGPT-4, leveraging OpenAI’s

ChatGPT-4 API for this process. This API allowed us

to automate sending prompts and receiving code re-

sponses, thereby facilitating a consistent and efﬁcient

analysis of the model’s code generation capabilities.

This decision meant that ChatGPT-4 was not provided

the full information in the question, which may hand-

icap it in providing better performing solutions. Our

rationale for only using question titles as prompts for

ChatGPT-4 both reﬂects common real-world scenar-

ios faced by developers and assesses its ability to gen-

erate solutions based on limited information.

The original Stack Overﬂow posts, human-

produced solutions, and ChatGPT-4-generated code

solutions—along with our entire set of ques-

tions and generated answers—can be accessed

in our Github repository at github.com/elnashara/

CodePerformanceComparison. We encourage read-

ers to replicate our results and submit issues and pull

requests for possible improvements.

We measured the runtime performance of each

code sample using Python’s timeit package. Code

samples were provided with small, medium, and large

inputs. These inputs were progressively increased in

size to show the effects of scaling on the generated

code. What constituted small, medium, and large was

ICSOFT 2024 - 19th International Conference on Software Technologies

260

problem-speciﬁc, as shown in Section 3.2 below. For

each input size, we generated 100 random inputs of

the given size to run tests on. In addition, for each

input, we tested the given code 100 times on the input

using the Python timeit package.

3.1.2 Overview of the Coding Problems

A total of 7 problems from Stack Overﬂow, all per-

taining to array operations, were selected for our anal-

ysis. These problems encompass a range of array-

related challenges, including PA1: identifying miss-

ing number(s) in an unsorted array, PA2: detecting a

duplicate number in an array that is not sorted, PA3:

ﬁnding the indices of the k smallest numbers in an

unsorted array, PA4: counting pairs of elements in an

array with a given sum, PA5: ﬁnding duplicates in a

array, PA6: removing array duplicates, and PA7: im-

plementing the Quicksort algorithm.

3.1.3 Prompting Strategies

In this experiment we applied various prompting

strategies to generate Python code with ChatGPT-4,

including

1. Naive approach, which used only the title from

Stack Overﬂow as the prompt, e.g., ”How to count

the frequency of the elements in an unordered ar-

ray”,

2. Ask for speed approach, which added a require-

ment for speed at the end of the prompt, e.g.,

”How to count the frequency of the elements in

an unordered array, where the implementation

should be fast”,

3. Ask for speed at scale approach, which pro-

vided more detailed information about how the

code should be optimized for speed as the size of

the array grows, e.g., ”How to count the frequency

of the elements in an unordered array, where the

implementation should be fast as the size of the

array grows”,

4. Ask for the most optimal time complexity,

which prioritized achieving the most optimal time

complexity, e.g., ”How to count the frequency of

the elements in an unordered array, where im-

plementation should have the most optimal time

complexity possible”, and

5. Ask for the chain-of-thought (Zhang et al.,

2022), which generated coherent text by provid-

ing a series of related prompts, e.g., ”Please ex-

plain your chain of thought to create a solution to

the problem: How to count the frequency of the

elements in an unordered array First, explain your

chain of thought. Next, provide a step by step de-

scription of the algorithm with the best possible

time complexity to solve the task. Finally, de-

scribe how to implement the algorithm step-by-

step in the fastest possible way.”

ChatGPT-4 was prompted 100 times with each

prompt per coding problem, yielding up to 100 dif-

ferent coding solutions per prompt.

We tested the

performance of all ChatGPT-4-generated code, how-

ever, and did not remove duplicate solutions. If two

different prompts had identical solutions, we bench-

marked each and left the results with the expectation

that 100 timing runs on 100 different inputs would av-

erage out any negligible differences in performance.

3.2 Analysis of Experiment Results

The results of our experiment that evaluated the per-

formance of code provided by Stack Overﬂow and

generated by ChatGPT-4 100 times for all seven cod-

ing problems with three different input sizes—small

(1,000), medium (10,000), and large (100,000)—are

shown in Figures 1, 2 and 3. Figure 4 shows the min-

imum average performance across all input. These

Figure 1: Number of Solutions within X% of the Best Run-

time (Input Size 1,000).

ﬁgures show the number of problems for each prompt

where the best of the 100 solutions generated by each

prompt was within 1%, 5%, etc. of the best solution

found across all prompts and the human. For each

Figure 2: Number of Solutions within X% of the Best Run-

time (Input Size 10,000).

In practice, fewer than 100 unique coding solutions

were sometimes produced since ChatGPT-4 often generated

logically equivalent programs.

Evaluating the Performance of LLM-Generated Code for ChatGPT-4 and AutoGen Along with Top-Rated Human Solutions

261

problem, a total of up to 601 solutions were bench-

marked (6 prompts * 100 solutions per prompt + 1

human solution).

The best performing solution was used as the

”Best Runtime” solution in the ﬁgures against which

other solutions were compared. Figures 1, 2, 3 and 4

Figure 3: Number of Solutions within X% of the Best Run-

time (Input Size 100,000).

collectively demonstrate how ChatGPT-4 selected the

best-performing solution out of 100 attempts when

employing chain-of-thought reasoning in response to

prompts. These solutions were competitive with—

and in many cases surpassed—the human-provided

solutions from Stack Overﬂow. This ﬁnding is signif-

icant as it underscores the potential of LLMs in gener-

ating efﬁcient solutions when prompted with a struc-

tured approach that includes chain-of-thought reason-

ing.

Figure 4: Number of Solutions within X% of the Best Run-

time (All Input Sizes).

The human solution was the fastest solution for

only one of the problems, speciﬁcally the ”P2 Find

Duplicate Number,” as depicted in Figure 5. We used

the title of the question as the input to ChatGPT-4.

All the code samples produced code with respect to

the title of the Stack Overﬂow post. Since we directly

translated the titles into prompts for ChatGPT-4, how-

ever, there may have been additional contextual in-

formation in the question that ChatGPT-4 could have

used to further improve its solution, as discussed in

Section 3.3.2.

Our results also demonstrate a signiﬁcant im-

provement in performance when using ChatGPT-4

compared to its predecessor, GPT-3.5 Turbo. This ad-

vancement in LLMs underscores the progressive en-

hancements in AI-driven coding solutions. Despite

Figure 5: Comparison of Average Execution Time for Dif-

ferent Prompts in P2 Find Duplicate Numbers.

this progress, the human-crafted solution still outper-

formed both GPT-4 and GPT-3.5 Turbo for problem

P2. This ﬁnding suggests that while LLMs are be-

coming increasingly competent in generating code,

there remains an edge that human experience and in-

tuition can provide, particularly in certain complex or

nuanced tasks.

Conversely, when evaluating the ”P1 Find Miss-

ing Number” problem, a distinct change in the hier-

archy of solution efﬁciency was evident. As shown

in Figure 6, the human solution was surprisingly the

least efﬁcient in terms of execution time, which high-

lights scenarios where LLMs may exceed human per-

formance. Interestingly, when structured prompt en-

gineering is applied—especially the chain-of-thought

method—GPT-3.5 Turbo’s capability to devise effec-

tive code solutions improves signiﬁcantly.

In general, however, the most pronounced en-

hancement is seen with GPT-4, which out-performs

both human solutions and GPT-3.5 Turbo when

equipped with the same structured prompting tech-

niques. This ﬁnding signiﬁes the remarkable ad-

vancements in LLMs, especially in the realm of in-

tricate problem-solving. The ﬁndings presented in

Figure 6 conﬁrm the superior performance of GPT-

4 in optimizing code execution time and setting a new

threshold in AI-assisted coding (which will likely be

surpassed with subsequent releases of ChatGPT).

Figure 6: Comparison of Average Execution Time for Dif-

ferent Prompts in P1 Find Missing Number.

The contrasting results—with humans prevailing

in one case, yet falling behind in another—provides

insight into the multifaceted nature of coding solu-

tions within the current LLM landscape. Our research

ICSOFT 2024 - 19th International Conference on Software Technologies

262

suggests that while LLMs like ChatGPT-4 can out-

strip human coders in certain instances, the creativity

and specialized skill of human programmers continue

to be invaluable assets in complex scenarios. This dy-

namic highlights the promising potential of a syner-

gistic approach, wherein human expertise is enhanced

by the efﬁciency and evolving capabilities of LLMs,

to elevate the process of developing coding solutions.

3.3 Threats to Validity

Threats to the validity of our experiment results are

discussed below.

3.3.1 Sample Size

Although the results presented in Section 3.2 are

promising, they are based on a relatively small sam-

ple size since our study considered a total of seven

computer science (CS) problems, each subjected to

100 testing iterations. While this number of prob-

lems and iterations was sufﬁcient to demonstrate ini-

tial trends, it does not capture the performance char-

acteristics and potential edge cases encountered in

larger datasets. More work on a larger sample size

is therefore needed to increase the robustness of our

ﬁndings.

In general, the software engineering and LLM

communities will beneﬁt from a large-scale set of

benchmarks that associate (1) code needs (expressed

as natural language requirements), questions, speciﬁ-

cations, and rules with (2) highly optimized human

code, as well as associated benchmarks and inter-

faces. These communities can then apply the bench-

marks to measure and validate LLM coding perfor-

mance over time to ensure research is headed in the

right direction regarding the development and use of

generative AI tools.

3.3.2 Prompt Construction

The construction of prompts posed an additional

threat to validity because it relied solely on the ti-

tles of Stack Overﬂow questions. In particular, in-

corporating no additional details from question bod-

ies prevented ChatGPT-4 from utilizing further code

to inform its responses. We did not want ChatGPT-

4 completing/improving fundamentally ﬂawed code.

However, this prompt design choice risked depriving

ChatGPT-4 of information it could have used to gen-

erate better solutions.

3.3.3 Problem Scope

Another risk area was the variety of coding problems

we analyzed. The problems were relatively narrow

in scope and data structure type. A wider range of

problem types is thus needed to ensure hidden risks

regarding speciﬁc problem structures do not occur.

There may be classes of problems that trigger poor

performing hallucinations or code structures we are

not aware of yet. This risk is particularly problematic

when attempting to generalize our results.

3.3.4 Selection Bias

Another threat to validity was the inherent question

and code sample selection bias in our study. These

questions and answers were selected manually to fo-

cus on problems and code samples that could be tested

and benchmarked readily. We may therefore have in-

appropriately inﬂuenced the problem types selected

and not chosen samples representative of what devel-

opers would ask in certain domains.

4 ChatGPT-4 vs. AutoGen: A

COMPARATIVE STUDY IN

PROGRAMMING

AUTOMATION

Computer science and its application domains evolve

continuously, requiring more efﬁcient and reliable

automated systems capable of solving complex

problems. This section systematically compares

ChatGPT-4 and AutoGen, which are two generative

AI-based systems that enable automated problem-

solving. Our comparison evaluates the capability of

ChatGPT-4 and AutoGen to (1) generate accurate so-

lutions for a set of predeﬁned computer science prob-

lems and (2) successfully pass rigorous tests designed

to validate the correctness of these solutions.

ChatGPT-4 was developed as part of OpenAI’s

GPT series and is adept at a wide range of natural

language tasks, catering to diverse users from various

domains. Its ﬂexibility and interactivity make it suit-

able for general inquiries, creative writing, and edu-

cational support. In contrast, AutoGen excels in au-

tomated code generation through structured and sys-

tematic prompting methods that harness predeﬁned

patterns and algorithms to craft solutions optimized

for accuracy, performance, and readability.

4.1 Problem Statement

AutoGen and ChatGPT-4 both support automated

problem-solving and algorithm generation. Little re-

search has been conducted, however, to determine

their efﬁciency and accuracy in producing viable so-

Evaluating the Performance of LLM-Generated Code for ChatGPT-4 and AutoGen Along with Top-Rated Human Solutions

263

lutions under varying conditions and constraints, es-

pecially when the tests themselves are dynamically

generated as part of the problem-solving process. Ad-

dressing this knowledge gap raises a critical question

(question Q4 in Section 2): How reliable are AutoGen

and ChatGPT-4 when faced with dynamically chang-

ing success criteria, particularly when these criteria

are crafted through prompt engineering to match the

problem’s speciﬁc nature?’

The study presented in this section aims to ﬁll the

current gap regarding the adaptability and precision

of AutoGen and ChatGPT-4 in such ﬂuid testing en-

vironments. The absence of predeﬁned tests means

the evaluation of these systems must account for their

ability to interpret problem statements, generate cor-

responding tests, and produce solutions that satisfy

these tests. What is needed, therefore, is a method

that assesses the quality of the generated solutions, as

well as the appropriateness and thoroughness of the

spontaneously created tests.

By addressing these challenges, we provide a nu-

anced understanding of the capabilities of ChatGPT-4

and AutoGen. We also explore the extent to which

these systems can autonomously generate both prob-

lems and their corresponding tests, which is becoming

common in continuous integration pipelines and auto-

mated software development processes (Arachchi and

Perera, 2018). The results of our comparative analysis

evaluate the potential of these LLM-driven systems to

contribute to and enhance the ﬁeld of automated soft-

ware testing and development.

4.2 Dataset Overview and Analysis

The dataset under consideration comprises a collec-

tion of 50 computer science problems, each character-

ized by a unique sequence number, a difﬁculty level

(Category), a ProblemType, and a detailed problem

statement. These problems are classiﬁed into various

categories, reﬂecting different areas of computer sci-

ence, such as algorithm design, data structures, and

computational theory. The problems are categorized

by difﬁculty levels, ranging from easy to more chal-

lenging problems.

This dataset includes a broad spectrum of test

cases for each problem, ensuring a comprehensive

evaluation of skills from basic functionality to intri-

cate scenarios. For example, test cases for ’Calculat-

ing the average of an array of numbers’ vary in ar-

ray sizes and types, while ’Graph traversal’ problems

test diverse graph structures. This method, akin to our

previous study on arrays in Section 3, showcases the

range of topics in the dataset, from fundamental algo-

rithms like ”Binary Search” to advanced techniques

like ”Depth-First Search.”

The analysis of the distribution of computer sci-

ence problems by type uncovers the wide range of

topics encompassed within the dataset. The pie chart

shown in Figure 7 depicts the percentage of prob-

Figure 7: Problem Types Distribution.

lems in each type, providing a visual representation of

which areas are emphasized more heavily. This dis-

tribution is crucial for understanding the breadth and

focus areas of the dataset.

Figure 8: Distribution of Problems by Difﬁculty Level.

The pie chart shown in Figure 8 presents the dis-

tribution of problems across different difﬁculty levels

(i.e., easy, medium, and hard) within the dataset. This

chart visualizes the proportion of problems in each

category, thereby elucidating the distribution pattern.

It accentuates the prevalence of speciﬁc categories

and offers insights into the relative emphasis placed

on each difﬁculty level in our dataset.

4.3 Methodology and Experiment

Design

Our experiment design covers the evolving landscape

of automated problem-solving and algorithm gen-

eration, focusing on the capabilities of ChatGPT-4

and AutoGen. Central to our study is the uniform

prompting strategy employed, which is pivotal in har-

nessing the capabilities of ChatGPT-4 and AutoGen.

This strategy applies a consistently structured prompt

ICSOFT 2024 - 19th International Conference on Software Technologies

264

crafted to convey problem requirements and context

uniformly to both AI models. This prompt facili-

tates a direct comparison of ChatGPT-4 and AutoGen

in terms of problem-solving efﬁciency, accuracy, and

adaptability.

By employing this single, standardized prompt

across all tests, our study compares and contrasts

the performance of these two systems in a controlled

and comparable manner. Given the dynamic nature

of our problem-solving environment—where tests are

not static but generated in response to each unique

problem—our study evaluates the efﬁciency and ac-

curacy of these systems under these varying condi-

tions.

4.3.1 Problem-Solving and Test Generation

Approach

Our approach is anchored in prompt engineer-

ing (Chen et al., 2023), which guides LLMs to in-

terpret problem statements and generate correspond-

ing solutions and tests. We give both ChatGPT-4 and

AutoGen the same structured prompt shown in Fig-

ure 9, which provides the foundation for both systems

to understand and approach the problem. This prompt

was crafted to outline the problem statement, solu-

tion development requirements, script necessities, test

case execution and preparation, and execution pro-

cess. Our approach enables a fair comparison be-

tween ChatGPT-4 and AutoGen, ensuring the focus

remains on the ability of these systems to generate so-

lutions, as well as create relevant and comprehensive

test cases.

Figure 9: Structured Prompt Example for LLM-Based So-

lution Generation in CS Problems.

4.3.2 Evaluating ChatGPT-4 and AutoGen

The evaluation of ChatGPT-4 and AutoGen involved

multiple layers. First, we assessed these systems’

ability to interpret problem statements accurately and

generate viable solutions. Second, we examined

the appropriateness and thoroughness of the sponta-

neously created test cases. These test cases were vi-

tal to our evaluation process since they represented

the dynamic criteria against which the generated so-

lutions were measured.

Our assessment compared the solutions and tests

generated by each system under identical problem

conditions. This comparative analysis evaluated the

adaptability, precision, and reliability of ChatGPT-4

and AutoGen in a ﬂuid testing environment where

both the problems and their corresponding tests were

generated autonomously.

This study provided a nuanced understanding of

the capabilities of ChatGPT-4 and AutoGen in auto-

mated problem-solving and test generation. Our work

is particularly pertinent in contexts like continuous

integration pipelines and automated software devel-

opment processes, where the ability to autonomously

generate and test solutions is vital. The ﬁndings of our

study provide insight into the potential role of LLM-

based systems in enhancing automated software test-

ing and development.

4.4 Analysis of ChatGPT-4 Experiment

Results

The experiment conducted using ChatGPT-4’s solu-

tion generation capabilities provided a comprehen-

sive view of its performance across a range of com-

puter science problems. To ensure a fair and accu-

rate comparison, the same set of 50 distinct problems,

along with identical prompts, were utilized for both

ChatGPT-4 and AutoGen in the tests. Figure 10 pro-

viding valuable insights into the effectiveness of the

generated solutions.

Figure 10: ChatGPT-4 - Pass Rate of Solutions.

4.4.1 Overall Success Rate

ChatGPT-4’s overall success rate was approximately

90.2%, as shown in Figure 11. This success rate in-

dicates ChatGPT-4’s capability in accurately solving

a broad spectrum of computational tasks. The high

percentage of correctly solved problems demonstrates

Evaluating the Performance of LLM-Generated Code for ChatGPT-4 and AutoGen Along with Top-Rated Human Solutions

265

Figure 11: ChatGPT-4 - Pass Rate of Solutions.

the effectiveness of its generated solutions in various

contexts.

4.4.2 Error Analysis

Distinct patterns emerged when examining ChatGPT-

4’s failed cases, highlighting areas where it faced

challenges. The most frequent error encountered was

related to ”Invalid input. Please provide valid nu-

meric values,” followed by issues like ”max() arg is an

empty sequence” and ”division by zero.” These errors

indicate that while ChatGPT was proﬁcient in many

areas, there were speciﬁc scenarios where improve-

ments were needed, particularly involving input vali-

dation and handling exceptional cases.

4.4.3 Problem Difﬁculty vs. Success Rate

An interesting aspect of ChatGPT-4’s behavior is the

correlation between problem difﬁculty and success

rate. Surprisingly, ’medium’ difﬁculty problems had

a higher success rate (around 93.48%) compared to

’easy’ (87.50%) and ’hard’ (91.11%) difﬁculties, as

shown in Figure 12. This ﬁnding suggests a potential

Figure 12: ChatGPT-4 - Problem Difﬁculty vs. Success

Rate.

discrepancy in the perceived versus actual complexity

of the problems or a higher adaptability of the system

in solving medium complexity tasks.

4.4.4 Problem Type Analysis

ChatGPT-4’s success rate also varied signiﬁcantly

across different problem types. Types such as ”Binary

Search” and ”Sorting algorithms” demonstrated a no-

tably high success rate (over 90%), whereas ”Graph

traversal” and ”Calculating the average of an array of

numbers” exhibited lower success rates. This vari-

ation highlighted ChatGPT-4’s strengths and weak-

nesses in different computational domains and offered

insights for targeted improvements in speciﬁc areas of

problem-solving.

4.4.5 Insights and Future Directions

Overall, our analysis of ChatGPT-4’s experiment re-

sults reveals that it was highly effective in solving

a wide range of computer science problems. How-

ever, the insights gained from the error analysis and

the variation in success rates across problem types

and difﬁculties suggest areas for further enhancement.

Improving input validation, error handling, and adapt-

ing strategies for speciﬁc problem types could yield

even higher success rates and more robust problem-

solving for ChatGPT-4. These ﬁndings help inform

future developments to reﬁne the solution generation

capabilities of ChatGPT-4.

4.5 Analysis of AutoGen Experiment

Results

The experiment conducted on the auto-generation

system for computer science problems provided a

wealth of data, allowing an in-depth analysis of its

performance. The dataset comprises results from tests

conducted on 50 different computer science problems

shown in Figure 13, where each test was evaluated

across multiple parameters.

Figure 13: Number of Pass and Fail Test Cases for Each

Problem Using AutoGen.

4.5.1 Overall Success Rate

AutoGen achieved an overall success rate of approxi-

mately 84.35% Figure 14. This high percentage indi-

ICSOFT 2024 - 19th International Conference on Software Technologies

266

Figure 14: AutoGen - Pass Rate of Solutions.

cates that it solves the majority of the problems cor-

rectly by the auto-generated solutions. It reﬂects Au-

toGen’s proﬁciency in handling a range of computa-

tional tasks and its effectiveness in producing accurate

solutions.

4.5.2 Problem Difﬁculty vs. Success Rate

Understanding the relationship between problem dif-

ﬁculty and success rate is crucial to assess the ef-

fectiveness of solution generation methods. The bar

chart shown in Figure 15 visualizes the success rates

Figure 15: AutoGen - Problem Difﬁculty vs. Success Rate.

of solutions across different problem difﬁculties in

our dataset and distinguishes problem difﬁculties,

such as ’easy’, ’medium’, and ’hard’, represented by

individual bars. The height of each bar signiﬁes the

percentage of successful solutions within that speciﬁc

difﬁculty category. This visualization enables an intu-

itive comparison of success rates across different lev-

els of problem complexity.

AutoGen’s approach, characterized by structured

LLM prompting, is highly effective for complex prob-

lems, which may account for the lower success rates

in ’easy’ problems. Its design seems tailored for in-

tricate scenarios requiring deep analysis, leading to

better performance in ’medium’ and ’hard’ problems.

This insight helps explain AutoGen’s proﬁciency with

complex issues and its less effective handling of sim-

pler tasks.

Contrary to expectations, Figure 15 reveals that

’easy’ problems have the lowest success rate, suggest-

ing a mismatch between perceived simplicity and ac-

tual solution effectiveness. Conversely, as we move

towards ’medium’ and ’hard’ problems, there is a no-

ticeable increase in success rates, which implies that

more complex problems might be better suited to the

solution generation and testing processes, leading to

higher success rates. The quantiﬁcation of success in

percentages adds precision to our analysis, enabling

a more accurate evaluation of solution performance

across different problem types.

4.5.3 Failed Cases Analysis

Two distinct patterns were identiﬁed in our analysis

of failed cases, shedding light on speciﬁc challenges

faced by AutoGen. One issue occurred in problems

dealing with the calculation of the average of an

array of numbers, where it struggled with handling

’NoneType’ values. In particular, AutoGen yielded

errors where a ﬂoating-point number was expected as

a string or a real number, but ’NoneType’ was encoun-

tered instead.

Another area of difﬁculty was observed in sort-

ing algorithms. In this area AutoGen faced challenges

due to string data type limitations, as shown by errors

indicating that a string does not support item assign-

ment. This ﬁnding indicated potential issues in Auto-

Gen’s approach to implementing or understanding the

intricacies of sorting strings.

These insights suggest that while AutoGen was

largely successful, it can be improved in certain ar-

eas. In particular, its handling of edge cases and spe-

ciﬁc data types requires attention. These patterns can

guide future enhancements to AutoGen for better ac-

curacy and robustness in solution generation.

4.6 Comparative Analysis of

ChatGPT-4 and AutoGen

Experiment Results

Conducting a detailed comparative analysis between

the ChatGPT-4 and AutoGen experiment results re-

vealed several key distinctions and similarities. It also

offered insightful perspectives on the performance

and application of each system. Our analysis be-

gins by examining the overall success rates of both

systems, as shown in Figure 16. This ﬁgure shows

ChatGPT-4 achieved a higher success rate (sim90.2%)

indicating its effectiveness in generating correct solu-

Evaluating the Performance of LLM-Generated Code for ChatGPT-4 and AutoGen Along with Top-Rated Human Solutions

267

Figure 16: Success Rate Comparison.

tions for the given programming problems. In con-

trast, AutoGen demonstrated a somewhat lower suc-

cess rate (∼84.35%), though it is still a substantial

majority. This ﬁnding suggests that while AutoGen is

largely reliable, it may encounter more challenges or

inconsistencies in generating correct solutions.

With respect to error analysis, Figure 17 shows

the differences become more pronounced. Of the

Figure 17: Error Rate Comparison.

AutoGen tests that did not pass, only two instances

recorded speciﬁc exceptions. Most of the errors

(39 out of 41) lacked detailed exception information,

which implied a range of underlying issues, from

logic errors to unhandled exceptions in the code.

Conversely, ChatGPT-4 had a lower overall er-

ror rate, with 16 instances of failed tests. Notably,

ChatGPT-4 documented speciﬁc exceptions in one

out of these 16 errors. This result offers better insight

into the nature of the issues, which included input val-

idation errors and undeﬁned variables.

We also analyzed the complexity of problems

and the handling of solutions by both systems. Al-

though tasked with similar problems (primarily ba-

sic arithmetic operations), ChatGPT-4’s solutions ex-

hibited capabilities for handling more complex sce-

narios, such as error handling and input validation.

Conversely, AutoGen showed a higher error rate and

its solutions lacked this complexity in error handling

within the sample data.

Summarizing our comparative analysis, both Au-

toGen and ChatGPT-4 exhibit distinct strengths and

limitations in programming solution generation. Au-

toGen’s slightly lower success rate suggests it is most

effective for educational use and basic programming

tests. Despite ChatGPT-4’s higher error rate in com-

plex scenarios, it shows advanced capabilities like ro-

bust error handling and input validation, positioning

it as a valuable tool for more advanced learning and

comprehensive testing environments. These differ-

ences highlight the potential applications and suitabil-

ity of each system in varying contexts of program-

ming education and automated solution testing.

5 RELATED WORK

The evolution of LLMs in code generation has been

pivotal, particularly in the discipline of prompt engi-

neering, which focuses on crafting effective natural

language inputs for LLMs, enabling them to solve

complex problems across diverse domains (Chen

et al., 2023). Studies in this area have emphasized

the importance of prompt structure and leveraged ex-

ternal tools and methods to enhance the capabilities of

LLMs in coding tasks (Yao et al., 2022). For instance,

Yao et al. (2022) integrated LLMs with external cod-

ing frameworks to augment their utility, while Van et

al. (2023) focused on maximizing the inherent capa-

bilities of LLMs in generating more complex and efﬁ-

cient code structures (van Dis et al., 2023). These ad-

vancements in prompt engineering show particularly

promising results in domains like mathematics, where

straightforward prompting often falls short, necessi-

tating more sophisticated approaches for better out-

comes (Frieder et al., 2023).

Our study delves deeper into the impact of prompt

design on the performance of LLM-generated code.

Existing research primarily employs direct queries

from sources like Stack Overﬂow, providing a base-

line for our investigation. However, the potential

for reﬁned prompting techniques to yield more efﬁ-

cient and accurate code solutions suggests an expan-

sive ﬁeld ripe for future exploration. This area of re-

search is critical, especially considering the increas-

ing reliance on AI-driven solutions in software devel-

opment.

Moreover, the reliability and security of code gen-

erated by LLMs have become focal points in recent

studies. Researchers like Borji et al. (2023) and

Frieder et al. (2023) have identiﬁed and addressed

various bugs and security vulnerabilities inherent in

LLM-generated code (Borji, 2023; ?). The compari-

son of the security proﬁle of human-written code ver-

ICSOFT 2024 - 19th International Conference on Software Technologies

268

sus LLM-generated code, as explored by Asare et al.

(2022), is also garnering signiﬁcant attention (Jalil

et al., 2023; ?; ?). This line of research is crucial in

understanding the trade-offs between human and AI-

generated code, especially concerning security and re-

liability aspects.

Another emerging area of interest is the impact of

LLMs on software development workﬂows and devel-

oper productivity. Studies have begun to assess how

LLMs inﬂuence the software development lifecycle,

from initial design to deployment, and their role in ac-

celerating development processes while maintaining,

or even improving, code quality. This aspect is partic-

ularly relevant as the industry gravitates towards more

AI-integrated development environments.

Overall, the body of research underscores the mul-

tifaceted impact of LLMs in programming. It high-

lights the challenges in ensuring the reliability and

security of LLM-generated code while also explor-

ing the opportunities in enhancing the efﬁciency and

effectiveness of software development practices. As

LLMs continue to evolve, so too does the landscape of

research, continually pushing the boundaries of what

can be achieved through AI-driven code generation

and opening new frontiers in the intersection of AI

and software engineering.

6 CONCLUDING REMARKS

This paper presented a comprehensive analysis of

programming automation, comparing AutoGen and

ChatGPT-4, and evaluating top Stack Overﬂow so-

lutions against those generated by ChatGPT-4. We

observed that ChatGPT-4 can produce solutions com-

petitive with human-crafted ones, especially when

guided by chain-of-thought reasoning. This approach

enhances its problem-solving and code generation ca-

pabilities.

In contrast, despite AutoGen’s slightly lower suc-

cess rate, it excels in handling complex programming

challenges with robust error handling and input vali-

dation, making it suitable for advanced education and

testing. ChatGPT-4, however, demonstrated versatil-

ity in generating optimized solutions for various prob-

lems when effectively prompted.

Key lessons learned from this research include:

• Our experiments demonstrated that prompting

and automatically benchmarking generated code

effectively leverages LLMs for optimized code.

As shown in Section 3, prompting multiple times

and selecting the best solution is a promising aid

for software engineers to optimize performance-

critical code sections.

• A key attribute of ChatGPT-4-based code genera-

tion is its ability to search many coding solutions.

Developers will likely use LLM-based tools like

Code Inspector and Auto-GPT to generate and an-

alyze multiple solutions per query, as discussed

in Section 4. Future tools should enable deﬁn-

ing metrics and automatically prompting until a

quality threshold is met, a prompt limit is reached,

and/or time runs out.

• ChatGPT-4 demonstrated a robust 90.2% success

rate, and was particularly effective for simpler

arithmetic tasks, making it valuable for education

and automated testing. As noted in Section 4.4.2,

however, its error diagnosis and reporting need

further reﬁnement.

• AutoGen’s 84.35% success rate demonstrated ad-

vanced solutions featuring error handling and in-

put validation, as described in Section 4.5. This

ﬁnding indicates AutoGen’s suitability for ad-

vanced education and comprehensive testing envi-

ronments where robust error handling is essential

In summary, our analysis of program automa-

tion using AutoGen and ChatGPT-4 reveals distinct

strengths in each system: AutoGen for basic ed-

ucational use and straightforward problem-solving

and ChatGPT-4 for advanced programming solutions

and robust error handling, particularly when utilizing

chain-of-thought reasoning.

Moreover, our analysis demonstrated that

ChatGPT-4 can generate solutions competitive

with—or superior to—top Stack Overﬂow answers

when given effective prompts. This ﬁnding highlights

the potential of LLMs in complex coding tasks but

also points to the limitations of using minimal context

from Stack Overﬂow titles. Optimized prompting

strategies are essential to fully leverage LLM ca-

pabilities in code generation. The choice between

these two systems should therefore be guided by the

speciﬁc needs of the application, i.e., whether the

priority lies in maximizing successful outcomes or

in handling complex programming challenges with

sophisticated error processing.

An intriguing direction for future work is explor-

ing the potential of leveraging LLM-based tools for

full stack software development. Rather than focus-

ing solely on individual modules or components, we

plan to investigate how LLMs perform at generat-

ing complete end-to-end systems encompassing front-

end, back-end, database, and infrastructure elements.

Examining the effectiveness of LLMs across the en-

tire software lifecycle may reveal new capabilities and

limitations. Key areas of analysis include correctness,

security, scalability, maintainability, and modularity

of auto-generated systems. In addition, studying inte-

Evaluating the Performance of LLM-Generated Code for ChatGPT-4 and AutoGen Along with Top-Rated Human Solutions

269

gration with human developers in a blended workﬂow

rather than as a wholesale replacement will provide

important insights.

Our future work will also consider if/how other

code quality metrics can be integrated to allow con-

sidering multiple dimensions of code quality beyond

performance. In particular, security and functional

correctness are clearly important points of consid-

eration, but must be supplemented with additional

analyses. Likewise, other quality attributes, such as

memory consumption, long-term maintainability, and

modularity, should also be analyzed. As LLMs con-

tinue to mature, understanding their role in higher-

level software creation and complementing human

programmers offer promising new frontiers.

ACKNOWLEDGEMENTS

We used ChatGPT-4’s Advanced Data Analysis capa-

biity to generate code for the data visualizations and

ﬁlter the data sets.

REFERENCES

Github copilot · https://github.com/features/copilot.

Arachchi, S. and Perera, I. (2018). Continuous integration

and continuous delivery pipeline automation for ag-

ile software project management. In 2018 Moratuwa

Engineering Research Conference (MERCon), pages

156–161.

Asare, O., Nagappan, M., and Asokan, N. (2022).

Is github’s copilot as bad as humans at intro-

ducing vulnerabilities in code? arXiv preprint

arXiv:2204.04741.

Bang, Y., Cahyawijaya, S., Lee, N., Dai, W., Su, D., Wilie,

B., Lovenia, H., Ji, Z., Yu, T., Chung, W., et al. (2023).

A multitask, multilingual, multimodal evaluation of

chatgpt on reasoning, hallucination, and interactivity.

arXiv preprint arXiv:2302.04023.

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R.,

Arora, S., von Arx, S., Bernstein, M. S., Bohg, J.,

Bosselut, A., Brunskill, E., et al. (2021). On the

opportunities and risks of foundation models. arXiv

preprint arXiv:2108.07258.

Borji, A. (2023). A categorical archive of chatgpt failures.

arXiv preprint arXiv:2302.03494.

Carleton, A., Klein, M. H., Robert, J. E., Harper, E., Cun-

ningham, R. K., de Niz, D., Foreman, J. T., Goode-

nough, J. B., Herbsleb, J. D., Ozkaya, I., and Schmidt,

D. C. (2022). Architecting the future of software en-

gineering. Computer, 55(9):89–93.

Chen, B., Zhang, Z., Langren

e, N., and Zhu, S. (2023). Un-

leashing the potential of prompt engineering in large

language models: a comprehensive review.

De Vito, G., Lambiase, S., Palomba, F., Ferrucci, F., et al.

(2023). Meet c4se: Your new collaborator for soft-

ware engineering tasks. In 2023 49th Euromicro Con-

ference on Software Engineering and Advanced Ap-

plications (SEAA), pages 235–238.

Elnashar, A., Moundas, M., Schimdt, D. C., Spencer-Smith,

J., and White, J. Prompt engineering of chatgpt to

improve generated code & runtime performance com-

pared with the top-voted human solutions.

Espejel, J. L., Ettifouri, E. H., Alassan, M. S. Y., Chouham,

E. M., and Dahhane, W. (2023). Gpt-3.5, gpt-4, or

bard? evaluating llms reasoning ability in zero-shot

setting and performance boosting through prompts.

Natural Language Processing Journal, 5:100032.

Frieder, S., Pinchetti, L., Grifﬁths, R.-R., Salvatori, T.,

Lukasiewicz, T., Petersen, P. C., Chevalier, A., and

Berner, J. (2023). Mathematical capabilities of chat-

gpt. arXiv preprint arXiv:2301.13867.

Giray, L. (2023). Prompt engineering with chatgpt: A guide

for academic writers. Annals of biomedical engineer-

ing, 51(12):2629—2633.

Jalil, S., Raﬁ, S., LaToza, T. D., Moran, K., and Lam,

W. (2023). Chatgpt and software testing education:

Promises & perils. arXiv preprint arXiv:2302.03287.

Krochmalski, J. (2014). IntelliJ IDEA Essentials. Packt

Publishing Ltd.

Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig,

G. (2023). Pre-train, prompt, and predict: A system-

atic survey of prompting methods in natural language

processing. ACM Computing Surveys, 55(9):1–35.

Pearce, H., Ahmad, B., Tan, B., Dolan-Gavitt, B., and Karri,

R. (2022). Asleep at the keyboard? assessing the se-

curity of github copilot’s code contributions. In 2022

IEEE Symposium on Security and Privacy (SP), pages

754–768. IEEE.

Porsdam Mann, S., Earp, B. D., Møller, N., Vynn, S., and

Savulescu, J. (2023). Autogen: A personalized large

language model for academic enhancement—ethics

and proof of principle. The American Journal of

Bioethics, 23(10):28–41.

van Dis, E. A., Bollen, J., Zuidema, W., van Rooij, R., and

Bockting, C. L. (2023). Chatgpt: ﬁve priorities for

research. Nature, 614(7947):224–226.

White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert,

H., Elnashar, A., Spencer-Smith, J., and Schmidt,

D. C. (2023). A prompt pattern catalog to enhance

prompt engineering with chatgpt. arXiv preprint

arXiv:2302.11382.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan,

K., and Cao, Y. (2022). React: Synergizing reason-

ing and acting in language models. arXiv preprint

arXiv:2210.03629.

Zhang, Z., Zhang, A., Li, M., and Smola, A. (2022). Au-

tomatic chain of thought prompting in large language

models. arXiv preprint arXiv:2210.03493.

ICSOFT 2024 - 19th International Conference on Software Technologies

270