Evaluating the Performance of LLM-Generated Code for ChatGPT-4 and AutoGen Along with Top-Rated Human Solutions

Ashraf Elnashar, Max Moundas, Douglas Schmidt, Jesse Spencer-Smith, Jules White

2024

Abstract

In the domain of software development, making informed decisions about the utilization of large language models (LLMs) requires a thorough examination of their advantages, disadvantages, and associated risks. This paper provides several contributions to such analyses. It first conducts a comparative analysis, pitting the best-performing code solutions selected from a pool of 100 generated by ChatGPT-4 against the highest-rated human-produced code on Stack Overflow. Our findings reveal that, across a spectrum of problems we examined, choosing from ChatGPT-4's top 100 solutions proves competitive with or superior to the best human solutions on Stack Overflow. We next delve into the AutoGen framework, which harnesses multiple LLM-based agents that collaborate to tackle tasks. We employ prompt engineering to dynamically generate test cases for 50 common computer science problems, both evaluating the solution robustness of AutoGen vs ChatGPT-4 and showcasing AutoGen's effectiveness in challenging tasks and ChatGPT-4's proficiency in basic scenarios. Our findings demonstrate the suitability of generative AI in computer science education and underscore the subtleties of their problem-solving capabilities and their potential impact on the evolution of educational technology and pedagogical practices.

Download


Paper Citation


in Harvard Style

Elnashar A., Moundas M., Schmidt D., Spencer-Smith J. and White J. (2024). Evaluating the Performance of LLM-Generated Code for ChatGPT-4 and AutoGen Along with Top-Rated Human Solutions. In Proceedings of the 19th International Conference on Software Technologies - Volume 1: ICSOFT; ISBN 978-989-758-706-1, SciTePress, pages 258-270. DOI: 10.5220/0012820600003753


in Bibtex Style

@conference{icsoft24,
author={Ashraf Elnashar and Max Moundas and Douglas Schmidt and Jesse Spencer-Smith and Jules White},
title={Evaluating the Performance of LLM-Generated Code for ChatGPT-4 and AutoGen Along with Top-Rated Human Solutions},
booktitle={Proceedings of the 19th International Conference on Software Technologies - Volume 1: ICSOFT},
year={2024},
pages={258-270},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012820600003753},
isbn={978-989-758-706-1},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 19th International Conference on Software Technologies - Volume 1: ICSOFT
TI - Evaluating the Performance of LLM-Generated Code for ChatGPT-4 and AutoGen Along with Top-Rated Human Solutions
SN - 978-989-758-706-1
AU - Elnashar A.
AU - Moundas M.
AU - Schmidt D.
AU - Spencer-Smith J.
AU - White J.
PY - 2024
SP - 258
EP - 270
DO - 10.5220/0012820600003753
PB - SciTePress