Model-Driven Development Using LLMs: The Case of ChatGPT

Virginia Niculescu

, Maria-Camelia Chis

alit¸

a-Cret¸u

Cristina-Claudia Osman

and Adrian Sterca

Babes¸-Bolyai University, Cluj-Napoca, Romania

{virginia.niculescu, maria.chisalita, cristina.osman, adrian.sterca}@ubbcluj.ro

Keywords:

Model-Driven Development, Large Language Models, Conceptual Diagrams, Business Process Model and

Notation, Entity-Relationship Diagrams, User Productivity, Business Analysts.

Abstract:

The recent rise of Large Language Models (LLMs) suggests the possibility for users with different levels of

expertise to generate software applications from high-level speciﬁcations such as formatted text, diagrams or

natural language. This would enhance productivity and make these activities accessible to users without a

technical background. Approaches such as Model-Driven Engineering (MDE) and Workﬂow Management

Systems (WfMSs) are widely used to enhance productivity and streamline software development through

automation. This study explores the feasibility of using LLMs, speciﬁcally ChatGPT, in software development,

focusing on their capability to assist business analysts (BAs) in generating functional applications. The goal

of this paper is threefold: (1) to assess the extent to which LLMs comprehend conceptual model diagrams,

(2) to evaluate the reliability of diagram-based code generation, and (3) to determine the level of technical

knowledge required for users to achieve viable solutions. Our methodology evaluates the effectiveness of using

LLMs to generate functional applications starting from BPMN process diagrams and Entity-Relationship (ER)

diagrams. The ﬁndings provide insights into the reliability and limitations of LLMs in diagram-based software

generation, the degree of technical expertise required, and the prospects for adopting LLMs as tools for BAs.

1 INTRODUCTION

A signiﬁcant factor behind the difﬁculty of develop-

ing complex software is the wide conceptual gap be-

tween the problem and the implementation domains

of discourse. Business analysts (BAs) are able to

provide a good description of the problem; business

processes can be represented as diagrammatic visual-

izations using standardized or non-standardized lan-

guages. In order to be analyzed and managed, busi-

ness processes need representations that are formal,

standardized and at the same time, easy to understand.

The Business Process Model and Notation

(BPMN) represents such a representation and it is a

standard managed by an international organization,

the Object Management Group (OMG). Nowadays,

practitioners use BPMN on a daily basis to design var-

ious complex business processes.

The term Model-Driven Engineering (MDE) is

typically used to describe software development ap-

https://orcid.org/0000-0002-9981-0139

https://orcid.org/0000-0002-1414-0202

https://orcid.org/0000-0002-5911-0269

proaches in which abstract models of software sys-

tems are created and systematically transformed into

concrete implementations (France and Rumpe, 2007).

MDE is meant to increase productivity and simplify

the process of the design and implementation.

Another approach that has the same goals of in-

creasing productivity and at the same time ensuring

simple and correct development of the business ap-

plications is represented by the Workﬂow Systems

(WfSs). A Workﬂow Management System (WfMS)

is a system that allows deﬁning, managing and exe-

cuting processes.

Large Language Models (LLMs) are a specialized

subset of GenAI, focusing on language generation,

while GenAI consists of a greater variety of AI mod-

els capable of creating various forms of content. Re-

cent popularity of the LLMs leads to the idea that in

a near future they could be used for developing soft-

ware applications. This for sure will increase produc-

tivity and allow non-specialized IT users to create ap-

plications from speciﬁcations. The speciﬁcations of

the software applications could be given in different

format: formatted text, diagrams or even natural lan-

guage. The general goal of our research is to study

328

Niculescu, V., Chis

ali¸t

a-Cre¸tu, M.-C., Osman, C.-C. and Sterca, A.

Model-Driven Development Using LLMs: The Case of ChatGPT.

DOI: 10.5220/0013484400003928

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2025), pages 328-339

ISBN: 978-989-758-742-9; ISSN: 2184-4895

how close we are from this stage.

Business processes are deﬁned as a set of business

activities that represent the steps required to achieve a

business objective. They include the ﬂow and use of

information and resources (OMG, 2013). The most

common modeling languages used in process repre-

sentations are Event-driven Process Chains (EPCs)

(Keller et al., 1992) and Business Process Model and

Notation (BPMN) (OMG, 2013) diagrams.

Targeted users of our research are BAs who have

to manage artefacts as business process diagrams that

consists of meaningful actions performed in speciﬁc

business areas. Often, BAs have to cope with chang-

ing requirements or deﬁning new business processes

that should be accommodated into the existing soft-

ware. They may be asked to provide a proof of con-

cept that helps the decision makers or to build a quick

solution that can be extended or adapted later. In this

context, LLMs become a tool at hand that may be em-

ployed to obtain results faster then other software de-

velopment methodologies.

We aim to investigate whether LLMs may be used

as effective tools to be employed in software develop-

ment by specialists, i.e., BAs, that may not have tech-

nical knowledge in particular, as developers do have.

Considering BAs as the targeted users, a promis-

ing approach of using LLMs may be considered: to

start by providing a minimal set of information com-

prised from just the conceptual models described us-

ing diagrams. The diagrams implicitly describe the

processes and the functionalities of the application.

An entity-relationship model (ER model) is also given

in order to assure the correct generation of all entities

involved. This approach is similar to the approach

used by the WfMS. Based on all these, we may as-

sume that the code sources of the application could

be generated by AI.

ChatGPT was chosen as the LLM variant in our

study. ChatGPT uses a complex machine learning

model built on a Generative Pre-trained Transformer

(GPT) architecture. The model is trained on a massive

dataset of text from books, articles, websites, publicly

available sources (Bala et al., 2025).

In the speciﬁc context of the proposed endeavour,

the users (BAs) provide ER diagrams and BPMN pro-

cess diagrams. So, we constrained the diagram anal-

ysis to these types of diagrams.

For a structured study approach we have identiﬁed

three research questions, as follows:

• RQ1 – To what extent does ChatGPT comprehend

conceptual model diagrams?

• RQ2 – To what extent diagram-based code gener-

ation is reliable?

• RQ3 – How much technical knowledge should the

user have in order to build a reliable solution?

The next section presents related works to our ap-

proach. Section 3 describes the methodology em-

ployed throughout the research and the evaluation re-

sults. The answer for each research question are in-

cluded in this section after the results analysis of the

corresponding experiments. Based on the obtained

results, we present an incipient analysis of the pos-

sible directions to follow in achieving success in us-

ing LLMs in software development in Section 4. The

paper concludes with the ﬁnal remarks.

2 RELATED WORK

The application of LLMs in BPM semantic quality

improvement is explored by (Ayad and Alsayoud,

2024). The paper investigates the extent to which

GenAI technologies can aid the modeler by suggest-

ing improvements. The study uses GPT-4o and exam-

ines its capabilities by employing various combina-

tions of prompts, incorporating proposed textual syn-

tax, and integrating contextual domain knowledge.

The ﬁndings indicate that the knowledge generated

by GPT-4o is predominantly generic, encompassing

ambiguous and general concepts that extend beyond

the speciﬁc domain. The use of speciﬁc proposed

prompts helps to reﬁne the generated knowledge,

leading to more speciﬁc and comprehensive outcomes

that align closely with the intended domain.

Kanuka et al. (Kanuka et al., 2023) explore the

capabilities of LLMs, particularly OpenAI’s Chat-

GPT, in addressing the challenges associated with

software modeling, explicitly focusing on the bidi-

rectional traceability problem between design models

and code. The study aims to showcase the proﬁciency

of ChatGPT in understanding and integrating speciﬁc

requirements into design models and code. The paper

investigates its potential to offer solutions to the bidi-

rectional traceability problem through a case study.

The ﬁndings indicate that ChatGPT is capable of gen-

erating design models and code from natural language

requirements, thereby bridging the gap between these

requirements and software modeling. ChatGPT has

limitations in suggesting a speciﬁc method to resolve

the problem, but exhibited the capacity to provide cor-

rections to be consistent between design models and

code.

The research conducted by (Rajbhoj et al., 2024)

suggests that generative AI techniques have the po-

tential to reduce the skill requirements necessary for

software development and signiﬁcantly accelerate the

development process.

Model-Driven Development Using LLMs: The Case of ChatGPT

329

GitHub Co-pilot

is examined by Wermelinger

(Wermelinger, 2023), focusing on the performance

for generating code, tests, and details in offering sup-

port for students to solve computer science problems

described as text. The study compares Copilot with

OpenAI Davinci in terms of correctness, diversity,

and guidance needed to obtain correct solutions. They

reported that DaVinci demonstrated more effective-

ness than Copilot regarding correctness and diversity.

Dakhel et al. (Dakhel et al., 2022) carried out

an assessment of GitHub Copilot as an AI pair pro-

grammer. The researchers investigated the quality of

the generated code compared to the human elaborated

code considering a set of programming tasks. The re-

sults showed the Copilot is capable of providing so-

lutions for almost all fundamental algorithmic prob-

lems, but some of them are buggy. When comparing

Copilot to humans, the results indicate that the ratio

of correct human solutions is greater than Copilot’s

correctness ratio, while the buggy solutions generated

by Copilot require less effort to be repaired.

The study of Kamrul et al. (Siam et al., 2024)

presents a thorough evaluation of leading program-

ming assistants, including ChatGPT, Gemini (Bard

AI), AlphaCode, and GitHub Copilot. The evalua-

tion is based on tasks like natural language process-

ing and code generation accuracy in different pro-

gramming languages like Java, Python, and C++. The

study offers a comparison of different LLMs and pro-

vides essential feedback on the rapidly changing area

of AI models, emphasizing the need for ethical devel-

opmental practices to actualize AI models’ full poten-

tial. Results indicate the strengths, weaknesses, and

the importance of further modiﬁcations to increase the

reliability and accuracy of the latest popular models.

Liukko et al. (Liukko et al., 2024) documents the

development of a real life web application in the ﬁ-

nancial domain using ChatGPT. They adopt an Agile

methodology in developing the software.

Dae-Kyoo Kim (Kim, 2024) compares OpenAI’s

ChatGPT and Google’s Bard in developing a tour

reservation web application. Both LLMs are used

for producing various software development life cy-

cle (SDLC) artifacts from a textual description of

the application: generating the functional and non-

functional requirements, domain modeling, and im-

plementation. The author mentions that both mod-

els identiﬁed correctly many entities of the domain of

the application, but also that they missed some of the

entities. Still, they can not produce class diagrams

or sequence diagrams for the the modeling phase.

When it comes to code generation, the author men-

tions that ChatGPT and Bard produced errors of dif-

https://github.com/features/copilot

ferent kinds in their generated code, including miss-

ing import statements, missing modiﬁers, undeﬁned

variables, undeﬁned data types, undeﬁned methods,

and parameter mismatches. When asked to correct

these compile errors, both tools were able to correct

most of the errors. The author concludes that both

tools are helpful in developing an application, but they

also have limitations.

3 METHODOLOGY AND

EVALUATION

Our approach implies the use of GenAI tools in the

SDLC process. Speciﬁcally, we intend to explore the

usage of ChatGPT as LLM tool in the software de-

velopment process. Moreover, we propose the use of

BPMN models and ERDs in the development of the

software product. Figure 1 represents the general ap-

proach of our study.

Our general endeavor is to evaluate the possibility

to generate source code for a complex web application

starting from business processes (depicted as BPMN

models) and business domain concepts (depicted as

ERDs). This analysis is split into three phases di-

rected by the established research questions.

For each phase of the evaluation we set the following

stages:

1. establish the evaluation criteria appropriate to the

corresponding research question;

2. create the input data set for evaluation;

3. establish the prompts that allow us to evaluate

the established criteria (the initial prompt may be

adapted based on the responses);

4. analyse the results and extract conclusions.

Full details about prompts, data, and evaluation of

studied diagrams could be accessed at the link avail-

able at the reference (Chis

alit¸

a-Cret¸u et al., 2025).

RQ1: To What Extent Does ChatGPT

Comprehend Conceptual Model

Diagrams?

The ﬁrst goal is to evaluate the ChatGPT’s degree

of comprehension of the used diagrams. We have

evaluated two types of diagrams, namely ERDs and

BPMN diagrams for processes.

Evaluation Criteria

The ﬁrst goal in our study is to evaluate the Chat-

GPT’s degree of comprehension of the various types

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

330

Figure 1: General approach.

of diagrams provided as input. The response gener-

ated is assessed based on the following criteria:

• C01: ChatGPT is able to correctly identify the

type of the received diagram;

• C02: the extent to which ChatGPT is able to cor-

rectly describe the symbols and relationships in-

cluded in the diagrams;

• C03: the extent to which ChatGPT is able to cor-

rectly transform the diagram into a persistent for-

mat through conversion to another artifact, e.g.,

database schema, XML format;

• C04: ChatGPT is able to improve the existing di-

agrams;

• C05: the extent to which ChatGPT is able to

generate a diagram based on a persitent format

(database schema, XML format);

Criteria C01 and C04 are evaluated with yes or no,

while criteria C02, C03, and C05 use the scale from 1

(low accuracy) to 5 (high accuracy). Criteria C03 and

C05 are complementary.

ER Diagrams

Data

Five samples of ER diagrams were provided as input

for ChatGPT to analyze. Two samples of ERDs do

not include attributes. From the remaining ones, one

ERD highlights primary keys (PKs) and foreign keys

(FKs) using graphical symbols, one ERD uses the text

acronyms PK and FK, while the last ERD does not

include distinct marks for attributes or keys. Because

it is more commonly used, we opted for Crow’s Foot

notation instead of Chen notation.

Prompts

The prompts given to ChatGPT are listed below and

the evaluation of C01 and C02 use the same prompt.

P1.ERD: ”Explain the uploaded diagram.”

P2.ERD: ”Provide the database schema for the uploaded

ERD.”

P3.ERD: ”Improve the database schema with new entities,

attributes, and relationships between entities.”

P4.ERD: ”Recreate the ERD for the improved database

schema. Please use the same type of diagram as the

uploaded one, including the crow’s foot notation.”

Results Analysis

For ERDs, the responses include details about the

type of the diagram, entities, attributes, and relation-

ships. Additional data produced refers to generated

database schema. A third type of data generated in-

cludes the improvements for the database schema, in

terms of new entities, attributes, and relationships. At

the end, ChatGPT generates a diagram that should

represent the improved database schema.

The prompt P1.ERD requested to explain the up-

loaded diagrams. The responses offered by ChatGPT

indicate that it manages to identify correctly the type

of the diagram. Therefore, criterion C01 is evaluated

with yes for all ﬁve recorded experiments, explicitly

saying that an ERD was uploaded. Other not recorded

experiments indicate that ChatGPT answered that the

processed ERD is a concept diagram and the user

needed to ask ChatGPT to clarify what type of con-

cept diagram actually is it. For this case, the responses

indicate that ChatGPT successfully identiﬁes the dia-

gram type only if additional prompts that reﬁne the

response are present.

P1.ERD prompt responses allowed the evalua-

tion of the C02 criterion, too. ChatGPT manages to

correctly describe the ERDs, identifying and detail-

ing entities, attributes, and relationships. The C02

criterion is evaluated with 5 for all recorded experi-

ments. For the cases where the ERDs do not specify

attributes, ChatGPT offered details about the entities

and relationships only, but offered information about

the cardinality, usage, and purpose of the diagram.

The prompt P2.ERD requested to generate the

database schema for the uploaded ERD. Almost all

generated database schema using SQL statements

meet partially the initial ERDs. For instance, ERD1

and ERD2 do not specify attributes. Still, ChatGPT

Model-Driven Development Using LLMs: The Case of ChatGPT

331

adds attributes to the existing entities and other new

entities, considered bridge tables to manage m:n re-

lationships. CO3 criterion in these cases is evaluated

to 4 because of the improvements suggested for the

missing attributes. ERD3 includes attributes, empha-

sizes crow’s foot notation, and primary and foreign

keys. Still, ChatGPT adds two new relationships be-

tween the existing entities. For the ERD3 sample and

this speciﬁc result, the CO3 criterion is evaluated to

3, as ChatGPT does not follow the details mentioned

in the diagram. ERD4 presents graphical symbols for

primary keys, foreign keys, and the crow’s foot nota-

tion. During the experiment for ERD4, ChatGPT suc-

cessfully manages to generate the database schema

that entirely mirrors the provided ERD and the CO3

criterion is evaluated to 5. All database schema pro-

vided are normalized to 3NF.

The prompt P3.ERD requested to improve the

database schema with new entities, attributes, and

relationships. For all experiments, the data gener-

ated indicate improvements and the C04 criterion was

evaluated to yes. For some cases, the initial entities

were replaced with new entities and relationships that

ChatGPT has evaluated as offering a better perspec-

tive for the application domain through ﬂexibility and

adaptability.

The prompt P4.ERD requested to recreate the

ERD for the improved database schema. The prompt

explicitly requested to use the crow’s foot notation,

but the generated diagrams do not follow the standard

notation for ERD. The C05 criterion was evaluated to

1 for all run experiments. ChatGPT did not suggest

using particular tools that could successfully generate

the ERD based on the provided database schema in

any recorded experiments.

Table 1 summarizes the results of the recorded ex-

periments on ERDs together with the evaluation for

the ﬁve used criteria.

Table 1: Experimental results for ERDs.

ERD C01 C02 C03 C04 C05

ERD1, no attr. yes 5 4 yes 1

ERD2, no attr. yes 5 4 yes 1

ERD3, with attr. yes 5 3 yes 1

ERD4, with attr yes 5 5 yes 1

ERD5, with attr. yes 5 3 yes 1

BPMN Diagrams

Data

We have used ﬁve BPMN diagrams

https://github.com/camunda/bpmn-for-research/tree/

master/BPMN\%20for\%20Research/English

• BPMN1 – collaboration diagram with different

gateways types and annotations (asynchronous

communication),

• BPMN2 – collaboration diagram with exclusive

gateways (synchronous communication),

• BPMN3 – process diagram with inclusive, exclu-

sive and parallel gateways,

• BPMN4 – collaboration diagram with event-

based gateways and different intermediate events,

• BPMN5 – process diagram with exclusive and

event-based gateways.

Prompts

P1.BPMN: ”Explain the uploaded diagram.”

P2.BPMN: ”Generate the XML-ﬁle (*.bpmn) correspond-

ing to the uploaded diagram.”

P3.BPMN: ”Improve the uploaded diagram by including

new elements (tasks, events, gateways, data objects,

swimlanes etc. - or any other BPMN concept).”

P4.BPMN: ”Recreate the BPMN model for the im-

proved BPMN model. Use BPMN speciﬁcation:

https://www.omg.org/spec/BPMN.”

Results Analysis

Our evaluation criteria are consistent with those previ-

ously employed for the assessment of ERDs, while in-

corporating necessary adaptations to ensure their ap-

plicability within the BPMN context. The ﬁrst cri-

terion C01 evaluates the ability of ChatGPT to iden-

tify the diagram type (process diagram modeled using

BPMN). The next criterion C02 refers to the ability of

ChatGPT to correctly describe the ﬂow of the BPMN

diagrams (namely, to identify the BPMN symbols and

the corresponding relationships between them). The

prompt used for the assessment of the ﬁrst two criteria

is P1.BPMN. The third criterion C03, evaluates the

ability of ChatGPT to extract from the diagrammatic

visualization of a BPMN model, the corresponding

XML representation (*.bpmn ﬁle). The prompt used

for the assessment of this criterion is P2.BPMN. Cri-

terion C04 evaluates the capacity to bring improve-

ments for the uploaded BPMN diagram. The im-

provements may include the incorporation of new

BPMN symbols (such as tasks, gateways, data ob-

jects, etc.). The corresponding prompt of the third

criterion is P3.BPMN. Through the fourth prompt

P4.BPMN we assess the ability of ChatGPT to pro-

duce a BPMN model that reﬂects the previously pro-

posed improvements.

The analysis of BPMN diagrams demonstrates

ChatGPT’s accurate analysis of BPMN process mod-

els, aligning with the diagrams’ symbol usage, suc-

cessfully fulﬁlling criteria C01 and C02. Criteria

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

332

Table 2: Experimental results for BPMN diagrams

BPMN C01 C02 C03 C04 C05

BPMN1 yes 5 2 yes 1

BPMN2 yes 5 2 yes 1

BPMN3 yes 5 2 yes 1

BPMN4 yes 5 2 yes 1

BPMN5 yes 5 2 yes 1

C03 and C05 are partially met as although Chat-

GPT proposes new concepts to be added on the di-

agram (usually it proposes the incorporation of tasks,

gateways, data objects, intermediate events, swim-

lanes, sub-processes, text annotations, message ﬂows,

etc.), the XML ﬁle that is generated does not com-

ply with BPMN speciﬁcation (OMG, 2013). The re-

sponse for P3.BPMN is usually the XML format of

the improved model with additional narrative expla-

nations related to the proposed updates. The *.bpmn

ﬁle cannot be read by BPMN editors like SAP Sig-

navio or bpmn.io, Bizagi being one of the BPMN

modeler tools able to partially interpret the XML ﬁles

(but the visualization provided overlaps the identiﬁed

BPMN elements). By providing additional details in

the prompt (for example: ”The position of the sym-

bols is missing from the XML ﬁle and the elements

are overlapped, can you please update the XML ﬁle?”

or ”The same position of the symbols is provided for

the elements from the XML ﬁle and the elements are

overlapped, can you please update the XML ﬁle?”),

the XML ﬁle is improved, it still fails to accurately

represent a machine-readable BPMN model. Con-

sequently, a well-deﬁned sequence of prompts is re-

quired for ChatGPT to effectively fulﬁll the initial

task (therefore, ChatGPT users should enhance their

prompt engineering skills).

Nevertheless, criterion C05 is minimally fulﬁlled.

The provided visualization deviates from the BPMN

standard. In some situations it reveals the Python

source code used in order to provide the graphical

visualization (see P4.1 for BPMN5, reachable at the

link from the reference (Chis

alit¸

a-Cret¸u et al., 2025)).

Still, it cannot provide standardized diagrams, neither

if the XML structure (for example the XML corre-

sponding ﬁle of a BPMN diagram) is provided.

Table 2 summarizes the results of the recorded

experiments on BPMN models, together with the

evaluation for the ﬁve used criteria.

Answer to RQ1:

From all these experiments we may conclude that

ChatGPT is able to understand the ERDs, the entities

and their relations being extracted from the images.

The database schema is not accurately generated as

some relations are not deﬁned at all, or they are not

appropriately deﬁned. Also, it is possible that new ta-

bles, attributes, and relationships to be added.

Requested improvements show that ChatGPT is able

to add new entities, attributes, and relationships

based on the deduced data domain represented by the

diagram. Reversely, ChatGPT is not able to generate

ERDs following domain-speciﬁc modeling notation as

dedicated standalone tools are able to.

The BPMN model analysis reﬂects a good interpre-

tation of the diagrams, but shows that a well-deﬁned

sequence of prompts is required for ChatGPT to ef-

fectively fulﬁll the transformation into the XML per-

sistent format.

ChatGPT is not able to produce standardized dia-

grams, neither if the entire XML structure of the

BPMN diagram is provided.

We may conclude that the interpretation and transfor-

mation work executed by ChatGPT for the analysed

diagram types is very useful since the number of in-

accuracies is low in percentage, and with additional

further veriﬁcation and prompts the correct solution

could be obtained.

RQ2: To What Extent Diagram Based

Code Generation Is Reliable?

The second goal of our study is to assess the extent

to which the generated code by ChatGPT is able to ﬁt

the real needs of the BAs from various perspectives.

To assess this capability, we conduct a case study

on developing an application using some of the most

widely adopted technologies in software develop-

ment, including React for the frontend, Spring for the

backend, and MySQL for the database.

The considered application is a sale web appli-

cation that involves two types of actors: seller and

customer. The seller prepares a quote and sends it to

the customer. Using the quote, the customer selects

the desired products. The customer then provides

additional information, such as the shipping cost, the

address, and the payment method. After entering

customer details, described in the data model as

Name, Primary Contact, Phone and Industry, the

customer creates and submits the order to the seller.

Upon receiving the order, the seller generates an

invoice number. Using the order details, the total

price is calculated, and the delegate is included in

the invoice. The shipping and the payment methods

are also incorporated based on the order information.

The seller adds the shipping address details, updates

the total price, generates the invoice and sends it to

the customer.

Model-Driven Development Using LLMs: The Case of ChatGPT

333

Evaluation Criteria

• C06: the extent to which ChatGPT is able to gen-

erate code consistent with the diagrams provided

by the user;

• C07: the extent to which ChatGPT is able to gen-

erate an entire application;

• C08: the extent to which ChatGPT is able to im-

prove the generated code as a consequence of us-

ing additional prompts (clariﬁcation);

• C09: the extent to which ChatGPT is able to in-

tegrate new requirements in the previously gener-

ated code;

• C10: the extent to which ChatGPT is able to guide

the user regarding the platform prerequisites for

successful deploy of the generated application.

Data

For this application we consider that the user (BA)

will provide only two diagrams. From the business

point of view they include enough information about

the entities and the processes into which they are in-

volved:

• an ER diagram that describes the main concepts

together with their attributes and relations be-

tween them (Fig. 2);

• a BMPN diagram that describes the main func-

tionalities of the application (Fig. 3).

Figure 2: ER Diagram for the Sale web application.

Prompts and Results Analysis

For this goal of evaluating the code generation,

we started from an initial prompt and following

prompts were given depending on the analysis of the

responses.

All-in-One Approach

Since we intend to ﬁnd out if the application that

respects the requirements implicitly given in the

diagrams, we tried with an initial prompt that asked

ChatGTP upfront to generate all the code ﬁles for

the web application, both the frontend part and the

backend.

Initial Prompt:

”Develop a web application using the generated database

and the BPMN process from the uploaded picture. The

web application should use a MySQL database and should

be made from a React frontend and a Java backend. Please

pack everything together as a downloadable zip archive.”

Response:

”Creating a web application with the described architecture

involves the following steps:

1. Database Setup: Set up the MySQL database with the

previously generated schema and data. 2. Backend Devel-

opment: Build a Java Spring Boot backend that connects to

the database and exposes RESTful APIs for the BPMN pro-

cesses.

3. Frontend Development: Create a React frontend that in-

teracts with the backend to follow the BPMN workﬂow.

4. Packaging: Organize the application as an IntelliJ IDEA

project and package it as a downloadable zip archive.

I will prepare the required ﬁles and structure the project.

This process may take some time. The web application

project has been successfully created and packaged as a zip

archive. You can download it using the link below:

Download ProductManagementWebApp.zip ”

The provided zip archive contains just the main

SpringBoot class of the application for the backend

and a simple Javascript skeleton ﬁle for the frontend.

It can be noticed that the response includes short de-

scription about the structure of the application, but it

suggests that the entire application was built, which it

not correct.

During the next attempt, we gave ChatGPT

the same prompt but without the requirement of

packaging all code in a zip archive. This time it

generated more relevant code, different classes for

the Java backend and .jsx ﬁles for the React frontend

application. Although the code was good and more

complete, it still had many inconsistencies between

different code ﬁles (e.g. objects that call methods

which does not exist). We had to spend a lot of time

to manually correct the code so that it would compile

and ﬁnally it would also run correctly.

Iterative Approach

Analysing the results of the ﬁrst two attempts lead

to the conclusion that the strategy should be changed

and for the third attempt we adopted the following

iterative-based development strategy: we would ask

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

334

Figure 3: BPMN diagram of the main processes of the Sale application.

ChatGPT to generate small chunks of code and an

experienced software developer (i.e. one of the au-

thors of the paper) would review the code and correct

it (and sometimes even gave ChatGPT back the cor-

rected code as input) so that errors don’t propagate

in the development process. In the remaining lines

of this subsection we document the interaction with

ChatGPT in this third attempt methodology of code

development assisted by ChatGPT.

For this approach we start with a prompt that up-

loaded the BPMN process diagram and asked Chat-

GPT:

”I want to implement the above BPMN process in a web

application with React frontend and Java Springboot back-

end. First, please analyze the BPMN diagram and tell me

what do you understand from it.”

After we veriﬁed that the provided interpretation

of the diagram was correct, we gave it the ERD image

with the database concepts and asked:

”Please generate SQL scripts for creating a Mysql

database for the above BPMN process. The conceptual dia-

gram of the database is depicted in the attached image.”

The generated SQL scripts were mainly correct

but they also had problems. Three out of nine rela-

tions in the diagram were not recognized by ChatGPT

and some attributes wrongly identiﬁed; on the plus

side, ChatGPT correctly identiﬁes most of the foreign

keys.

We then corrected the SQL scripts and instructed

ChatGPT to use the corrected ones in the rest of the

dialogue. Following, we asked ChatGPT to generate

sample data for the database which we used in order

to populate the database - they required minor correc-

tions.

Next, we started to implement the backend API

service. The prompt used was this:

”Generate Java code for the backend REST API im-

plementation: generate JPA entity classes, service classes,

repository classes, API Endpoints for the above database ta-

bles.”

ChatGPT generated classes for the entities, ser-

vice, JPA repository, controller, but it generated only

two for each category. It also generated the JPA Hi-

bernate conﬁguration. These classes contained cor-

rect code.

It was necessary to explicitly ask ChatGPT to gen-

erate all the entities. As a result these were generated

but there were some inconsistencies. For instance,

some of them were placed on a different package then

the previous generated entities (i.e. model.* instead

of entity.*), so we had to manually verify and cor-

rect them.

During the following iterations, we explicitly

asked ChatGPT to generate:

• the rest of the JPA repository interfaces,

• the rest of the service classes and

• the rest of the REST controllers.

It was necessary to review each generated class

ﬁles, some of them requiring minor corrections re-

lated to identiﬁer inconsistencies across different

ﬁles.

Many of the entity class ﬁles required additional

annotations like @Column(name="paymentmethod")

for a private String paymentMethod member so

that the Hibernate SQL querries would work with the

database structure previously generated by ChatGPT

(otherwise, the attribute name would be converted by

default to the column payment method which did not

existed in the database). We corrected all these man-

ually.

The controllers were correct, except for the fact

they did not have the CORS (Cross-Origin Resource

Sharing) security setting setup (in order to be accesi-

ble from the React server.

Model-Driven Development Using LLMs: The Case of ChatGPT

335

In addition, some of the required service or repos-

itory classes were not generated at all, so we had to

remind ChatGPT to generate them like in the follow-

ing prompt:

”You forgot to generate the class

com.example.productsale.service.ProductService.”

As a result the missing classes have been gener-

ated.

We then asked ChatGPT to generate the Gradle

build ﬁle and OpenAPI documentation for the gener-

ated REST API.

Finally, for the backend side, we asked ChatGPT:

”Please generate the backend code that implements the

BPMN process depicted in the diagram that I uploaded in

the beginning of this chat.”

In the ﬁrst response, ChatGPT generated just

skeleton ﬁles for Camunda without being speciﬁc to

the BPMN process given by us, but after we insisted,

it generated the required backend ﬁles for the BPMN

process.

The service class had many inconsistencies (i.e.

calls of method that did not exist, inexistent at-

tributes), so eventually we prompted ChatGPT:

”Please look again at the updated database structure

(that I have given you previously as SQL statements). The

implementation of OrderProcessService class you have pro-

vided me is incorrect, it does not use the database structure

I mentioned. Can you generate again the OrderProcessSer-

vice class?”

At this point ChatGPT provided two code imple-

mentations of the same class and asked us to choose

one. After we chosen one, there were still some in-

consistencies that we corrected manually.

We were then ready to move to the generation of

the React frontend app. We used the prompt:

”Can you generate the REACT frontend for the previ-

ously generated backend service ?”

There were still some inconsistencies in the gen-

erated frontend code that we had to manually correct.

The deployment phase was entirely our attribu-

tion. Everything had to be placed together conﬁgured

and set. The positive aspect is the fact that after we

deployed the frontend and backend applications, the

web application worked correctly and implemented

the functionality of the BPMN process given in the

diagram.

Summarizing, ChatGPT generated approximately

1000 lines of code for the REACT frontend and 1300

lines of code for the Java SpringBoot backend.

ChatGPT service was of real help and it managed

to generate a web application from a BPMN process

diagram and a conceptual database diagram. But

the application would not have worked correctly

without an experienced software developer reviewing

and correcting the generated code (criterion C06).

We have found that ChatGPT does not generate

a complete application all at once and the best

methodology in our experience was an iterative one

with code inspection and review after each round

(criterion C07). Related to criterion C08, we evaluate

that ChatGPT is very efﬁcient in correcting itself

following additional clariﬁcations from the user,

although sometimes it offers two possible code

solutions and asks the user to choose one. The least

efﬁciency of ChatGPT was related to packaging

the application and deploying it - the problem was

more proeminent in the backend application, and

integrating different code ﬁles previously generated

(criteria C09 and C10).

Answer to RQ2: We may conclude that ChatGPT

is not able to directly create a software application in

one single step: an iterative approach is necessary.

During each iteration the tasks should be clear, ex-

plicitly given and focused on a single responsibility.

Each iteration needs veriﬁcation and corrections, but

with additional clariﬁcations from the user ChatGPT

could efﬁciently correct itself. The most important as-

pect is the fact that the ﬁnal goal – building a reliable

software application – is reached. The effort needed

for this software construction is considerably reduced

and, in addition, the knowledge support offered by

ChatGPT is important since at each step it produce

also explanation together with the generated code.

RQ3: How Much Knowledge Should the

User Have in Order to Reach a Solution?

The third goal of our research refers to the human

user employing ChatGPT in his work. Therefore, we

examine the amount of technical knowledge needed

to use ChatGPT to successfully obtain the desired

solution. The inquiry addresses the several aspects,

as follows:

Evaluation Criteria

• C11: The level of technical knowledge required

for a user to be able to give the effective prompts

and to understand the generated responses.

• C12: The level of technical knowledge required

for the user to be able to aggregate all the received

source code ﬁles, to conﬁgure the platform, and to

deploy it to produce the fully working application.

For this research question we have used the same

data, prompts and responses used for the research

question RQ2.

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

336

Results Analysis

The ﬁrst plain observation is that the user should un-

derstand the diagrams that are included into the ﬁrst

prompt. This is an obvious observation since we as-

sume that the targeted users are BAs for which these

kinds of diagrams represent the common knowledge.

As it may be seen from the ﬁrst initial prompt

of RQ2, information about the technologies are re-

quired. This entails knowledge regarding these tech-

nologies at least at the level of knowing about their

purpose and their context of usage. This means that

the BA should be familiar to the corresponding termi-

nology.

From the ﬁrst response we can notice that the in-

formation about the structure of the application and

its main components are given. Again, fundamentals

regarding these should be known in order to be able to

understand the ChatGPT’s response. This leads to the

conclusion that for the criterion C11 the required level

of technical knowledge includes fundamentals related

to design and technologies used in the business pro-

cess management and software application develop-

ment.

Since Chat GPT was not able to provide the en-

tire code from the beginning, the user should be able

to iteratively ask explicitly for different components

to be implemented and integrated. During this phase

some information about other technologies that Chat-

GPT chooses to use are given. These should be un-

derstood by the user as well.

On the other hand, if some mistake are given by

the user, ChatGPT could emphasizes it and suggest

correction. For example, in the experiment we have

used an entity with the name ’Order’ and ChatGPT

emphasized the fact that it is a SQL keyword and

should be changed. This could give valuable help to

the user.

Through iterative requests all the source code was

obtained, but it was the user responsibility to man-

ually extract the code and introduce it into a local

project. The generated code contains errors, some in-

accuracies and inconsistencies. To correct all these,

advanced knowledge in software development was

necessary. In addition, the solution may involve par-

ticular project conﬁguration and settings.

Providing the solution, ChatGPT chose to use di-

verse set of frameworks and libraries (e.g. Flask,

Spring, Axios or Fetch API). All these additional

tools have to be installed, and the necessary conﬁg-

uration to be set.

Finally, ChatGPT was asked to give all the nec-

essary information needed to reach a deployable ap-

plication. Even if the response includes several de-

tails about the necessary steps and guidelines, they

could not be successfully executed without appropri-

ate knowledge and experience.

So, we may conclude that for the criterion C12 we

need an expert user with advanced technical knowl-

edge.

Answer to RQ3: The experiments emphasize that the

user should know at least the fundamentals knowl-

edge such that to be able to give the correct prompts

and to understand the received responses. For achiev-

ing a complete functional application the user should

be an experimented advanced developer. At the cur-

rent stage, ChatGPT could represent more an assis-

tant tool in software development rather than an in-

strument able to create concrete, functional software

application for the BAs. This imposes high level of

user knowledge in order to allow obtaining functional

applications.

Threats to Validity

For the previous analysis we have used ChatGPT-

4o free version. Additionally, paid plans may pro-

vide better results. ChatGPT-o1 model is trained with

a large-scale reinforcement learning algorithm that

provides responses using Chain-of-Thought (CoT).

Therefore, GPT-o1 is reported to have longer re-

sponse time than GPT-4o (mini) that may affect the

UX. ChatGPT-o3 mini is fast for advanced reasoning.

Currently, free plan users may send 50 requests every

three hours, which are reduced when pictures and di-

agrams should be analysed. Paid plans users do not

have any issues of unavailability, even during peak

hours. This implies that there is the possibility that we

didn’t obtain the best possible responses. During the

experiments, we did not encounter disruptions with

ChatGPT’s functionality, but rather limitations on the

number of requests allowed.

Not customizing ChatGPT may result in increased

number of prompts that should be provided in order to

obtain the desired results.

The computational effort that is needed to analyze

and generate diagrams resulted in constraints when

ﬁnishing the evaluation in single working session.

Multiple ChatGPT sessions were required to generate

the dataset for RQ1.

In addition, we noticed that there is quite a high

level of non-determinism in obtaining the responses.

For the same user or not, giving to ChatGPT the

same prompt, different responses are generated. This

means that the quality of the responses is not always

the same.

Model-Driven Development Using LLMs: The Case of ChatGPT

337

4 DIRECTIONS OF LLMs

INTEGRATION IN SDLC

Based on the previous analysis we may derive also

some directions related to the integration of LLMs

into the software development life cycle.

Software Development Life Cycle

SDLC methodologies provide a systematic manage-

ment framework based on stages with speciﬁc deliv-

erables of the software development process. Some of

the common SDLC stages are: Plan, Design, Imple-

ment, Test, Deploy and Maintain. Following a SDLC

methodology assures improvement of the following

aspects: estimation, planning, and scheduling; risk

management and cost estimation; software delivery

and customer satisfaction; and visibility of the devel-

opment process for all involved stakeholders.

SDLC has shifted from Waterfall models (Royce,

1970), Iterative models, V-Models (Hill, 1996), Spi-

ral models (Boehm, 1988) or Model Driven Archi-

tecture (MDA) (OMG, 2001) to agile methodologies

(Extreme Programming (XP) (Beck, 1999), Scrum

(Schwaber and Beedle, 2001) or Kanban (Anderson,

2012)).

Among these we may identify two of them that

may beneﬁt the most of allowing a LLMs to be used

as an actor involved in the development process.

The Iterative SDLC model presented in Fig 4,

stands out as a ﬂexible and efﬁcient methodology that

promotes continuous improvement and adaptability.

The key principles of this are: Incremental Progress,

Flexibility and Adaptability, Continuous Evaluation,

and Risk Management.

Figure 4: Iterative Model.

(source:https://www.geeksforgeeks.org/sdlc-models-types-phases-use)

Agile SDLC approach is described in Fig 5.

The key principles of this model are: Iterative and

Incremental Development, Customer Collaboration,

Adaptability to Change, and Cross-Functional Teams.

Both of them are based on iterations in the

development process. This is important since a LLM

actor managed well with small and well deﬁned tasks.

Figure 5: Agile Model.

(source:https://www.geeksforgeeks.org/sdlc-models-types-phases-use)

Comparison with Model-Driven Development

Model-driven development (MDD) approach is meant

to increase productivity by using standardized mod-

els, simplifying the process of design via models of

recurring design patterns in the application domain. It

promotes communication between working individu-

als and teams by using standard terminology and rec-

ognized best practices.

At their foundation, LLMs are also working by

identifying and generating models. So, they could

provide information organized and recognized mod-

els as well. Diagrams or programming patterns are

examples of such models from the domain of pro-

gramming.

The conceptual diagrams are quite well handled

by LLMs (e.g. ChatGPT). Even if the results are not

perfect they provide interpretation and transformation

at a high level of accuracy and could become a very

useful actor in MDD approach.

Comparison with Workﬂow Systems Approach

Workﬂow systems allow secure and productive soft-

ware development that starts from process diagrams,

too. They provide a very high level of abstraction in-

teraction with the user and provide automatic devel-

opment of the software based on speciﬁcations pro-

vided through the workﬂows (processes). Obtain-

ing the ﬁnal application doesn’t imply expert level of

technical knowledge – it is automatically done based

on a pre-existant generic implementation which is

represented by the workﬂow engine.

They efﬁciently cover exactly the part that it is dif-

ﬁcult to obtain from the interaction with a LLM actor.

Still, in this case the obtained application is not an in-

dependent one and could be executed only inside the

chosen workﬂow management system.

5 CONCLUSIONS

We conducted a research that aimed to evaluate the

extent to which LLMs (e.g. ChatGPT) could be used

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

338

for the development of an application prototype start-

ing from conceptual diagrams as ER and BPMN pro-

cess diagram. This would be very useful especially

from a business analyst point of view that usually

starts by deﬁning these kind of diagrams.

The research questions structure the research on

three directions: diagram interpretation and manage-

ment, source code generation, and user necessary

knowledge. The results of the human user interaction

with ChatGPT were documented and several criteria

are formulated for each research question. Various re-

sult types obtained were evaluated, e.g., explanation,

transformation, improvements, code, offered support.

From the conducted experiments, we may con-

clude that, at this phase, ChatGPT can be used much

more as an assistant tool in developing software appli-

cation than as a reliable developer. Since it provides

very good results for small and very well speciﬁed

tasks, it may be included as an assistant actor in an

iterative or agile software development approaches.

As further work we propose to repeat the experi-

ments using other LLMs, as Gemini for example, or

using ChatGPT-Plus. Another investigation direction

would be to modify the initial problem such that to

contain not only diagrams but also descriptive func-

tional requirements.

REFERENCES

Anderson, D. J. (2012). Lessons in agile management: On

the road to Kanban. Blue Hole Press.

Ayad, S. and Alsayoud, F. (2024). Exploring ChatGPT

Prompt Engineering for Business Process Models Se-

mantic Quality Improvement, pages 412–422.

Bala, S., Sahling, K., Haase, J., and Mendling, J. (2025).

Chatgpt for tailoring software documentation for man-

agers and developers. In International Confer-

ence on Agile Software Development, pages 103–109.

Springer.

Beck, K. (1999). Embracing change with extreme program-

ming. Computer, 32(10):70–77.

Boehm, B. W. (1988). A spiral model of software develop-

ment and enhancement. Computer, 21(5):61–72.

Chis

alit¸

a-Cret¸u, M. C., Osman, C., Sterca, A., and

Niculescu, V. (2025). ChatGPT response collec-

tion used for evaluation. https://ﬁgshare.com/s/

3bdcd6a7686a0ce20610.

Dakhel, A., Majdinasab, V., Nikanjam, A., Khomh, F., Des-

marais, M., and Ming Jiang, Z. (2022). GitHub Copi-

lot AI pair programmer: Asset or Liability?

France, R. and Rumpe, B. (2007). Model-driven develop-

ment of complex software: A research roadmap. In

Future of Software Engineering (FOSE ’07), pages

37–54.

Hill, D. R. (1996). Object-Oriented Analysis and Simula-

tion. Addison-Wesley Longman Publishing Co., Inc.,

USA.

Kanuka, H., Koreki, G., Soga, R., and Nishikawa, K.

(2023). Exploring the ChatGPT approach for bidi-

rectional traceability problem between design models

and code.

Keller, G., Scheer, A.-W., and N

uttgens, M. (1992). Se-

mantische Prozeßmodellierung auf der Grundlage”

Ereignisgesteuerter Prozeßketten (EPK)”. Inst. f

Wirtschaftsinformatik.

Kim, D.-K. (2024). Comparing Proﬁciency of ChatGPT

and Bard in Software Development, pages 25–51.

Springer Nature Switzerland, Cham.

Liukko, V., Knappe, A., Anttila, T., Hakala, J., Ketola, J.,

Lahtinen, D., Poranen, T., Ritala, T.-M., Set

a, M.,

ainen, H., and Abrahamsson, P. (2024). Chat-

GPT as a Full-Stack Web Developer, pages 197–215.

Springer Nature.

OMG (2001). Model driven architecture (MDA). https:

//www.omg.org/cgi-bin/doc?ormsc/01-07-01.pdf.

OMG (2013). Business Process Model and Notation

(BPMN) Speciﬁcation, Version 2.0.2. https://www.

omg.org/spec/BPMN/2.0.2/.

Rajbhoj, A., Somase, A., Kulkarni, P., and Kulkarni, V.

(2024). Accelerating software development using

generative ai: Chatgpt case study. In Proceedings of

the 17th Innovations in Software Engineering Confer-

ence, pages 1–11.

Royce, W. (1970). Managing the development of large

systems: Concepts and techniques. In 9th Inter-

national Conference on Software Engineering. ACM,

pages 328–38.

Schwaber, K. and Beedle, M. (2001). Agile software devel-

opment with Scrum. Prentice Hall PTR.

Siam, M. K., Gu, H., and Cheng, J. (2024). Programming

with AI: Evaluating ChatGPT, Gemini, AlphaCode,

and GitHub Copilot for Programmers.

Wermelinger, M. (2023). Using github copilot to solve sim-

ple programming problems. In Proceedings of the

54th ACM Technical Symposium on Computer Science

Education V. 1, SIGCSE 2023, page 172–178, New

York, NY, USA. Association for Computing Machin-

ery.

Model-Driven Development Using LLMs: The Case of ChatGPT

339