
ing and generation with agents, internal and exter-
nal tools (Yao et al., 2022), and source code reflec-
tion. First, we justify our contribution to fine-tuning
an open-source LLM within the core of the frame-
work and its advantages compared to relying solely
on external models such as GPT-4 (OpenAI et al.,
2024). Fine-tuning an internal LLM allows organiza-
tions to adapt the models to their needs and improve
consistency with domain-specific knowledge and ter-
minology (Lv et al., 2024). This approach improves
data protection, reduces long-term costs, and provides
more control and flexibility when updating and de-
ploying models. The methods also consider the prac-
ticality of using the tool in production. The goal is
therefore to develop a small open-source model that
can run on both the CPUs and GPUs of end users.
In this sense, during our experiments, we found that
a class 7B/8B model (eight billion parameters) such
as Llama3.1 (Grattafiori et al., 2024) can provide the
trade-off between computational efficiency and ease
of use.
Working Mode. The interaction between users and
the framework takes place via a conversational assis-
tant that works in the following way. The complete
test creation is coordinated step by step, with the hu-
man giving instructions in natural language. The as-
sistant then helps with the correct syntax and consid-
erations as to which available implementation in the
source code of the internal project the natural lan-
guage instruction can refer to, using LLM.
The tool avoids the problems observed in the
previous work (Karpurapu et al., 2024), where the
most common syntax errors concerned the absence
of certain keywords, the wrong order of parameters,
names or even a wrong format. The proposed LLM
can draw conclusions based on the features pro-
cessed with some core NLP techniques and fix these
problems in almost all aspects. In our context, the
problem of missing links (i.e. incorrectly linked step
implementation) is solved using human-in-the-loop,
as the assistant asks for help by informing the user
that it could not find a link to the implementation. In
addition, the user can edit the generated response if
the LLM suggestions are incorrect.
In the following we summarize the contribution of our
work:
• According to our research, this is the first study to
investigate the use of LLMs in combination with
NLP, human-in-the-loop, and Agentic AI tech-
niques to automate BDD test creation in an indus-
trial setting, reducing manual effort and engaging
all project stakeholders more effectively. The con-
cept of Agentic AI (Kapoor et al., 2024) is used to
allow the LLM reason and combine tools (func-
tion) calling iteratively to solve the requests. By
using the human-in-the-loop concept, users have
full control over the generation, with the AI assis-
tant acting as a helper in this process.
• The proposed methods and evaluation examples
were developed following discussions with the in-
dustry to understand the gaps in improving prod-
uct testing. In our case, we used two public games
in the market. According to our observations, the
BDD tests can be written with AI autocorrection,
similar to how LLMs help in software develop-
ment through code autocompletion.
• The proposed framework called BDDTestAIGen
is available as open source at https://github.com/
unibuc-cs/BDDTestingWithLLMs.git. It has a
plugin architecture with scripts to adapt to new
use cases, and domains or to change components
(e.g. the LLM model) as required. A Docker im-
age is also provided for faster evaluation.
• We consider the computational effort that is jus-
tified to use the methods on developer machines.
From this perspective, we evaluate and conclude
that small models (such as Llama3.1 8B), fine-
tuned and combined with various NLP feature
processing and pruning techniques, can provide
the right balance between cost and performance.
The rest of the article is organized as follows. The
next section presents related work in the field and our
connection or innovations to it. A contextual intro-
duction to BDD and the connection to our goals and
methods can be found in Section 3. The architec-
ture and details of our implementation can be found
in Section 4. The evaluation in an industrial prototype
environment is shown in 5. The final section discusses
the conclusions.
2 RELATED WORK
Before the LLM trend, a combination of AI and spe-
cific NLP techniques was used to improve BDD test
generation. The review paper in (Garousi et al., 2020)
discusses how common NLP methods, code infor-
mation extraction, and probabilistic matching were
used to automatically generate executable software
tests from structured English scenario descriptions. A
concrete application of these methods is the work in
(Storer and Bob, 2019). The methods were very in-
spiring, as their use together with LLMs can, in our
experience, increase computational performance.
A parallel but related and insightful work is
(Ou
´
edraogo et al., 2024), in which the authors in-
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
806