Enhancing User Experience in Games with Large Language Models

Ciprian Paduraru

, Marina Cernat

and Alin Stefanescu

1,2

Department of Computer Science, University of Bucharest, Romania

Institute for Logic and Data Science, Romania

Keywords:

Large Language Models, Retrieval Augmented Generation, Teacher Model, Fine-Tuning, Video Games,

Active Assistance, Simulation Applications.

Abstract:

This paper explores the application of state-of-the-art natural language processing (NLP) technologies to im-

prove the user experience in games. Our motivation stems from the realization that a virtual assistant’s in-

put during games or simulation applications can signiﬁcantly assist the user in real-time problem solving,

suggestion generation, and dynamic adjustments. We propose a novel framework that seamlessly integrates

large-scale language models (LLMs) into game environments and enables intelligent assistants to take the

form of physical 3D characters or virtual background entities within the player narrative. Our evaluation con-

siders computational requirements, latency and quality of results using techniques such as synthetic dataset

generation, ﬁne-tuning, Retrieval Augmented Generation (RAG) and security mechanisms. Quantitative and

qualitative evaluations, including real user feedback, conﬁrm the effectiveness of our approach. The frame-

work is implemented as an open-source plugin for the Unreal Engine and has already been successfully used

in a game demo. The presented methods can be extended to simulation applications and serious games in

general.

1 INTRODUCTION

Following the current development in the ﬁeld of

Large Language Models (LLMs), our primary con-

tribution in this work is to explore and develop ap-

proaches that allow the end user to beneﬁt from their

potential in games or simulation applications in gen-

eral. The main problem we are addressing is how to

support the user at runtime with the help of an assis-

tant who can be asked either by text or voice. The

entity assisting the user can be represented virtually

in the simulation environment, e.g. a non-playable

character (NPC) or a narrator/environmental listener.

Use Cases. Together with industry partners, we inves-

tigated use cases for the inclusion of such an assistant

in games and simulation applications in general. The

ﬁrst category investigated relates to the classic senti-

ment analysis (Wankhade et al., 2022). An example of

this is when a user is playing with an NPC and states

either verbally or in writing that the difﬁculty of the

simulation is too high or not challenging enough. In

this case, one option would be to dynamically change

the difﬁculty level so that the user has an engaging

experience. Another concrete example: Imagine the

user is playing an ice hockey game like NHL

and is

https://www.ea.com/en-gb/games/nhl

unhappy with the referee’s decision. Reactions that he

expresses loudly could prompt the referee in the game

to penalize the user’s actions in order to entertain and

motivate him.

Furthermore, we have discovered that NLP tech-

niques can be used to construct NPC companions that

physically exist in the simulated environment and can

be prompted by a voice or text command to help the

user. Speciﬁc examples are when the user is stuck at a

destination or is unable to ﬁnd the way to a particular

destination. In this situation, the NPC companion can

understand the user’s request and guide them to a spe-

ciﬁc destination or give advice on how to solve vari-

ous tasks. Another source of irritation for users when

dealing with virtual worlds is their inability to under-

stand the many designed dynamics. This is a common

problem in the industry that leads to users abandoning

the applications before the developers can make rev-

enues or provide an adequate gaming experience. In

this scenario, we see NLP as a viable option where

the user can ask questions that a chatbot can answer

live to assist them. Speciﬁc examples include ques-

tions about healing procedures, ﬁnding certain items

in a particular room, determining what is needed or

missing to achieve certain goals, etc.

Paduraru, C., Cernat, M. and Stefanescu, A.

Enhancing User Experience in Games with Large Language Models.

DOI: 10.5220/0012854600003753

In Proceedings of the 19th International Conference on Software Technologies (ICSOFT 2024), pages 293-304

ISBN: 978-989-758-706-1; ISSN: 2184-2833

293

Contributions. Our main research goal is to create

a novel framework that enables game developers to

use LLMs in their games in a reusable and efﬁcient

way in terms of computational power. To the best of

our knowledge, this is the ﬁrst work that addresses the

problem of integrating LLMs into games for real-time

inference that are locally deployed and fulﬁll develop-

ers’ requirements in terms of use cases and technolo-

gies. The contributions are summarized below: Our

contributions are summarized below.

• A reusable and ﬂexible open-source framework

for game developers (and simulation applica-

tions in general) to integrate LLMs and re-

lated processes into their products. We re-

fer to this as GameAssistant and its open-

source repository https://github.com/AGAPIA/

NLPForVideoGames is intended to act as a plu-

gin for the Unreal Engine

, a game engine

widely used in games, the simulation industry and

academia. User input can be in the form of voice

or text messages.

• Prototyping and evaluation of different techniques

to reuse processes and tools in the LLM domain

for simulation software and video games, such as

Retrieval Augmented Generation (RAG), agents

and tools, ﬁne-tuning pipelines.

• A mechanism to incorporate a teacher LLM

model to create a synthetic dataset with rigorous

structure, input and output for customized game

actions. This is used to reduce the cost of obtain-

ing a ﬁne-tuning dataset.

• Address the efﬁciency problem of integrating

LLMs into games for end users without incur-

ring additional cloud costs or requiring a dedi-

cated GPU, running everything on the machine

where the game is played. Our solution is based

on evaluating and reusing small models that are

further trained (ﬁne-tuned) depending on the use

case to achieve the required quality. The model

currently used is Phi-3-mini (Abdin et al., 2024),

a language model with 3.8 billion parameters re-

cently released by Microsoft. It is integrated in

a plugin-compatible form so that other extensions

can also be integrated.

The rest of the work is structured as follows. Sec-

tion 2 shows use cases of NLP applications and large

language models in different domains that have ex-

plored similar solutions in different environments.

Section 3 gives a sketch presentation of the ﬁeld of

LLMs, both theoretical and practical, including the

motivation for the current model choice. A general

https://www.unrealengine.com

overview of the supported features is shown in Sec-

tion 4. The architecture and implementation of the

GameAssistant framework are presented in Dection

5. The evaluation results are presented in Section 6,

which includes both a quantitative and a qualitative

evaluation as well as information on our setup, ob-

servations and datasets. The ﬁnal section summarizes

our ﬁndings and suggestions for future work.

2 RELATED WORK

Conversational Agents (CA) are widely used in var-

ious industries, including medicine, the military and

online shopping, as reported in the review study (Al-

louch et al., 2021). These agents are usually vir-

tual agents that attempt to communicate with inter-

ested people and answer their queries, at least until

they receive information from them, which is then

passed on to actual human agents. Recently, LLMs

have improved the capabilities of traditional NLP

techniques in building virtual agents. The authors

of (Guan et al., 2023) investigate the application of

LLMs to improve the natural language understanding

and reasoning capabilities of intelligent virtual assis-

tants. The researchers present an innovative LLM-

integrated virtual assistant capable of autonomously

performing multi-step tasks within mobile applica-

tions in response to high-level user commands. The

implemented system, which is at its core a mobile

payment application, provides a comprehensive solu-

tion for interpreting instructions, deriving goals and

performing actions. The article in (Cascella et al.,

2024) explores the potential uses of LLMs in health-

care, focusing on their role in chatbots and systems

that interact in the management of clinical docu-

mentation and medical literature summarization (also

known as Biomedical NLP). The main hurdle in this

area is research into their application in diagnostics,

clinical decision support and patient triage. The re-

ported results are from one year of using LLMs in

real-life situations. CAs were also used for gamiﬁca-

tion reasons. In (Yunanto et al., 2019), the authors

propose an educational game called Turtle Trainer

that uses an NLP method for its non-playable charac-

ters (NPCs). In the game, the NPCs can automatically

respond to other players’ questions in English. Hu-

man users can compete against NPCs, with the player

who answers the most questions correctly winning a

round. While their methods for understanding and an-

swering questions are based on standard NLP meth-

ods, we adopt their scoring strategy, which is based on

qualitative feedback from two perspectives: (a) How

do human users perceive how well their opponents,

ICSOFT 2024 - 19th International Conference on Software Technologies

294

i.e. NPCs, understand and answer the questions? (b)

Does the existence of NPCs in this way indicate a

stronger interest in the learning game itself? Duolingo

(Munday, 2017) is a common platform for learning

different languages. It supports the development of

educational games, such as language learning through

gamiﬁcation , as well as machine learning-based ap-

proaches to performance testing. Using the platform

itself and NLP methods, the authors recommend the

use of automatically generated language tests that can

be scored and psychometrically analysed without hu-

man labour or supervision.

LLMs in Games. The work that comes closest to

our goals is (Paduraru et al., 2023), in which the au-

thor uses classical NLP techniques based on similar-

ity metrics and intent models to implement a chat as-

sistant in video games. However, they do not con-

sider an ongoing conversation or any additional con-

text and are strict regarding the phrases and natural

language variations that can be used. In comparison,

our work integrates a comprehensive language model

that has the following features: a) fast adaptation to

different user styles by training massive datasets con-

taining different language formats, b) support for vari-

able context that can be used as knowledge with the

RAG method, and c) full support for the history of

messages during a conversation. A motivating re-

lated work is presented in (Isaza-Giraldo et al., 2024),

in which the authors explore LLMs as evaluators in

serious games, focusing speciﬁcally on open-ended

challenges. The researchers developed a prototype

sustainability game about energy communities and

tested it with ChatGPT-3.5, ﬁnding that it scored 81%

of player responses correctly and provided valuable

learning experiences. The results suggest that LLMs

have the potential to act as mediators in educational

games and that they can facilitate the development of

game prototypes through natural language prompts.In

(Cox and Ooi, 2024), the authors discuss how the ca-

pabilities of LLMs can be used to improve the dynam-

ics and variety of conversational responses of NPCs

in video games. The paper explores what guidelines

can be created for designers to effectively incorporate

LLMs into NPC dialogs and the potential impact from

different perspectives to drive further research and de-

velopment efforts. We follow their study and integrate

their observations into our current work. In (Huber

et al., 2024), the authors discuss the transformative

potential and related challenges of LLMs in education

and how these challenges could be addressed with the

help of game and game-based learning. The paper ex-

plores how a new pedagogy of learning with AI could

utilize LLMs for learning by generating games and

gamifying learning materials. The work in (Gallotta

et al., 2024) explores the diverse functions that LLMs

can perform within a gaming context and provides

an overview of the latest advancements in the various

uses of LLMs for gaming purposes. The authors also

delve into areas that have not been fully explored and

suggest potential avenues for the future application of

LLMs in gaming. The work strikes a balance between

the capabilities and constraints of LLMs in the realm

of gaming. It offers an understanding of how an LLM

can be incorporated into a broader system for sophis-

ticated gameplay. The authors express optimism that

this article will lay the groundwork for pioneering re-

search and innovation in this rapidly evolving ﬁeld. In

comparison to ours, their work is a survey or highg-

lights of potential without ofering a technical deep so-

lution or architecture. We reuse their presented ideas

and future and ofer a technical framework that can

satisfy their discussed requirements.

Agents and Tools. The main goal of these elements

is to bridge the gap between user requests and back-

end systems through tasks expressed in natural lan-

guage. There are numerous strategies in this area. The

technique we have included in our project is one that

embeds reasoning traces and task-speciﬁc actions into

language models (Yao et al., 2023). This combination

allows the model to gain a comprehensive understand-

ing of the tasks and dynamically adapt its behavior

based on the state of the backend system and the con-

text of the conversation.

We have also experimented with another method,

known as (Schick, 2023), which offers several ad-

vantages and disadvantages compared to the previous

method. In their study, a language model (LLama 2

7B (Touvron et al., 2023)) is trained speciﬁcally for

independent interaction with external tools via sim-

ple APIs. By carefully selecting APIs, incorporat-

ing calls at the right time, and seamlessly integrating

their results into subsequent token predictions, Tool-

former effectively reduces the distance between lan-

guage modeling capabilities and practical tool usage.

However, we could not fully implement this approach

as it requires a dataset of API call examples, which

we could not create in time. A detailed comparison is

planned for future work.

When it comes to large systems and APIs, a more

appropriate strategy might be to include Gorilla (Patil

et al., 2023). In their research, the authors ﬁne-tuned a

language model that outperforms GPT-4, especially in

the area of API request formulation. Gorilla, through

the use of APIs and integration with a document re-

triever, outperforms other methods in terms of adapt-

ability to document changes during testing. This

adaptability signiﬁcantly increases reliability and ap-

plicability across a wide range of tasks, especially for

Enhancing User Experience in Games with Large Language Models

295

large projects.

Retrieval Augmented Generation (RAG). To sup-

port RAG in our application, we use the Faiss library

(Douze et al., 2024), (Johnson et al., 2019), which is

designed for fast vector similarity search. It provides

a comprehensive toolkit with indexing techniques and

associated primitives for tasks such as search, cluster-

ing, compression and vector transformation.

3 A CONTEXTUAL

INTRODUCTION TO LLMs

Language Models (LMs) are statistical models that

assign probabilities to word sequences by capturing

the inherent structure and patterns of natural lan-

guage. An autoregressive language model (LLM), on

the other hand, predicts the next token in a sequence

based on the previous tokens. The main goal is to train

a model that improves the likelihood of text data.

Consider a text sequence represented as

, s

, ..., s

), where s

denotes the token at po-

sition t. The probability of predicting w

considering

the previous context (s

, s

, ..., s

t−1

) is denoted by

P(s

, s

, ..., s

t−1

). The typical objective function in

this case is to minimize the cross entropy loss, i.e., to

maximize the conditional probability associated with

a given text sequence, Eq (1).

∑

t=1

−logP(s

, s

, ..., s

t−1

) (1)

Some of the well-known applications of au-

toregressive language models are: text generation,

translation, code auto- completion, question-

answering.

Architectures. Currently, the known LLMs are

based on the transformer (Vaswani et al., 2023)

architecture and different variants of self-attention

mechanisms. There are three known architectures:

encoder only, encoder-decoder and decoder only. The

models we experiment with in this paper fall into

the decoder-only architecture class, which uses only

the decoder component of the traditional transformer

architecture. In contrast to the encoder-decoder

conﬁguration, which contains both an encoder and a

decoder, the decoder-only architecture focuses solely

on the decoding process. It generates the tokens

sequentially, taking into account the previous tokens

in the sequence. This approach has proven itself for

tasks such as text generation, as it eliminates the need

for an explicit encoding phase. It is also suitable for

our use case in which the chatbot receives a context,

a user task, and has to generate a response in the form

of a text.

Embeddings. The text representation must be con-

verted to a sequence of ﬂoating point numbers to be

useful for training LLMs with transformers. This rep-

resentation is called embedding. It involves moving

a sequence of N words into a space of R

N×D

, where

D is the dimension used to embed each word. Intu-

itively, the goal is to give each word a ﬂuent repre-

sentation such that the distance or angle between two

similar words is small and large when the semantics

of the words are different.

We brieﬂy mention some other concepts used

along Section 5. Prompt Engineering is a technique

used to control the responses of Large Language

Models (LLMs) by formulating speciﬁc statements or

questions to elicit the desired results from the model.

As shown in the literature, providing examples in a

speciﬁc order and selecting the content for the prompt

is important for the quality of the results (Marvin

et al., 2024). Retrieval Augmented Generation (RAG),

is a method that extends the capabilities of LLMs by

incorporating knowledge from external sources, gen-

erally outside the training data used by the LLM. This

helps to improve the accuracy and credibility of the

generated content and avoid hallucinations. (Zhao

et al., 2024). agents refer to entities that can think,

plan and act individually when prompted with a query

by querying the LLM. The original problem is split

into smaller parts, some of which are solved using

tools or functions, which are interfaces provided by

the backends of the system. In this way, the LLM

is able to access external functionalities and provide

better quality answers, again avoiding hallucinations.

Fine-Tuning. The LLM models are trained in a

series of steps: a) preparing a training corpus of

data, b) pre-training the LLM, c) ﬁne-tuning the

model using speciﬁc data or tasks. Since our work

reuses the Phi-3 models, we start with step c) and

use general supervised ﬁne-tuning (SFT). Here, the

baseline LLM obtained after steps a) and b) is further

trained using a dataset of (instruction, output) pairs.

This process of data set acquisition and ﬁne-tuning is

explained in detail in Section 5.5.

Motivation for Fine-Tuning. The efﬁciency of

LLMs can be signiﬁcantly improved by using

domain-speciﬁc datasets (OpenAI et al., 2024) for ap-

plications that require a combination of general and

domain-speciﬁc language as well as the understand-

ing of technical terms. The efﬁciency of smaller ﬁne-

tuned models and their performance after ﬁne-tuning

is also described in the literature, and two main cases

ICSOFT 2024 - 19th International Conference on Software Technologies

296

are of interest for our work.

a) Training of models for code generation from gen-

eral language models, (Rozi

ere et al., 2024), (Li

et al., 2023)

b) Fine-tuning of models for interaction with ex-

ternal agents and tools (Schick, 2023), (Schick,

2023), (Patil et al., 2023) on different data sets

and use cases.

Our methods follow these two themes in ﬁne-

tuning the foundation LLM to achieve the proposed

goals. The former is used by the framework to make

the LLM aware of static content of a game that can be

presented in a textual format (e.g. manuals, guides,

Youtube transcripts, etc.), while the latter trains the

LLM to invoke functionalities exposed by the game

developer in a decoupled way. Sections 5.3 and 5.5

present the technical details.

4 FRAMEWORK OVERVIEW

Assistant Features. After discussions with the game

developers, we have set up some concrete use cases

where an assistant can be used in the game either in

physical form (NPC character) or without one (narra-

tive character). Note that the implementation of the

proposed framework is not limited to the feature list,

but that its architecture and components have been de-

veloped with generality in mind.

1. Question Answering.

This category is about situations where the user

needs help or support. Two subcategories have

been identiﬁed:

• Context-related situations during a game situa-

tion: e.g. How can I defend myself efﬁciently

against this fast attacker? - during a football

match.

• General questions that are in a loaded manual

and do not depend on the game: e.g. In this

UI window, where do I ﬁnd the ammo upgrade?,

”Which buttons do I have to press for a chip shot

to trick the goalkeeper?”.

2. Explicit actions requests.

These are requests from the user to the game wiz-

ard to perform various actions. Typical use cases

are: (a) requesting hints or solutions to solve puz-

zles in contextual situations, (b) requesting a com-

panion NPC (in this case the physical assistant) to

show a path to a destination that the user has difﬁ-

culty reaching, (c) helping to eliminate an enemy

character, or even (d) changing settings to improve

performance (e.g. rendering options) or volume

options without going through menus.

3. Actions derived through sentiment analysis This

category includes cases where the user requests or

responds to a change in settings but does not spec-

ify exactly what they want. For example, the user

could simply say by voice: This opponent is too

hard to defeat/too easy. In this case, the inten-

tion is not clearly deﬁned, but a sentiment analysis

module tries to understand it and transmit it as in-

put to the assistant to then take appropriate action.

Another use case that is required from an indus-

try perspective is to capture automatically detected

defects or problems. For example, if a visual arti-

fact or incorrect physics simulation is displayed in

the current state of a game and the user vocalizes

this, a quick capture of memory dumps and screen-

shots can be grabbed and sent to the developers as

an open issue.

Input. The player can enter a command for the

assistant via voice or text chat, Figure 1. In the case

of a voice message, it is ﬁrst converted into text

using the OpenAI Whisper-tiny speech-to-text model

(Radford et al., 2022). The GameAssistant then uses

both the PlayerState information (e.g. the user’s

current situation as represented and exported by the

game) and the data provided by the game (e.g. the

current context the user is in, available environments,

static information such as manuals or user guides,

etc.) to analyze and answer the user’s questions.

The extracted data is used for the LLM within the

GameAssistant component to provide a more targeted

response, either in the form of a message or an

action or a change in game state. Section 5 provides

details on the implementation of these interactions

and components. To increase the generality of the

interaction between the GameAssistant, the user’s

state (PlayerState) and the game data, the framework

keeps the communication as decoupled as possible

by using interfaces.

Base Language Model. One of the main criteria in

selecting a base model for further ﬁne-tuning was the

required computing resources and latency in respond-

ing to requests, taking into account the available end-

user hardware. With these considerations in mind, we

opted for Phi-3 (Abdin et al., 2024), a model with

3.8B (billion) parameters. In particular, we opted for

the version with token windows of up to 4096 tokens

instead of the 128K version, as it fulﬁls the goals

of the assistance function while being cost-efﬁcient.

A detailed comparison to motivate the choice is pre-

sented in Section 6.

Enhancing User Experience in Games with Large Language Models

297

GameAssistant

User

Is using

voice ?

Text or

vocal

message

Speech

Recognition

Yes

Message processing module

Text

message

PlayerState

e.g. current health,

ammo, statistics,

available items,

mission progress

Physical NPC

character or

virtual narrator.

Game

Static

information:

manuals,

guides.

Environments

specifications

Current running game context:

characters, states, available

locations, etc.

Send

commands

Get data

Respond to messages

Use PlayerState as contextual information

Figure 1: For an overview of the interaction between the components when the user, who is in a certain state during a game,

sends a message either verbally or via text. The ﬁrst step in the pipeline is to convert this message into a text representation

and then pass it on to the GameAssistant framework, which uses NLP techniques to extract information from both the game

and the user’s state and then react or perform actions depending on the situation.

5 ARCHITECTURE AND

IMPLEMENTATION

The architecture and components are designed to use

a plugin pattern. This allows the integration of higher

capacity models or other ﬁne-tuning methods, as well

as the ability to enable or disable different functional-

ities and components depending on the requirements

of speciﬁc use cases. This approach corresponds to a

principle of software engineering known as separation

of concerns, which we want to take into account when

developing the framework. Figure 2 illustrates the in-

teraction between the system and its components dur-

ing a user conversation. The LangChain

APIs are

used to establish connections between the component

interfaces, monitor the conversation and manage the

message ﬂow.

5.1 Conversation Handling

At each time t, a conversation history H

is kept so that

the assistant is able to respond in the context of previ-

ously requested messages. When a new user message

Q is recognized, representing a request such as a ques-

tion, an action request or a response from the user, the

ﬁrst step is to ask the LLM model to create a stan-

dalone request that contains both the history and the

new user’s request aggregated in a single prompt. In

the ﬁgure, a concrete example is given by the element

. In this step, the ability of LLMs to summarize

long texts is used. If the input text is longer than the

4096 token limit, only the last part of the conversation

is kept within this limit. Our experiments have shown

that in a conversational environment, it is beneﬁcial to

store a limited number of the last conversational mes-

sages and forget what the context of the discussion

was for more than the last 10 messages. To obtain

, a summary request is made to LLM via a prompt

template as speciﬁed in Listing 1.

https://www.langchain.com/

1 [ I NST ] Rephras e th e fo l l o win g c o n v e r sat i o n an d su b s e q uent

re q u est so that it is a s t an d -al o ne que sti on , in

it s or i gin a l l ang u a g e .

2 C o n v e r s a tio n :

3 {conversation H var}

4 Fol lo w -up requ e s t :

5 {request Q var}

6 St an d - alo n e re q u est :

7 [/ INS T ]

Listing 1: The prompt template used to generate a

standalone query Q

from the ongoing discussion H and

the new user’s query Q. The variables enclosed in curly

brackets are used to populate the prompt with speciﬁc

instances. label

5.2 Retrieval Augmented Generation

(RAG)

To summarize and extract content from the game and

the user’s current state, the framework uses Retrieval

Augmented Generation (RAG). Figure 3 shows the

detailed architecture of the RAG component, which

was ﬁrst shown in Figure 2. The data potentially

queried at runtime is divided into two parts:

• A static part - it represents data that does not

change during a game depending on user actions,

e.g. manuals, instructions, environment and item

descriptions, etc.

• A dynamic part - data that is constantly changing,

in our case represented by the PlayerState and the

current context of the game.

The static part can be represented in any text for-

mat (PDF, Word documents, etc.), Youtube video

transcripts (a commonly used resource in the ﬁeld

of game development), formatted data sets (in CSV,

SQL, JSON ﬁles, etc.). These are processed ofﬂine

with the Langchain API, i.e. not during the run-

time of a game, by splitting them into text sections,

embedding them and then indexing them in a vector

database.

For the dynamic part, which is constantly chang-

ing, a general interface is needed that allows the de-

velopers of a game to share the information for the

ICSOFT 2024 - 19th International Conference on Software Technologies

298

Chat history

New user question,

request, or reaction

E.g. "How can I get

this weapon upgrade

then?

1. Get standalone

contextual request

How can I get my sword weapon

upgraded to beat this opponent

without loosing too much life?

Discussion was about

challenging a character

in a game level

GameAssistantLLM

2. Given get

relevant contextual

data stored

RAG Safeguarding

3. Get final answer

+ retrieved context

Done

Safe

Unsafe ? Ask for rewrite

given unsafe reason

Evaluate

Embedding

(encode / decode operations)

Tools support

Game

Use

(optional)

User

Player state

Extract data

Conversation and

Sentiment Analysis

Plugin

Figure 2: The conversational chat system implemented in the proposed method and its associated components. The ﬂow is

represented by the blue arrows. The three main steps (the green boxes) are used to aggregate the conversation ﬂow, retrieve

internally relevant stored data, and then obtain the ﬁnal response after the safeguard component approves the results. The

optional step is used by the LLM to interact with the registered tools on the deployed platform.

RAG. The reason for this is that each game has a dif-

ferent context of information (e.g., in a shooter game

it could be information about weapons and locations,

while in a football game it could be about statistics

of the current game, the morale and energy status of

the players, etc.). To represent the information from a

game in a generic form, the framework uses the Jinja

template format, inspired by the way LLMs use the

same method to adapt different types of datasets to

text representation in the process of training or ﬁne-

tuning (Rozi

ere et al., 2024), (Touvron et al., 2023),

(Gallotta et al., 2024).

The conversion of the text representation into a

vector database for embedding is done using the Faiss

(Douze et al., 2024), (Johnson et al., 2019) which is a

state-of-the-art indexing and retrieval method widely

used in the construction of real-time assistants based

on LLMs. The vector database is divided into two

parts, as the static one, VectorStore

static

, does not

change at all during the runtime of a game, while the

dynamic one, VectorStore

dynamic

, is constantly chang-

ing and updated at each V Store

rate

For a new query Q

, both stores are checked,

and the best K results based on the similarity met-

ric are returned by Faiss. However, a threshold

RAG

is used for this similarity to ﬁlter the results

and have ﬁner control over the quality of the out-

put results. The extracted information is used to

help the LLM give a better quality response and

avoid hallucinations as much as possible. For each

query/prompt, the information is fed into the template

variable CONT EXT VAR, as shown in Listing 2, line

16.

https://jinja.palletsprojects.com/en/3.0.x

5.3 Tools and Interaction with the

Backend Systems

The framework uses the ReACT agents (Yao et al.,

2023) from the Langchain API, with ﬁne-tuning of

the LLM as described in Section 5.5. This allows our

project to integrate state-of-the-art methods that han-

dle the interaction between the user, the LLM and the

system processes (in our case a game) in a decoupled

way. The concepts are known in the literature under

the terms Agents, and Tools, or Functions, (Yao et al.,

2023), (Schick, 2023), (Patil et al., 2023). In short,

ReACT is an AI agent based only on the language

model that goes through a cycle of Thought-Action-

Observation in which it tries to decompose the prob-

lem into easier-to-solve sub-problems. At the end,

when it has enough knowledge, it will respond. An

example from the game prototype is shown in Listing

2. Any developer can reuse the same template for the

prompt and add their own customized set of functions

in the variable T OOLS VAR, line 15. The available

functions can be deﬁned in a JSON format, line 1.

When the user (or the ReACT agent itself) makes a

query, line 19, the descriptions of the tools are intu-

itively correlated with the query by the LLM based

on its training. If a strong similarity is detected, an

attempt is made to extract the parameters and return

the function call in order to obtain either an internal

response or a ﬁnal response for the user. To achieve

this goal, the Phi-3 base model had to be ﬁne-tuned,

as shown in Figure 4.

1 T O O L S_V A R =

2 { ... .

3 {’ name ’: ’show path to destination’,

4 ’ d e sc r i pt i o n ’: ’W he n the us e r re q u est s to go to a

ce r t ain p o s iti o n or lo c at i o n in th e ga me , ex t r act

it a n d use as pa r a met e r to th e func tio n ’,

5 ’ p ara m et e r ’: {

6 ’typ e ’: ’o bj ect ’,

7 ’ p ro p ert i es ’: {

Enhancing User Experience in Games with Large Language Models

299

Split into chunks of text

Static Game

information

Embeddings

PDF docs

Chunk 1

Chunk 2

Chunk N

Create embeedings for

each data pieces

Formatted data: sql,

pandas, csv, etc

Retrieval Augmented Generation (RAG)

Youtube transcripts

Player state

Template to expose

information as text

Store

Contextual request

Information

database

Get stored content based on similarity score to

Output:

Top

K results

content as

text.

Score >

Game current state

Figure 3: The pipeline for creating the RAG support for both the static data of the game (gray blocks and black arrows

ﬂowing), as well as for the dynamic state (red colored ﬂow). The upper part represents the indexing process. The lower part

represents the ﬂow of the database query when a new query arrives.

8 ’ loc ati on ’: {

9 ’typ e ’: ’s tr ing ’,

10 ’ d esc r ipt i on ’: ’ The po s it i on , l oca tio n , a re a ,

or ob jec t na me to go ’,

11 ’ req uir ed ’: Tr ue }} }}

12 .. ...

13 }

14 <s > <| s ys t e m |>

15 You are a h elp f u l as s i s t a nt in a vi d eo wit h ac c ess to

th e fo l l owi n g f u nct i o n s : {T OOLS VAR} . Us e them

on ly whe n ne c e s s ary .

16 You can al s o use the fo l low i n g i nfor m a t i o n in yo u r

an s w er : {CONT EX T VAR}.

18 < |user| >

19 Can you show methe path tothe weapons shop whereI can buythe Sword with level 10?

20 <| e nd |>

21 <| a s s i s tan t | >

22 T h o ug h t : To so l ve thi s que ry , I fir s t ne ed to fin d the

lo c a t ion of the le v el 10 sw ord . I w i ll us e the

se a r c h _ m a n u al to ol .

23 A c tio n : search manual

24 A c tio n In p ut : {

′

input

′

sword level 10

′

}

25 O b s e r v atio n : Sw o r d le ve l 10 i s po w erf u l . ..[ o m i tte d te x t

]. .. it can be fou n d in We a p on Sho p Mo u n t a in for

10 coin s and fo r free un d er t h e ro ck ne a r the

wa t e r f all .

26 T h o ug h t : I h ave th e an sw e r no w , it ’ s in Wea p on Shop

Mo u n t ain . I wi l l call a f un c t i o n that sho w s the

pa th .

27 A c tio n : show path to destination

28 A c tio n In p ut : {

′

location

′

Weapon Shop Mountain

′

}

29 R e s pon s e : H e re it is . <| en d | >

Listing 2: An example where a user needs to ﬁnd a path to

a speciﬁc item and the response of the GameAssistantLLM

in ReACT style. Highlighted are the The descriptions of the

tools are deﬁned in the TOOLS VAR variable (for space

reasons, only the path ﬁnding tool is shown). The relevant

information in the current state is extracted with RAG and

fed into the CONTEXT VAR variable. In this example,

the ReACT agent tries to ﬁnd the answer in the internal

manual by calling the search manual function, and when

it has found the answer, it displays the ﬁnal answer with

show path to destination. In the game, this displays a

travel path in 3D to a speciﬁc location, as in Figure 5.

5.4 Safeguarding Interactions

This system component, shown in Figure 2, serves

two main purposes:

(a) Ensuring dialogue compliance: It ensures that in-

teractions between users and chatbots adhere to

various rules, such as gender neutrality and po-

liteness. A comprehensive classiﬁcation of these

rules can be found in the work of (Inan et al.,

2023).

(b) Secure interaction with user requests and tools:

The component guarantees secure interactions

when users make requests or use tools (e.g. when

writing or executing code). If an unsafe response

or request is detected, prompt engineering in-

structs the model to change the original output ac-

cordingly. By integrating these two aspects, the

language learning model (LLM) can usually gen-

erate a safe response in one or more attempts.

In our approach, we ﬁrst investigated the integra-

tion of Llama Guard (Inan et al., 2023), an LLM se-

curity model developed speciﬁcally for this use case.

Llama Guard also provides a taxonomy of security

threats. When prompted, the model ﬁrst determines

whether the response is secure or not. If it is deemed

unsafe, it provides details about the speciﬁc unsafe

element. Considering the computational overhead

of loading a 7B model, our approach also includes

packages from the Natural Language Toolkit (NLTK)

(Bauer et al., 2020). These packages use traditional

NLP techniques to account for different security clas-

siﬁcations in prompts. There is a compromise be-

tween the two solutions: The Llama Guard provides

more precise results without requiring a local taxon-

omy of elements, while the NLTK packages are fast

and memory efﬁcient on the request side.

5.5 Datasets Construction and

ﬁne-tuning

The full ﬂow of the processes that collect the datasets

and ﬁne-tune the Phi-3 model on them is shown in

Figure 4. The purpose is to ﬁne-tune the model to the

three categories of use cases mentioned in Section 4.

ICSOFT 2024 - 19th International Conference on Software Technologies

300

There are two types of datasets that are required for

ﬁne-tuning the base model:

• Data sets for the category Question Answering

(from manuals, guides or video transcripts).

• Records for (explicit or implicit) action requests

related to the state of the game.

The former is obtained directly from the static data

content indexed by the RAG process, Figure 3. We

denote this with D

static

The second area requires more attention in order

to avoid wasting resources. With this in mind, we

ﬁrst start with a set of human annotated datasets (au-

thors and collaborators) of possible questions, Q

initial

as shown in Listing 3. This set contains 15 question

templates, where the variables between curly brackets

are placeholders that must be ﬁlled, e.g. the variable

SETOFOBJECT S in line 1, while the square brack-

ets are optional context variables for different situa-

tions. In the example, there are three types of con-

texts: a) a general one, line 1, b) a situation where

complex logic is required to proceed and the user

might need help or hints, line 2, and c) and object in-

formation in speciﬁc situations, line 3. There are tem-

plates where a context is optional, like a), and those

where the LLM cannot understand the question with-

out a previous context, like cases b) and c). These

contexts could intuitively correspond to the content

retrieved by RAG in Listing 2, line 16. One idea to

further automate this collection process is to use RAG

to annotate more examples.

1 Q

initial

2 Can you sh o w me the pa t h to the { S E T O FOBJ E C T S }? [

CO N T E X T_G ]

3 How can I sol v e this pu zz le , ca n you sh o w me ?{ C ONT E X T _ P }

4 W ha t we a p on s h ou l d I get to de fea t th i s op p o nen t ? {

CO N T E X T_O }

5 ...

6 }

7 TeacherPrompt:

8 You mu st c rea t e {N } vari a n t s for ea c h of th ese :

9 Q_ { i n iti a l }

10 I n s t r u c t ion s :

11 You can va r y the lan g u a ge and / or re p l ace th e va r i a b les

in curl y br a c es and sq u a re b r a cke t s wi th th e

fo l l o w ing :

12 S E T O F O B J ECT S = { Te lep o rte r , W e apo n s Shop ,.. . }

13 C O N T EXT _ P ={ P l aye r is in the wa t er f al l , n ea r the el e va t or

,. . .}

14 C O N T EXT _ O ={ G u a r dia n nea r the temp le , .. . }

15 The va r i a b l es in cu r ly b r ack e t s mu st be fi l le d wi t h one

of t h e sp e c i f ied o pt ion s , the othe r s are o pti o n a l .

Do n o t use an y othe r con t e x ts or op t i o ns o u tsi d e

th e sp e c ifi e d te x t .

Listing 3: Examples from the initial dataset Q

init

and the

prompt used to vary the content with the teacher model

GPT4.

The GPT4 model is considered as a teacher, and

we collect a dataset for instruction tuning using the

TeacherPrompt and the method of (Peng et al., 2023).

We obtain a dataset Q

ug where the model was asked

to create 30 variations for each template, resulting in a

total of 450 questions. Since GPT4 is trained on func-

tion calls, we query it to generate function calls using

the tool description and Q

ug as the user prompt, as

shown in Listing 2. Of these, 314 were actually useful

after being evaluated by the game. We refer to this set

as D

aug

. Finally, the dataset D = Q

initial

∪Q

aug

is used

for ﬁne-tuning the model using standard command

ﬁne-tuning and cross-entropy loss (Hui and Belkin,

2020).

GPT4 and

Instruction

fine-tuning

GameAssistantLLM

Phi-3 base

model

Figure 4: The pipeline for extracting the datasets and ﬁne-

tuning the Phi-3 (3.8B) base model using both static and

possibly contextual actions requests.

6 EVALUATION

6.1 Setup

Demo Application. The decision to use a plugin

on the UE5 engine was made to ensure a clear sep-

aration of responsibilities and minimal dependen-

cies between the application and the proposed frame-

work. During development and testing, Flask https://

ﬂask.palletsprojects.com/en/2.2.x was used to quickly

switch between different models and parts used for

training or inference, etc. Throughout the evaluation

phase, the models were deployed locally using UE5’s

Neural Network Engine (NNE) plugin. This model

was able to run the two models using the ONNX

for-

mat. Using this decoupled format allows the models

to run on other similar solutions, e.g., Unity

. A snap-

shot from the game can be seen in Figure. 5.

Choosing the Foundation LLM Version. In this

study, we evaluated several small models in the

range of 2B-7B, including Llama2-7B (Touvron et al.,

2023), Gemma 2B (Team et al., 2024) and Phi-3

(3.8B) (Abdin et al., 2024). The decision on which

LLM to choose as the basis for further ﬁne-tuning

was mainly constrained by the resources consumed

and the inference speed. With the currently available

hardware, we only need to consider models that can

fulﬁl the runtime expectations on the end-user hard-

ware. Since most games are already pushing GPUs

to the limit with their rendering systems, another lim-

itation is to only use CPUs in comparison. There-

fore, we evaluate the memory requirements and words

per second generated by the LLMs on an entry-level

AMD Ryzen 7700 CPU that is limited to using only 4

https://onnx.ai/

https://unity.com/products/sentis

Enhancing User Experience in Games with Large Language Models

301

Figure 5: A snapshot from the demo application in which

the user has an ongoing conversation with the assistant

(NPC). The user asks which item he needs to defend an en-

emy and where he can ﬁnd it. After the conversation, the

assistant suggests the item needed and shows a path to its

location (the red arrows in the picture).

cores for inference. The results are shown in Table 1.

Although the literature generally evaluates the metric

tokens per second instead of words per second, we be-

lieve that this is a more natural choice for comparison

in the current use case.

Table 1: Inference speed measured in words per second for

an AMD Ryzen 7 CPU limited to 4 cores and memory re-

quirements (RAM, not GPU memory). 8-bit quantization

was used for all models.

Model

Metric

Llama2B-7B Gemma2B Phi-3

Words/s 1.65 7.9 3.61

RAM 8.5GB 2.7GB 4.6GB

Voice-to-Text is processed with OpenAI Whisper-

tiny(Radford et al., 2022) with dynamic quantiza-

tion. On average, the inference time on the same

CPU was 1.6 seconds for a short message of 9-10

seconds. However, for a voice prompt, this time is

always added to the overall latency of the assistant’s

response, as it must ﬁnish the input before querying

the LLM.

Fine Tuning Parameters. The ﬁne-tuning process

was performed over 10 epochs with LoRA (Low-

Rank Adaptation) (Hu et al., 2022). All layers were

ﬁne-tuned with rank r = 16, α = 32, dropout = 0.05.

The default quantization parameters with b f loat − 16

were used to maximize the training runtime. Nucleus

sampling with p = 0.95 and temperature = 0.8 was

used for decoding. However, during test time, the

temperature was set to zero to obtain deterministic re-

sults that are predictable and repeatable, a decision

inspired by previous research (Siddiq et al., 2022),

(Alshahwan et al., 2024). The AdamW optimizer

(Loshchilov and Hutter, 2019) is used, with β1 = 0.9

and β2 = 0.95, with a learning rate γ = 3×10

−3

, and a

decay rate of 0.01. In the ﬁrst training phase, a warm-

up of 200 steps was used. Each batch had a size of 32,

with 4 gradient accumulation steps and different con-

text window sizes within the boundaries of the base

model.

6.2 Quantitative Evaluation

In this perspective, it is of interest to see how the ﬁne-

tuned version of the GameAssistantLLM model per-

forms compared to the base version Phi-3. The BLEU

metric is used for evaluation similar to other related

work ((Touvron et al., 2023)). The score ranges from

0 to 100, with a higher score indicating a stronger

match between the generated response and the refer-

ence. The evaluation is based on samples from the

data set D and the averaging of the results over 100

trials. The averaged results show a score of 79% for

the ﬁne-tuned model and 34% for the base model.

Looking at the different results, it is concluded that

the results of the base model were determined by the

general knowledge that an LLM may possess and that

it can hallucinate around topics even if it does not

know the answer, e.g. telling the user to use differ-

ent roads and modes of transportation when asked to

point the way to different objects.

Other parameters.

6.3 Qualitative Evaluation

The experiments and statistical results for the

qualitative evaluation of our framework and demo

come from a group of 46 people (volunteers from

the quality assurance departments of our industry

partners and students from University of Bucharest)

who played the demo for two hours and tried to

navigate through the application by asking the NPCs

questions on various topics. During our qualitative

review, we focused on three research topics.

RQ1. Do the application or NPCs give correct an-

swers, i.e. do they seem to understand the questions

or requests asked and answer in the right context?

To measure this, each of the 46 participants

played the demo for two hours and were instructed

to send between 80 and 100 messages to the ap-

plication/NPCs, using almost identical amounts of

voice and text input. After each response from the

application, they were asked whether the NPCs or

the application had responded correctly. Table 2

shows participants’ averaged feedback for both types

of input, categorized by message. The results show

that users are generally satisﬁed with the feedback,

ICSOFT 2024 - 19th International Conference on Software Technologies

302

with lower ratings for voice input as expected, as an

additional layer is required to convert voice to text

and performance naturally degrades along the way.

The scores are expectedly lower for requests where

actions need to be performed through function calls

and inference. One way to improve this is to create a

much larger dataset for ﬁne-tuning. Also, the implicit

actions are more difﬁcult to understand as the intent

expressed by the user sometimes does not match the

understanding of the language model.

Table 2: Qualitative evaluation of the correctness of feed-

back depending on the type of input and the message cate-

gory.

Input

type