Supporting Automated Documentation Updates in

Continuous Software Development with Large Language Models

Henok Birru, Antonio Cicchetti

and Malvina Latifaj

alardalen University, V

aster

as, Sweden

Keywords:

Continuous Software Development, Continuous Documentation, Large Language Models, Retrieval

Augmented Generation.

Abstract:

Nowadays software is ubiquitous in our society, making its role increasingly mission critical. Software ap-

plications need to be continuously maintained and evolved to keep up with the pace of market demands and

emerging issues. Continuous Software Development (CSD) processes are an effective technological coun-

termeasure to the mentioned evolutionary pressures: practices like DevOps leverage advanced automation

mechanisms to streamline the application life-cycle. In this context, while handling the application develop-

ment and implementation is adequately investigated, managing the continuous reﬁnement of the corresponding

documentation is a largely overlooked issue.

Maintaining accurate technical documentation in CSD is challenging and time-consuming because the fre-

quent software changes require continuous updates and such a task is handled manually. Therefore, this work

investigates the automation of documentation updates in correspondence with code changes. In particular, we

present CodeDocSync, an approach that uses Large Language Models (LLMs) to automate the updating of

technical documentation in response to source code changes. The approach is developed to assist technical

writers by summarizing code changes, retrieving updated content, and allowing follow-up questions via a chat

interface.

The approach has been applied to an industrial scenario and has been evaluated by using a set of well-known

predeﬁned metrics: contextual relevancy, answer relevancy, and faithfulness. These evaluations are performed

for the retriever and generator components, using different LLMs, embedding models, temperature settings,

and top-k values. Our solution achieves an average answer relevancy score of approximately 0.86 with Ope-

nAI’s gpt-3.5-turbo and text-embedding-3-large. With an emotion prompting technique, this score increases

to 0.94, testifying the viability of automation support for continuous technical documentation updates.

1 INTRODUCTION

The ever increasing ubiquity of software in our every-

day lives and the global market competition are ex-

acerbating the evolutionary pressures applications are

subject to. In particular, emerging demands and/or

issues related to a software application need to be

quickly addressed to avoid losing users or even worse

risking penalties due to malfunctions. The pressures

for frequent updates stress the effectiveness of the

adopted development process, since any friction in the

workﬂow tends to cause delays and additional work-

load.

Continuous software development (CSD) method-

ologies target speciﬁcally scenarios in which appli-

https://orcid.org/0000-0003-0416-1787

https://orcid.org/0000-0002-2754-9568

cations require frequent updates. In particular, CSD

proposes to maximize the automation for repetitive

and time-consuming tasks to both save time and also

to use skilled personnel for core aspects of the prod-

uct development (Bosch, 2014; Chui et al., 2016).

Among the software development phases an often

overlooked part should be devoted to prepare techni-

cal documentation (TD); TD is usually released to-

gether with the software product and is meant to in-

struct on how to effectively use the application.

Especially in industrial contexts, it is critical to

keep the documentation up-to-date; in this respect,

even though some approaches to manage fast-paced

release processes exist (e.g. Documentation as Code

(doc-as-code)), this task remains largely manual. No-

tably, documentation writers (also known as technical

writers) might need to rearrange the contents based

Birru, H., Cicchetti, A. and Latifaj, M.

Supporting Automated Documentation Updates in Continuous Software Development with Large Language Models.

DOI: 10.5220/0013286800003928

In Proceedings of the 20th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2025), pages 92-106

ISBN: 978-989-758-742-9; ISSN: 2184-4895

on the code changes, update code snippets or screen-

shots, and other similar documentation reﬁnements

that require advanced understanding of change con-

sequences (Khan and Uddin, 2023).

Large Language Models (LLMs) are increas-

ingly supporting and even completely replacing labor-

intensive tasks in software development processes.

Code generation (Austin et al., 2021; Chen et al.,

2021), code summarization (Ahmed and Devanbu,

2022), and code review (Lu et al., 2023) are just few

examples of powerful features that come in handy, es-

pecially in large and rapidly changing scenarios. Not

surprisingly, the application of LLMs has also been

investigated for handling software documentation (Su

et al., 2023; Khan and Uddin, 2022). Nonetheless, the

current solutions do not provide support for handling

the update of technical documentation due to source

code changes, a very common scenario in CSD that

makes such feature essential.

This paper proposes a methodology to set up an

end-to-end system, making it possible for technical

writers to quickly update technical documentation in

response to code changes. The proposed solution uses

LLMs and appropriate prompting techniques to pro-

vide the following three complementary components:

an automatic updater of the existing technical docu-

mentation based on source code changes; a change

descriptor summarizing how the modiﬁcations have

been interpreted; a chat replying to custom questions

posed by technical writers. The underlying aim is to

drastically reduce the effort required to manage the

documentation by technical writers and developers.

The proposed methodology is validated by instan-

tiating it within an industrial CSD scenario. The de-

veloped code is used as a basic framework on top

of which it is possible to create networking related

tools. In this respect, features are continuously added

and updated, making frequent documentation update

a critical requirement. The solution resulting from the

instantiation of our methodology is evaluated using

predeﬁned metrics achieving an average answer rele-

vancy of 0.86 using OpenAI’s gpt-3.5-turbo LLM and

text-embedding-3-large embedding model. More-

over, the score reaches to 0.94 when emotion prompt-

ing technique is adopted in the query prompt.

The paper is structured as follows. Section 2 pro-

vides background information on the key concepts.

Section 3 describes the related work to this research.

Section 4 presents the proposed approach, while Sec-

tion 5 describes the instantiation of the approach in an

industrial case study and evaluation results. Section 6

concludes the paper and describes future research di-

rections.

2 BACKGROUND

This section provides basic information about the

technical solutions that underpin the proposed solu-

tion, notably LLMs and prompt engineering. More-

over, it highlights the needs for continuous documen-

tation and existing aids to clarify the motivation be-

hind this research work.

2.1 Large Language Models

In machine learning (ML), language models are solu-

tions tailored to predict the next word in a sentence

based on the rest of that is given as input. The pre-

diction is computed through a probability distribution

of the available vocabularies and an estimation of the

likelihood of each word in a given context. The tech-

nical realization of language models has been evolv-

ing over time

; the latest ones are called Generative

Pre-trained Transformer (GPT)-series LLMs and are

the base for the approach proposed in this paper.

LLMs represented a major breakthrough in natu-

ral language processing (NLP) research thanks to the

adoption of the transformer architecture and the atten-

tion mechanism (Vaswani et al., 2017). The discus-

sion of the technical details of LLMs from an AI/ML

perspective goes beyond the scope of this work. In our

scope, it is worth mentioning that the architecture and

the attention mechanism allow parallel computation

and the capture of long-distance sequences. These

features boost the precision of predictions and keep

the computation time relatively tractable; moreover,

they allow handling huge amounts of training data

(in the order of billions of parameters). Among the

usages for which LLMs have shown interesting per-

formances, it is worth mentioning the following auto-

mated software engineering tasks (ASET)(Shin et al.,

2023):

• Code Generation: generate code that fulﬁlls a

given natural language description (requirement);

• Code Summarization: automatically create co-

herent, precise, and valuable code comments to

assist developers in understanding a chunk of

code given as input;

• Code Translation: convert code from one pro-

gramming language to another while keeping the

functionality intact;

• Code Review and Evaluation: perform static

analysis to identify potential faults in a program.

https://www.appypie.com/blog/evolution-of-

language-models

Supporting Automated Documentation Updates in Continuous Software Development with Large Language Models

In this work, we use LLMs to analyze code

changes originating from merge requests and update

technical documentation accordingly.

2.2 Prompt Engineering

The input submitted to LLMs is called a prompt and

has demonstrated to play a fundamental role for the

quality of the generated outputs (Shin et al., 2023).

In fact, since LLMs basic goal is predicting words

in natural language sentences, generating domain-

speciﬁc text (like for ASET) might require to direct

the LLM towards the correct interpretation of the in-

put and avoid erroneous outputs. Notably, there ex-

ists the very well-known problem of model hallucina-

tions, i.e. scenarios in which the LLMs generate non-

sense and/or unreliable results (Ji et al., 2023). In fact,

a discipline has recently emerged that targets the pro-

cesses of designing, building, and reﬁning prompts

to obtain the most relevant response from an LLM,

called Prompt Engineering. (Sahoo et al., 2024).

A selection of prompt engineering techniques in-

cludes Zero-Shot Prompting, which uses only in-

structions without examples, and Few-Shot Prompt-

ing, which improves responses by providing exam-

ples in the prompt (Gao et al., 2023). Chain-of-

Thought (CoT) Prompting introduces intermediate

reasoning steps for handling complex tasks and gener-

ating explanations, while Self-Consistency Prompt-

ing extends CoT by exploring multiple reasoning

paths and selecting the most consistent outcome

(Wang et al., 2022). Prompt Chaining breaks tasks

into sub-tasks with detailed sub-prompts, aiding in

complex problem-solving. Tree of Thoughts (ToT)

Prompting explores multiple generation paths, eval-

uating and selecting the best option (Yao et al., 2023).

Expert Prompting improves LLM’s output by creat-

ing detailed, domain-speciﬁc prompts that guide the

model to respond as a distinguished expert in the rel-

evant ﬁeld (Xu et al., 2023a). Emotion Prompting

incorporates emotional stimuli into prompts, improv-

ing LLMs’ performance across various tasks (Li et al.,

2023). Finally, Retrieval Augmented Generation

(RAG) incorporates information retrieval to provide

domain-speciﬁc knowledge, ensuring consistency and

reducing hallucinations (Sahoo et al., 2024).

As discussed in more detail in Section 5, our ap-

proach proposes to use RAG as a prompting tech-

nique.

2.3 Continuous Software Development

and Documentation

A well-maintained documentation is of paramount

importance, especially to support practitioners effec-

tiveness and keep an adequate level of users’ expe-

rience (Aghajani et al., 2020). This requirement is

even more relevant in the case of technical software

documentation, that is, the documentation associated

with the source code. For instance, software develop-

ment kit (SDK) and API documentation are examples

of documentation that are used to share information

among upstream and downstream developers. In this

respect, the accuracy of such information allows for

the proper construction of derived applications and

effective maintenance of the code itself (Rai et al.,

2022).

The evolutionary pressures modern software is

subject to have pushed development processes to-

wards enhancing speed. In particular, methodologies

like Agile and DevOps aim to compress the time re-

quired for tasks that do not add tangible value for the

user while improving the responsiveness of the pro-

cess to the need for changes (Bosch, 2014). These

trends apparently collide with an adequate mainte-

nance of the documentation, since this task is often

seen as a less valuable part of a software product re-

lease and can rapidly become a bottleneck when the

rate of changes increases. Aghajani et al. (Agha-

jani et al., 2020) highlight poor documentation related

challenges as reported by practitioners, like inaccura-

cies in information content, erroneous code examples,

incorrect comments, and lack of completeness. Addi-

tionally, outdated documentation is a commonly re-

ported issue negatively affecting the developer expe-

rience (Aghajani et al., 2019). In this respect, automa-

tion can play a crucial role in alleviating these issues

by enhancing the technical documentation process.

The remainder of this section illustrates some

countermeasures introduced with the purpose of sup-

porting continuous technical documentation mainte-

nance and positions the contribution of this paper.

2.4 Documentation as Code

Docs-as-code proposes to adopt a documentation

maintenance approach that goes hand-in-hand with

the development of the software; software develop-

ment and documentation shall use the same proce-

dures and tools, including issue trackers, version con-

trol systems, lightweight markup languages, auto-

mated testing, and incorporating documentation re-

views into the writing workﬂow. The underlying goal

is to promote the collaboration between developers

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

and technical writers in effectively producing techni-

cal documentation (Holscher, 2024).

Although most of the stages related to documen-

tation writing would beneﬁt from the same automa-

tion mechanisms existing for code development, few

tasks require additional efforts from the developers

and technical writers. Notably, whenever new fea-

tures are added or large code changes occur, infor-

mation exchanges are required to clarify the purposes

of the new software. In this respect, either develop-

ers write the ﬁrst draft of the documentation based on

the code change they have made, or technical writers

need to learn more about the technical details of the

changes before starting the actual writing.

This work explores the capability of LLMs to

analyze developers’ code changes and then modify

the existing documentation accordingly. The solu-

tion aims to replace the efforts required by developers

in writing the draft documentation content for every

code change in the docs-as-code approach. In addi-

tion, it limits the efforts of technical writers to learn

and understand the changes the source code has un-

dergone. This support allows documentation mainte-

nance to maintain the speed of source code develop-

ment and frees precious resources for other core tasks.

2.5 Code Summarization

Code summarization refers to the automated creation

of some form of source code documentation. In par-

ticular, the summarization can be extractive or ab-

stractive: the former extracts information directly

from the code (e.g., the name and type of parame-

ters passed to a function); the latter provides a more

advanced description of the code that is not limited

to the mere direct reading of the code (Mastropaolo

et al., 2023). As expected, many generative mod-

els including LLMs produce abstractive summariza-

tion, typically natural language summaries of code

snippets that can be used as inline comments or inte-

grated into technical documentation along with other

descriptions.

Code summarization can be performed at differ-

ent granularity levels, ranging from line-by-line to a

code module; correspondingly, for each granularity

level the summarization output will produce different

insights about the code, ranging from, e.g., the input

and output of a function to the features and interfaces

with other modules, respectively (Zheng et al., 2023).

This work investigates the use of summarization

techniques to analyze source code changes, and by

taking into account existing documentation automat-

ically generate appropriate updates. The changes are

given in terms of merge requests, and their impact is

determined by examining the entities involved in the

changes and their interdependencies with the existing

code (and hence documentation). Moreover, the sum-

marization is used to provide the users with a human-

friendly description of the changes and leveraged by

a chatbot answering technically detailed questions

about the modiﬁcations.

3 RELATED WORK

Automated documentation has been investigated at

length, even before the advent of AI/ML techniques:

for example, the Automatic Summary Comment Gen-

erator (Sridhara et al., 2010) and Automatic Code

Summarization (Moreno et al., 2013) are solutions

that leverage heuristics and/or predeﬁned rules to ex-

tract and summarize source code information. More-

over, there exist solutions using text retrieval tech-

niques to derive documentation from the structure and

syntactical characteristics of the code(Haiduc et al.,

2010a; Haiduc et al., 2010b). Although potentially ef-

fective, these techniques are less resilient to the evolu-

tion of the codebases, since heuristics and rules need

to be kept up to date with the evolution history of the

source code.

AI/ML ﬁrst and the advent of LLMs recently have

triggered a bursting interest for solutions aiming at

the automation of repetitive tasks in the software de-

velopment process (Batarseh et al., 2021; Nguyen-

Duc et al., 2023), including the handling of soft-

ware documentation (Khan and Uddin, 2022). In

general, these solutions can be distinguished between

learning-based and pre-trained: the former ones re-

quire a preliminary phase in which a typically large

dataset is used to build (i.e. train) the model support-

ing the ML solution (Hu et al., 2018; Moore et al.,

2019; Ahmad et al., 2020; Rai et al., 2022; Feng

et al., 2020); the latter ones rely on large pre-made

models that have demonstrated advanced capabilities

in natural language processing, including code gener-

ation and summarization (Phan et al., 2021; Ahmad

et al., 2021; Husain et al., 2019; Khan and Uddin,

2022). The learning-based solutions are intrinsically

more effective when working on the development sce-

narios they have been trained for. However, they tend

to degrade when the codebase is affected by extensive

changes (both from a size and a time perspective). On

the contrary, pre-trained solutions might show lower

effectiveness on speciﬁc scenarios, but have the ad-

vantage of providing more stable performances in the

general case.

With the recent boost of generative AI models

there has been an increasing interest in using pre-

Supporting Automated Documentation Updates in Continuous Software Development with Large Language Models

trained transformer models to automate documenta-

tion tasks: models like Cotext (Phan et al., 2021) and

PLBART (Ahmad et al., 2021) outperformed existing

learning-based solutions like CodeBERT (Feng et al.,

2020) in code understanding and summarization. In

particular, Codex is a GPT-3 pre-trained model spe-

cialized towards natural language and programming

languages. The evaluation on CodeSearchNet dataset

by Khan et al (Khan and Uddin, 2022) shows that

Codex surpasses existing techniques with one-shot

learning. Moreover, building upon Codex’s capabili-

ties for code understanding and summarization, Khan

et al. also proposed an approach to generate code ex-

amples for documentation purposes (Khan and Uddin,

2023).

While the approaches discussed so far focus on

generating documentation or code examples for indi-

vidual program units such as functions, our approach

goes beyond that scope. It aims to address a broader

context by examining all code changes within a sin-

gle merge request to capture the inter-dependencies

and implications of each modiﬁcation. Furthermore,

while most existing works utilize basic prompt tech-

niques such as zero-shot or few-shot prompt engineer-

ing for documentation automation tasks, which suf-

ﬁce for their scenario, our approach uses advanced

prompting techniques such as RAG to enhance the

quality of the generated outcomes.

RepoAgent is an open-source framework that uses

LLMs to generate, maintain, and update code doc-

umentation (Luo et al., 2024). RepoAgent lever-

ages the entire code repository as context to deduce

the functional semantics of target code objects. In

particular, meta information, Abstract Syntax Tree

(AST) analysis, and project tree organization and ref-

erence relationships within the code are used to create

representations of code repositories. Subsequently,

such representations are used to construct prompts

devoted to documentation generation tasks. Our ap-

proach leverages merge requests as part of the gen-

eration prompts instead of AST; as such, using the

RAG prompting technique is critical to achieve ap-

propriate results. In this respect, RepoAgent is eval-

uated by means of human inspection, while we adopt

widespread metrics assessing the quality of the gen-

erated results. Besides, although the RAG prompt

technique is not central to the RepoAgent approach, it

is mentioned as a potential future investigation direc-

tion. Interestingly, also for RepoAgent it is planned as

future investigation a chat feature to support a better

understanding of code changes.

An ongoing open-source project called Autodoc,

similar to RepoAgent, uses LLMs to automatically

generate documentation directly from source code

(Alephium, 2023). Autodoc’s approach analyzes the

entire codebase with an LLM to produce inline doc-

umentation that explains the purpose and function of

each part of the code, embedding these explanations

directly within the source ﬁles. This documentation

is primarily focused on detailing code functionality at

a granular level. In contrast, our tool is designed to

update technical documentation maintained in a sep-

arate repository following a docs-as-code approach.

Instead of analyzing the entire codebase, our tool pro-

cesses speciﬁc MRs in the context of the existing doc-

umentation, allowing the LLM to efﬁciently generate

and update documentation based on recent changes.

This includes not only code explanations but also

broader context, such as usage guidelines and code

examples, all of which are important for maintaining

comprehensive and user-focused technical documen-

tation.

It is also worth mentioning that signiﬁcant efforts

have been invested in identifying and correcting dis-

crepancies between the source code and correspond-

ing documentation. In particular, approaches have

been proposed to enhance the consistency between

code and comments (Wen et al., 2019; Liu et al.,

2023; Dau et al., 2023), and between software repos-

itory documentation and code references (Tan et al.,

2024; Tan et al., 2023).

4 APPROACH

Figure 1 provides a high-level overview of our pro-

posed approach. Since the approach assumes tech-

nical documentation is developed using the docs-as-

code methodology, documentation is treated as part

of the codebase and is placed within the same version

control system (VCS) as the source code and merge

request (MR). The tool connects to the VCS, retrieves

open MRs, analyzes the associated code changes,

and combines this information with existing docu-

mentation to generate context-aware prompts. These

prompts are designed to guide the LLM in fulﬁll-

ing the three core capabilities: automatically updating

documentation, summarizing code changes, and pro-

viding interactive support for technical writers. The

approach description is structured as follows. Sec-

tion 4.1 covers the preprocessing of input data. Sec-

tion 4.2 outlines the workﬂow for automated docu-

mentation updates, followed by Section 4.3, which

details the summarization feature. Section 4.4 ex-

plains the workﬂow for the chat engine, and Sec-

tion 4.5 provides an overview of the selected LLMs.

Eventually, Section 4.6 illustrates the metrics used to

evaluate the quality of both change retrieval and cor-

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

responding generation of documentation updates.

4.1 MR Data Cleaning

To achieve accurate and contextually appropriate re-

sponses, it is crucial to provide the LLM with clean

and ﬁltered data. In many MRs, certain changes –

such as third-party package installations in lock ﬁles

or modiﬁcations to unit and integration test ﬁles - are

not relevant and introduce unnecessary noise. Our

approach minimizes storage overhead and enhances

the LLM’s performance by ﬁltering out non-essential

ﬁles (A), focusing exclusively on relevant data. In ad-

dition, it empowers technical writers to further reﬁne

the data-cleaning process, allowing them to exclude

any extra ﬁles that do not contribute to the documen-

tation.

4.2 Automatic Documentation Update

Our approach employs the RAG technique to auto-

matically update the technical documentation. As

presented in Section 2, RAG is an advanced prompt

engineering technique that enhances the LLM’s out-

put by integrating external knowledge sources with

the model’s pre-existing training data. Sources can

include databases, repositories, or other knowledge

bases; in our implementation, the existing techni-

cal documentation serves as the external knowledge

source, enabling the system to reference existing doc-

umentation content before incorporating the changes.

As shown in Figure 1, the RAG pipeline consists of

three key stages: i) loading, ii) indexing, and iii)

querying.

4.2.1 Loading

The ﬁrst stage of RAG entails retrieving the external

knowledge base which in our case is the existing tech-

nical documentation from its storage location (C), as-

suming a docs-as-code methodology is in place. In

this setup, technical documentation is stored as mark-

down ﬁles in a Git repository or accessed from local

storage. To enhance efﬁciency, the system identiﬁes

the relevant content directory, excluding non-essential

ﬁles. The technical documentation is loaded using

the LlamaIndex SimpleDirectoryReader class, which

reads each ﬁle as a document object. The documen-

tation is typically organized into multiple ﬁles, each

discussing speciﬁc features or topics, and related ﬁles

are grouped under common directories. Rather than

dividing the content into very small chunks, we uti-

lize the existing ﬁle structure as our chunking strategy

to maintain contextual coherence. By default, Lla-

maIndex splits ﬁles into multiple nodes with a chunk

size of 1024 tokens and a 20-token overlap, result-

ing in ﬁles being segmented across multiple nodes for

efﬁcient processing and retrieval. The chunking pa-

rameters are adjusted to retain context and minimize

information loss.

4.2.2 Indexing

The second stage of RAG is indexing, which involves

organizing the nodes into a structure optimized for

fast retrieval. This process enables efﬁcient access to

relevant information based on semantic similarity. In-

dexes are built from the nodes created during the load-

ing phase, and once indexed, these nodes are fed to an

embedding model (D) and converted into vector em-

beddings stored in the VectorStoreIndex (E). Embed-

dings represent words or phrases as dense vectors of

real numbers in a continuous vector space. These vec-

tors are lists of ﬂoating point numbers. Unlike the tra-

ditional encoding techniques such as one-hot encod-

ing, embeddings encode semantic meaning and rela-

tionships between words which is very useful for the

LLMs to understand the similarities and differences

between words on their context. The embedding pro-

cess starts by tokenizing the given text. The size of the

vector used to represent each token in the embedding

space is called the embedding dimension. The dimen-

sion size or the number of values required to represent

a token is different across various models. Addition-

ally, embedding models have a maximum token size,

which refers to the number of tokens the model can

process at a time. For texts that have longer size than

the capacity of the embedding model, they need to be

broken down into smaller chunks before being con-

verted to embeddings.

The embedding process requires an embedding

model that transforms text into numerical represen-

tations to capture semantic meaning. The choice of

embedding model directly impacts both the perfor-

mance and cost-efﬁciency of the system. In this ap-

proach, we experimented with two different embed-

ding models. The ﬁrst, BAAI’s bge-small-en-v1.5

is a general-purpose embedding model ranked 37th

on the Massive Text Embedding Benchmark (MTEB)

Top-40 leaderboard. With 384 embedding dimen-

sions and a token capacity of 512, this model offers

a compact structure and is available for free on Hug-

gingFace, making it a cost-effective option. In con-

trast, OpenAI’s text-embedding-3-large model ranks

14th on the MTEB leaderboard and offers 3072 em-

bedding dimensions with a signiﬁcantly higher token

limit of 8191. However, this advanced capability in-

curs additional costs, distinguishing it from the cost-

https://huggingface.co/BAAI/bge-small-en

Supporting Automated Documentation Updates in Continuous Software Development with Large Language Models

Figure 1: CodeDocSync workﬂow.

free alternative.

4.2.3 Querying

The last stage of RAG entails the querying process.

The technical writer’s query (1a) is combined with

the code diff from the ﬁltered MR (2a) and addi-

tional instruction statements. To prepare the query

prompt, the approach employs prompt chaining, emo-

tion prompting, and expert prompting. Prompt chain-

ing directs the model through a series of prompts,

proving particularly useful for multi-step tasks (Gade-

sha and Kavlakoglu, 2024). Prompts are sequentially

sent to the LLM using the create and reﬁne prompt

templates. The initial prompt employs the create tem-

plate to generate a ﬁrst response from the LLM, which

is then reﬁned through subsequent prompts using the

reﬁne template. This process builds upon each re-

sponse, improving the model’s focus and precision

across steps until all nodes are processed. Emotion

prompting improves LLM’s performance by includ-

ing psychological principles into prompts through

emotional stimuli (Li et al., 2023). In our prompt

templates, phrases like “You’d better be sure” are em-

ployed, drawing on the social identity theory to im-

prove the quality of responses from the model. Ex-

pert prompting improves LLM’s responses by posi-

tioning it as an authoritative ﬁgure. Starting prompts

with phrases like “As a technical writer specializing

in developer documentation, your task is to update

the outdated part of the documentation”, we guide

the model to respond with the expertise needed for

technical documentation updates (Xu et al., 2023b).

The query prompt is initially processed by the Vec-

torIndexRetriever (3a), which employs the embed-

ding model (4a) to transform the query into vector

embeddings (5a). The next step is choosing the ap-

propriate searching strategy to ﬁnd the relevant re-

sponse for the query. The search strategy has evolved

from keyword matching to semantic search. Key-

word matching looks for speciﬁc index words that

match the query words, whereas, semantic search-

ing focuses on the meaning of the query words and

ﬁnds contextually relevant data. The search is per-

formed on pre-existing data, and the number of rel-

evant data points returned is determined by setting

the value of k in top-k similarity search. In our ap-

proach, the VectorIndexRetriever performs a top-k se-

mantic retrieval (where k=2) from the VectorStoreIn-

dex containing the embeddings of the technical docu-

mentation. This retrieval is based on cosine similar-

ity, calculating the likeness between the query’s em-

bedding and those of the technical documents. This

identiﬁes outdated sections of the documentation by

comparing the current technical documentation with

recent code changes. The top-k retrieved nodes (7a)

are then forwarded to the response synthesis compo-

nent (8a). The response synthesis component com-

bines the user’s query with nodes retrieved in the

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

previous step and sends a prompt to the LLM (9a).

Among the various methods for generating LLM re-

sponses, our solution uses the compact response gen-

eration technique. This technique is chosen because

it minimizes LLM calls while sending code changes

to the LLM by ﬁtting multiple nodes into a single

prompt, up to the allowed prompt size. The interac-

tion is performed in a question-answer format using a

create and reﬁne prompt template from LLamaIndex.

The result is the generation of the updated technical

documentation (10a).

4.3 MR Summary

The summary feature relies solely on MR data to gen-

erate a concise summary of the MR itself. Unlike the

previous approach applied for updating the technical

documentation where the MR was directly used as

a query, the current method involves converting the

MR into embeddings and storing them in a Vector-

StoreIndex for future retrieval. The initial step entails

indexing the ﬁltered MR data by leveraging the Git-

Lab API, which provides a list of JSON objects rep-

resenting the modiﬁcations within an MR. For each

JSON object, a LlamaIndex document is created with

the main content sourced from the diff text. This doc-

ument is populated with the diff text and enriched

with metadata attributes such as new path, old path,

new ﬁle, renamed ﬁle, and deleted ﬁle to provide ad-

ditional context about the code changes.

The number of documents corresponds to the

number of ﬁltered code change ﬁles. Consistently

with prior implementations, documents and nodes are

generated using a chunk size of 1,024 tokens and

a chunk overlap of 20 tokens. If a ﬁle’s content

exceeds 1,024 tokens, it is divided across multiple

nodes. These nodes are then processed by the em-

bedding model (B), and the resulting MR embeddings

are stored in a VectorStoreIndex for later use (E). This

embedding process occurs when an MR is selected for

analysis. When a technical writer requests a summary

of MR changes (1b), the SummaryIndexRetriever asks

for the relevant nodes (2b) from the VectorStoreIn-

dex where they are stored. Unlike the prior method

which returned only the top-k nodes, here, all nodes

are returned (3b) because each contributes to the sum-

mary feature. The returned nodes are then passed

to the response synthesis component (4b). To gen-

erate the LLM response, a tree summarization gener-

ation technique is employed, where nodes are recur-

sively sent to the LLM using a summary prompt tem-

plate (5b). Depending on the variety of changes, mul-

tiple responses may be generated. The ﬁnal summary

is constructed by organizing these responses into a

tree structure, with the root node providing a com-

plete summary (6b).

4.4 MR Chat

The MR chat functionality, like the summary fea-

ture, relies solely on MR data to perform its oper-

ations. When a technical writer poses a question

about MR changes through the MR chat (1c), the Vec-

torIndexRetriever employs the embedding model (2c)

to convert the question into vector embeddings (3c).

Following this, a top-k semantic retrieval process (4c)

occurs within the VectorStoreIndex, the repository for

MR embeddings generated in advance. Here, only

the top-k nodes (where k=2) are retrieved (1c), as

they contain the most relevant segments of the MR

data necessary for answering the technical writer’s

question. These selected nodes are then passed to

the response synthesis component (6c), which formu-

lates and sends a prompt to the LLM (7c), leverag-

ing a compact response generation technique, similar

to that used in the query engine. This process ulti-

mately generates an answer to the technical writer’s

question (8c).

4.5 Large Language Model

The three LLMs selected for this study, as detailed in

Table 1, are chosen based on factors such as cost and

context window. However, our approach is not tied to

these speciﬁc models; it is designed to be adaptable,

allowing easy integration of additional LLMs via an

API endpoint.

Table 1: LLMs and their utilized parameters in our imple-

mentation.

Model Provider Context Window

gpt-3.5-turbo-

0125

OpenAI 16,385 tokens

llama-v2-34b-

code-instruct

Meta 16k - 100k tokens

llama-v3-70b-

instruct

Meta 8k tokens

• The gpt-3.5-turbo-0125 model, developed by

OpenAI, features a substantial context window of

16k tokens. This model is designed for conver-

sational tasks, allowing it to handle extensive in-

puts effectively, making it ideal for applications

that require a deep understanding of context, such

as documentation generation and summarization.

The cost is $0.0005 for every 1,000 input tokens

and $0.0015 for every 1,000 output tokens.

• The llama-v2-34b-code-instruct model from Meta

Supporting Automated Documentation Updates in Continuous Software Development with Large Language Models

is an open-source model speciﬁcally designed for

code-related tasks. It features a context window

that ranges from 16k to 100k tokens, enabling efﬁ-

cient handling of large merge requests This model

is speciﬁcally trained for general code synthesis

and comprehension, essential for our goal of ana-

lyzing diff outputs to identify necessary documen-

tation updates.

• The llama-v3-70b-instruct model from Meta is an

open-source model that shows excellent perfor-

mance in reasoning, code generation, and follow-

ing instructions(Meta, 2024). It features a context

window of around 8k tokens, which is sufﬁcient

for handling merge requests that are of moderate

size.

4.6 Quality Measures

The RAG pipeline forms the backbone of our so-

lution, with separate evaluations performed on both

the retriever and response synthesis components us-

ing pre-deﬁned metrics.

The contextual relevancy metric is used to evalu-

ate the performance of the retriever component. Dur-

ing the retrieval phase, the query is converted into an

embedding, followed by a similarity search to locate

the most relevant documents from the vector store.

This metric is calculated as the ratio of the relevant

statements retrieved by the retriever for a given query

to the total number of statements.

The answer relevancy and faithfulness metrics are

used to evaluate the performance of the response syn-

thesis component, otherwise also referred to as the

generator component. The answer relevancy metric

measures how relevant the generated answer is com-

pared to the input query. Unlike contextual relevancy,

this metric is calculated as the ratio of relevant state-

ments generated by the chosen LLM for a given query

to the total number of generated statements. The

faithfulness metric measures the factual consistency

of the output by assessing how accurately its content

aligns with the information from the retrieved docu-

ments. This metric is calculated by ﬁrst extracting all

claims from the generated output. Then, among these

claims, those that do not contain any factual contra-

dictions when compared to the retrieved documents

are identiﬁed as truthful claims. This metric is the ra-

tio of truthful claims to the total number of claims in

the generated output.

It is worth noticing that our approach proposes the

use of standard metrics measuring the quality of the

generated output based on the characteristics of the

prompts. In this respect, our approach does not re-

quire a ground truth dataset to evaluate the quality of

the generated documentation.

5 INDUSTRIAL CASE STUDY

CodeDocSync is applied in Nokia’s industrial CSD

environment, where frequent code changes require

continuous updates to technical documentation. Here,

we focus on Nokia’s Network as Code (NAC) project

to demonstrate CodeDocSync’s capability to auto-

mate technical documentation updates in response to

changes in the Software Development Kit (SDK).

Nokia’s NAC project enables developers to dynam-

ically control network performance directly from

downstream applications. Although Nokia currently

employs a docs-as-code approach, this method of-

ten requires developers to draft updates and techni-

cal writers to ﬁnalize them, making the process both

time-consuming and inefﬁcient. Given its frequent

feature enhancements, the NAC project serves as an

ideal case for evaluating CodeDocSync’s effective-

ness in ensuring accurate, up-to-date documentation.

The interested reader can access further details on the

implementation code discussed in this paper in the

GitHub repository

5.1 Implementation

CodeDocSync is integrated with NAC’s SDK and

technical documentation repositories, leveraging Git-

Lab’s API to automatically retrieve code changes

from open MRs and existing technical documen-

tation. The existing documentation comprises ap-

proximately 40 technical markdown ﬁles, including

concise API references, general concepts, and in-

structions. It features nested structures and domain-

speciﬁc terminology related to 5G networking, re-

ﬂecting the complexity nature of the content. The

size of the codebase is less than 1000 ﬁles. The

MR changes are processed through a custom pipeline

that ﬁlters the original JSON data and performs sev-

eral pre-processing steps: i) extracting the Git diff, ii)

cleaning the data by removing irrelevant changes that

introduce noise, and iii) chunking the changes based

on ﬁle paths. The documentation repository cate-

gorizes content into separate markdown ﬁles, each

dedicated to speciﬁc topics, such as network slicing.

These ﬁles are loaded and ﬁltered to exclude unneces-

sary ﬁles, and then chunked into distinct nodes, typ-

ically with each node representing a markdown page

on a particular topic. After preprocessing, both the

https://anonymous.4open.science/r/codedocsync-

85F2/README.md

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

100

code changes and documentation nodes are converted

to embeddings and stored in a VectorStoreIndex, en-

abling efﬁcient retrieval and alignment during query

processing.

For evaluation purposes, we focused on the net-

work slice attachment feature within NAC. First, an

MR is created in the SDK repository to modify the

existing attach() and detach() methods of this fea-

ture. The existing technical documentation, which

does not yet reﬂect these changes, is stored in the Vec-

torStoreIndex. As shown in Figure 2, selecting the

MR displays a page listing the ﬁles that are ready for

analysis. This page also includes three tabs - one for

each of the core CodeDocSync features.

The fetch summary tab provides concise informa-

tion for each modiﬁed ﬁle, with a concluding para-

graph to give an overall explanation of the changes as

shown in Figure 3. The summarization can be cus-

tomized to focus on speciﬁc ﬁles by excluding se-

lected ﬁles from the ﬁrst tab.

The fetch updated documentation tab provides the

updated documentation reﬂecting the recent changes.

As shown in Figure 4, the new feature? ﬂag is avail-

able to generate documentation for a new feature that

has not been previously covered in the existing docu-

mentation. If the option new feature? is selected, the

prompt will be updated to generate new documenta-

tion content based on the data from the code changes.

The generated content is formatted in Markdown syn-

tax for easier integration into the ﬁnal documentation.

The MR-chat tab is designed to assist the technical

writer in understanding changes in the selected merge

request by enabling follow-up questions. As shown in

Figure 5 , the technical writer can ask questions about

details such as the new method signature or request a

general explanation of the changes and the chat will

provide the required response.

5.2 Evaluation

This section presents the evaluation results for our

chosen case study. The evaluation is performed using

the DeepEval

tool by applying the predeﬁned met-

rics outlined in Section 4.6, and the OpenAI’s gpt-4o

model.

5.2.1 Contextual Relevancy

The retriever component of the RAG pipeline is eval-

uated based on contextual relevancy using three dif-

ferent values of k, which represent the number of rel-

evant nodes expected in the top-k similarity search.

Figure 6 shows that the retriever component of our

https://docs.conﬁdent-ai.com

solution performs more effectively as the k value de-

creases. The average contextual relevancy score in-

creases from 0.53 at k = 5 to 0.83 at k = 2. This

improvement suggests that lower k values allow the

retriever to focus more precisely on the most relevant

nodes, reducing noise from less relevant results.

5.2.2 Answer Relevancy

The answer relevancy metric is used to evaluate the

performance of the response synthesis component

which in DeepEval is referred to as the generator

component with assessments conducted for two in-

dependent variables: temperature and LLM - embed-

ding model combination. Temperature is set to ensure

creative but accurate outputs. Both contextual and

answer relevancy metrics are calculated using Equa-

tion 1.

Relevancy =

Number o f Relevant Statements

Total Number o f Statements

(1)

Figure 7 shows the answer relevancy scores for

different LLM - embedding model combinations. The

combination of the gpt-3.5-turbo-0125 LLM with the

text-embedding-3-large embedding model achieves

the highest average score of 0.86, which increases

to 0.94 when the emotion-prompting technique de-

scribed in Section 4.2.3 is applied to the prompt tem-

plate

. In contrast, the combination of the llama-v3-

70b-instruct LLM with the BAAI/bge-small-en-v1.5

embedding model reaches a highest average score

of only 0.3. The combination of the llama-v3-70b-

instruct LLM with the text-embedding-3-large em-

bedding model improved the score to 0.68. These re-

sults highlight the impact of the LLM - embedding

model combinations on performance.

5.2.3 Faithfulness

The faithfulness metric is used to assess the factual

accuracy of the LLM’s response in relation to the con-

tent retrieved by the retriever component. The LLM -

embedding model combination serves as the indepen-

dent variable. It is calculated using Equation 2.

Faith f ulness =

Number o f Truth Claims

Total Number o f Claims

(2)

Figure 8 shows the faithfulness scores, where

the combination of the gpt-3.5-turbo-0125 LLM with

the text-embedding-3-large embedding model has an

average score of 0.82. This is lower than the

For the sake of space limitations we do not report the

results in full details. The interested reader is referred to

(Birru, 2024) for the complete information about the used

prompts and corresponding results/performances

Supporting Automated Documentation Updates in Continuous Software Development with Large Language Models

101

Figure 2: List of ﬁltered changes for Network Slice Attachment.

Figure 3: MR-summary for Network Slice Attachment.

combination of the llama-v3-70b-instruct LLM with

the BAAI/bge-small-en-v1.5 embedding model, which

achieves an average score of 0.95. This difference

may be due to the llama-v3-70b-instruct model’s bet-

ter alignment with the retrieved content, leading to

more factually accurate responses.

5.3 Discussion

The RAG prompt engineering technique is used in

our solution, which enables responses to be gener-

ated directly from existing technical documentation

and code change data. In addition to RAG, we incor-

porated advanced techniques such as emotion prompt-

ing and expert prompting within the prompt templates

used in the RAG pipeline. Initially, we experimented

with simpler techniques, such as zero-shot and one-

shot prompts; however, given the size of the docu-

mentation data and the context window limitations of

LLMs, RAG proved to be the optimal approach. This

choice not only maximized context utilization but also

opened up opportunities to integrate MR-summary

and MR-chat features, improving the solution’s func-

tionality and user experience.

Our solution updates documentation affected by

code changes by embedding the “diff” from MRs

into the RAG pipeline. To handle the token limits

of the embedding model, we ﬁlter irrelevant infor-

mation from MRs and then apply chunking to di-

vide large MR size, preserving all content without

exceeding limits. Each chunk’s embedding is then

combined into a single vector using a weighted av-

erage, ensuring proportional representation based on

chunk length. This approach optimizes the embed-

ding process, enabling effective similarity searches

against technical documentation even when handling

large MRs.

Even though a thorough empirical assessment for

the value of our proposed automated documentation

generation goes beyond the scope of this work, we

performed interviews with a technical writer and a de-

veloper at Nokia, who tested the tool and provided

feedback. While they appreciated the overall preci-

sion and usability of the documentation update sup-

port, they also suggested the improvement of certain

features. Notably, they expressed the need for visual

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

102

Figure 4: Updated documentation section for Network Slice Attachment.

Figure 5: MR-chat for Network Slice Attachment.

Figure 6: Contextual relevancy score at various k values.

indicators or explanations able to distinguish speciﬁc

changes in updated documentation. Such distinction

would signiﬁcantly improve clarity and usability by

allowing to isolate speciﬁc changes and correspond-

ing updates.

While our solution primarily focuses on updating

existing documentation, some MRs introduce entirely

Figure 7: Answer relevancy scores for LLM - embedding

model combinations.

new information not covered in current documenta-

tion. To address this, our solution offers a “New Fea-

ture” ﬂag that, when enabled, allows technical writers

to generate documentation for new features. When

this ﬂag is active, the system applies a prompt specif-

ically designed to create new content, ensuring docu-

mentation coverage for new features.

Supporting Automated Documentation Updates in Continuous Software Development with Large Language Models

103

Figure 8: Faithfulness metric scores for LLM - embedding

model combinations.

Our approach relies on the quality of source docu-

mentation, and its effectiveness may be limited if the

documentation is poorly organized, potentially im-

pacting the overall results. In a broader perspective,

it would be important to test our solution across di-

verse project types, especially those with varying pro-

gramming languages and technical writing styles. Al-

though the solution is not inherently dependent on

any speciﬁc programming language, the evaluation

has been conducted using only Python-based projects.

Expanding testing to include projects in other lan-

guages would help assess the solution’s versatility. In

this respect, it is worth noticing that the research com-

munity is missing a standard procedure to evaluate

the automated generation of documentation. There-

fore, establishing standard benchmarking procedures

would be valuable to enable the validation and reﬁne-

ment of the proposed solutions.

6 CONCLUSIONS AND FUTURE

WORKS

In CSD settings, keeping updated the technical docu-

mentation of software is critical, especially when such

software is used further in upstream and downstream

developments. Despite the availability of techniques

to support documentation updates, this task is still

largely manual. As a consequence, when software is

changed frequently keeping the pace for documenta-

tion becomes rapidly intractable.

We presented CodeDocSync, an approach devoted

to enhancing documentation updates in CSD. The ap-

proach leverages LLMs to provide: an automatic up-

dater of the existing technical documentation based

on source code changes; a change descriptor summa-

rizing how the modiﬁcations have been interpreted; a

chat replying to custom questions posed by technical

writers.

The concretization in an industrial case study

demonstrates the potentials of the proposed approach

in terms of quality of generated documentation and

corresponding support. Nonetheless, further investi-

gations are required to empirically measure the gains

provided by the automation as well as its applicabil-

ity to diverse kinds of applications and programming

languages. These empirical validations would explore

deeper qualitative feedback from domain experts in-

volved in documentation handling.

From a more technical perspective, our approach

improves the existing state of the art by avoiding a

training phase for the automation support. However,

it would be interesting to provide a more structured

and detailed comparison between pre-trained models

and those ﬁne-tuned on appropriate datasets to mea-

sure the gaps in quality of the generation. Addition-

ally, improvements to the RAG system, such as in-

tegrating re-ranking mechanisms or advanced index-

ing techniques, should be explored to enhance perfor-

mance.

REFERENCES

Aghajani, E., Nagy, C., Linares-V

asquez, M., Moreno, L.,

Bavota, G., Lanza, M., and Shepherd, D. C. (2020).

Software documentation: the practitioners’ perspec-

tive. In Proceedings of the ACM/IEEE 42nd Inter-

national Conference on Software Engineering, ICSE

’20. ACM.

Aghajani, E., Nagy, C., Vega-Marquez, O. L., Linares-

Vasquez, M., Moreno, L., Bavota, G., and Lanza,

M. (2019). Software documentation issues unveiled.

In 2019 IEEE/ACM 41st International Conference on

Software Engineering (ICSE). IEEE.

Ahmad, W., Chakraborty, S., Ray, B., and Chang, K.-W.

(2020). A transformer-based approach for source code

summarization. In Proceedings of the 58th Annual

Meeting of the Association for Computational Lin-

guistics. Association for Computational Linguistics.

Ahmad, W. U., Chakraborty, S., Ray, B., and Chang, K.-W.

(2021). Uniﬁed pre-training for program understand-

ing and generation.

Ahmed, T. and Devanbu, P. (2022). Few-shot training llms

for project-speciﬁc code-summarization. In Proceed-

ings of the 37th IEEE/ACM International Conference

on Automated Software Engineering. arXiv.

Alephium (2023). AI-Powered Improved Docs: Explain-

ing Alephium Full Node Code — alephium.

https://medium.com/@alephium/ai-powered-

improved-docs-explaining-alephium-full-node-

code-6795667fac02. [Accessed 15-11-2024].

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski,

H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q.,

and Sutton, C. (2021). Program synthesis with large

language models.

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

104

Batarseh, F. A., Mohod, R., Kumar, A., and Bui, J. (2021).

The application of artiﬁcial intelligence in software

engineering: a review challenging conventional wis-

dom.

Birru, H. (2024). Exploring the use of llms in agile technical

documentation writing. Master’s thesis, M

alardalen

University, V

aster

as, Sweden.

Bosch, J. (2014). Continuous Software Engineering: An

Introduction, page 3–13. Springer International Pub-

lishing.

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O.,

Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brock-

man, G., Ray, A., Puri, R., Krueger, G., Petrov, M.,

Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S.,

Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavar-

ian, M., Winter, C., Tillet, P., Such, F. P., Cummings,

D., Plappert, M., Chantzis, F., Barnes, E., Herbert-

Voss, A., Guss, W. H., Nichol, A., Paino, A., Tezak,

N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saun-

ders, W., Hesse, C., Carr, A. N., Leike, J., Achiam,

J., Misra, V., Morikawa, E., Radford, A., Knight, M.,

Brundage, M., Murati, M., Mayer, K., Welinder, P.,

McGrew, B., Amodei, D., McCandlish, S., Sutskever,

I., and Zaremba, W. (2021). Evaluating large language

models trained on code.

Chui, M., Manyika, J., and Miremadi, M. (2016). Where

machines could replace humans-and where they can’t

(yet). The McKinsey Quarterly, pages 1–12.

Dau, A. T. V., Guo, J. L. C., and Bui, N. D. Q. (2023). Doc-

checker: Bootstrapping code large language model

for detecting and resolving code-comment inconsis-

tencies.

Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M.,

Shou, L., Qin, B., Liu, T., Jiang, D., and Zhou, M.

(2020). Codebert: A pre-trained model for program-

ming and natural languages.

Gadesha, V. and Kavlakoglu, E. (2024). What is prompt

chaining? — IBM — ibm.com. https://www.ibm.

com/topics/prompt-chaining. [Accessed 14-05-2024].

Gao, S., Wen, X.-C., Gao, C., Wang, W., Zhang, H., and

Lyu, M. R. (2023). What makes good in-context

demonstrations for code intelligence tasks with llms?

In 2023 38th IEEE/ACM International Conference on

Automated Software Engineering (ASE). IEEE.

Haiduc, S., Aponte, J., and Marcus, A. (2010a). Supporting

program comprehension with source code summariza-

tion. In Proceedings of the 32nd ACM/IEEE Interna-

tional Conference on Software Engineering - Volume

2, ICSE ’10. ACM.

Haiduc, S., Aponte, J., Moreno, L., and Marcus, A. (2010b).

On the use of automated text summarization tech-

niques for summarizing source code. In 2010 17th

Working Conference on Reverse Engineering. IEEE.

Holscher, E. (2024). Docs as Code — writethe-

docs.org. https://www.writethedocs.org/guide/docs-

as-code/. [Accessed 28-01-2024].

Hu, X., Li, G., Xia, X., Lo, D., Lu, S., and Jin, Z. (2018).

Summarizing source code with transferred api knowl-

edge. In Proceedings of the Twenty-Seventh Inter-

national Joint Conference on Artiﬁcial Intelligence,

IJCAI-2018. International Joint Conferences on Arti-

ﬁcial Intelligence Organization.

Husain, H., Wu, H.-H., Gazit, T., Allamanis, M., and

Brockschmidt, M. (2019). Codesearchnet challenge:

Evaluating the state of semantic code search.

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E.,

Bang, Y. J., Madotto, A., and Fung, P. (2023). Survey

of hallucination in natural language generation. ACM

Computing Surveys, 55(12):1–38.

Khan, J. Y. and Uddin, G. (2022). Automatic code docu-

mentation generation using gpt-3.

Khan, J. Y. and Uddin, G. (2023). Combining contexts from

multiple sources for documentation-speciﬁc code ex-

ample generation.

Li, C., Wang, J., Zhu, K., Zhang, Y., Hou, W., Lian, J.,

and Xie, X. (2023). Emotionprompt: Leveraging psy-

chology for large language models enhancement via

emotional stimulus. ArXiv, abs/2307.11760.

Liu, Z., Xia, X., Lo, D., Yan, M., and Li, S. (2023). Just-

in-time obsolete comment detection and update. IEEE

Transactions on Software Engineering, 49(1):1–23.

Lu, J., Yu, L., Li, X., Yang, L., and Zuo, C. (2023).

Llama-reviewer: Advancing code review automa-

tion with large language models through parameter-

efﬁcient ﬁne-tuning. In 2023 IEEE 34th International

Symposium on Software Reliability Engineering (IS-

SRE), pages 647–658.

Luo, Q., Ye, Y., Liang, S., Zhang, Z., Qin, Y., Lu, Y., Wu,

Y., Cong, X., Lin, Y., Zhang, Y., Che, X., Liu, Z., and

Sun, M. (2024). Repoagent: An llm-powered open-

source framework for repository-level code documen-

tation generation.

Mastropaolo, A., Cooper, N., Palacio, D. N., Scalabrino,

S., Poshyvanyk, D., Oliveto, R., and Bavota, G.

(2023). Using transfer learning for code-related

tasks. IEEE Transactions on Software Engineering,

49(4):1580–1598.

Meta (2024). Introducing Meta Llama 3: The most capable

openly available LLM to date — ai.meta.com. https:

//ai.meta.com/blog/meta-llama-3/. [Accessed 03-06-

2024].

Moore, J., Gelman, B., and Slater, D. (2019). A convo-

lutional neural network for language-agnostic source

code summarization. In Proceedings of the 14th In-

ternational Conference on Evaluation of Novel Ap-

proaches to Software Engineering, ENASE 2019,

page 15–26, Setubal, PRT. SCITEPRESS - Science

and Technology Publications, Lda.

Moreno, L., Aponte, J., Sridhara, G., Marcus, A., Pollock,

L., and Vijay-Shanker, K. (2013). Automatic genera-

tion of natural language summaries for java classes. In

2013 21st International Conference on Program Com-

prehension (ICPC). IEEE.

Nguyen-Duc, A., Cabrero-Daniel, B., Przybylek, A., Arora,

C., Khanna, D., Herda, T., Raﬁq, U., Melegati, J.,

Guerra, E., Kemell, K.-K., Saari, M., Zhang, Z., Le,

H., Quan, T., and Abrahamsson, P. (2023). Genera-

tive artiﬁcial intelligence for software engineering – a

research agenda.

Supporting Automated Documentation Updates in Continuous Software Development with Large Language Models

105

Phan, L., Tran, H., Le, D., Nguyen, H., Anibal, J., Peltekian,

A., and Ye, Y. (2021). Cotext: Multi-task learning

with code-text transformer.

Rai, S., Belwal, R. C., and Gupta, A. (2022). A review

on source code documentation. ACM Transactions on

Intelligent Systems and Technology, 13(5):1–44.

Sahoo, P., Singh, A. K., Saha, S., Jain, V., Mondal, S., and

Chadha, A. (2024). A systematic survey of prompt

engineering in large language models: Techniques and

applications.

Shin, J., Tang, C., Mohati, T., Nayebi, M., Wang, S., and

Hemmati, H. (2023). Prompt engineering or ﬁne tun-

ing: An empirical assessment of large language mod-

els in automated software engineering tasks.

Sridhara, G., Hill, E., Muppaneni, D., Pollock, L., and

Vijay-Shanker, K. (2010). Towards automatically gen-

erating summary comments for java methods. In Pro-

ceedings of the IEEE/ACM international conference

on Automated software engineering, ASE10. ACM.

Su, Y., Wan, C., Sethi, U., Lu, S., Musuvathi, M., and Nath,

S. (2023). Hotgpt: How to make software documen-

tation more useful with a large language model? In

Proceedings of the 19th Workshop on Hot Topics in

Operating Systems, HOTOS ’23. ACM.

Tan, W. S., Wagner, M., and Treude, C. (2023). Wait, wasn’t

that code here before? detecting outdated software

documentation. In 2023 IEEE International Confer-

ence on Software Maintenance and Evolution (IC-

SME). IEEE.

Tan, W. S., Wagner, M., and Treude, C. (2024). Detecting

outdated code element references in software reposi-

tory documentation. Empir. Softw. Eng., 29(1).

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, L., and Polosukhin, I.

(2017). Attention is all you need.

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang,

S., Chowdhery, A., and Zhou, D. (2022). Self-

consistency improves chain of thought reasoning in

language models.

Wen, F., Nagy, C., Bavota, G., and Lanza, M. (2019). A

large-scale empirical study on code-comment incon-

sistencies. In 2019 IEEE/ACM 27th International

Conference on Program Comprehension (ICPC).

IEEE.

Xu, B., Yang, A., Lin, J., Wang, Q., Zhou, C., Zhang, Y.,

and Mao, Z. (2023a). Expertprompting: Instructing

large language models to be distinguished experts.

Xu, B., Yang, A., Lin, J., Wang, Q., Zhou, C., Zhang, Y.,

and Mao, Z. (2023b). Expertprompting: Instructing

large language models to be distinguished experts.

Yao, S., Yu, D., Zhao, J., Shafran, I., Grifﬁths, T. L., Cao,

Y., and Narasimhan, K. (2023). Tree of thoughts: De-

liberate problem solving with large language models.

Zheng, Z., Ning, K., Chen, J., Wang, Y., Chen, W., Guo, L.,

and Wang, W. (2023). Towards an understanding of

large language models in software engineering tasks.

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

106