Analyzing Declarative Deployment Code with Large Language Models
Giacomo Lanciano
1, a
, Manuel Stein
2
, Volker Hilt
2 b
and Tommaso Cucinotta
3 c
1
Scuola Normale Superiore, Pisa, Italy
2
Nokia Bell Labs, Stuttgart, Germany
3
Scuola Superiore Sant’Anna, Pisa, Italy
Keywords:
Large Language Models, Infrastructure-as-Code, DevOps, Kubernetes, Machine Learning, Quality
Assurance.
Abstract:
In the cloud-native era, developers have at their disposal an unprecedented landscape of services to build
scalable distributed systems. The DevOps paradigm emerged as a response to the increasing necessity of better
automations, capable of dealing with the complexity of modern cloud systems. For instance, Infrastructure-as-
Code tools provide a declarative way to define, track, and automate changes to the infrastructure underlying
a cloud application. Assuring the quality of this part of a code base is of utmost importance. However,
learning to produce robust deployment specifications is not an easy feat, and for the domain experts it is time-
consuming to conduct code-reviews and transfer the appropriate knowledge to novice members of the team.
Given the abundance of data generated throughout the DevOps cycle, machine learning (ML) techniques seem
a promising way to tackle this problem. In this work, we propose an approach based on Large Language
Models to analyze declarative deployment code and automatically provide QA-related recommendations to
developers, such that they can benefit of established best practices and design patterns. We developed a
prototype of our proposed ML pipeline, and empirically evaluated our approach on a collection of Kubernetes
manifests exported from a repository of internal projects at Nokia Bell Labs.
1 INTRODUCTION
During the last decade, cloud technologies have been
evolving at an impressive pace, such that we are
now living in a cloud-native era where developers
can leverage on an unprecedented landscape of ad-
vanced services to build highly-resilient distributed
systems, providing compute, storage, networking,
load-balancing, security, monitoring and orchestra-
tion functionality, among others. To keep up with this
pace, development and operations practices have un-
dergone very significant transformations, especially
in terms of improving the automations that make re-
leasing new software, and responding to unforeseen
issues, faster and sustainable at scale. The result-
ing paradigm is nowadays referred to as DevOps (Al-
nafessah et al., 2021).
Quality assurance (QA) is obviously a fundamen-
tal part of the DevOps cycle. However, the complex-
a
https://orcid.org/0000-0002-7431-8041
b
https://orcid.org/0000-0002-1826-8297
c
https://orcid.org/0000-0002-0362-0657
Giacomo Lanciano was an intern at Nokia Bell Labs.
ity of modern cloud frameworks and services makes
a developer’s job unprecedentedly hard. On top of
that, development teams are typically composed by
persons with very diverse backgrounds, and varying
levels of expertise. As a team, this makes adhering to
best practices everything but straightforward, because
transferring knowledge from experts to novice mem-
bers takes a lot of time. Therefore, in line with the
DevOps philosophy, automating this process as much
as possible seems the right approach. Indeed, there
exist a vast amount of tools that provide (static) code
analysis functionality, and that can be seamlessly inte-
grated in existing continuous-integration/continuous-
delivery (CI/CD) pipelines to address QA concerns.
However, given the impressive abundance of data
generated throughout the DevOps cycle, applying ma-
chine learning (ML) techniques in this context seems
a promising path towards providing developers with
high-quality feedbacks and recommendations, auto-
matically.
When developing a cloud-native application, the
definition of its deployment plays a fundamental role.
Modern cloud management frameworks, like Kuber-
Lanciano, G., Stein, M., Hilt, V. and Cucinotta, T.
Analyzing Declarative Deployment Code with Large Language Models.
DOI: 10.5220/0011991200003488
In Proceedings of the 13th International Conference on Cloud Computing and Services Science (CLOSER 2023), pages 289-296
ISBN: 978-989-758-650-7; ISSN: 2184-5042
Copyright
c
2023 by SCITEPRESS Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)
289
netes and OpenStack (two of the most well-known
open-source and widely adopted projects), typically
offer at least an Infrastructure-as-Code (IaC) solution
(e.g., deployment manifests and Heat templates, re-
spectively). Such a mechanism allows for specify-
ing the desired properties and the relations among the
components of the deployment via declarative code,
that can then be versioned and treated in the same
way as the code that implements the actual application
logic. It is obviously very important to follow best
practices when, e.g., specifying a Kubernetes deploy-
ment manifest, as failing to do so may lead the appli-
cations to experience many types of issues (Mumtaz
et al., 2021; Li et al., 2022). Static analysis tools for
manifest files, like for instance Polaris
1
or Kubesec,
2
allow for mitigating the risk that such issues may ac-
tually occur. However, they are typically designed to
run relatively simplistic checks, that do not take into
account complex design patterns.
In this work, we propose an approach to declar-
ative deployment code analysis based on Large Lan-
guage Models (LLMs), that can automatically provide
QA-related recommendations to developers, based on
established best practices and design patterns, build-
ing on top of standard (static) analysis approaches. To
the best of our knowledge, our approach is novel, in
the sense that we did not find in the research litera-
ture any other proposal to specialize LLMs on deploy-
ment code to specifically address QA-related con-
cerns. Also, while we mainly focus on deployment
code, it is interesting to consider that this information
could eventually be integrated with the other available
data sources in the DevOps cycle to consider a more
comprehensive picture, like: version control system
history, code review feedbacks, tests measurements
and logs, etc.
This paper is organized as follows. Section 2 pro-
vides an overview of the existing related works in the
space of code analysis with LLMs. Section 3 presents
the main features of our proposed approach and the
prototype ML pipeline we implemented. Section 4
presents the results of our preliminary validation on
a set of Kubernetes manifest files exported from a
Nokia Bell Labs repository. Section 5 concludes the
paper and provides indications for future research di-
rections.
2 RELATED WORKS
In this work, we propose the use of Natural Lan-
guage Processing (NLP) models to detect architec-
1
https://www.fairwinds.com/polaris
2
https://kubesec.io/
tural smells and issues in declarative deployment
code. Indeed, language models are nowadays exten-
sively used in practice to analyze and generate source
code (Sharma et al., 2021; MacNeil et al., 2022). In
particular, we focus on LLMs, that are models based
on the transformer architecture (Vaswani et al., 2017),
consisting of millions, or even billions, of learnable
parameters. During the last years, in fact, this class of
models has been gaining a lot of attention from the re-
search community, due to their fascinating emergent
properties like unsupervised multitask (Radford et al.,
2019) and few-shot (Brown et al., 2020) learning.
In (Zhang et al., 2021), the authors propose an
LLM-based approach to automatically fix textual and
semantic merge conflicts in a version-controlled code-
base. Their approach leverages entirely on few-shot
learning, and exhibits remarkable performance with-
out requiring fine-tuning. In (Chen et al., 2021), the
authors propose Codex, a GPT (Radford et al., 2018)
model extensively fine-tuned on open-source code re-
trieved from GitHub, that exhibits remarkable per-
formance in generating source code when prompted
with the corresponding textual description. Simi-
larly, in (Heyman et al., 2021), the authors propose
an LLM-based approach to code generation that takes
into account both the code already written by devel-
opers and their intent, expressed in plain natural lan-
guage. In particular, such model is empirically val-
idated on Python code generation for data science
applications. In (Shorten and Khoshgoftaar, 2023),
KerasBERT is proposed. Such model is trained on
a considerable amount of code examples, notebooks,
blog posts and forum threads regarding the Keras
deep learning framework, to provide an automatic
tool to analyze and generate documentation for re-
lated code snippets. The authors of (Jain et al., 2022)
propose Jigsaw, an approach based on program syn-
thesis techniques, to post-process the source code
generated by specialized LLMs in order to provide
quality guarantees.
The work presented in (Thapa et al., 2022) demon-
strates how LLMs can also be used for detecting
software vulnerabilities. Indeed, the authors provide
the results of an empirical analysis, conducted on
vulnerability datasets for C/C++ source code, show-
ing how LLMs outperform other neural models like
those based on long short-term memory (LSTM) and
gated recurrent units (GRUs). Similarly, the authors
of (Demırcı et al., 2022) propose a malware detection
mechanism that leverages on a combination of LSTM
and LLMs to discover malicious instructions in as-
sembly code.
In (Ma et al., 2022), the authors investigate on the
reasons behind the emergent capability of LLMs to
CLOSER 2023 - 13th International Conference on Cloud Computing and Services Science
290
learn code syntax and semantic. In particular, they
rely on Abstract Syntax Trees (AST) and static analy-
sis to deeply understand the role that the self-attention
mechanism plays in learning the dependencies among
code tokens. On a related note, in (Wan et al., 2022),
the authors approach the problem of interpreting pre-
trained LLMs for code analysis. Remarkably, their re-
sults show that, in a transformer architecture, the code
syntax structure is typically preserved in the interme-
diate representations of each layer and, as a result, that
such LLMs are able to induce ASTs.
The authors of (Sarsa et al., 2022) empirically
demonstrated how LLMs can be successfully used to
generate, and explain, code for programming exer-
cises that is both novel and reasonable. On the other
hand, in (Sontakke et al., 2022), the authors provide
evidence that the same type of models heavily rely on
contextual cues (e.g., natural-language comments, or
function names) and that, by masking such informa-
tion, their summarization performance drops signifi-
cantly.
The works referenced in this section generally use
LLMs to either provide general-purpose code genera-
tion solutions (e.g., (Chen et al., 2021; Heyman et al.,
2021)), or realize code analysis tools for specific pro-
gramming languages and/or frameworks (e.g., (Thapa
et al., 2022; Shorten and Khoshgoftaar, 2023)). How-
ever, none of them proposes an approach to detect, or
recommend, the usage of specific best-practices and
high-level design patterns, that are very important for
QA. Furthermore, none of the aforementioned works
specializes LLMs to analyze declarative deployment
code that, nowadays, is ubiquitously used to config-
ure modern cloud environments. Therefore, we be-
lieve our work addresses a very relevant problem and
constitutes an innovative solution.
3 PROPOSED APPROACH
Our work focuses on the analysis of Kubernetes de-
ployment manifest files. In particular, our goal is to
provide non-expert developers with recommendations
regarding the (mis-)usage of relevant Kubernetes ar-
chitectural patterns (e.g., the Operator pattern). We
identified a set of fundamental features that such a
tool should have in order to achieve our goal:
F1: Classifying good- and bad-quality manifests.
F2: Explaining which characteristics contribute
the most to the outcome of the classification.
F3: Pinpointing design smells and issues, and
possibly recommending a suitable fix.
F4: Leveraging on the relations specified among
the components to detect highly complex archi-
tectural patterns.
We assume that a (possibly small) set of annotated
manifest examples is available. This is reasonable to
assume in a scenario where DevOps teams conduct
code reviews, such that useful annotations could even
be automatically extracted from the platform used for
such activities. Therefore, implementing F1 can be
approached as a supervised learning problem. In this
context, the notions of good and bad can be inter-
preted in many ways, also according to the nature of
the available annotations. An expert developer can
generally tell “at a glance” whether a manifest seems
to be poorly written or not. Although, there are pos-
sibly many reasons why a specific manifest is prob-
lematic. Therefore, it may not be actually useful to
treat this problem as a simple binary classification
task. Indeed, both F2 and F3 are concerned with aug-
menting the quality of the recommendations. How-
ever, while F2 refers to the possibility to apply spe-
cific techniques (Atanasova et al., 2020; Tenney et al.,
2020; Hoover et al., 2020) to better interpret the out-
put of an arbitrary model, F3 entails that such a model
should be able to solve a more complex task than a
simple classification, in order to provide the end user
with fine-grained recommendations. Implementing
both F2 and F3 inherently requires a trade-off to be
made between the interpretability and the power/com-
plexity of the underlying ML model. Similarly, F4 is
concerned with endowing the model with the capabil-
ity of detecting more convoluted design patterns, that
are not easily discoverable when looking at resources
in isolation. Given the set of desired features, and
the fact that the input data mainly consist in source
code (or text, in general), we believe that LLMs are
the most suitable tools to address our problem.
3.1 ML Pipeline
In order to realize the tools described in Section 3,
we propose the ML pipeline that is synthetically de-
scribed in Figure 1. Nowadays, Kubernetes is one
of the most used cloud orchestration framework, and
definitely among the most important projects backed
by the Cloud Native Computing Foundation (CNCF).
Therefore, it is very easy to find large open-source
collections of high-quality deployment manifest files,
like by considering those from CNCF graduated and
incubating projects. On top of that, we have access
to a vast number of (confidential) deployment man-
ifests developed by Nokia Bell Labs research teams
and business units for their products.
However, in this case, the main data quality-
Analyzing Declarative Deployment Code with Large Language Models
291
Kubernetes
Manifest
Collection
Feature
Engineering
Clustering Clustering
Assessment
Architectural
Patterns
Detection
Figure 1: The proposed ML pipeline.
related problem is represented by the scarcity of an-
notations that could be used to train supervised ML
models. To overcome this limitation, we propose
to use an (unsupervised) clustering approach (e.g.,
HDBSCAN (McInnes and Healy, 2017)) to try and
detect significant similarities among the manifests.
In general, clustering approaches are not designed to
handle textual data directly, so it is crucial to establish
a proper feature engineering process such that man-
ifests are transformed into appropriate feature vec-
tors, e.g., using standard tf-idf scores (Rajaraman and
Ullman, 2011). Even though the vector representa-
tions are typically very high-dimensional, the results
of the clustering process can be easily visualized us-
ing embedding techniques, as made in t-SNE (Maaten
and Hinton, 2008). These approaches are designed to
project high-dimensional spaces onto 2 or 3 dimen-
sions, while retaining the spatial relations among the
data points as much as possible. This way, an expert
can manually inspect some representatives from the
discovered clusters and provide initial annotations.
Depending on the actual task to be solved, the
annotated data must then be transformed in a way
that they can be consumed by a supervised learning
model. In the case of LLMs, there exist two main
strategies that can be used to solve a supervised learn-
ing task: fine-tuning or few-shot learning. LLMs gen-
erally require a very large amount of resources to
be trained, due to their impressive size, that directly
affects their computational complexity, and the (hu-
mongous) amount of textual training data needed to
make them exhibit the properties they are famous for.
Therefore, it is typically too expensive to train them
from scratch. However, provided that a checkpoint of
the weights of such a model is publicly available, it
is still possible to benefit from them to solve specific
tasks, even though the original training process was
optimized for another type of task and/or was con-
ducted on textual data unrelated with the application
domain. Indeed, one could choose the fine-tuning op-
tion, that is an example of transfer learning, and use
the original model as the initial part of a bigger archi-
tecture. The remaining part is typically optimized for
solving the problem at hand (e.g., a sequence classifi-
cation task), and trained using domain-specific textual
data. Such an approach may generally obtain impres-
sive performance even though the amount of available
data is small. On the other hand, LLMs trained for
causal language modeling (i.e., open-end text genera-
tion) are also capable of few-shot learning. This prop-
erty consists in such a model being able to extrapolate
how to solve a given learning task, provided that its
description and a few input-output examples can be
specified as a textual prompt (see the examples pro-
vided in Section 5). In this way, one does not even
need to develop (and allocate resources for) a training
pipeline, as the LLM is only used in inference mode.
4 PRELIMINARY EXPERIMENTS
In order to validate the ideas presented in Section 3,
we developed some prototypes of the different parts
of the proposed pipeline, and conducted some prelim-
inary experiments considering a simplified version of
our problem. Specifically, we gathered a set of
˜
100
manifest files from internal Nokia Bell Labs projects
and ran our clustering pipeline on them. While our
initial intent was to try and see whether the cluster-
ing output exposed interesting similarities that could
be used to obtain a tentative data labeling, this step
was particularly useful to filter out some noise from
our data. Figure 2a shows our initial clustering re-
sults. As described in Section 3.1, we obtained tf-
idf -based representations of the manifests and used
Principal Component Analysis (PCA) to get the top-
10 dimensions, to limit the amount of data to be fed
to HDBSCAN. After using t-SNE to project the clus-
tered vectors in a 2D space, we observed that our
data included a (strangely) regularly-shaped cluster
of manifests (in the top-right corner). Upon inspect-
ing the corresponding manifests, we realized that their
contents were not adding valuable information to our
analysis. Figure 2b reports the result we obtained by
re-running the clustering pipeline after filtering out
the uninteresting manifests.
Given the limited dimension of our data sam-
ple, we were not able to use the clustering re-
sults to derive interesting annotations at this stage.
Therefore, we decided to run Polaris on our man-
ifests and considered the output of the (boolean)
cpuLimitsMissing check, that reports whether CPU
CLOSER 2023 - 13th International Conference on Cloud Computing and Services Science
292
(a) before (b) after
Figure 2: Results of the clustering process, before and after filtering out the uninteresting manifest files. The clustered
manifests are projected onto a 2D space by using t-SNE (i.e., the axes do not directly refer to any specific feature).
usage limits were correctly specified for Kubernetes
resources like Deployment and Service. In this way,
we were able to quickly obtain an annotated dataset,
that allowed us to reduce our problem to a binary se-
quence classification task and conduct some experi-
ments with LLMs. Given the impressive computa-
tional complexity of such models, we accelerated our
experiments using the following GPUs:
NVIDIA Quadro RTX 6000 (Turing), 24 GB
memory;
NVIDIA Quadro RTX 8000 (Turing), 48 GB
memory.
In order to do that, we extensively leveraged on the
HuggingFace transformers library (Wolf et al., 2020)
and the pre-trained model checkpoints available on
the associated model hub.
3
Essentially, we focused
our experiments on two LLMs: GPT-2 (medium)
4
and
GPT-J-6B.
5
The medium-sized version of GPT-2 (Radford
et al., 2019), consists of 355M parameters, and ac-
cepts a maximum of 1024 tokens as input. During
our few-shot learning tests, such an input token limit
allowed us to provide just a couple of examples, as
we had to save enough space for the actual input to be
processed (see Section 5). Furthermore, given that the
kind of outputs we obtained were not related in any
way to the labels we specified in the prompt, we con-
cluded that this model is not particularly suitable for
declarative code analysis via few-shot learning. This
is likely due to the fact that the model was trained on
English natural language only, and probably never ob-
served any code example. However, as we were able
to run fine-tuning jobs even on our smaller GPU, we
believe that this model could be easily fine-tuned on a
3
https://huggingface.co/models
4
https://huggingface.co/gpt2-medium
5
https://huggingface.co/EleutherAI/gpt-j-6B
bigger declarative code training set and yield signifi-
cant results, similarly to what done in (Heyman et al.,
2021).
As the name suggests, GPT-J-6B is instead
a 6B parameters model, inspired by the success
of GPT-2/GPT-3, developed and open-sourced by
EleutherAI.
6
Furthermore, such model is trained on
The Pile (Gao et al., 2020), an 800+ GB open dataset
containing a very diverse set of textual documents, in-
cluding source code. Given the significantly bigger
size and input token limit (2048), our few-shot learn-
ing tests were rather successful. We observed that
the model was indeed able to understand the specified
classification task and output a correct label in most
of the cases. Furthermore, we did not have any prob-
lem with running the float16 revision of the model
on our smaller GPU, as the model took only
˜
12 out
of the available 24 GB of memory. It is also quite
impressive that we obtained comparable results using
the 8-bit quantized version of the same model, that al-
most halves the memory requirements, by leveraging
on a recently-added feature
7
of transformers, whose
details are described in (Dettmers et al., 2022). How-
ever, using the few-shot learning strategy still imposes
great limitations in terms of the amount of training
examples that the model can observe. This way, the
ability of the model to generalize is in turn quite lim-
ited. At the time of writing, the 8-bit quantized ver-
sion seems not to support fine-tuning. Therefore, for
our fine-tuning tests, we used the float16 revision.
Although, to avoid getting CUDA out-of-memory er-
rors on our GPU setup, it was necessary to use Deep-
Speed (Rasley et al., 2020),
8
a framework that lever-
ages on Zero Redundancy Optimizer (ZeRO) (Rajb-
6
https://www.eleuther.ai/
7
https://huggingface.co/blog/
hf-bitsandbytes-integration
8
https://www.deepspeed.ai/
Analyzing Declarative Deployment Code with Large Language Models
293
handari et al., 2020; Rajbhandari et al., 2021) to op-
timize model training memory footprint, either on a
single or multiple GPUs, at the expense of speed. Set-
ting up effective fine-tuning jobs, and properly assess-
ing the resulting model performance, is still a work in
progress. Contrary to few-shot learning tests, they re-
quire more, and better annotated, input data then the
limited sample we were able to generate from one of
the repository of internal projects at Nokia Bell Labs.
5 CONCLUSIONS AND FUTURE
WORKS
In this work, we proposed a method for analyzing
declarative deployment code (specifically, Kubernetes
deployment manifest files), such that non-expert de-
velopers can benefit from design patterns recommen-
dations. To the best of our knowledge, our proposed
approach is a novel way to address QA-related is-
sues by specializing LLMs on declarative deployment
code analysis. We conducted a preliminary valida-
tion of our ML pipeline on a simplified version of the
problem, that shows that LLMs are indeed a viable
and promising option for achieving our end goal.
We plan to extend the approach beyond rec-
ommendations that can be obtained with standard
static analysis tools (e.g., Polaris), by considering
more convoluted design patterns and architectural
smells (Carrasco et al., 2018; Neri et al., 2020), that
involve a potentially large number of Kubernetes re-
sources, possibly taking into account also security
concerns (Ponce et al., 2022). In these regards, fram-
ing our problem as an extractive question-answering
task seems a promising avenue. However, it would
also be interesting to investigate the feasibility of
a hybrid approach that combines LLMs with other
types of models that can leverage on existing struc-
tures (e.g., relations among Kubernetes resources) in
the input data, like Graph Neural Networks (Bacciu
et al., 2020). We also plan to conduct a more thor-
ough comparison of different types of LLMs and their
usage modes (e.g., few-shot learning vs fine-tuning vs
re-training). On a related note, it would be interesting
to explore methods for deriving more compact rep-
resentations of the inputs, to work around the maxi-
mum input tokens limit (e.g., YAML vs JSON encod-
ing; optimize tokenizers for declarative code, simi-
larly to the approach used for the natural language-
guided programming model proposed by (Heyman
et al., 2021)). As Kubernetes is not the only cloud
computing framework that leverages on declarative
code for its configuration, we want to generalize our
approach to other forms of deployment configuration
files like, for instance, Heat Orchestration Templates
for OpenStack. Finally, we believe it would be inter-
esting to integrate active learning (Ren et al., 2021)
techniques into our approach, to facilitate expert ar-
chitects with sharing and embedding their knowledge
into the underlying model.
REFERENCES
Alnafessah, A., Gias, A. U., Wang, R., Zhu, L., Casale, G.,
and Filieri, A. (2021). Quality-Aware DevOps Re-
search: Where Do We Stand? IEEE Access, 9:44476–
44489.
Atanasova, P., Simonsen, J. G., Lioma, C., and Augen-
stein, I. (2020). A Diagnostic Study of Explainabil-
ity Techniques for Text Classification. In Proceed-
ings of the 2020 Conference on Empirical Methods in
Natural Language Processing (EMNLP), pages 3256–
3274. Association for Computational Linguistics.
Bacciu, D., Errica, F., Micheli, A., and Podda, M. (2020). A
gentle introduction to deep learning for graphs. Neu-
ral Networks, 129:203–221.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger,
G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.,
Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E.,
Litwin, M., Gray, S., Chess, B., Clark, J., Berner,
C., McCandlish, S., Radford, A., Sutskever, I., and
Amodei, D. (2020). Language Models are Few-Shot
Learners. In Advances in Neural Information Pro-
cessing Systems, volume 33, pages 1877–1901. Cur-
ran Associates, Inc.
Carrasco, A., Bladel, B. v., and Demeyer, S. (2018). Migrat-
ing towards microservices: migration and architecture
smells. In Proceedings of the 2nd International Work-
shop on Refactoring, pages 1–6. Association for Com-
puting Machinery.
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O.,
Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brock-
man, G., Ray, A., Puri, R., Krueger, G., Petrov, M.,
Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S.,
Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavar-
ian, M., Winter, C., Tillet, P., Such, F. P., Cummings,
D., Plappert, M., Chantzis, F., Barnes, E., Herbert-
Voss, A., Guss, W. H., Nichol, A., Paino, A., Tezak,
N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saun-
ders, W., Hesse, C., Carr, A. N., Leike, J., Achiam,
J., Misra, V., Morikawa, E., Radford, A., Knight, M.,
Brundage, M., Murati, M., Mayer, K., Welinder, P.,
McGrew, B., Amodei, D., McCandlish, S., Sutskever,
I., and Zaremba, W. (2021). Evaluating Large Lan-
guage Models Trained on Code.
Demırcı, D., s¸ahın, N., s¸irlancis, M., and Acarturk, C.
(2022). Static Malware Detection Using Stacked BiL-
STM and GPT-2. IEEE Access, 10:58488–58502.
Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L.
(2022). LLM.int8(): 8-bit Matrix Multiplication for
Transformers at Scale.
CLOSER 2023 - 13th International Conference on Cloud Computing and Services Science
294
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T.,
Foster, C., Phang, J., He, H., Thite, A., Nabeshima,
N., Presser, S., and Leahy, C. (2020). The Pile: An
800GB Dataset of Diverse Text for Language Model-
ing.
Heyman, G., Huysegems, R., Justen, P., and Van Cutsem,
T. (2021). Natural language-guided programming. In
Proceedings of the 2021 ACM SIGPLAN International
Symposium on New Ideas, New Paradigms, and Re-
flections on Programming and Software, pages 39–55.
Association for Computing Machinery.
Hoover, B., Strobelt, H., and Gehrmann, S. (2020).
exBERT: A Visual Analysis Tool to Explore Learned
Representations in Transformer Models. In Proceed-
ings of the 58th Annual Meeting of the Association for
Computational Linguistics: System Demonstrations,
pages 187–196. Association for Computational Lin-
guistics.
Jain, N., Vaidyanath, S., Iyer, A., Natarajan, N.,
Parthasarathy, S., Rajamani, S., and Sharma, R.
(2022). Jigsaw: large language models meet pro-
gram synthesis. In Proceedings of the 44th Inter-
national Conference on Software Engineering, pages
1219–1231. Association for Computing Machinery.
Li, R., Soliman, M., Liang, P., and Avgeriou, P. (2022).
Symptoms of Architecture Erosion in Code Reviews:
A Study of Two OpenStack Projects. In 2022 IEEE
19th International Conference on Software Architec-
ture (ICSA), pages 24–35.
Ma, W., Zhao, M., Xie, X., Hu, Q., Liu, S., Zhang, J., Wang,
W., and Liu, Y. (2022). Is Self-Attention Powerful to
Learn Code Syntax and Semantics?
Maaten, L. v. d. and Hinton, G. (2008). Visualizing Data
using t-SNE. Journal of Machine Learning Research,
9(86):2579–2605.
MacNeil, S., Tran, A., Mogil, D., Bernstein, S., Ross, E.,
and Huang, Z. (2022). Generating Diverse Code Ex-
planations using the GPT-3 Large Language Model.
In Proceedings of the 2022 ACM Conference on In-
ternational Computing Education Research - Volume
2, volume 2, pages 37–39. Association for Computing
Machinery.
McInnes, L. and Healy, J. (2017). Accelerated Hierarchical
Density Based Clustering. In 2017 IEEE International
Conference on Data Mining Workshops (ICDMW),
pages 33–42.
Mumtaz, H., Singh, P., and Blincoe, K. (2021). A system-
atic mapping study on architectural smells detection.
Journal of Systems and Software, 173:110885.
Neri, D., Soldani, J., Zimmermann, O., and Brogi,
A. (2020). Design principles, architectural smells
and refactorings for microservices: a multivocal re-
view. SICS Software-Intensive Cyber-Physical Sys-
tems, 35(1):3–15.
Ponce, F., Soldani, J., Astudillo, H., and Brogi, A. (2022).
Smells and refactorings for microservices security: A
multivocal literature review. Journal of Systems and
Software, 192:111393.
Radford, A., Narasimhan, K., Salimans, T., and Sutskever,
I. (2018). Improving Language Understanding by
Generative Pre-Training. Technical report.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and
Sutskever, I. (2019). Language models are unsuper-
vised multitask learners.
Rajaraman, A. and Ullman, J. D. (2011). Data Mining. In
Mining of Massive Datasets, pages 1–17. Cambridge
University Press.
Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. (2020).
ZeRO: memory optimizations toward training trillion
parameter models. In Proceedings of the Interna-
tional Conference for High Performance Computing,
Networking, Storage and Analysis, pages 1–16. IEEE
Press.
Rajbhandari, S., Ruwase, O., Rasley, J., Smith, S., and He,
Y. (2021). ZeRO-infinity: breaking the GPU memory
wall for extreme scale deep learning. In Proceedings
of the International Conference for High Performance
Computing, Networking, Storage and Analysis, pages
1–14. Association for Computing Machinery.
Rasley, J., Rajbhandari, S., Ruwase, O., and He, Y. (2020).
DeepSpeed: System Optimizations Enable Training
Deep Learning Models with Over 100 Billion Param-
eters. In Proceedings of the 26th ACM SIGKDD Inter-
national Conference on Knowledge Discovery & Data
Mining, pages 3505–3506. Association for Comput-
ing Machinery.
Ren, P., Xiao, Y., Chang, X., Huang, P.-Y., Li, Z., Gupta,
B. B., Chen, X., and Wang, X. (2021). A Survey
of Deep Active Learning. ACM Computing Surveys,
54(9):180:1–180:40.
Sarsa, S., Denny, P., Hellas, A., and Leinonen, J. (2022).
Automatic Generation of Programming Exercises and
Code Explanations Using Large Language Models. In
Proceedings of the 2022 ACM Conference on Inter-
national Computing Education Research - Volume 1,
volume 1, pages 27–43. Association for Computing
Machinery.
Sharma, T., Kechagia, M., Georgiou, S., Tiwari, R., and
Sarro, F. (2021). A Survey on Machine Learning
Techniques for Source Code Analysis.
Shorten, C. and Khoshgoftaar, T. M. (2023). Language
Models for Deep Learning Programming: A Case
Study with Keras. In Wani, M. A. and Palade, V.,
editors, Deep Learning Applications, Volume 4, pages
135–161. Springer Nature.
Sontakke, A. N., Patwardhan, M., Vig, L., Medicherla,
R. K., Naik, R., and Shroff, G. (2022). Code Summa-
rization: Do Transformers Really Understand Code?
In Deep Learning for Code Workshop.
Tenney, I., Wexler, J., Bastings, J., Bolukbasi, T., Coenen,
A., Gehrmann, S., Jiang, E., Pushkarna, M., Rade-
baugh, C., Reif, E., and Yuan, A. (2020). The Lan-
guage Interpretability Tool: Extensible, Interactive
Visualizations and Analysis for NLP Models. In Pro-
ceedings of the 2020 Conference on Empirical Meth-
ods in Natural Language Processing: System Demon-
strations, pages 107–118. Association for Computa-
tional Linguistics.
Thapa, C., Jang, S. I., Ahmed, M. E., Camtepe, S.,
Pieprzyk, J., and Nepal, S. (2022). Transformer-Based
Language Models for Software Vulnerability Detec-
tion. In Proceedings of the 38th Annual Computer Se-
Analyzing Declarative Deployment Code with Large Language Models
295
curity Applications Conference, pages 481–496. As-
sociation for Computing Machinery.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L., and Polosukhin, I.
(2017). Attention is All you Need. In Advances in
Neural Information Processing Systems, volume 30.
Curran Associates, Inc.
Wan, Y., Zhao, W., Zhang, H., Sui, Y., Xu, G., and Jin, H.
(2022). What do they capture? a structural analysis
of pre-trained language models for source code. In
Proceedings of the 44th International Conference on
Software Engineering, pages 2377–2388. Association
for Computing Machinery.
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C.,
Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz,
M., Davison, J., Shleifer, S., von Platen, P., Ma, C.,
Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S.,
Drame, M., Lhoest, Q., and Rush, A. (2020). Trans-
formers: State-of-the-Art Natural Language Process-
ing. In Proceedings of the 2020 Conference on Empir-
ical Methods in Natural Language Processing: Sys-
tem Demonstrations, pages 38–45. Association for
Computational Linguistics.
Zhang, J., Mytkowicz, T., Kaufman, M., Piskac, R., and
Lahiri, S. K. (2021). Can Pre-trained Language Mod-
els be Used to Resolve Textual and Semantic Merge
Conflicts?
APPENDIX
FEW-SHOT LEARNING
EXAMPLES
Listing 1: Example of prompt used for few-shot learning.
ap i Ve rs i on : app s / v1
ki nd : D ep lo y me nt
sp ec :
re pli ca s : 1
st rat eg y : {}
te mpl at e :
sp ec :
co n ta in e rs :
- i ma ge : aaa - doc ker - r eg i st ry . co m / image -n ame - a aa
:0 .1 _d ev
na me : ima ge - name - a aa
po rt s :
- n am e : h tt p
co n ta i ne rP o rt : 80
pr oto co l : TCP
re s ou rc es :
re que st s :
cpu : 0 .1
me mo ry : 2 Mi
li mi ts :
cpu : 2
me mo ry : 5 Gi
re s ta r tP ol i cy : Al wa ys
st at us : { }
An sw er = cp u _l i mi t _p o si t iv e
## ## #
ap i Ve rs i on : app s / v1
ki nd : D ep lo y me nt
sp ec :
re pli ca s : 1
te mpl at e :
sp ec :
co n ta in e rs :
- i ma ge : bbb - doc ker - r eg i st ry . co m / image -n ame - b bb
:1 .0 .0
na me : ima ge - name - b bb
An sw er = cp u _l i mi t _n e ga t iv e
## ## #
CLOSER 2023 - 13th International Conference on Cloud Computing and Services Science
296