AsmDocGen: Generating Functional Natural Language Descriptions for

Assembly Code

Jesia Quader Yuki

, Mohammadhossein Amouei

1 a

, Benjamin C. M. Fung

1 b

Philippe Charland

2 c

and Andrew Walenstein

3 d

School of Information Studies, McGill University, Montreal, QC, Canada

Mission Critical Cyber Security Section, Defence R&D Canada, Quebec, QC, Canada

BlackBerry Limited, Waterloo, ON, Canada

Keywords:

Assembly Code, Reverse Engineering, CodeBERT, Transformers, Code Summarization.

Abstract:

This study explores the ﬁeld of software reverse engineering through the lens of code summarization, which

involves generating informative and concise summaries of code functionality. A signiﬁcant aspect of this re-

search is the application of assembly code summarization in malware analysis, highlighting its critical role

in understanding and mitigating potential security threats. Although there have been recent efforts to develop

code summarization techniques for high-level programming languages, to the best of our knowledge, this

study is the ﬁrst attempt to generate comments for assembly code. For this purpose, we ﬁrst built a carefully

curated dataset of assembly function-comment pairs. We then focused on automatic assembly code summa-

rization using transfer learning with pre-trained natural language processing (NLP) models, including BERT,

DistilBERT, RoBERTa, and CodeBERT. The results of our experiments show a notable advantage of Code-

BERT: despite its initial training on high-level programming languages alone, it excels in learning assembly

language, outperforming other pre-trained NLP models.

1 INTRODUCTION

Reverse engineering refers to the process of analyzing

an existing software system to recover its design: to

understand its functionality, design, and implementa-

tion details. This technique involves examining soft-

ware code, system behavior, and dependencies to cre-

ate a representation of the architecture and function-

ality of the system. It is often used to update or

improve existing systems, create documentation, or

build new software applications compatible with the

original system. Reverse engineering can also help

detect malicious software or potential vulnerabilities.

Not all reverse engineering efforts have the lux-

ury of starting with source code, as software source

code may not always be available to reverse engineers

for several reasons, such as the code being proprietary

or protected by copyright laws. In addition, the code

https://orcid.org/0000-0003-4208-4014

https://orcid.org/0000-0001-8423-2906

https://orcid.org/0000-0003-4051-9942

https://orcid.org/0000-0003-1103-2465

may be lost or corrupted, making it impossible to an-

alyze it directly. However, when starting with just the

executable, we have effective techniques to recover an

assembly-language representation, so understanding

assembly code is a common task in reverse engineer-

ing. But understanding assembly code is more com-

plicated than understanding high-level programming

languages for several reasons. Assembly code is a

low-level language consisting of complex instructions

that can be challenging to read and understand. It

typically lacks the abstraction and structure found in

high-level programming languages, making it harder

to identify program ﬂow and comprehend the overall

purpose of the code. As a result of the unavailability

of source code and the complexity of assembly code,

there is a greater need for automated assistance to the

reverse engineer trying to understand assembly code.

A valuable tool for reverse engineering is code

summarization. It is also known as “code comment-

ing” and concerns generating a concise and informa-

tive summary of a software code’s functionality or

behavior. Code summarization techniques use natu-

ral language processing and machine learning algo-

Yuki, J., Amouei, M., Fung, B., Charland, P. and Walenstein, A.

AsmDocGen: Generating Functional Natural Language Descriptions for Assembly Code.

DOI: 10.5220/0012761400003753

Paper copyright by his Majesty the King in Right of Canada as represented by the Minister of National Defence

In Proceedings of the 19th International Conference on Software Technologies (ICSOFT 2024), pages 35-45

ISBN: 978-989-758-706-1; ISSN: 2184-2833

rithms to analyze the syntax, structure, and comments

of the code to generate a human-readable and easy-

to-understand summary (Steidl et al., 2013). The gen-

erated summary can provide a rapidly reviewed pro-

posal of the code’s likely functionality, helping the

reverse engineer identify potential ﬂaws and more

rapidly understand the functionality and behavior of

a large and complex software codebase (Tenny, 1988;

Woodﬁeld et al., 1981). Several prior works have in-

dicated effective advances in code summarization, in-

cluding (Allamanis et al., 2016; LeClair et al., 2019;

Ahmad et al., 2020; Kusupati and Ailavarapu, 2022;

Wang et al., 2020).

However, little appears to be known about uti-

lizing such summarization techniques on assembly

language, speciﬁcally: about how well transformer

technologies work, whether ML models trained on

non-assembly corpora can effectively be leveraged

through transfer learning, and what qualities are re-

quired of training corpora to yield effective learning

performance. We aimed to explore how pre-trained

NLP models can be trained for the speciﬁc task of

summarizing assembly code, leveraging their existing

knowledge and experience with natural language pro-

cessing tasks. By comparing the performance of dif-

ferent pre-trained NLP models, we aimed to identify

which models are most effective for transfer learning

in this context, providing insight into how best to uti-

lize pre-trained models for this speciﬁc task.

The contributions of this paper relating to these

research problems are as follows:

• We propose AsmDocGen, a CodeBERT-based so-

lution that generates human-readable comments

for assembly functions. AsmDocGen makes as-

sembly code easier to understand by automati-

cally creating clear comments for it. This is an

important step forward in making complex code

more accessible and easier to work with, espe-

cially in areas like software reverse engineering.

Our experiments show that the comments gener-

ated are accurate and useful. This progress is cru-

cial to improving the way we handle and docu-

ment lower-level programming languages. To the

best of our knowledge, we are the ﬁrst to report

code summarization techniques applied to assem-

bly code, so the performance results serve as a

baseline for future work in the area.

• We provide a well-curated dataset of 5,084 assem-

bly function comment pairs for training and val-

idating assembly code summarization solutions.

As subject matter experts, we handpicked and

manually edited assembly function comment pairs

to create our dataset. The resulting dataset pro-

vides a valuable resource for future research on

assembly code summarization.

• We provide evidence that NLP models pre-trained

on other corpora can be successfully retrained and

tuned to be applied to assembly code. Speciﬁ-

cally, we show that CodeBERT can be re-trained

to understand assembly language and generate

concise comments, describing the functionality of

a piece of assembly code by leveraging its knowl-

edge of high-level programming languages.

The rest of this paper is organized as follows. Sec-

tion 2 brieﬂy reviews the literature on NLP and code

summarization. In Section 3, we explain our data col-

lection strategy and dataset. Section 4 details our pro-

posed method. In Sections 5 and 6, we outline the

experiments and results we achieved. Finally, Section

7 concludes the article.

2 RELATIONS TO PRIOR WORK

AsmDocGen adapts and extends previous work in the

so-called ’transformer’ architectures for NLP. Trans-

former based approaches have been shown to be ef-

fective in a wide range of NLP tasks, such as text

summarization and chatbots (Vaswani et al., 2017;

Brown et al., 2020; Devlin et al., 2019). Recent ad-

vances, including novel pretraining techniques (Lewis

et al., 2020; Radford et al., 2018), have resulted in

a state-of-the-art where transformer-based solutions

have shown to match or exceed previous language

modeling techniques for many NLP tasks, including

but not limited to text summarization, translation, di-

alogue generation, and code summarization.

2.1 Text Summarization

Sarkar (Sarkar, 2013) proposed a method to summa-

rize a document by extracting its main concepts. This

approach aims to provide an overall understanding of

the content of a document. Similarly, Christian et

al. (Christian et al., 2016) created an automatic text

summarizer using the TF-IDF algorithm. The TF-

IDF algorithm measures the importance of each word

in a document, and the words with the highest TF-

IDF scores are used to construct the summary. Verma

and Om (Verma and Om, 2018) presented a novel

technique for extracting multi-document summaries

based on Shark Smell Optimization (SSO). The ap-

proach is based on graph-based optimization and aims

to produce concise and coherent summaries. Liu and

Lapata (Liu and Lapata, 2019) applied BERT to the

summarization task and showed that it can perform

well for both extractive and abstractive summariza-

tion tasks. Gupta et al. (Gupta et al., 2022) applied

ICSOFT 2024 - 19th International Conference on Software Technologies

a few pre-trained models such as BERT, GPT, and

RoBERTa for text summarization. Our work is akin

to text summarization in the sense that we generate

text that purposefully omits details to offer a concise

representation of important elements in the original

work; however, it differs from text summarization in

that the language and even language type of the sum-

marization (natural language, English say) is different

from the source language (assembly).

2.2 Translation

The use of transformer models has been widely

adopted in neural machine translation. Researchers

have applied these models for the translation of En-

glish into French, as demonstrated by Sutskever et

al. and Cho et al. (Sutskever et al., 2014; Cho

et al., 2014). Bahdanau et al. (Bahdanau et al.,

2014) expanded the basic encoder-decoder for the

English-to-French translation task. Furthermore, Gao

et al. (Gao et al., 2021) introduced a Scalable Trans-

formers model and demonstrated its effectiveness in

translating English to German and English to French.

Like such prior work, AsmDocGen translates seman-

tic content from one language to another, but unlike

translation, where faithful representation of the con-

tent is expected, we purposefully reduce the content

of the output compared to the input for the purposes

of summarization.

2.3 Dialogue Generation and Grammar

Olabiyi and Mueller (Olabiyi et al., 2020) presented

DLGNet, a transformer-based model for dialogue

modeling. Lee et al. (Lee et al., 2021) used the Trans-

former with Copying Mechanism that outperformed

two commercial grammar checks and other NMT-

based models. Cao et al. (Cao et al., 2020) investi-

gated dialogue models with numerous input sources

modiﬁed from the pretrained language model GPT2.

They evaluated several strategies for fusing multiple

sources of attention information. Their experimen-

tal results reveal that correct fusion procedures out-

perform simple fusion baselines in terms of relevance

with the dialogue history.

2.4 Code Summarization

Iyer et al. (Iyer et al., 2016) presented CODE-NN, a

novel method that uses LSTM and an attention proce-

dure to generate summaries of C# code snippets and

SQL queries. Hu et al. (Hu et al., 2018b) presented

DeepCom, a model that examines the structural infor-

mation of Java methods to generate better comments.

Furthermore, Hu et al. (Hu et al., 2018a) developed

TL-CodeSum, an RNN-based model that effectively

used API knowledge in conjunction with source code

to generate code summarization.

Allamanis et al. (Allamanis et al., 2016) pre-

sented a unique convolutional self-attention network

to perform extreme summarization based on source

code where “extreme” denotes the production of ex-

tremely brief messages. LeClair et al. (LeClair et al.,

2019) use an attentional GRU encoder-decoder model

to produce summaries for code. Ahmad et al. (Ah-

mad et al., 2020) showed that relative encoding sig-

niﬁcantly enhances summarization efﬁciency by us-

ing transformers to generate a comprehensible sum-

mary that represents the functionality of a program.

Similarly, Kusupati and Ailavarapu (Kusupati and

Ailavarapu, 2022) used transformers for code sum-

marization. PYMT5, the PYTHON method text-to-

text transfer transformer, is presented by Clement et

al. (Clement et al., 2020). This model has the ca-

pability to predict complete methods based on natu-

ral language documentation strings (docstrings), and

it can also condense code into docstrings of various

conventional styles.

Some advanced pre-trained language mod-

els, such as BERT (Devlin et al., 2019), XL-

Net (Yang et al., 2019), GPT (Brown et al.,

2020), RoBERTa (Liu and Lapata, 2019), and

CodeBERT (Feng et al., 2020), have seemed promis-

ing for pairing comments with code (Liu and Lapata,

2019; Husain et al., ). This has inspired different

researchers to outperform those state-of-the-art mod-

els by employing pre-trained language models in the

task of code summarization (Barone and Sennrich,

2021; Wang et al., 2020) or natural language-based

code search (Gu et al., 2018).

Our study is distinctive in its focus on generating

comments that summarize the functionality of assem-

bly code, which has received little attention in previ-

ous research on code summarization. Assembly code

can be particularly challenging to comprehend, due to

its high complexity and lack of high-level abstraction.

2.5 Reference Code Corpora

One of our contributions is a reference code corpus

for the purposes of training or validating software en-

gineering tools, such as the CoNaLa (Yin et al., 2018)

and Bellon’s clone detector corpus (Bellon et al.,

2007). Our corpus

, offered with a license of permis-

sive use, is expected to be a valuable result for further

studies in the area.

https://github.com/McGill-DMaS/AsmDocGen

AsmDocGen: Generating Functional Natural Language Descriptions for Assembly Code

Table 1: Numbers of functions taken from various online

platforms.

Platforms No. of functions

Github 2428

HackerRank 1505

StackOverﬂow 574

Codeforces 423

Codechef 154

3 DATASET

The absence of a comprehensive and well-curated

dataset has made it difﬁcult to train machine learning

and deep learning models for assembly code summa-

rization. To overcome this obstacle, we have carefully

curated a dataset of assembly function comment pairs.

To create our dataset, we curated a selection of as-

sembly code functions and their corresponding com-

ments. Initially, we collected a diverse range of C

and C++ source code samples from well-known on-

line sources such as GitHub, Codeforces, StackOver-

ﬂow, HackerRank, and Codechef.

After collecting the source code samples, we man-

ually examined each ﬁle to ensure correct match-

ing between functions and their corresponding com-

ments. Once we ﬁnished curating the source code

comment pairs, we compiled the source code using

the GCC compiler. Then, we used IDA Pro to dis-

assemble the resulting executables. Finally, we man-

ually correlated assembly functions with their corre-

sponding source code functions to identify the orig-

inal source comments that match the disassembled

functions.

During the matching process, we found many

unexplained functions within the source code, and

among those with comments, we observed that many

comments suffer from poor quality. Thus, we man-

ually eliminated low-quality comments. The charac-

teristics of those comments classiﬁed as poor quality

are:

• Stating the obvious without providing additional

insights or details about the functionality.

• Being obsolete and not corresponding to the cur-

rent version of the code.

• Being overly wordy or containing excessively

technical language.

• Having grammatical mistakes.

• Containing irrelevant information.

• Being unclear or ambiguous

• Being misleading or incorrect

Figure 1: Snippet of the dataset.

Overall, the process of building a high-quality

dataset involved four months of work with meticulous

attention to detail. As a result of our effort, we col-

lected 5,084 assembly function comment pairs (see

Table 1); Figure 1 shows an excerpt from our dataset.

4 THE METHOD: AsmDocGen

Given the limited size of our assembly language com-

ment dataset, it is not feasible to train large NLP mod-

els from scratch. Further, it is not clear that con-

structing an extensive-enough dataset of assembly-

comment pairs that alone can be used to train a highly-

performant comment generator is required. Instead,

we hypothesize that transfer learning by using a suit-

able foundation NLP model and retraining on our tar-

geted data set can yield signiﬁcantly improved per-

formance than using non-assembly-based foundation

models directly or trying to train an NLP model from

scratch using such a limited data set as ours.

We chose CodeBERT (Feng et al., 2020) as our

foundation to build AsmDocGen. CodeBERT is a

state-of-the-art pretrained language model based on

the Transformer architecture. It has been trained on

a vast corpus of English words and source code, in-

cluding Python, Java, and C++, making it one of the

most versatile pre-trained models available and mak-

ing it adept at understanding code structure and func-

tion. CodeBERT’s ﬂexibility stems from its ability

to capture both syntactic and semantic information

from natural language and programming language in-

puts. Therefore, we hypothesize that CodeBERT’s

ability to capture semantic representations for natural

and programming languages is advantageous for un-

derstanding assembly code’s unique grammar for as-

ICSOFT 2024 - 19th International Conference on Software Technologies

sembly code summarization, making it a more effec-

tive choice for training on small datasets than train-

ing models from scratch. Furthermore, the research

conducted by Zhou et al. (Zhou and Su, 2002) has

conﬁrmed the ability of CodeBERT to adapt beyond

its pre-trained data. This ability makes CodeBERT a

more robust and adaptable model, enabling it to per-

form well on new tasks and domains.

We expect that the semantic representations

learned by CodeBERT are particularly beneﬁcial for

understanding assembly code. Assembly code has

a unique grammar for constructing instructions and

linking operations and operands. CodeBERT’s abil-

ity to capture these structures and their relationships

makes it an effective tool for generating accurate com-

ment representations.

4.0.1 Model Background

CodeBERT is based on the transformer architecture,

similar to the original BERT model. It has a multi-

layer transformer encoder, which takes as input code

and comments. To tokenize the input, CodeBERT

uses the WordPiece tokenization method for both

code and comments (Wu et al., 2016). The input to the

model consists of a sequence of tokens, which are the

individual words and symbols of the code and com-

ments.

CodeBERT’s training objectives include Masked

Language Modeling (MLM) and Replaced Token De-

tection (RTD). MLM involves masking parts of the

text at random and requesting the model to predict

them, while RTD involves replacing tokens in the text

with plausible alternatives and having the model de-

termine which tokens have been replaced. This ap-

proach improves the robustness of the model by al-

lowing it to handle variations in input data.

In addition to MLM and RTD, CodeBERT in-

cludes a Cross-Lingual Language Model (XLM) ob-

jective, which enables it to learn cross-lingual repre-

sentations by jointly training on monolingual and par-

allel data. This feature makes CodeBERT particularly

useful for natural language processing tasks involving

multilingual inputs.

In the pre-training phase, the input is set as the

concatenation of two segments with special separator

tokens. The input format is [CLS], w

, w

, ..., w

[SEP], c

, c

, ..., c

, [EOS], where [CLS] is added at

the beginning of each sentence to capture the sentence

representation. The [SEP] token separates the code

and comment tokens, making it easier for BERT to

understand that the input is made up of two parts, code

and comments. The [EOS] token is used to indicate

the end of a sentence.

The input tokens are passed through an embed-

ding layer, where they are transformed into numer-

ical vectors that capture the meaning of the tokens.

These tokens are then combined with three other types

of embeddings to form a single input vector for the

model. The three types of embedding include seg-

mentation embeddings, position embedding, and to-

ken embeddings. Figure 2 demonstrates how the to-

kens are passed through various embedding layers to

form a single input vector. The role of these three

types of embedding involves:

• Segmentation embeddings are used to differenti-

ate between distinct lines of code.

• Position embeddings show the position of each

token inside the line of code.

• Token embeddings refer to the semantics of each

token.

Figure 2: Illustration of the multi-layered process used by

the CodeBERT model to transform individual tokens into

embeddings via various layers.

CodeBERT is pre-trained on a dataset of code

and comment pairs using masked language modelling

and replaced token detection objectives. The train-

ing process involves two neural networks, a gener-

ator G and a discriminator D. The encoder of both

networks, usually a transformer network, converts a

sequence of embedding tokens x = [x

, ..., x

] into

a sequence of contextualized vector representations

h(x) = [h

, ..., h

]. These embeddings are then passed

through the transformer encoder, which consists of

multiple layers of self-attention and feed-forward

neural networks. The self-attention layers enable the

model to focus on distinct segments of the input se-

quence, while the feed-forward layers help the model

acquire a deeper understanding of the relationships

between the individual tokens.

The ﬁnal output of the transformer encoder is a set

of embeddings for each token in the input sequence,

which captures the meaning of the code and com-

ments in a ﬁxed-length vector representation. These

embeddings are then used to train the model for code

summarization tasks.

In general, the overall structure of CodeBERT

for code summarization is an encoder-decoder struc-

ture, where the encoder is the transformer-based neu-

ral network, and the decoder is the task-speciﬁc net-

work that generates the summary of the code based

AsmDocGen: Generating Functional Natural Language Descriptions for Assembly Code

CodeBERT EncoderCodeBERT Encoder

DecoderDecoder

[CLS] movl [MASK] ebp [CLS] check [MASK] [EOS]

Contexual Embedding

esp number

Figure 3: Overview of training AsmDocGen. During the test, the comment is removed from the input and appears on the

model’s output.

on the encoded representation. We train this encoder-

decoder model using our dataset, which consists of

assembly functions and their corresponding ground-

truth comments.

4.1 Training AsmDocGen

Dataset Construction. We collected a comprehen-

sive dataset of assembly code functions according to

the description in Section 3. We then tokenized the

dataset using the WordPiece algorithm, which is well

suited for tokenizing both natural language and code

(Gu et al., 2018). Assembly language has its own

unique syntax and vocabulary, and we ﬁnd that Word-

Piece is the optimal choice for our problem. The al-

gorithm enables the encoding of any unusual words

in the lexicon with suitable subword tokens, without

requiring the addition of any ”unknown” tokens.

To maintain consistency in sentence length, we

utilize padding and truncation techniques to generate

sentences of a uniform length of 100 tokens. We de-

termined this length by conducting experiments with

varying sentence lengths and evaluating the trade-off

between the model’s accuracy and training time.

Training. To generate code summaries, a sequence-

to-sequence pipeline was utilized, where the encoder

was initialized using CodeBERT, which comprises 12

layers, a maximum length of 512, an embedding size

of 768, and 12 attention heads of size 64. For the de-

coder, a randomly initialized Transformer with 6 lay-

ers, hidden states of 768 dimensions, and 12 attention

heads were used.

To update the model parameters, Adam Optimizer

was used with a learning rate of 5e-5 and a batch size

of 32. The hyperparameters were optimized and early

stopping was applied based on the validation dataset.

We partitioned the dataset into training, validation,

and test sets with proportions of 70%, 15%, and 15%,

respectively.

Inference. As illustrated in Figure 3, we pass the

tokenized words through the CodeBERT encoder to

produce ﬁxed-length embeddings. These embeddings

capture the meaning of the code and comments, creat-

ing a contextual vector representation for each token

in the input sequence. This contextual vector is ﬁnally

passed to a decoder to generate a code summary.

5 EXPERIMENTAL SETUP

To test our transfer learning hypothesis and evalu-

ate AsmDocGen’s performance, we train a sample of

NLP models to compare their results with AsmDoc-

Gen. The CPU architecture of the code is x86/x64.

5.1 Metrics

We use four widely accepted metrics within this ﬁeld

to compare the performance of AsmDocGen against

the experimental sample of foundation models that

have not been trained for the assembly language com-

menting task. These metrics provide a quantiﬁable

means of determining the success of the model in

achieving its objectives.

5.1.1 BLEU

BLEU (Bilingual Evaluation Understudy) is a met-

ric used to evaluate the quality of machine-generated

text, such as machine translation or text summariza-

tion. It measures the similarity between the generated

text and the reference text (usually human-generated)

based on n-gram overlap. The higher the BLEU score,

the better the machine-generated text matches the ref-

erence text. BLEU score ranges from 0 to 1, where

a score of 1 indicates that the machine-generated text

is identical to the reference text. BLEU score is com-

monly used in natural language processing research as

a standard metric to evaluate the quality of machine-

generated text.

ICSOFT 2024 - 19th International Conference on Software Technologies

5.1.2 ROUGE-1

ROUGE-1 is a metric used to evaluate the quality of

text summarization or machine translation. It mea-

sures the overlap between word unigrams in the refer-

ence summary (or translation) and the generated sum-

mary (or translation). The ROUGE-1 score ranges

from 0 to 1, where 1 indicates a perfect overlap be-

tween the reference summary and the generated sum-

mary in terms of unigrams.

5.1.3 ROUGE-2

ROUGE-2 is an evaluation metric used for auto-

matic summarization tasks that measures the over-

lap of word bigrams between the system-generated

summary and the reference summary. It is similar

to ROUGE-1, but instead of considering individual

words, it looks at pairs of words in the summary and

reference summary. The score is calculated as the

number of overlapping bigrams divided by the total

number of bigrams in the reference summary.

5.1.4 ROUGE-L

ROUGE-L is a metric to evaluate the quality of

text summaries or translations. It stands for Recall-

Oriented Understudy for Gisting Evaluation - Longest

Common Subsequence, and it measures the longest

common subsequence (LCS) of words between the

summary and the reference text. The LCS is the

longest sequence of words that appear in the same

order in both the summary and the reference text.

ROUGE-L considers all such LCS sequences and

takes their length into account to compute a weighted

average of the F1 score. The ROUGE-L score gives

more weight to long sequences of words that are sim-

ilar in the summary and reference text, and it is often

used as a more comprehensive evaluation metric than

ROUGE-1 and ROUGE-2.

5.2 Sample and Procedure

We evaluated the performance of our model by com-

paring it against a convenience sample of four lead-

ing NLP models: RoBERTa, BERT, DistilBERT and

Transformer. Each of the models was trained on our

assembly-comment pair corpus (see Sectoin3), and

then they and AsmDocGen were passed a validation

input set of undocumented assembly language func-

tions to generate English summaries. The metrics of

Section 5.1 were collected for each and for AsmDoc-

Gen.

5.2.1 Transformer

Transformer, introduced by Vaswani et al. (Vaswani

et al., 2017), is a neural network architecture based on

the concept of self-attention, which allows the model

to weigh the importance of different parts of the in-

put sequence when generating output. The Trans-

former consists of an encoder and a decoder, each

containing multiple layers of self-attention and feed-

forward neural networks. Its use of self-attention al-

lows the model to capture long-range dependencies

more effectively than other NLP models and paral-

lelize computations across the input sequence, mak-

ing it more computationally efﬁcient. The Trans-

former has achieved state-of-the-art performance on

a wide range of NLP tasks and has inspired the devel-

opment of other Transformer-based models such as

BERT.

5.2.2 BERT

BERT, introduced by Devlin et al. (Devlin et al.,

2019) in 2019, is an NLP model developed by Google

in 2018. It is based on the transformer architecture

and is pre-trained on a large corpus of unannotated

text using a masked language modeling task and a

next-sentence prediction task. BERT can be trained

or ﬁne-tuned on a variety of NLP tasks, achiev-

ing state-of-the-art performance on many benchmark

NLP datasets with relatively small amounts of task-

speciﬁc data. Its ability to handle a wide range of NLP

tasks has made it a popular model for NLP research

and applications.

5.2.3 RoBERTa

RoBERTa, introduced by Liu et al. (Liu and Lapata,

2019) in 2019, is an NLP model that is based on the

same architecture as BERT but with several modiﬁ-

cations to its training process and hyperparameters.

The model is trained on a much larger corpus of data,

with up to 160 GB of text, and uses dynamic mask-

ing during pre-training. RoBERTa also changes the

hyperparameters used in BERT, including removing

the next sentence prediction task, increasing the batch

size, and training the model for longer durations. Ad-

ditionally, RoBERTa uses byte-pair encoding for sub-

word tokenization, which can improve the model’s

ability to handle rare and out-of-vocabulary words.

These modiﬁcations allow RoBERTa to achieve state-

of-the-art performance on a wide range of NLP tasks.

5.2.4 DistilBERT

DistilBERT, introduced by Sanh et al. (Sanh et al.,

2019), is a compact and efﬁcient version of the BERT

AsmDocGen: Generating Functional Natural Language Descriptions for Assembly Code

Table 2: A comparison of the BLEU scores between our

proposed approach and the baseline results.

Models BLEU Score

Transformer 34.54

RoBERTa 50.01

DistilBERT 50.82

BERT 51.85

AsmDocGen 54.10

model, created through a process called distillation.

It has fewer parameters (40% less) than the BERT

base model. DistilBERT is trained using knowledge

distillation, where the knowledge of a larger model,

in this case BERT, is distilled into a smaller model.

Despite its smaller size, DistilBERT achieves perfor-

mance similar to that of the larger BERT model for

many NLP tasks, while being faster and requiring less

memory to train and run. It has become a popular

choice for NLP tasks, where computational resources

are limited.

6 RESULTS

Table 2 shows the BLEU score for Transformer,

RoBERTa, DistilBERT, BERT and AsmDocGen. Ta-

ble 3 shows that, on average, AsmDocGen outper-

forms the trained sample models BERT, RoBERTa

and DistilBERT by 26%, 23%, and 20% in terms of

precision, recall, and F1-score, respectively.

6.1 Quantitative Analysis

In order to investigate the effectiveness of AsmDoc-

Gen, we compared it with the baseline models men-

tioned above using the metrics BLEU, ROUGE-1,

ROUGE-2, and ROUGE-L. The comparison between

the Transformer and the pre-trained models shows the

effectiveness of transfer learning in training a model

for code summarization using a small dataset. The

outcome of the comparison between AsmDocGen,

BERT, RoBERTa, and DistilBERT shows the high ca-

pability of CodeBERT to learn and understand the as-

sembly language.

The results in Table 2 show that the pre-trained

models, despite their initial unfamiliarity with assem-

bly code syntax, signiﬁcantly outperform a model

(Transformer) trained from scratch using our smaller

dataset. This suggests that patterns and knowledge,

even when marginally relevant, captured from exten-

sive datasets during the pretraining phase can be sig-

niﬁcantly beneﬁcial for learning new tasks with lim-

ited data.

Consistent outperformance of AsmDocGen

against the other models, shown in Table 3, sup-

ports the inference that CodeBERT’s pre-training

objectives, which speciﬁcally target high-level code

and comment pairs, can provide an advantage for

understanding low-level languages compared to mod-

els that were pre-trained on general language data.

Additionally, our results highlight the importance of

selecting a model that is well suited to the task at

hand rather than relying solely on pretraining size or

architecture.

6.2 Qualitative Analysis

Our observations found three types of correctly gen-

erated comments in terms of their similarity to the

ground-truth comments. These three types are:

• Identical: Descriptions that include an exact set

of words in the same order as the ground truth de-

scription (see Table 4).

• Partially Similar: This group refers to comments

that are semantically similar to the ground truth

but only include a subset of original words (see

Table 5).

• Contextually Similar: These comments have dif-

ferent structures and wording from the ground

truth, but convey the relevant context or seman-

tics (see Table 6).

In this section, we present some examples from

each group to compare the ground truth with the re-

sults generated by the model. The aim of this demon-

stration is to illustrate the model’s accuracy in pre-

dicting results that are consistent with the actual ones.

This information is important in evaluating the perfor-

mance of the model and determining its effectiveness

in solving the problem it was designed to address.

Table 5 shows some examples of partially simi-

lar generated comments. It provides a noteworthy

example that showcases the learning capabilities of

the model, going beyond mere pattern memorization.

The table demonstrates how the model accurately

comprehended the meaning of the words ”unopened”

and ”closed”, and produced correct predictions. This

outcome aligns with our expectations for the model

and highlights its desired performance.

Sometimes, the comments generated by our

model, as presented in Table 6, differ completely from

the ground-truth comments. This raises the question

of whether these predictions are actually relevant to

the code’s functionality.

As shown in Table 6, AsmDocGen generated the

comment ”returns the size of the queue,” whereas the

ground-truth comment is ”fuzzy compare operations.”

ICSOFT 2024 - 19th International Conference on Software Technologies

Table 3: Performance comparison of AsmDocGen and baseline results based on the ROUGE-1, ROUGE-2, and ROUGE-L

scores in terms of precision, recall, and F1-score.

Models ROUGE-1 ROUGE-2 ROUGE-L Average

RoBERTa 0.50 0.16 0.50 0.39

DistilBERT 0.52 0.45 0.52 0.50

BERT 0.56 0.45 0.50 0.50

Precision

AsmDocGen 0.70 0.69 0.51 0.63

RoBERTa 0.36 0.16 0.37 0.30

DistilBERT 0.47 0.30 0.47 0.41

BERT 0.55 0.39 0.46 0.47

Recall

AsmDocGen 0.70 0.56 0.48 0.58

RoBERTa 0.42 0.16 0.43 0.34

DistilBERT 0.49 0.36 0.49 0.45

BERT 0.55 0.43 0.48 0.49

F1-score

AsmDocGen 0.68 0.59 0.49 0.59

Table 4: Three examples of AsmDocGen’s description that

are an exact match to the true description.

Identical

Predicted comment Ground truth

ﬁnd vertex number

and edges out

ﬁnd vertex number

and edges out

perform subtraction

then addition

perform subtraction

then addition

perform multiple mul-

tiplication

perform multiple mul-

tiplication

Table 5: Samples of the predicted output which uses similar

words when compared to the true description.

Partially Similar

Predicted comment Ground truth

ﬁnd unopened closing

brackets

check too many clos-

ing brackets

ﬁnd second thursday

of sept

ﬁnd second thursday

of september 2013

ﬁnd partially paired

brackets

ﬁnd paired and nested

brackets

Given that in a fuzzy system, a queue can be utilized

to keep track of intermediate results or manage the se-

quence of various operations, the generated comment

seems relevant. Our manual examination of the code

conﬁrmed that this assembly code speciﬁcally uses a

queue in a fuzzy system. Additionally, it is worth not-

ing that while ”Find edge destination” and ”ﬁnd last

node” are not identical, they still convey similar con-

Table 6: Predicted output for sentences that are completely

different and not identical to the truth, but similar to the

concept based on code’s functionality.

Contextually Similar

Predicted comment Ground truth

returns the size of the

queue

fuzzy compare opera-

tions

ﬁnd last node

ﬁnd edge destination

cepts. ”Find edge destination” is a speciﬁc term in

graph theory, while ”ﬁnd last node” is a broader term

that can apply to various structures, such as linked

lists, trees, or graphs. These ﬁndings suggest that al-

though AsmDocGen’s comments may not be identical

to the ground truth, they are still relevant to the func-

tionality of the code.

7 CONCLUSIONS

The comparison between the Transformer and the

pretrained models shows the effectiveness of trans-

fer learning in training a model for code summariza-

tion using a small dataset. The outcome of the com-

parison between AsmDocGen, BERT, RoBERTa, and

DistilBERT shows the high capability of CodeBERT

to learn and understand the assembly language, and

the overall performance of AsmDocGen as a whole.

The evaluation supports the argument that AsmDoc-

Gen represents a signiﬁcant advance in the ﬁeld of au-

tomatic code commenting for low-level programming

languages by using a transformer-based model. The

reference corpus we created of assembly-comment

pairs was shown to be beneﬁcial for retraining Code-

AsmDocGen: Generating Functional Natural Language Descriptions for Assembly Code

BERT to align it better with this assembly documen-

tation task. This innovative approach sets a new stan-

dard in the area.

Future work on this research could include im-

proving the performance of the model by training it on

a larger dataset. Another area of improvement could

be expanding the system to overcome compiler opti-

mization challenges in generating comments for sim-

ilar functions that are compiled with different compil-

ers/optimization levels. In addition, the system could

be modiﬁed to generate multi-sentence summaries,

instead of just one-sentence comments.

ACKNOWLEDGMENT

This research is supported by NSERC Alliance Grants

(ALLRP 561035-20), BlackBerry Limited, and De-

fence Research and Development Canada (DRDC).

REFERENCES

Ahmad, W. U., Chakraborty, S., Ray, B., and Chang, K.-

W. (2020). A transformer-based approach for source

code summarization. In Proceedings of the 58th An-

nual Meeting of the Association for Computational

Linguistics, page 4998–500.

Allamanis, M., Peng, H., and Sutton, C. (2016). A convo-

lutional attention network for extreme summarization

of source code. In Proceedings of the International

Conference on Machine Learning, page 2091–2100.

Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural ma-

chine translation by jointly learning to align and trans-

late. ArXiv, 1409.

Barone, A. V. M. and Sennrich, R. (2021). A parallel cor-

pus of python functions and documentation strings for

automated code documentation and code generation.

In Proceedings of the Eighth International Joint Con-

ference on Natural Language Processing, 2:314–319.

Bellon, S., Koschke, R., Antoniol, G., Krinke, J., and

Merlo, E. (2007). Comparison and evaluation of clone

detection tools. IEEE Transactions on software engi-

neering, 33(9):577–591.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,

Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,

Askell, A., Agarwal, S., Herbert-Voss, A., Krueger,

G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.,

Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E.,

Litwin, M., Gray, S., Chess, B., Clark, J., Berner,

C., McCandlish, S., Radford, A., Sutskever, I., and

Amodei, D. (2020). Language models are few-shot

learners. In Advances in Neural Information Process-

ing Systems, (33):1877–1901.

Cao, Y., Bi, W., Fang, M., and Tao, D. (2020). Pretrained

language models for dialogue generation with multi-

ple input sources. pages 909–917.

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D.,

Bougares, F., Schwenk, H., and Bengio, Y. (2014).

Learning phrase representations using rnn encoder-

decoder for statistical machine translation.

Christian, H., Agus, M. P., and Suhartono, D. (2016).

Single document automatic text summarization using

term frequency-inverse document frequency (tf-idf).

ComTech: Computer, Mathematics and Engineering

Applications, 7:285.

Clement, C., Drain, D., Timcheck, J., Svyatkovskiy, A., and

Sundaresan, N. (2020). Pymt5: Multi-mode transla-

tion of natural language and python code with trans-

formers. In Proceedings of the 2020 Conference on

Empirical Methods in Natural Language Processing.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2019). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. In Proceedings

of the 2019 Conference of the North American Chap-

ter of the Association for Computational Linguistics:

Human Language Technologies, page 4171–4186.

Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong,

M., Shou, L., Qin, B., Liu, T., Jiang, D., and Zhou,

M. (2020). Code-bert: A pre-trained model for pro-

gramming and natural languages. In Findings of the

Association for Computational Linguistics: EMNLP

2020, page 1536–1547.

Gao, P., Geng, S., Qiao, Y., Wang, X., Dai, J., and Li,

H. (2021). Scalable transformers for neural machine

translation.

Gu, X., Zhang, H., and Kim, S. (2018). Deep code

search. In Proceedings of IEEE/ACM 40th Interna-

tional Conference on Software Engineering (ICSE),

page 933–944.

Gupta, A., Chugh, D., Anjum, and Katarya, R. (2022).

Automated news summarization using transformers.

pages 249–259.

Hu, X., Li, G., Xia, X., Lo, D., and Jin, Z. (2018a). Deep

code comment generation. In Proceedings of the

26th Conference on Program Comprehension, page

200–210.

Hu, X., Li, G., Xia, X., Lo, D., Lu, S., and Jin, Z. (2018b).

Summarizing source code with transferred api knowl-

edge. In Proceedings of the Twenty-Seventh Interna-

tional Joint Conference on Artiﬁcial Intelligence Main

track, pages 2269–2275.

Husain, H., Wu, H.-H., Gazit, T., Allamanis, M., and

Brockschmidt, M. Codesearchnet challenge: Evalu-

ating the state of semantic code search.

Iyer, S., Konstas, I., Cheung, A., and Zettlemoyer, L.

(2016). Summarizing source code using a neural at-

tention model. In Proceedings of the 54th Annual

Meeting of the Association for Computational Lin-

guistics, 1:2073–2083.

Kusupati, U. and Ailavarapu, V. R. T. (2022). Natural lan-

guage to code using transformers.

LeClair, A., Jiang, S., and McMillan, C. (2019). A neural

model for generating natural language summaries of

program subroutines. In Proceedings of the 41st Inter-

national Conference on Software Engineering, page

795–806.

ICSOFT 2024 - 19th International Conference on Software Technologies

Lee, M., Shin, H., Lee, D., and Choi, S.-P. (2021). Korean

grammatical error correction based on transformer

with copying mechanisms and grammatical noise im-

plantation methods. Sensors, 21:2658.

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mo-

hamed, A., Levy, O., Stoyanov, V., and Zettlemoyer,

L. (2020). Bart: Denoising sequence-to-sequence pre-

training for natural language generation, translation,

and comprehension. In Proceedings of the 58th An-

nual Meeting of the Association for Computational

Linguistics, page 7871–7880.

Liu, Y. and Lapata, M. (2019). Text summarization with

pretrained encoders.

Olabiyi, O. O., Bhattarai, P., Bruss, C. B., and Kulis,

Z. (2020). Dlgnet-task: An end-to-end neural

network framework for modeling multi-turn multi-

domain task-oriented dialogue. In Proceedings of the

2nd Workshop on Natural Language Processing for

Conversational AI.

Radford, A., Narasimhan, K., Salimans, T., and Sutskever,

I. (2018). Improving language understanding by gen-

erative pre-training.

Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019).

Distilbert, a distilled version of bert: smaller, faster,

cheaper and lighter.

Sarkar, K. (2013). Automatic single document text sum-

marization using key concepts in documents. J. Inf.

Process. Syst, 9:602–620.

Steidl, D., Hummel, B., and Juergens, E. (2013). Qual-

ity analysis of source code comments. In Proceed-

ings of the 2013 21st International Conference on

Program Comprehension (ICPC), San Francisco, CA,

USA, page 20–21.

Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Se-

quence to sequence learning with neural networks.

Advances in Neural Information Processing Systems,

4:3104–3112.

Tenny, T. (1988). Program readability: procedures versus

comments. IEEE Transactions on Software Engineer-

ing, 14(9):1271–1279.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A., Kaiser, L., and Polosukhin, I. (2017).

Attention is all you need. In Advances in Neural In-

formation Processing Systems, (30):5998–6008.

Verma, P. and Om, H. (2018). Mcrmr: Maximum coverage

and relevancy with minimal redundancy based multi-

document summarization. Expert Systems with Appli-

cations, 120.

Wang, R., Zhang, H., Lu, G., Lyu, L., and Lyu, C. (2020).

Fret: Functional reinforced transformer with bert for

code summarization. IEEE Access.

Woodﬁeld, S. N., Dunsmore, H. E., and Shen, V. Y. (1981).

The effect of modularization and comments on pro-

gram comprehension. In Proceedings of the 5th in-

ternational conference on Software engineering, page

215–223.

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi,

M., Macherey, W., Krikun, M., Cao, Y., Gao, Q.,

Macherey, K., Klingner, J., Shah, A., Johnson, M.,

Liu, X., Łukasz Kaiser, Gouws, S., Kato, Y., Kudo,

T., Kazawa, H., Stevens, K., Kurian, G., Patil, N.,

Wang, W., Young, C., Smith, J., Riesa, J., Rudnick,

A., Vinyals, O., Corrado, G., Hughes, M., and Dean,

J. (2016). Google’s neural machine translation sys-

tem: Bridging the gap between human and machine

translation.

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov,

R., and Le, Q. (2019). Xlnet: Generalized autore-

gressive pre training for language understanding. In

Advances in Neural Information Processing Systems,

32:5753–5763.

Yin, P., Deng, B., Chen, E., Vasilescu, B., and Neubig, G.

(2018). Learning to mine aligned code and natural

language pairs from stack overﬂow. In International

Conference on Mining Software Repositories, MSR,

pages 476–486. ACM.

Zhou, G. and Su, J. (2002). Named entity recognition us-

ing an hmm-based chunk tagger. proceedings of the

40th Annual Meeting on Association for Computa-

tional Linguistics.

AsmDocGen: Generating Functional Natural Language Descriptions for Assembly Code