AsmDocGen: Generating Functional Natural Language Descriptions for Assembly Code

Jesia Yuki, Mohammadhossein Amouei, Benjamin C. M. Fung, Philippe Charland, Andrew Walenstein

2024

Abstract

This study explores the field of software reverse engineering through the lens of code summarization, which involves generating informative and concise summaries of code functionality. A significant aspect of this research is the application of assembly code summarization in malware analysis, highlighting its critical role in understanding and mitigating potential security threats. Although there have been recent efforts to develop code summarization techniques for high-level programming languages, to the best of our knowledge, this study is the first attempt to generate comments for assembly code. For this purpose, we first built a carefully curated dataset of assembly function-comment pairs. We then focused on automatic assembly code summarization using transfer learning with pre-trained natural language processing (NLP) models, including BERT, DistilBERT, RoBERTa, and CodeBERT. The results of our experiments show a notable advantage of Code-BERT: despite its initial training on high-level programming languages alone, it excels in learning assembly language, outperforming other pre-trained NLP models.

Download


Paper Citation


in Harvard Style

Yuki J., Amouei M., C. M. Fung B., Charland P. and Walenstein A. (2024). AsmDocGen: Generating Functional Natural Language Descriptions for Assembly Code. In Proceedings of the 19th International Conference on Software Technologies - Volume 1: ICSOFT; ISBN 978-989-758-706-1, SciTePress, pages 35-45. DOI: 10.5220/0012761400003753


in Bibtex Style

@conference{icsoft24,
author={Jesia Yuki and Mohammadhossein Amouei and Benjamin C. M. Fung and Philippe Charland and Andrew Walenstein},
title={AsmDocGen: Generating Functional Natural Language Descriptions for Assembly Code},
booktitle={Proceedings of the 19th International Conference on Software Technologies - Volume 1: ICSOFT},
year={2024},
pages={35-45},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012761400003753},
isbn={978-989-758-706-1},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 19th International Conference on Software Technologies - Volume 1: ICSOFT
TI - AsmDocGen: Generating Functional Natural Language Descriptions for Assembly Code
SN - 978-989-758-706-1
AU - Yuki J.
AU - Amouei M.
AU - C. M. Fung B.
AU - Charland P.
AU - Walenstein A.
PY - 2024
SP - 35
EP - 45
DO - 10.5220/0012761400003753
PB - SciTePress