Source Code Implied Language Structure Abstraction through Backward Taint Analysis

Zihao Wang, Pei Wang, Qinkun Bao, Dinghao Wu

2023

Abstract

This paper presents a novel approach for inferring the language implied by a program’s source code, without requiring the use of explicit grammars or input/output corpora. Our technique is based on backward taint analysis, which tracks the flow of data in a program from certain sink functions back to the source functions. By analyzing the data flow of programs that generate structured output, such as compilers and formatters, we can infer the syntax and structure of the language being expressed in the code. Our approach is particularly effective for domain-specific languages, where the language implied by the code is often unique to a particular problem domain and may not be expressible by a standard context-free grammar. To test the effectiveness of our technique, we applied it to libxml2. Our experiments show that our approach can accurately infer the implied language of some complex programs. Using our inferred language models, we can generate high-quality corpora for testing and validation. Our approach offers a new way to understand and reason about the language implied by source code, and has potential applications in software testing, reverse engineering, and program comprehension.

Download


Paper Citation


in Harvard Style

Wang Z., Wang P., Bao Q. and Wu D. (2023). Source Code Implied Language Structure Abstraction through Backward Taint Analysis. In Proceedings of the 18th International Conference on Software Technologies - Volume 1: ICSOFT; ISBN 978-989-758-665-1, SciTePress, pages 564-571. DOI: 10.5220/0012129000003538


in Bibtex Style

@conference{icsoft23,
author={Zihao Wang and Pei Wang and Qinkun Bao and Dinghao Wu},
title={Source Code Implied Language Structure Abstraction through Backward Taint Analysis},
booktitle={Proceedings of the 18th International Conference on Software Technologies - Volume 1: ICSOFT},
year={2023},
pages={564-571},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012129000003538},
isbn={978-989-758-665-1},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 18th International Conference on Software Technologies - Volume 1: ICSOFT
TI - Source Code Implied Language Structure Abstraction through Backward Taint Analysis
SN - 978-989-758-665-1
AU - Wang Z.
AU - Wang P.
AU - Bao Q.
AU - Wu D.
PY - 2023
SP - 564
EP - 571
DO - 10.5220/0012129000003538
PB - SciTePress