Authors:
Zihao Wang
1
;
Pei Wang
2
;
Qinkun Bao
2
and
Dinghao Wu
1
Affiliations:
1
Pennsylvania State University, University Park, U.S.A.
;
2
Individual Researcher, U.S.A.
Keyword(s):
Program Analysis, Context-Free Grammar, Static Analysis, Fuzzing, Data-Flow Analysis, Taint Analysis.
Abstract:
This paper presents a novel approach for inferring the language implied by a program’s source code, without requiring the use of explicit grammars or input/output corpora. Our technique is based on backward taint analysis, which tracks the flow of data in a program from certain sink functions back to the source functions. By analyzing the data flow of programs that generate structured output, such as compilers and formatters, we can infer the syntax and structure of the language being expressed in the code. Our approach is particularly effective for domain-specific languages, where the language implied by the code is often unique to a particular problem domain and may not be expressible by a standard context-free grammar. To test the effectiveness of our technique, we applied it to libxml2. Our experiments show that our approach can accurately infer the implied language of some complex programs. Using our inferred language models, we can generate high-quality corpora for testing and
validation. Our approach offers a new way to understand and reason about the language implied by source code, and has potential applications in software testing, reverse engineering, and program comprehension.
(More)