3.1 Modularity
Example multilingual layer MLSA filter programs
include the generation of the forward and reverse
control flow, the identification of variable use and of
variable assignment, the allocation of heap memory,
the identification of procedure calls and so forth.
Consider an illustrative example of such a pipeline:
A Reaching Definitions Analysis (Nielson, Nielson,
and Hankin, 2005) (RDA) filter.
A monolingual AST filter CASN inputs the (C)
AST and generates a CSV file of variable
assignments locations. Another filter CRCF
generates the reverse control flow as a CSV file. A
multilingual filter RDA takes both as input and
calculates reaching definition for each assignment.
The MLSA architecture promotes modularity.
In the RDA example of the previous section:
1. Adding extra languages just requires adding new
monolingual filters for the language. For
Python, these are the PASN and PRCF filters.
2. Modifying analyses just requires reconfiguration
of the analysis pipeline: for example,
substituting the CSV output of CCON containing
condition locations for that of CASN would
perform RDA for the condition statements.
The architecture also supports open source
interactions. It is relatively easy to specify and build
new filters, or replace filters with more efficient
ones.
3.2 Computational Efficiency
The MLSA architecture was also designed with
computational efficiency in mind. The policy of
dividing the analysis into pipelines was chosen with
the objective of making parallelism and dependency
explicit. For relatively small multilingual codebases,
a static analysis network (as in Figure 1(b)) could be
distributed among multiple cores. The parallelism
and dependency can be derived directly from the
pattern of AST and CSV file use.
For a large codebase, in a realistic large company
scenario with a widely-distributed set of code
developers and contractors, a static analysis network
might need to function continually (a daily basis for
example) on cloud computing, regenerating and
updating CSV file components across the network.
In this scenario, the architecture also promotes a
‘just in time’ efficiency where CSV files are only
recalculated when needed by code changes, with
dependency information from the CSV files.
3.3 Multilingual Analysis
Call graph analysis (CGA) is a useful software
engineering tool (Ali and Lhotak, 2012). In
particular, for multilingual code, the call graph can
be used to investigate the boundary line between
languages, a boundary that is opaque in many tools.
For example, a C program may call a Python
procedure in addition to many C procedures.
Consider that one such C procedure OpenPort
exposes a security risk and needs to have its
invocations pass a security review. Just looking at
the call graph of the C procedures, some of which
may invoke the Python procedure, can give the false
security that it shows all the call sequences for the
program. However, the Python code may itself call
OpenPort or may call other procedures in that
same C program that in turn call OpenPort.
In the next section, we present one specific
problem we are addressing with MLSA – the
construction of multilingual call graphs – with the
objective of eliminating issues with the opacity of
language boundaries.
4 MULTILINGUAL CALL
GRAPH ANALYSIS
Call Graph for the program S is defined
CG(S
c
)=(V
c
, E
c
) consisting of a set of nodes:
• V
c
={(pname,parglist)} where pname
∈
Procs(S
c
)
is a procedure name within the program S
c
, and
parglist is the argument list in the procedure
call;
• E
c
⊆
V
c
2
links a node v to node u iff some
execution of v.pname calls u.pname with
arguments u.parglist.
For imperative language without first class functions
constructing a CG is not challenging. For functional
languages and OO languages, the issue of dynamic
dispatch complicates the construction, and a
technique such as Control-Flow Analysis (CFA)
(Nielson, Nielson, and Hankin, 2005) must be used.
Calling a cross-language procedure may be almost
trivial (in the C/C++ case) or may involve a cross-
language API as in the case of JNI (C/Java) or
Python.h (C/Python) and others. A monolingual
call graph analysis will yield leaf nodes that are the
cross-language API calls. We restrict the call graph
CG to be a tree for ease of display, with recursive
calls as leaf nodes.