fails to recover four functions, resulting in a 99.9%
recovery rate. Upon further examination, we find that
all four functions missed are from the factor program.
Table 1: Function recovery by compilation case.
Ground
truth
Functions
found
Functions
missed
Recovery
fraction
strip 18139 18139 0 1.0000
standard 18139 18139 0 1.0000
debug 18139 18135 4 0.9998
To determine the cause of the missed functions,
we further investigate the Ghidra decompilation of
factor and find that each of the missed functions re-
sults in a decompilation error, “Low-level Error: Un-
supported data-type for ResolveUnion”. This indi-
cates that an error occurred when attempting to re-
solve a union data type within the decompilation of
these functions. Since this error only occurs in the de-
bug compilation case, it is clear that Ghidra’s parsing
and interpretation of DWARF information contributes
to this error. This same union data type causing the
error is successfully captured and represented in our
ground truth program information and, thus, this is
likely a bug within Ghidra’s resolution logic.
5.2 High-Level Variable Recovery
To evaluate the variable (varnode) recovery accuracy
of the Ghidra decompiler, we first measure the in-
ference performance of high-level varnodes, includ-
ing varnodes with complex and aggregate types such
as arrays, structs, and unions. We further measure
the varnode inference accuracy by metatype to deci-
pher which of the metatypes are most and least ac-
curately inferred by the decompiler. This analysis
is performed under each compilation configuration
(stripped, standard, and debug).
In all our varnode evaluation tables, the Varnode
comparison score metric is defined as follows: For
each varnode comparison level, we first linearly as-
sign an integer representing the strength of the varn-
ode comparison (NO MATCH = 0, OVERLAP = 1,
SUBSET = 2, ALIGNED = 3, MATCH = 4). We then
normalize these scores to fall within the range zero to
one. Then, for each ground truth varnode, we com-
pute this normalized score. We take the average score
over all ground truth varnodes to obtain the resulting
metric. This metric approximates how well, on aver-
age, the decompiler infers the ground truth varnodes.
In Table 2, we show the high-level varnode recov-
ery metrics for each of the compilation conditions, ag-
gregated from each of the benchmark programs. We
find that Ghidra at least partially infers 97.2%, 99.3%,
and 99.6% and precisely infers 36.1%, 38.6%, and
99.7% of high-level varnodes for each of the stripped,
standard, and debug compilation cases, respectively.
In addition, the varnode comparison scores for each
compilation case are 0.788, 0.816, and 0.998, respec-
tively. These metrics indicate that the standard com-
pilation case slightly outperforms the stripped case
in varnode inference while the debug compilation
case results in significant improvements over both the
stripped and standard cases, particularly in exact varn-
ode recovery.
In Table 3, we show the inference performance
of high-level varnodes broken down by the metatype
for each compilation configuration. From the stripped
and standard compilation cases, we observe that varn-
odes with metatype INT are most accurately recov-
ered when considering varnode comparison score,
fraction partially recovered, and fraction exactly re-
covered. In the stripped case, the inference of AR-
RAY varnodes shows the worst performance with a
varnode comparison score of 0.315. In the standard
case, varnodes with metatype STRUCT are least ac-
curately recovered with a varnode comparison score
of 0.560, followed closely by ARRAY and UNION.
We see that, for both the stripped and standard compi-
lation cases, the complex (aggregate) metatypes, AR-
RAY, STRUCT, and UNION, show the lowest recovery
accuracy with respect to varnode comparison score.
Among the primitive metatypes, FLOAT shows the
worst recovery metrics for these two cases.
The debug compilation case demonstrates high
relative recovery accuracy across varnodes of all
metatypes when compared to the stripped and stan-
dard cases. Of the primitive metatypes, varnodes of
the FLOAT metatype are perfectly recovered while
varnodes of the INT and POINTER metatypes show
exact recovery percentages of 99.8% and 99.9%, re-
spectively. The complex (aggregate) metatypes, on
average, display slightly lower recovery metrics than
primitive metatypes in the debug compilation case.
The ARRAY metatype reveals the worst varnode com-
parison score at 0.986. The UNION metatype demon-
strates the lowest exact match percentage at 87.5%.
5.3 Decomposed Variable Recovery
In this section, we repeat a similar varnode recovery
analysis over all varnodes; however, we first recur-
sively decompose each varnode into a set of primitive
varnodes (see Section 4). We perform this analysis
over all benchmarks for all three compilation cases.
Similar to the high-level varnode analysis, we
show the inference of the decomposed varnodes for
each benchmark and for each compilation configu-
ICISSP 2023 - 9th International Conference on Information Systems Security and Privacy
234