when evaluating a scanner. True positive means that
a scanner did report an existing error. Security errors
are typically spread over several lines. The question
is whether a scanner reports the right location. We
distinguish the following cases:
True positive+: Correct location
The scanner has reported an error at the same
source code line where this error had been
documented, assuming that we know the exact
location of the error.
True positive: Unspecified location
The scanner has correctly reported a specific
error. The exaction location is unknown in the
source code or has not been specified. We typ-
ically know a line range it the error is con-
tained in a (bad) function.
True positive–: Incorrect location
The scanner has correctly reported an error,
but given a wrong location, which is close
enough to be counted as a true positive.
In the Juliet test suite, we have errors where we
know either the exact source code line or a range of
source code lines, i.e., we have a list of True posi-
tive+ and True positive. A scanner should report as
many True positives as possible. It is better to have a
report of an error with an incorrect location than no
report of that error at all, i.e., a True positive- re-
ported by a scanner is much better than a True nega-
tive. We can only count a true positive as a true
positive+, when we ourselves know the exact loca-
tion in the test suite.
4.5 Security Model
We have used a security model in which CWE en-
tries were combined to more abstract categories. For
example, a category “Buffer Overflow” represents
different CWE entries that describe types of buffer
overflows. We have used these categories according
to (Center for Assured Software 2011), to generate a
security model as part of a more general software
quality model (Ploesch et al. 2008). This security
model helps to interpret the scanner results. For
example, with the model we can determine a scan-
ner’s weak-ness or strength in one or more areas.
This information can be used to choose a particular
scanner for a software project. Thus, if a scanner is
weak in identifying authentication problems but is
strong in other areas, this scanner can be used for
projects that do not have to deal with authentication.
Alternatively, an additional scanner can be used with
strong authentication results. In general, the security
model helps to analyze found errors in more detail.
Another advantage of having a security model is to
provide a connection between generic descriptions
of software security attributes and specific software
analysis approaches (Wagner et al. 2012). This al-
lows us to automatically detect security differences
between different systems or subsystems as well as
security improvements over time in a software sys-
tem.
5 RESULTS
The Juliet Test Suite consists of a Java and a C/C++
test suite. We will start with Java. Subsequently the
results for C/C++ will be presented. Finally, an
overall discussion of the findings will follow.
5.1 Java
In the Java test suite we had far more true positives
with unspecified location (True Positive) than such
with a conclusive location (True Positive+), which
were determined with the additional ‘SAMATE’
files. Consequently, the scanners can only deliver a
few conclusive locations. Figure 1 contrasts the
percentage of errors with a conclusive location, i.e.,
True Positives+, to errors with an unspecified loca-
tion, i.e., True Positives. Figure 2 shows the distribu-
tion of the test cases by the security model. We can
see that the numbers of test cases are not balanced
for every category. As a result, scanners, which find
many issues in categories with many test cases, are
better in this comparison than other scanners. Ap-
parently, the category “Buffer overflow” has no test
cases at all. This should not come as a surprise as the
Java Runtime Environment prevents buffer over-
flows in Java programs. Figure 3 shows an over-
view of all errors that were detected by the different
scanners. The entry Juliet on top shows the actual
number of documented errors in the test suite. We
can see that for a small percentage the exact location
of the error is known, but for most errors this is not
the case. Apparently, FindBugs has detected the
most errors, followed by Jlint and PMD. However,
PMD has found more exact locations than Jlint and
FindBugs. As Fig. 3 shows, the numbers of True
Positives+, True Positives, True Positives- and
Wrong Positives are higher than the numbers of Jlint
and FindBugs. Thus, it can be said that PMD has
found the most accurate errors regarding to the test
suite. A deeper analysis of the results has shown that
the GDS rule set used within PMD were responsible
for the results of PMD. Without them, PMD would
not have found any errors in the test suite. Neverthe-
less, the overall results were poor. The scanners
UsingtheJulietTestSuitetoCompareStaticSecurityScanners
249