to effectively manage the tree structure of
code, since most programming languages are
parsed into Abstract Syntax Trees (ASTs).
• The language should make all derivable
metadata and bindings available for querying.
• The language should allow fine-grained
queries down to individual syntax-token level.
• The language must enable optimized, high
performance query execution that can scale to
millions or billions of source files.
• The language should be composable – that is,
the output of a query should be able to be used
as the input to another. This allows complex
queries to be built from smaller, simpler ones.
• The language should be extensible, and allow
programmers to include imperative code inline
for maximum flexibility and compactness.
• Finally, the language should have a familiar,
easily-understood SQL-like syntax that will
allow simple, compact construction of queries,
with clear, meaningful results.
This paper will present the Composable Auditing
and Security Tree-optimized Language (CASTL), a
new, flexible query language for source code. We
will discuss the design of the language, how it meets
the above criteria, its optimizations, some results of
testing queries on Java source code, and our open
source implementation, which may be further
enhanced or integrated into other tools, and its
performance. We also discuss a few of the surprises
we found while implementing our language, such as
discovering that the tree-pruning operations we
introduced to optimize query performance were the
same operations we came to use constantly to tailor
our results correctly in our tree-based system.
2 RELATED WORK
The querying of source code is a relatively recent
phenomenon, as publicly available codebases have
grown in number and size, and a growing emphasis
on security has increased interest in both automated
and interactive evaluation of code quality.
Historically, query writers used simple text
searching tools like awk and grep, or relational
database query languages like SQL. These tools are
not well-suited for source code because they do not
capture the rich structure and semantics of code.
However, they have nonetheless been used. Google
Code Search offered a huge archive of code to
search, regular expression queries, and improved
special character handling, a step above most other
options at the time (Cox, 2012). Natural language
query interpretation, correlated against the linguistic
information present in source comments and
identifier names, can also avoid the limitations of
flat searches. Haiduc, Bavota, Marcus, Oliveto, De
Lucia, and Menzies (2013) developed a system to
automatically detect low-quality queries and rewrite
them for more relevant results. Hill (2010) proposed
a hybrid system combining natural language and
program structure that used the natural language
query to prune poor results, allowing the query to
return more promising results.
Not all static analysis needs to query deeply
within source code at all. Robles and Merelo (2006)
described how non-code artifacts in a project may be
as rich a source of information as the code itself.
CQL (Code Query Language) and its successor,
CQLinq, which underpin the popular NDepend static
analysis tool, are SQL-like languages for .NET
projects that look primarily at high level .NET
assembly metadata, representing programs as simple
relations (Smacchia, 2008). They have limited
capability to search below the class and member
level. PQL (Martin, Livshits, and Lam, 2005) is a
fascinating query language focused on the
identification of object event patterns, analyzing
sequences of method calls. Mcmillan, Poshyvanyk,
Grechanik, Xie, and Fu (2013) identify chains of
function calls as the key search result developers
require and their Portfolio system thus models the
code as a directed graph of function calls.
Query languages focused on general tree or graph
structures also have relevance to code querying.
XQuery (Chamberlin, 2003) is a W3C standard
language based on XPath for querying XML
documents, which particularly influenced our work
due to its natural implementation of paths through
the XML nodes and its straightforward and powerful
FLOWR expression which allows the output to be
filtered, tweaked, and output into a customized html
format. PMD (PMD Introduction, 2018), a source
code analysis tool also based on XPath, with many
simple and useful rules predefined, allows efficient
querying of the entire AST, but lacks support for
bindings and metadata. Gremlin (Rodriguez. 2015)
is a graph traversal and query language that allows a
mixture of imperative and declarative queries, that
we have also drawn inspiration from.
The BOA language and infrastructure (Dyer,
Nguyen, Rajan, and Nguyen, 2013) is a
comprehensive system with high performance for
source code mining. It provides streamlined ASTs
for querying. However, the visitor-based query
language, derived from Google Sawzall, has a steep