or by some number of authors. An example of this
would be showing that a program was probably
written by three different authors, without actually
identifying the authors in question.
5. Author intent determination. In some cases we
need to know whether a piece of code, which caused
a malfunction, was written having this as its goal or
was the result of an accidental error. In many cases,
an error during the software development process
can cause serious problems.
The traditional methodology that has been
followed in this area of research is divided into two
main steps (Krsul, Spafford 1995; MacDonell et al.
2001; Ding 2004). The first step is the extraction of
software metrics and the second step is using these
metrics to develop models that are capable of
discriminating between several authors, using a
machine learning algorithm. In general, the software
metrics used are programming - language dependent.
Moreover, the metrics selection process is a non
trivial task.
In this paper we present a new approach, which
is an extension of a method that has been applied to
natural language text authorship identification
(Keselj et al., 2003). In our method, byte-level N-
grams are utilised together with author profiles. We
propose a new simplified profile and a new
similarity measure which enables us to achieve a
high degree of accuracy for authors for whom we
have a very small training set. Our methodology is
programming - language independent since it is
based on low-level information and is tested to data
sets from two different programming languages. The
simplified profile and the new similarity measure we
introduce provide a less complicated algorithm than
the method used in text authorship attribution and in
many cases they achieve higher prediction accuracy.
Special attention is paid to the evaluation
methodology. Disjoint training and test sets of equal
size were used in all the experiments in order to
ensure the reliability of the presented results. Note,
that in many previous studies the evaluation of the
proposed methodologies was performed on the
training set. Our approach is able to deal effectively
with cases where there are just a few available
programs per author. Moreover, the accuracy results
are high even for cases where the available programs
are of restricted length.
The rest of this paper is organized as follows.
Section 2 contains a review on past research efforts
in the area of source code authorship analysis.
Section 3 describes our approach and section 4
includes the experiments we have performed.
Finally, section 5 contains conclusions and future
work.
2 RELATED WORK
The most extensive and comprehensive application
of authorship analysis is in literature. One famous
authorship analysis study is related to Shakespeare’s
works and is dating back over several centuries.
Elliot and Valenza (1991) compared the poems of
Shakespeare and those of Edward de Vere, 7th Earl
of Oxford, where attempts were made to show that
Shakespeare was a hoax and that the real author was
Edward de Vere, the Earl of Oxford. Recently, a
number of authorship attribution approaches have
been presented (Stamatatos et. al, 2000; Keselj, et
al., 2003; Peng et al, 2004) proving that the author of
a natural language text can be reliably identified.
Although source code is much more formal and
restrictive than spoken or written languages, there is
still a large degree of flexibility when writing a
program (Krsul, and Spafford, 1996). Spafford and
Weeber (1993) suggested that it might be feasible to
analyze the remnants of software after a computer
attack, such as viruses, worms or trojan horses, and
identify its author. This technique, called software
forensics, could be used to examine software in any
form to obtain evidence about the factors involved.
They investigated two different cases where code
remnants might be analyzed: executable code and
source code. Executable code, even if optimized,
still contains many features that may be considered
in the analysis such as data structures and
algorithms, compiler and system information,
programming skill and system knowledge, choice of
system calls, errors, etc. Source code features
include programming language, use of language
features, comment style, variable names, spelling
and grammar, etc.
Oman and Cook (1989) used “markers” based on
typographic characteristics to test authorship on
Pascal programs. The experiment was performed on
18 programs written by six authors. Each program
was an implementation of a simple algorithm and it
was obtained from computer science textbooks.
They claimed that the results were surprisingly
accurate.
Longstaff and Shultz (1993) studied the WANK
and OILZ worms which in 1989 attacked NASA and
DOE systems. They have manually analyzed code
structures and features and have reached a
ICETE 2005 - SECURITY AND RELIABILITY IN INFORMATION SYSTEMS AND NETWORKS
284