Authors:
Georgia Frantzeskou
;
Efstathios Stamatatos
and
Stefanos Gritzalis
Affiliation:
Laboratory of Information and Communication Systems Security, Aegean University, Greece
Keyword(s):
Source Code Authorship Analysis, Software Forensics, Security.
Related
Ontology
Subjects/Areas/Topics:
Data and Application Security and Privacy
;
Data Engineering
;
Data Privacy and Security
;
Databases and Data Security
;
Information and Systems Security
Abstract:
Source code authorship analysis is the particular field that attempts to identify the author of a computer program by treating each program as a linguistically analyzable entity. This is usually based on other undisputed program samples from the same author. There are several cases where the application of such a method could be of a major benefit, such as tracing the source of code left in the system after a cyber attack, authorship disputes, proof of authorship in court, etc. In this paper, we present our approach which is based on byte-level n-gram profiles and is an extension of a method that has been successfully applied to natural language text authorship attribution. We propose a simplified profile and a new similarity measure which is less complicated than the algorithm followed in text authorship attribution and it seems more suitable for source code identification since is better able to deal with very small training sets. Experiments were performed on two different data se
ts, one with programs written in C++ and the second with programs written in Java. Unlike the traditional language-dependent metrics used by previous studies, our approach can be applied to any programming language with no additional cost. The presented accuracy rates are much better than the best reported results for the same data sets.
(More)