results of this work to finally describe the
conclusions and possible future work in the last
section.
2 BACKGROUND
Visual Basic for Applications (Wikipedia, n.d.) is
the language used to create macros in Office. It
appeared in 1993, and its latest version dates from
2013. It is related to Visual Basic, in the sense that it
needs its engine to run, but it is not independent: it
must run within another application that contains the
code, and interact with other applications through
OLE Automation objects (a Microsoft internal CPI).
VBA is compiled into P-Code (also used in Visual
Basic). This is a proprietary system from Microsoft
that allows its decompilation to the original format
in which the code was written. Once compiled, it is
stored in the corresponding Office document as a
separate flow in an OLE or COM object.
Since 2007 there are two very different formats of
Office documents, and depending on the version of
the Office format used, this object can be found
embedded in the document or as a separate file. The
different formats are:
Based on Microsoft formats prior to 2007 with
.doc or .xls extensions ("classic" format).
Formats prior to 2007 are actually an OLE
object in themselves.
Based on Open XML formats (Microsoft, s.f.)
after 2007 with .docx or .xlsx extensions for
example. These formats are actually ZIP files,
which contain the same COM object as a macro.
COM or OLE objects used by Microsoft to store
macros are specifically OLE objects with the
structure "Office VBA File Format".
3 STATE OF THE ART
From the macro malware analysis standpoint, on the
Internet we can find numerous specific analysis
about malware that uses different techniques that
maximize the chances of VBA to get control of the
system and execute code. In the antivirus industry,
numerous patents have been created to control this
type of malware, such as (Ko, 2004), that describes
how to extract the macro in a document, analyses the
flow and operations of the code, compares against a
database previously categorized and issues a verdict.
Improving the previous approach, since the late 90
new malware detection techniques appear based on
program behavior analysis, such as (Chi, 2006),
patented by Symantec in 2006. As a third approach,
we can categorize those techniques, for example
(Shipp, 2009), consisting of a more thorough
analysis of the code itself through the use of
statistics, but always limited to the morphological
aspect, that is, comments, character frequency,
names of variables and functions, etc.
In addition to the above, a new and different
approach can be taken into account, based on the use
of machine learning techniques to detect malware. In
this case, most of the existing literature comes from
academia and is considerably less extensive than that
addressed by the aforementioned approaches. For
example, in (Nissim, et al., 2015) Nir Nissim et al
use a methodology they have named Active
Learning in which they use machine learning
techniques in order to, from Open XML formats
(.docx extension), extract features from the
document that are external to the code using a
system called SFEM, and that when combined with
their learning system ALDOCX, help identify
malware on office documents. The extraction of
SFEM features is based on obtaining internal paths
of the ZIP composing the document. This system is
restricted only to new formats based in Office’s
XML, and it needs the full document to work
properly, including text or relevant content, which
may violate privacy if information control during
analysis is not strict enough.
Furthermore, in (Schreck, et al., 2013) Schreck et al
presented in 2013 another approach which they
called Binary Instrumentation System for Secure
Analysis of Malicious Documents, which sought to
distinguish malicious documents extracting the
malicious malware payload and identifying the
exploited vulnerabilities. However, it only worked
for classifiying classic Microsoft formats (.doc
extension).
The technique and framework presented in this
research is able to work with both classic format and
Open XML-based documents. In addition, it relies
primarily on the characteristics of the VBA project
code and other metadata of the file, but it completely
detaches from the contents of the document or any
aspect that allows establishing a connection with a
particular document. The use of metadata is limited,
focusing the machine learning on the code features
that define them at semantic level. Moreover, as we
will analyze in further sections, we potentiate the
selection of features similar to that used by Schreck,
et al., automating it and making it dynamic in time.