if the structure of a PDF document has been altered,
malware presence in the document is very likely. In
this case, the presented system does not rely on ma-
chine learning, but just uses predefined rules for de-
termining whether the document is malformed (which
implies malware) or not. From our perspective, this
solution is very interesting as an additional source of
information for a machine learning implementation
together with others. In our work, we have included
this information as a feature to create the feature vec-
tor of our system.
When using machine learning approaches, the
success depends on the multiple areas of the docu-
ments or strategies for extracting information that will
allow the algorithms to be more accurate. For exam-
ple, in (Laskov and
ˇ
Srndic, 2011), Laskov et al. pur-
pose the use of lexical analysis of JavaScript code as
the input for the machine learning system. However,
if the JavaScript code has been designed to look like
legitimate code, e.g. blending the malicious one with
a big portion of legitimate code, as described in (los
Santos, 2016), the lexical analysis of the code may not
be effective enough for malware detection. Another
popular approach is the use of document metadata as
an input for the features vector used for predictions.
For example, (Smutz and Stavrou, 2012) uses mainly
the properties of documents metadata in combination
with what they called ”Structural features” of the doc-
ument. Those features are closely related with the
document content, e.g. number of included images,
strings size, etc. The use of metadata is an interesting
approach, but usually this metadata is counterfeited
by the attacker, so it’s necessary take into account not
only the typical metadata in malicious documents but
rather the counterfeited one. Furthermore, evaluating
the content of the document could be an effective so-
lution for detecting a specific set of malware types
(such as phishing attacks, where the attacker emulates
the appearance of an official document of an specific
organization) but may not be effective enough when
the attacker attaches the malicious content into a le-
gitimate document.
As a downside of these aforementioned cases,
samples for both training and test datasets have been
selected from public malware repositories which usu-
ally implies that the system will be trained with old
and very known samples and malware families. This
situation implies losing the focus in the main advan-
tage of machine learning use, that is, get a step ahead
to the signature based systems in new malware detec-
tion.
4 DESCRIPTION OF THE
PROPOSAL
The presented proposal is based on the application of
machine learning techniques for detecting fresh and
real malware, focusing in PDF format and determin-
ing if it is possible to improve regular and typical
static solutions and become more effective. Specifi-
cally, the aim is to get ahead to these solutions in de-
tection of new types/families of malware, developing
a solution as a complement for them. This solution
may be used as an isolated framework for malware
detection that does not need the PDF document con-
tent itself to work. This framework can be easily in-
troduced in cloud computing architectures to analyse
users documents preserving their privacy.
4.1 Samples Identification and
Collection
The main goal at this stage of the study is to get an ini-
tial dataset which represents (as realistic as possible)
the final environment where the classifier will operate.
For this purpose we have collected PDF malware sam-
ples from different sources and different types (PDF
document format versions). We have used a signifi-
cant percentage of recent samples (about 80% of the
PDF samples collected were created in 2016 or 2017).
JavaScript presence is a required characteristic for all
collected samples.
Regarding to the sources where samples were ob-
tained, we have selected them taking into account we
wanted to represent the best possible real world sce-
nario. We have used email accounts that usually re-
ceive spam or malicious attached documents, pub-
lic malware repositories (malwr.com, Contagio, etc)
and document repositories in general (P2P networks,
search engines, etc.). In addition, in order to collect as
much recent malware samples in the wild as possible,
most of included samples have been obtained from
the Cyber Threat Alliance (CTA, 2017), a dedicated
threat data sharing platform based on the cooperation
between companies in the industry. From all these
sources, we have collected a total of 1712 samples.
For samples identification and classification, we
have defined two different classes or sets:
• Goodware: Samples with JavaScript extracted
from trusted sites (public document repositories
of public entities, universities, etc) without sus-
picious characteristics detected by heuristics, e.g.
obfuscation, suspicious calls to API, etc. As a fi-
nal layer to ensure that these samples were benign,
we used a set of antivirus engines to verify that no
sample was detected as malware.
ICISSP 2018 - 4th International Conference on Information Systems Security and Privacy
340