knowledge. The second part of our dataset was then
used to test if the predictions of our classifier were
correct.
Usually, between 60% and 80% of the dataset is
used for training. Bellow 60%, the training set may
be to small, and the classifier will have poor perfor-
mances when trying to infer class. On the other hand,
if more than 80% of the dataset is used for training,
the risk of overfitting the SVM increases. Overfit-
ting is a phenomena which happens when a classi-
fier learns noise and details specific to the training set.
Furthermore, if one uses more than 80% of the dataset
for training, one will have to test the classifier with
less than 20% of the data, which may not be enough
to provide representative results. For these reasons,
we first chose to use 60% of our dataset for training,
and saw how the success rate of our classifier evolved
when we increased the size of the training set up to
80%.
We used a Support Vector Machine algorithm
(SVM) as the classification algorithm. Basically this
algorithm considers one scatterplot per label, and
finds a hyperplan (or a set of hyperplans when more
than two labels are considered) to delimit them. Usu-
ally, it is unlikely that the considered set is linearly
separable. For this reason, we consider the problem in
a higher-dimensional space that will hopefully make
the separation easier. Furthermore, we want the dot-
product in this space to be easily computed, with re-
spect to the coordinates of the vectors of the original
space. We define this new dot-product in term of the
kernel function k(x,y), where x and y are two vectors
of the original space. This well known trick is due
to (Aizerman et al., 1964) and was first applied to a
SVM in (Boser et al., 1992). It allows us to work in
a higher dimensional space, without having to com-
pute the coordinates of our vectors in this space, but
only with dot-products, which is computationally less
costly.
We denote by n
f eatures
the number of features we
consider (i.e. the size of our vector), and we choose to
use a Gaussian Radial Basis Function (RBF) kernel:
k(x, y) = exp(−γ · ||x − y||
2
),
with parameter γ = 1/n
f eatures
.
2.2 Experimentations
We implemented our classifier in Python, using the
Scikit-learn package. We initially used PDFiD with
the 21 default features to create our vectors. Then,
we trained our SVM on 60% of our shuffled dataset.
We used the remaining 40% data to test our SVM and
calculate its accuracy: the ratio (number of well clas-
sified PDF files)/(total number of PDF files). After
having split our dataset, we obtained an accuracy of
99.60%. Out of 2 622 PDF files containing malware,
only 29 have been detected as clean (1.11%). Out
of 5 465 clean files, only 3 were considered contain-
ing malware (0.05%). To compare with related work,
the SVM used in (Kittilsen, 2011) has a success rate
of 99.56%, out of 7 454 clean PDF files, 18 were de-
tected as containing malware. Out of 16 280 files con-
taining malware, 103 were detected as clean.
Using Different Settings. To go further in our ex-
perimentations, we slightly modify our SVM. For in-
stance, we tested other values of the gamma parame-
ter in the RBF kernel. We also tested the other kernels
proposed by Scikit-learn package. We figured out that
using the RBF kernel with default γ = 1/n
f eatures
pa-
rameter yields to the best results, considering all the
settings we tried.
Change Splitting Ratio. We changed how we split
the initial dataset into training and testing sets. We
saw that if 80% of our dataset is used for training
and 20% for testing, then the success rate was slightly
higher. We also used cross-validation: we restarted
our training/testing process several times with differ-
ent training sets and testing sets, and combined the
results we obtained. We did not notice any overfitting
issue (i.e. our SVM does not seem to be affected by
the noise of the training set).
Change the Default Features. Instead of choos-
ing the default 21 features proposed by Didier
Stevens (Stevens, 2006), we tried other features selec-
tions. In the whole set, we found more than 100 000
different tag types. Considering that it requires 12Gb
of memory to compute the vectors of each PDF file in
a SVM, using 100 000 tags would be too much. A first
strategy implemented was to select features by their
frequency in the files (e.g. “90% frequency” means
that 90% of the files in the dataset have this feature).
In one hand, we chose the most common features in
the clean PDF files, in the other, we chose the most
frequently used features in the malicious files, and
combine them into a sublist of features. Once this se-
lection was made, the resulting list could be merged
with the 21 default features. A second strategy, that
we call better sublist selection, was to remove non-
significant features from the first sublist, by removing
features one by one and computing the SVM (with
cross-validation) to check if accuracy improved or de-
teriorated. Note that these two strategies can be com-
bined together. In practice, we selected initial sublist
: from 21 default features and/or frequency selection
and apply the better sublist selection. Table 1 shows
some results found by applying these two strategies,
SECRYPT 2018 - International Conference on Security and Cryptography
414