Malware Detection in PDF Files using Machine Learning

Bonan Cuan

, Ali

enor Damien

2,3

, Claire Delaplace

4,5

and Mathieu Valois

INSA Lyon, CNRS, LIRIS, Lyon, France

Thales Group, Toulouse, France

CNRS, LAAS, Toulouse, France

Univ Rennes 1, CNRS, IRISA, 35000 Rennes, France

Univ. Lille, CRIStAL, 59655 Villeneuve d’Ascq, France

Normandie Univ., UNICAEN, ENSICAEN, CNRS, GREYC, 14000 Caen, France

Keywords:

Malicious PDF Detection, SVM, Evasion Attacks, Gradient-Descent, Feature Selections, Adversarial

Learning.

Abstract:

We present how we used machine learning techniques to detect malicious behaviours in PDF ﬁles.

At this aim, we ﬁrst set up a SVM (Support Machine Vector) classiﬁer that was able to detect 99.7% of

malware. However, this classiﬁer was easy to lure with malicious PDF ﬁles, which we forged to make them

look like clean ones. For instance, we implemented a gradient-descent attack to evade this SVM. This attack

was almost 100% successful. Next, we provided counter-measures to this attack: a more elaborated features

selection and the use of a threshold allowed us to stop up to 99.99% of this attack.

Finally, using adversarial learning techniques, we were able to prevent gradient-descent attacks by iteratively

feeding the SVM with malicious forged PDF ﬁles. We found that after 3 iterations, every gradient-descent

forged PDF ﬁle were detected, completely preventing the attack.

1 INTRODUCTION

Billions of Portable Document Format (PDF) ﬁles are

available on the web. Not all of them are as harmless

as one may think. In fact, PDF ﬁles may contain var-

ious objects, such as JavaScript code or binary code.

Sometimes, these objects may be malicious. A mal-

ware may try to exploit a ﬂaw in a reader in order to

infect the machine.

In 2017, sixty-height vulnerabilities were discov-

ered in Adobe Acrobat Reader (CVEDetails, 2017).

More than ﬁfty of them may be exploited to run arbi-

trary code. Every reader has its own vulnerabilities,

and a malicious PDF ﬁle may ﬁnd ways to make use

of them.

In this context, several works propose the use

machine learning to detect malicious PDF ﬁles,

(e.g. (Kittilsen, 2011; Maiorca et al., 2012; Borg,

2013; Torres and De Los Santos, 2018)). These works

rely on the same main idea: select discriminating fea-

tures (i.e. features that are more likely to appear in

malicious PDF ﬁles) and build what is called a clas-

siﬁer. For a given PDF ﬁle, this classiﬁer will take as

input the selected features and, considering the num-

ber of occurrences of each of them, will try to deter-

mine if the PDF ﬁle contains malware or not. Many

approaches may be considered, both for the choice of

the features and for the design of the classiﬁer. In fact

there are many classiﬁcation algorithms that can be

utilised: Naive Bayes, Decision Tree, Random For-

est, SVM... The authors of (Maiorca et al., 2012) ﬁrst

describe their own way to select features, and use a

Random Forest algorithm in their classiﬁer, while the

author of (Borg, 2013) relies on the choice of features

proposed by Didier Steven (Stevens, 2006), and uses

a SVM (Support Vector Machine) classiﬁer. Both ap-

proaches seem to provide accurate results.

However, it is still possible to bypass such detec-

tion algorithms. Several attacks have been proposed

(e.g. (Ateniese et al., 2013; Biggio et al., 2013)).

In (Biggio et al., 2013), the authors propose to evade

SVM and Neural Network classiﬁers using gradient-

descent algorithms. Meanwhile, the authors of (Ate-

niese et al., 2013) explain how to learn information

about the training dataset used by a given target clas-

siﬁer. To do so, they build a meta-classiﬁer, and train

it with a set of classiﬁers that are themselves trained

with various datasets. These datasets have different

412

Cuan, B., Damien, A., Delaplace, C. and Valois, M.

Malware Detection in PDF Files using Machine Learning.

DOI: 10.5220/0006884704120419

In Proceedings of the 15th International Joint Conference on e-Business and Telecommunications (ICETE 2018) - Volume 2: SECRYPT, pages 412-419

ISBN: 978-989-758-319-3

properties. Once their meta-classiﬁer is trained, they

run it on the classiﬁer they aim to attack. Doing

so, their goal is to detect interesting properties in the

training dataset utilised by this classiﬁer. Hopefully

they can take advantage of this knowledge to attack

said classiﬁer.

We worked on three aspects of malware detection

in PDF ﬁles.

First we implemented our own PDF ﬁle classiﬁer,

using SVM algorithm, as it provides good results. We

explored different possibilities for features selection:

our initial choice was based on (Stevens, 2006) selec-

tion. We reﬁned this choice the following way: from

the set of available features, we selected those which

appeared to be the most discriminating in our case.

We trained and tested our SVM with a dataset of 10

000 clean and 10 000 malicious PDF ﬁles from the

Contagio database (Contagio Dump, 2013), and we

also tuned the SVM to study its behavior. We came

up with a classiﬁer that had more than 99% success

rate.

In the second part of our work, we study evasion

techniques against our classiﬁer: how to edit mali-

cious PDF ﬁles such that they are not detected by our

classiﬁer. One ﬁrst naive attack consists in highly in-

creasing the number of occurrences of one arbitrary

feature but can be easily countered using a threshold.

A more interesting one is the gradient-descent attack.

We implemented this attack following the description

of (Ateniese et al., 2013).

Finally, we propose ways to prevent this attack.

Our ﬁrst intuition is to set up a threshold value for

each feature. We also present a smarter features selec-

tion to make our SVM more resistant against gradient-

descent attacks. Finally we suggest updating our

SVM using adversarial learning. We tested these

counter-measures and they allowed us to stop almost

all gradient-descent attacks.

2 MALWARE CLASSIFIER

In this section, we present the technical tools for PDF

ﬁle analysis, machine learning, and the different set-

tings we used in our methodology. Then we present

the ﬁrst results for that classiﬁer using different opti-

mization approaches.

2.1 Useful Tools

We ﬁrst recall the structure of a PDF ﬁle, and explain

how this knowledge can help us to detect malware.

PDF File Analysis. A PDF ﬁle is composed of

objects which are identiﬁed by one or several tags.

PDFiD 0.2.1 CLEAN_PDF_9000_files/rr-07-58.pdf

PDF Header: %PDF-1.4

obj 23

endobj 23

stream 6

endstream 6

xref 2

trailer 2

startxref 2

/Page 4

/Encrypt 0

/ObjStm 0

/JS 0

/JavaScript 0

/AA 0

/OpenAction 0

/AcroForm 0

/JBIG2Decode 0

/RichMedia 0

/Launch 0

/EmbeddedFile 0

/XFA 0

/Colors > 2^24 0

Figure 1: Output of PDFiD.

These tags stand for features that characterise the ﬁle.

There are several tools made to analyse PDF ﬁles. In

this work, we used the PDFiD Python script designed

by Didier Stevens (Stevens, 2006). Stevens also se-

lected a list of 21 features that are commonly found in

malicious ﬁles. For instance the feature /JS indicates

that a PDF ﬁle contains JavaScript and /OpenAction

indicates that an automatic action is to be performed.

It is quite suspicious to ﬁnd these features in a ﬁle,

and sometimes, it can be a sign of malicious be-

haviour. PDFiD essentially scans through a PDF ﬁle,

and counts the number of occurrences of each of these

21 features. It can also be utilised to count the number

of occurrences of every features (not only the 21) that

characterise a ﬁle.

Figure 1 shows an output example of PDFiD. For

each feature, the corresponding tag is given on the

ﬁrst column, and the number of occurrences on the

second one.

We can represent these features by what we call a

feature vector. Each coordinate represents the number

of occurrences of a given feature.

Supervised Learning and PDF File Classiﬁcation.

Machine Learning is a common technique to deter-

mine whether a PDF ﬁle may or may not contain

malware. We consider a classiﬁer function class that

maps a feature vector to a label whose value is 1 if the

ﬁle is considered as clean and -1 otherwise. To infer

the class function, we used what is called Supervised

Learning techniques.

We considered a dataset of 10 000 clean PDF ﬁles

and 10 000 containing malware from the Contagio

database (Contagio Dump, 2013). By knowing which

ﬁles are clean, and which are not, we were able to la-

bel them. We then split our dataset into two parts. The

ﬁrst part has been used as a training dataset. In other

words, for every feature vector x of every ﬁle in the

dataset, we set class(x) to 1 if the PDF ﬁle was clean

and x = −1 if it contains malware. We used a classiﬁ-

cation algorithm to infer the class function using this

Malware Detection in PDF Files using Machine Learning

413

knowledge. The second part of our dataset was then

used to test if the predictions of our classiﬁer were

correct.

Usually, between 60% and 80% of the dataset is

used for training. Bellow 60%, the training set may

be to small, and the classiﬁer will have poor perfor-

mances when trying to infer class. On the other hand,

if more than 80% of the dataset is used for training,

the risk of overﬁtting the SVM increases. Overﬁt-

ting is a phenomena which happens when a classi-

ﬁer learns noise and details speciﬁc to the training set.

Furthermore, if one uses more than 80% of the dataset

for training, one will have to test the classiﬁer with

less than 20% of the data, which may not be enough

to provide representative results. For these reasons,

we ﬁrst chose to use 60% of our dataset for training,

and saw how the success rate of our classiﬁer evolved

when we increased the size of the training set up to

80%.

We used a Support Vector Machine algorithm

(SVM) as the classiﬁcation algorithm. Basically this

algorithm considers one scatterplot per label, and

ﬁnds a hyperplan (or a set of hyperplans when more

than two labels are considered) to delimit them. Usu-

ally, it is unlikely that the considered set is linearly

separable. For this reason, we consider the problem in

a higher-dimensional space that will hopefully make

the separation easier. Furthermore, we want the dot-

product in this space to be easily computed, with re-

spect to the coordinates of the vectors of the original

space. We deﬁne this new dot-product in term of the

kernel function k(x,y), where x and y are two vectors

of the original space. This well known trick is due

to (Aizerman et al., 1964) and was ﬁrst applied to a

SVM in (Boser et al., 1992). It allows us to work in

a higher dimensional space, without having to com-

pute the coordinates of our vectors in this space, but

only with dot-products, which is computationally less

costly.

We denote by n

f eatures

the number of features we

consider (i.e. the size of our vector), and we choose to

use a Gaussian Radial Basis Function (RBF) kernel:

k(x, y) = exp(−γ · ||x − y||

with parameter γ = 1/n

f eatures

2.2 Experimentations

We implemented our classiﬁer in Python, using the

Scikit-learn package. We initially used PDFiD with

the 21 default features to create our vectors. Then,

we trained our SVM on 60% of our shufﬂed dataset.

We used the remaining 40% data to test our SVM and

calculate its accuracy: the ratio (number of well clas-

siﬁed PDF ﬁles)/(total number of PDF ﬁles). After

having split our dataset, we obtained an accuracy of

99.60%. Out of 2 622 PDF ﬁles containing malware,

only 29 have been detected as clean (1.11%). Out

of 5 465 clean ﬁles, only 3 were considered contain-

ing malware (0.05%). To compare with related work,

the SVM used in (Kittilsen, 2011) has a success rate

of 99.56%, out of 7 454 clean PDF ﬁles, 18 were de-

tected as containing malware. Out of 16 280 ﬁles con-

taining malware, 103 were detected as clean.

Using Different Settings. To go further in our ex-

perimentations, we slightly modify our SVM. For in-

stance, we tested other values of the gamma parame-

ter in the RBF kernel. We also tested the other kernels

proposed by Scikit-learn package. We ﬁgured out that

using the RBF kernel with default γ = 1/n

f eatures

pa-

rameter yields to the best results, considering all the

settings we tried.

Change Splitting Ratio. We changed how we split

the initial dataset into training and testing sets. We

saw that if 80% of our dataset is used for training

and 20% for testing, then the success rate was slightly

higher. We also used cross-validation: we restarted

our training/testing process several times with differ-

ent training sets and testing sets, and combined the

results we obtained. We did not notice any overﬁtting

issue (i.e. our SVM does not seem to be affected by

the noise of the training set).

Change the Default Features. Instead of choos-

ing the default 21 features proposed by Didier

Stevens (Stevens, 2006), we tried other features selec-

tions. In the whole set, we found more than 100 000

different tag types. Considering that it requires 12Gb

of memory to compute the vectors of each PDF ﬁle in

a SVM, using 100 000 tags would be too much. A ﬁrst

strategy implemented was to select features by their

frequency in the ﬁles (e.g. “90% frequency” means

that 90% of the ﬁles in the dataset have this feature).

In one hand, we chose the most common features in

the clean PDF ﬁles, in the other, we chose the most

frequently used features in the malicious ﬁles, and

combine them into a sublist of features. Once this se-

lection was made, the resulting list could be merged

with the 21 default features. A second strategy, that

we call better sublist selection, was to remove non-

signiﬁcant features from the ﬁrst sublist, by removing

features one by one and computing the SVM (with

cross-validation) to check if accuracy improved or de-

teriorated. Note that these two strategies can be com-

bined together. In practice, we selected initial sublist

: from 21 default features and/or frequency selection

and apply the better sublist selection. Table 1 shows

some results found by applying these two strategies,

SECRYPT 2018 - International Conference on Security and Cryptography

414

regarding the original result.

Results. The application of the frequency selection

method did not improve the accuracy of our SVM,

and increased signiﬁcantly the number of features,

making the training and testing of the SVM much

slower. Applying the better sublist selection method

improved the SVM’s accuracy signiﬁcantly and kept

a reasonable amount of features. We also saw that ap-

plying the better sublist selection method to the 21 de-

fault features, improved the accuracy (+0.25%). The

resulting set contains only 11 features, signiﬁcantly

reducing the time to train and test the SVM.

3 EVASION ATTACKS

In this section, we propose some evasion attacks to

trick the trained SVM. The main goal of these attacks

is to increase the amount of objects in the infected

PDF ﬁles so that the SVM considers them as clean.

To this end, the modiﬁcations performed on the ﬁles

should not be noticeable by the naked eye. Removing

objects is a risky practice that may, in some cases,

change the display of the ﬁle. On the other hand,

adding empty objects seems to be the easiest way to

modify a PDF ﬁle, without changing its physical ap-

pearance.

We consider a white box adversary. In this model,

the adversary has access to everything the defender

has, namely: the training dataset used to train the

classiﬁer, the classiﬁer algorithm (here SVM), the

classiﬁer parameters (kernel, used features for vector,

threshold, ...), and infected PDF ﬁles that are detected

by the classiﬁer

This attacker is the most powerful one, since she

knows everything about the scheme she’s attacking.

Naive Attack. The ﬁrst implemented attack to lure

the classiﬁer is the component brutal raise. Given the

feature vector of a PDF ﬁle, the attacker picks one

feature and increments it until the vector is considered

as clean by the classiﬁer. The choice of the feature

is either done arbitrarily or with the same process as

feature selection in section 2.2.

A quick counter-measure that can be applied is

to ignore the surplus number of features when this

number is too high, and consider it as the maximum

permitted value. To implement this idea, we used a

threshold value. A threshold is a value that is con-

sidered as the maximum value a feature can take. For

example, if the threshold is 5, the original feature vec-

tor x = (15, 10,2, 3, 9,1) would be cropped to become

the vector x

= (5, 5, 2,3, 5, 1).

We experimented this naive attack with the default

features from PDFiD, and ﬁgured out that, if we set

up a threshold of 1, this naive attack is completely

blocked.

Gradient-Descent. The gradient-descent is a widely

used algorithm in machine learning to analytically

ﬁnd the minimum of a given function. Among its var-

ious applications, we were particularly interested in

how it can be utilised to attack SVM classiﬁers. Given

a PDF ﬁle of feature vector x, that contains malware

and has been correctly classiﬁed, the goal is to ﬁnd a

vector x

on the other side of the hyperplan, so that the

difference between x and x

is the smallest possible.

Usually, to quantify this difference, the L

distance is

utilised. In other words, given a feature vector x such

that class(x) = −1, we aim to ﬁnd a vector x

such

that class(x

) = 1 and

||x − x

∑

− x

is minimized. The gradient-descent algorithm tries

to converge to this minimum using a step-by-step ap-

proach: ﬁrst, initialise x

to x, then at each step t > 0,

is computed to be equal to:

= x

t−1

− ε

· ∇class(x

t−1

with ε

a well chosen step size and ∇class(x

t−1

) is

the gradient of the class at point x

t−1

. The iteration

terminates when class(x

) is equal to 1.

An illustration of the result of this attack is shown

on Figure 2 where features 1 and 2 are slightly in-

creased to cross the hyperplan.

Feature 1

Feature 2

+ +

Figure 2: Example of attack using gradient-descent.

This attack does not consider components indi-

vidually but as a whole, allowing the algorithm to

ﬁnd a shorter difference vector than with the naive at-

tack. Hence the L

distance between the crafted and

the original vector is signiﬁcantly lower than with the

naive attack, it results in a crafted PDF ﬁle that has

been way less modiﬁed.

Using this attack we conducted two experiments:

the ﬁrst was to compute its theoretical success rate

and the second its success rate in practice.

Theoretical success rate: to compute the theoret-

ical success rate, we took the feature vector of ev-

ery infected ﬁle and ran the gradient-descent to get

Malware Detection in PDF Files using Machine Learning

415

Table 1: Results of features selection using: PDFiD default features (D), frequency selection on all features (F), the merge of

these two lists (M), and better sublist selection (BS) applied to each of these feature set.

Features selection Accuracy Nb of Time to

(cross-validation) features compute SVM

(D) Default 21 features 99,43% 21 21,17s

(F) Frequency (90%) 99,22% 31 54,49s

(M) Frequency (95%) 99,40% 39 47,66s

+ default features

(D+BS) Sublist from 21 default features 99,68% 11 7,03s

(F+BS) Frequency (80%) 99,63% 13 11,30s

+ Sublist

(M+BS) Frequency (80%) 99,64% 10 15,89s

+ default features + Sublist

the feature vector that lures the classiﬁer. By this ex-

periment, we found that 100% of the forged vectors

are detected as clean by the classiﬁer (namely every

gradient-descent succeeded).

Practical success rate: to compute the practical

success rate of the attack, we ran the gradient-descent

on the vectors of every PDF ﬁles and then recon-

structed new ﬁle according to the crafted feature vec-

tor.

Remark: If we denote by m the number of selected

features considered by the SVM, the gradient-descent

computes a vector x

∈ R

, however only an integer

number of objects can be added to the PDF ﬁle, thus

a rounding operation is needed in practice. For this

attack we rounded component values to the nearest

integer (the even one when tied).

Due to the rounding operation, this practical at-

tack had a 97.5% success rate instead of a 100% the-

oretical success rate. In previous work (Biggio et al.,

2013), the success rate was about the same, even if

they made a deeper analysis of it using various set-

tings of their SVM.

4 COUNTER-MEASURES

The gradient-descent attack has an impressively high

success rate. This is due to the huge degree of free-

dom the algorithm has. Every component of the vec-

tor can be increased as much as required.

Hence, to counter the gradient-descent attack, one

would reduce the degree of freedom of the algorithm.

This can be achieved in three different ways: apply-

ing a threshold, smartly selecting features that are the

hardest to exploit, and ﬁnally a mix of both solutions.

Another approach to counter this attack is to

restart the training of the SVM with maliciously

forged ﬁles: it is called adversarial learning.

4.1 Vector Component Threshold

Once again, the threshold is deﬁned by t ∈ N

∗

due to

the discreteness of the number of PDF ﬁle objects. To

choose t, we have used algorithm 1. We considered a

SVM with the 21 default features of PDFiD.

Algorithm 1: Calculate the best threshold to block as many

attacks as possible.

t ← 20

s(20) ← success rate of gradient-descent with t =

while t > 0 do

apply t on each forged feature vector x

compute success rate s(t) of gradient-descent

if s(t) > s(t +1) then

return t + 1

end if

t ← t −1

end while

return t

Algorithm 1 decreases the threshold until a lo-

cal minimum success rate of the gradient is found

(namely when most attacks are blocked).

Remark: Algorithm 1 assumes that the function

s(t) is continuous. Hence, by the intermediate value

theorem, algorithm 1 can converge. Moreover, we

only retrieve a local minimum, not a global one.

Despite these imprecisions, our counter-measure is

rather efﬁcient in practice (cf. Table 2).

All in all, a threshold of 1 allows the SVM to block

almost every attack while keeping a reasonably good

accuracy (the difference between no threshold and a

threshold of 1 is about 0,10%).

We also applied a threshold of 1 on a SVM

constructed from different lists of features (see sec-

tion 2.2). The corresponding results are presented in

Table 3.

SECRYPT 2018 - International Conference on Security and Cryptography

416

Table 2: Percentage of Gradient-Descent attacks stopped

depending on the threshold value.

Threshold Attack prevention Accuracy of SVM

(theory)

5 0% 99,55%

4 0% 99,57%

3 29% 99,63%

2 38% 99,74%

1 99,60% 99.48%

Table 3: Features selection global method application re-

sults. D means that we used the default features, and F the

frequency selection.

Features set Attacks Accuracy #features

stopped

(theory)

21 D 99,60% 99,37% 7

F (80%) 91,00% 99,45% 9

F (90%) 100,00% 98,00% 30

F (80%) + D 21,00% 99,61% 17

F (90%) + D 96,40% 99,46% 29

Results. Threshold usage allows to keep a very good

accuracy while reducing the success rate of gradient-

descent attack, but this reduction is not optimal and

depends a lot on the features list utilised in the SVM.

4.2 Features Selection - Prevent

Gradient-Descent

Because the gradient-descent attack uses some vul-

nerable features, another idea of counter-measure is

to select only features that are less vulnerable to this

attack. As shown by Figure 3, we selected the fea-

tures that an adversary has no interest in modifying,

in order to make a malicious ﬁle pose as a clean one.

In other words, increasing the number of occurence of

these features will never allow the gradient-descent to

get closer to the hyperplan we aim to cross.

Feature 1

Feature 2

Remove Feature 1

Feature 2

Figure 3: Suppression of features vulnerable to gradient-

descent attack (Feature 1 is vulnerable here, and Feature 2

is not).

To detect the vulnerable features from a list, we

used algorithm 2. This algorithm computes a SVM

with the current feature list and performs the gradient-

descent attack on it. At each iteration, the features

used by the gradient-descent attack are removed from

the current feature list. The algorithm stops when the

feature list is stable (or empty).

Algorithm 2: Find the features vulnerable to Gradient-

Descent attack and remove them.

f eatures list ← initial list

GD f eatures used ← initial list

while GD f eatures used 6= ⊥ do

compute SVM for f eatures list

apply gradient-descent attack to current SVM

GD f eatures used ← features used by gradient-

descent Attack

f eatures list.remove(GD f eatures used)

end while

return f eatures list

Global Features Selection. To select the ﬁnal list

of features, the following steps are applied: We se-

lected the initial sublist : from 21 default features

and or frequency selection (see 2.2), apply the bet-

ter sublist selection (see 2.2), apply gradient-descent

resistant selection (Algorithm 2), apply Better Sublist

selection (see 2.2) We used different parameters here

(frequency, force 21 default features or not), and the

corresponding results are presented in Table 4.

Remark: accuracy is not taking into account the

forged ﬁle, only the initial dataset of 10 000 clean and

10 000 malicious PDF ﬁles.

Results. This features selection method drastically

reduces the number of features (between 1 and 3,

except without application of the better selection

method ﬁrst), resulting in very bad accuracy results.

The application of the global method on the 95% fre-

quently used features + 21 default features only gives

quite good results (98,22% of accuracy), but it is far

less from the initial SVM.

Combination of Threshold & Features Selection

Counter-measures. The threshold counter-measure

previously proposed, signiﬁcantly reduces the attacks

whilst keepingg very good accuracy, but was not sufﬁ-

cient to block every gradient-descent attacks. The fea-

tures selection counter-measure stopped all gradient-

descent attacks, but signiﬁcantly reduced accuracy

and had very different results depending on the ini-

tial features list. Hence, we tried to combined both

counter-measures to obtain both good accuracy and

total blocking of gradient-descent attack. Table 5

presents the results obtained by applying both of these

techniques.

Malware Detection in PDF Files using Machine Learning

417

Table 4: Features selection global method application results.

Initial features set Attack prevention Accuracy Nb of features

(theory)

21 default features (GD resistant selection only) 100% 98,03% 6

21 default features 100% 55,68% 2

Frequency (80%) 100% 67,64% 1

Frequency (90%) 100% 95,12% 3

Frequency (80%) + default features 100% 55,66% 1

Frequency (90%) + default features 100% 55,79% 3

Frequency (95%) + default features 100% 98,22% 3

Results. We obtained in each case a 100% theoretical

resistance of gradient-descent, but only the initial fea-

tures sets including the default 21 features (and with

or without the frequency-selected features) had good

accuracy (more than 99%).

Practical Results. Table 6 summarizes the best re-

sults obtained by applying each counter-measure. The

conclusion we can make is that the best compromise

between accuracy and attack prevention is when both

the threshold and the gradient-descent resistant fea-

tures selection are applied. In this case, we are able

to prevent 99,99% of gradient-descent attacks while

conserving a reasonably good accuracy of 99,22%.

4.3 Adversarial Learning

One of the major drawbacks concerning the super-

vised learning that we used so far is that the SVM

is only trained once. The decision is then always the

same and the classiﬁer does not learn from its mis-

takes.

The Adversarial Learning solves this issue by al-

lowing the expert to feed the SVM again with vectors

it wrongly classiﬁed. Hence, the classiﬁer will learn

step-by-step the unexplored regions that are used by

the attacks to bypass the SVM. Algorithm 3 describes

how we iteratively fed our SVM to implement the ad-

versarial learning. Note that we used the 21 default

features from PDFiD.

The results of this counter-measure are presented

in Table 7. With n = 10 we found that 3 rounds are

enough for the SVM to completely stop the gradient-

descent attack. Furthermore, at each round, we see

that the gradient-descent algorithm requires more and

more steps to ﬁnally converge, and thus the attack be-

comes more and more costly. The Support Vectors

column represents the number of support vectors the

SVM has constructed. The accuracy represents the

number of correct classiﬁcations of the SVM. One in-

teresting fact using this counter-measure is that the

accuracy of the SVM barely changes. Furthermore,

we did not need a more elaborated choice of features.

Algorithm 3: Adversarial learning.

n ← number of PDF ﬁles to give to c at each itera-

tion

c ← trained SVM

← success rate of gradient-descent

← −1

while s

6= s

← s

feed c with n gradient-descent-forged ﬁles

relaunch the learning step of c

← success rate of gradient-descent

end while

return c

5 CONCLUSION AND

PERSPECTIVES

We implemented a naive SVM, that we easily tricked

with a gradient-descent attack. We also implemented

counter-measures against this attack: ﬁrst, we set up

a threshold over each considered feature. This alone

enabled us to stop almost every gradient-descent at-

tack. Then, we reduced the number of selected fea-

tures, in order to remove features that were used dur-

ing the gradient-descent attack. This makes the attack

even less practical, at the cost of reducing the accu-

racy of the SVM. We also proposed another approach

to reduce the chances of success of the gradient de-

scent attack, using adversarial learning, by training

the SVM with gradient-descent forged PDF ﬁles, and

re-iterating the process. Our SVM was resistant to

gradient-descent attacks after only three iterations of

the process.

Obviously more can be said about the topic. For

instance, it can be interesting to see what can happen

if the adversary does not know the algorithms and fea-

tures that have been used in the classiﬁer. We could

also perform a gradient-descent attack using other al-

gorithms (e.g. Naive Bayes, Decision Tree, Random

Forest) and see how many ﬁles thus forged can bypass

our SVM. The adversary could also use other types of

SECRYPT 2018 - International Conference on Security and Cryptography

418

Table 5: Features selection global method with threshold application results.

Initial features set Attack prevention Accuracy Nb of features

(theory)

21 default features 100% 99,11% 6

Frequency (80%) 100% 94,36% 5

Frequency (90%) 100% 98,00% 30

Frequency (80%) + default features 100% 98,10% 8

Frequency (90%) + default features 100% 99,05% 6

Table 6: Results of Counter-measures Threshold and Features Selection application.

Attack prevention Accuracy Nb of features

(in practice)

Treshold only 94,00% 99,81% 20

Features selection only 99,97% 98,05% 2 (/JS and /XFA)

Threshold + Features selection 99,99% 99,22% 9

Table 7: Adversarial learning results.

Round # Support Vectors Accuracy (%) Steps number of GD Success rate of GD (%)

0 293 99,70 800 100

1 308 99,68 1 800 90

2 312 99,67 3 000 0

attacks, like Monte-Carlo Markov Chains (MCMC)

techniques. Other attacks may exploit some proper-

ties that are inherent in the training set. To avoid them,

it may be interesting to have a look at unsupervised

learning techniques, and try to identify a malicious

behaviour with clustering. We could also use deep

learning algorithms like Generative Adversarial Net-

work (GAN), in order to generate a classiﬁer and test

its resistance against various attacks.

ACKNOWLEDGEMENTS

This work has been accomplished during the french

working session REDOCS’17. We thank Pascal

Lafourcade for his support, as well as Boussad Ad-

dad, Olivier Bettan, and Marius Lombard-Platet for

having supervised this work.

REFERENCES

Aizerman, M. A., Braverman, E. A., and Rozonoer, L.

(1964). Theoretical foundations of the potential func-

tion method in pattern recognition learning. Automa-

tion and Remote Control.

Ateniese, G., Felici, G., Mancini, L. V., Spognardi, A., Vil-

lani, A., and Vitali, D. (2013). Hacking smart ma-

chines with smarter ones: How to extract meaningful

data from machine learning classiﬁers. CoRR.

Biggio, B., Corona, I., Maiorca, D., Nelson, B.,

Srndi

c, N.,

Laskov, P., Giacinto, G., and Roli, F. (2013). Evasion

Attacks against Machine Learning at Test Time.

Borg, K. (2013). Real time detection and analysis of pdf-

ﬁles. Master’s thesis.

Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992).

A training algorithm for optimal margin classiﬁers.

COLT ’92.

Contagio Dump (2013). Contagio: Malware dump.

http://contagiodump.blogspot.fr/2013/03/

16800-clean-and-11960-malicious-files.

html.

CVEDetails (2017). Adobe vulnerabilities statistics.

https://www.cvedetails.com/product/497/

Adobe-Acrobat-Reader.html.

Kittilsen, J. (2011). Detecting malicious pdf documents.

Master’s thesis.

Maiorca, D., Giacinto, G., and Corona, I. (2012). A Pattern

Recognition System for Malicious PDF Files Detec-

tion,.

Stevens, D. (2006). Didier stevens blog. https://blog.

didierstevens.com/.

Torres, J. and De Los Santos, J. (2018). Malicious pdf doc-

uments detection using machine learning techniques.

Malware Detection in PDF Files using Machine Learning

419