An Automatic Test Data Generation Tool using Machine Learning

Ciprian Paduraru

1,2,3

and Marius-Constantin Melemciuc

The Research Institute of the University of Bucharest (ICUB), University of Bucharest,

Bd. M. Kogalniceanu 36-46, 050107, Bucharest, Romania

Department of Computing Science, University of Bucharest, Romania

Electronic Arts, Romania

Keywords:

Fuzz Testing, Recurrent Neural Networks, LSTM, Tensorﬂow, Pipeline.

Abstract:

This paper discusses an open source tool that is capable to assist users in generating automatic test data for

multiple programs under test. The tool works by clustering inputs data from a corpus folder and producing

generative models for each of the clusters. The models have a recurrent neural network structure and their

training and sampling are parallelized with Tensorﬂow. As features, the tool supports online updating of

the corpus folder and the already trained models, and supports any kind of program under test or input ﬁle

example. There is no manual effort for users, other than customizing per cluster parameters for optimizations

or using function hooks that they could use through a data structure, which acts as an expert system. The

evaluation section shows the efﬁciency of both learning and code coverage using some concrete programs and

new tests sampling methods.

1 INTRODUCTION

The importance of security in software systems has

increased year over year recently, because of the wide

interconnectivity between different software pieces.

Important resources are invested nowadays in de-

tecting security bugs in these systems before being re-

leased on the market. Machine generated test data is

desirable for automatizing the process of testing and

ensuring a better coverage.

Ideally, the purpose of an automatic test data ge-

neration system for programs evaluation should be to

generate test data that covers as many branches as

possible from a program’s code, with the least com-

putational effort possible. The most common techni-

que is Fuzz testing (Godefroid, 2007), which is a pro-

gram analysis technique that looks for inputs causing

errors such as buffer overﬂows, memory access vio-

lations, null pointer dereferences, etc, which in gene-

ral have a high rate of being exploitable. Using this

technique, testing data is generated using random in-

puts and the program under test is executing them for

the purpose of detecting issues like the above mentio-

ned ones. One of the main limitations of fuzz testing

is that it takes a signiﬁcant effort to produce inputs

that covers almost all branches of a program’s source

code. This comes from the fact that using random-

ness, it results in a high chance of producing inputs

that are not correct and rejected in the early outs of a

program’s execution.

Alternative methods that augment the classic random

fuzz testing with different methods were created.

Such ideas involved the use of genetic algorithms for

better guiding the test data generation towards unco-

vered areas (Paduraru et al., 2017), or by using re-

current neural networks and predicting the probabi-

lity distribution of the next character knowing a previ-

ously generated context (Godefroid et al., 2017), (Ra-

jpal et al., 2017).

This paper discusses an open-source tool (from

the authors’ knowledge, the ﬁrst one at the moment

of writing this paper) that given a corpus of different

existing test ﬁle formats, it performs cluster analysis,

then learns a generative model for each cluster, which

can be used later to quickly generate new tests with

a high rate of being correct (i.e., touching more bran-

ches of a program instead of taking the early outs due

to incorrect inputs). More speciﬁcally, the contributi-

ons of this paper in the ﬁeld of using machine learning

for automating software testing are:

• An open-source tool that is capable of storing a

database of generative models for sampling new

test data for multiple programs at once. These mo-

dels are learned from a corpus of test data, which

472

Paduraru, C. and Melemciuc, M-C.

An Automatic Test Data Generation Tool using Machine Learning.

DOI: 10.5220/0006836604720481

In Proceedings of the 13th International Conference on Software Technologies (ICSOFT 2018), pages 472-481

ISBN: 978-989-758-320-9

can be updated online with newly added content.

No manual clusterization of inputs is needed.

• Description of a parallelized implementation for

learning the models and sampling from them

using Tensorﬂow (Abadi et al., 2016). The mo-

dels also permit checkpoints and online learning.

• Present a technique for assigning begin/end mar-

kers in the pre-processed training data that works

for all kinds of ﬁles, not just the well-known ones.

The previous work in the ﬁeld that uses the same

core system as our tool (i.e. recurrent neural net-

works) is focused only on PDF ﬁles.

• Allows users to leverage expert system in oversi-

zing the work and perform custom optimizations

and logs for learning or sampling certain catego-

ries of ﬁle types.

The paper is structured as follows. Next section

presents some existing work in the ﬁeld that inspired

the work presented in this paper. Section 3 makes a

quick introduction in one of the ways machine lear-

ning can be used to generate new texts based on an

existing corpus of texts. Section 4 presents our met-

hods for automating the process of test data genera-

tion. Evaluation of our tool and methods are discus-

sed in Section 5. Finally, conclusions and future work

are given in the last section.

2 RELATED WORK

In the ﬁeld of fuzzing techniques, there are three main

categories currently: blackbox random fuzzing (Sut-

ton et al., 2007), whitebox random fuzzing (Gode-

froid et al., 2012), and grammar based fuzzing (Pur-

dom, 1972), (Sutton et al., 2007). The ﬁrst two are au-

tomatic methods proving efﬁciency in ﬁnding vulne-

rabilities in binary-format ﬁle parsers. These methods

are also augmented with others for better results. For

example, in (Paduraru et al., 2017) authors present a

distributed framework using genetic algorithms that

generates new tests by looking at the probability of

each branch encountered during the execution. Their

ﬁtness function scores a newly generated input test

by the probability of the branches encountered in the

program’s execution trace. This way, the genetic al-

gorithm tries to create input data that drives the pro-

gram’s execution towards rare (low probability) bran-

ches inside the program’s control ﬂow. They use Apa-

che Spark for parallelization and dynamic tainting to

know the paths taken during the execution. Their met-

hod obtains better scores than classical random fuz-

zers and it is one of the solutions that we compare

against, using the same two examples: an HTTP par-

ser and an XML parser.

On the other side, the grammar based fuzzing is

not fully automatic: it requires a grammar specifying

the input format of the application under test. Typi-

cally, this grammar is written by hand and the process

becomes time consuming and error prone. It can be

viewed as a model-based testing (Utting et al., 2012),

and the work on it started with (Hanford, 1970), (Pur-

dom, 1972). Having the input grammar, test gene-

ration from it can be done either (usually) random

(Sirer and Bershad, 1999), (Coppit and Lian, 2005)

or exhaustive (L

ammel and Schulte, 2006). Methods

that combine whitebox fuzzing with grammar-based

fuzzing were discussed in (Majumdar and Xu, 2007),

(Godefroid et al., 2008). Recent work concentra-

tes also on learning grammars automatically. For in-

stance, (Bastani et al., 2017) presents an algorithm to

synthesize a context-free grammar from a given set

of inputs. The method uses repetition and alterna-

tion constructs for regular expressions, then merging

non-terminals for the grammar construction. This can

capture hierarchical properties from the input formats

but, as mentioned in (Godefroid et al., 2017) the met-

hod is not well suited for formats such as PDF objects

for instance, which include a large diverse set of con-

tent types and key-value pair.

Autogram, mentioned in (H

oschele and Zeller,

2016) learns context-free grammars given a set of in-

puts by using dynamic tainting, i.e. dynamically ob-

serving how inputs are processed inside a program.

Syntactic entities in the generated grammar are con-

structed hierarchically by observing what parts of the

given input is processed by the program. Each such

input part becomes an entity in the grammar. The

same idea of processing input formats from exam-

ples and producing grammars, but this time associa-

ting data structures with addresses in the application’s

address space is presented in (Cui et al., 2008).

Both approaches described above for learning

grammars automatically require access to the pro-

gram for adding instrumentation. Thus, their appli-

cability and precision for complex formats under pro-

prietary applications such as PDF, DOC or XML par-

sers is unclear. Another disadvantage of these is that if

the program’s code changes, the input grammar must

be learned again. The method presented in (Gode-

froid et al., 2017) uses neural-network models to learn

statistical generative models for such formats. Star-

ting from a base suite of input PDF ﬁles (not bina-

ries) they concatenate all and use recurrent neural net-

works (RNN, and more speciﬁcally a sequence - to -

sequence network) to learn a generative model for ot-

her PDF ﬁles. Their work is focused on generative

An Automatic Test Data Generation Tool using Machine Learning

473

models for non-binary objects and, since for binary

formats such as an image embedded in PDF, the ex-

isting methods for fuzz testing (classical ones) are al-

ready efﬁcient. Their work is a foundation for our

open-source pipeline solution, which is able to gene-

rate models from any kind of input ﬁles in a distri-

buted environment that also supports online learning,

and produce new test inputs based on a database of

models. If the base method evaluation was done on

PDF parsers, our tests also include HTTP and XML

parsers.

3 USING MACHINE LEARNING

TO LEARN GENERATIVE

MODELS FOR TESTING

A statistical learning approach for learning generative

models for PDF ﬁles was introduced in (Godefroid

et al., 2017). Their main idea is to learn the model ba-

sed on a large corpus of PDF objects using recurrent

neural networks, and more speciﬁcally a sequence-to-

sequence network model (Cho et al., 2014), (Sutske-

ver et al., 2014). This model has been used for ma-

chine translation (Sutskever et al., 2014) and speech

recognition (Chorowski et al., 2015), producing state

of the art results in these ﬁelds. The model can be trai-

ned in an unsupervised manner to learn a generative

model from the corpus folder, then used to produce

new sequences of test data.

3.1 Sequence-to-Sequence Neural

Network Model

Recurrent neural networks (RNN) are neural network

models that operate on a variable input sequence <

, x

, ...,x

> and have a hidden layer of states h,

and an output y. At each time step (t) one element

from the input sequence is consumed, modifying the

internal hidden state and the output of the network as

follows:

= f (h

t−1

, x

) (1)

= σ(h

) (2)

where σ is a function such as softmax (used ty-

pically in learning classiﬁers) that computes the out-

put probability distribution over a given vocabulary

by taking into account the current hidden state, while

f is a non-linear activation function used to make the

transition between hidden states (e.g. of functions:

sigmoid, tanh, ReLU, etc). Thus, the RNN can learn

the probability distribution of the next character (x

)

in the vocabulary, given a character sequence as input

< x

, x

, ..., x

t−1

>, i.e. it can learn the conditional

distribution p(x

| < x

, x

, ..., x

t−1

>).

Sequence-to-sequence model (seq2seq) was in-

troduced in (Cho et al., 2014). It consists of two

connected recurrent neural networks: one that acts

as an encoder, processing a variable input sequence

and producing a ﬁxed dimensional representation, and

another one that acts as a decoder by taking the ﬁxed

dimensional input sequence representation and gene-

rating a variable dimensional output sequence. The

decoder network uses the output character at time step

t as an input character for time step t + 1. Thus, it

learns a conditional distribution over a sequence of

next outputs, i.e. p(< y

, ..., y

T 1

> | < x

, ..., x

T 2

>).

Figure 2 shows the architecture of the model.

The test data of generative models presented in

this paper uses the seq2seq models. The corpus of

input ﬁles are treated as a sequence of characters, so

the model itself contains the distribution of the next

character in the vocabulary based on a previously ge-

nerated context. An epoch is deﬁned in the machine

learning terminology as a full iteration of the learning

algorithm over the entire training database (i.e. input

ﬁles in the corpus). In the evaluation section, we use

different epochs (10, 20, 30, 40 and 50) to correlate

the time needed to train versus the quality of the trai-

ned model.

3.2 Using the Model to Generate New

Inputs

After the seq2seq model is trained, it can be used to

generate new inputs based on the probability distri-

bution of next characters and the previously genera-

ted context. The work in (Godefroid et al., 2017)

always starts with “obj” string and continuously ge-

nerates characters using different policies to draw the

next characters from the model, until the output pro-

duced is the string “endobj”. These markers are the

ones used to represent the beginning and ending of

PDF objects. While our tool is capable of dynami-

cally adapting to new / unknown ﬁle types or without

any expert knowledge, we use a different strategy for

deﬁning the beginning / end markers (see the next

section for details).

There are four documented policies that can be

used when deciding which character a model should

output next:

• No sampling : just use the model as it is without

randomness ; this will produce deterministic re-

sults from any starting point, i.e. the highest pro-

bability character will be chosen always.

ICSOFT 2018 - 13th International Conference on Software Technologies

474

Figure 1: A sequence-to-sequence graphical representation. In this example, the encoder part takes as input the internal string

marker representing the beginning of an HTTP request, while the decoder produces the beginning text of such a request.

• Sample: random sampling at each next character

according to the probability distribution encapsu-

lated in the model. This strategy produces a di-

verse set of new inputs combining the patterns le-

arned from data but also mixing with random fuz-

zing.

• SampleSpace: random sampling only at white

spaces. According to the evaluation section, this

produces better well-formed new inputs that are

not deterministic but that are not as diverse as the

Sample model.

• SampleFuzz: A parameter deﬁnes the threshold

probability for deciding how to choose the next

character from the learned model. Then, a random

value is drawn at each time step and if it is higher

than the threshold, the next character chosen is the

one with the highest probability from the learned

model. Otherwise, the character with the lowest

probability is selected in an attempt to trick the

PDF parser. The idea was analysed in (Godefroid

et al., 2017). However, in our analysis this shows

worse results than Sample and SampleSpace met-

hods.

4 PIPELINE FOR GENERATING

NEW TESTS BASED ON

EXISTING CORPUS

The tool presented in this paper is open-source and

currently available at: https://github.com/AGAPIA/

AutomaticTestDataRNN. It receives as input a corpus

of different input ﬁle types, with no previous classi-

ﬁcation made manually by the user. The content of

the folder can be updated online in both directions:

either adding new ﬁles of existing types, or adding

new ﬁle types. This is an important requirement since

the main requirements from software security com-

panies (such as the one we collaborated with, Bitde-

fender) are: (1) to be able to learn and produce new

inputs of different kinds for many different programs

with the purpose of security evaluation, and (2) to au-

tomatically and dynamically collect data from users,

i.e. new input tests are added online and used to im-

prove the trained model).

4.1 The Training Pipeline

Given the path to an existing corpus folder (data), the

training pipeline writes its output in two folders:

(1) data preprocesses

(2) data models

Folder (1) stores the clusterized and preprocessed

corpus data. Since the types of the ﬁles in there is

unknown, our ﬁrst target is to cluster them by iden-

tifying the type of each ﬁle in the corpus then put

them in a different subfolder corresponding to each

ﬁle type. As an example, if the corpus folder (data)

contains three different input ﬁle types such as XML,

PDF and HTTP requests, then the ﬁrst step will cre-

ate (if not already existing) three clusters (folders) and

add each input ﬁle to the corresponding one. Cur-

rently, the classiﬁcation of ﬁles to clusters is done

using the f ile − l command in Unix, and getting the

output string of the command (we plan to improve

this classiﬁcation in the future work by using unsu-

pervised learning and perform clusterization based on

common identiﬁed features). Since at each training

epoch the entire sequence of character in each ﬁle

must be processed, and considering that seek operati-

ons on disk can be expensive, the strategy used by our

training pipeline is to concatenate together all ﬁles in

each cluster (folder) in a single ﬁle to make the trai-

ning process faster. Thus, each of the three folders in

the concrete example above will contain a single ﬁle

with the aggregated context from the initial ones. The

neural-network model of each cluster is trained by

splitting the aggregated ﬁle content (C

Content

) in mul-

tiple training sequences of a ﬁxed size L, which can

be customized by user. Thus, the i

training sequence

contains t

= C

Content

[i∗L : (i+1)∗L] (where F[a : b]

An Automatic Test Data Generation Tool using Machine Learning

475

denotes the subsequence of characters in F between

indices a and b). For each of these training sequences,

the expected output that the network is trained against

is the input one shifted by 1 position to the right, i.e,

= C

Content

[i ∗ L + 1 : (i + 1) ∗ L + 1]. The model

is then trained with all these input/output sequences

from a cluster’s content and using backpropagation

to correct the weights, it learns the probability map

of next characters having a given context (prior se-

quence). This previous context is modeled with the

hidden state layer.

However, we need a generic way to mark the

beginning and ending of an individual ﬁle content,

such that the sampling method knows how to start

and when to stop. At this moment, the beginning

marker is a string BEGIN#CLUSTERID, while the

end marker is a string END#CLUSTERID, where

CLUST ERID is an integer built using a string to inte-

ger mapping heuristic. The input string used for map-

ping is the full classiﬁcation output string given by

the f ile − l command when the ﬁle was classiﬁed in

a cluster. A supervisor map checks if all hashcodes

are unique and tries different methods until for each

cluster there is a unique identiﬁer. The equation be-

low shows the content of a cluster’s aggregated ﬁle,

where the

∑

and + operators acts as concatenation of

strings, and C is a given cluster type.

Identi f ier(C) = GetUniqueClusterIdenti f ier(C)

Cluster(C) =

∑

each f ileF∈C

(“BEGIN

+ Identi f ier(C)

+FileContent(F) + “END

+ Identi f ier(C))

The tool uses Tensorﬂow (Abadi et al., 2016) for

implementing both learning and sampling processes.

Each cluster will have its own generative model, sa-

ved in data models folder. In the example given

above, three models will be created, one for each

XML, PDF and HTTP input types. A mapping from

CLUST ERID to the corrensponding model will be

created (and stored on disk) to let the sampling pro-

cess know where to get data from. In the network

built using Tensorﬂow implementation we use LSTM

cells for avoiding the problems with exploding or va-

nishing gradients (Zaremba et al., 2014). By default,

the network built has two hidden layers each with 128

hidden states. However, the user can modify this net-

work using expert knowledge per cluster granularity

as stated in Section 4.3 (the starting point of the pro-

cess described in this section is deﬁned in generate-

Model.py script, which has a documented set of para-

meters as help). Tensorﬂow is also able to parallelize

automatically the training/sampling in a given cluster.

On a high-level view, the framework allows users to

customize a network and its internal compiler / exe-

cuter decides where to run tasks with the scope of op-

timizing performance (e.g. minimize communication

time, GPU-CPU memory transfer, etc).

Our tool takes advantage of the checkpointing fe-

ature available in the Tensorﬂow framework, i.e. at

any time the learned model up to a point can be sa-

ved to disk. This helps users by letting them update

the generative models if new ﬁles were added dyna-

mically to the clusters after the initial learning step.

This way, the learned weights in the neural network

are reused and if the new ﬁles are not completely dif-

ferent in terms of features from the initial ones, the

training time scales proportionally to the size of the

new content added. At the implementation level, an

indexing service keeps the track of the new content

in each cluster and informs a service periodically to

start the generative models updating for each of the

modiﬁed clusters. Another advantage of the check-

point feature is that it allows users to take advantage

of the intermediate trained models. Although not op-

timal, these can be used in parallel with the training

process (until convergence) to generate new test data.

4.2 New Inputs Generation

The pseudocode in the listing below shows the met-

hod used to generate a new input test. The function

receives as input a cluster type (considering that there

exists a trained generative model for the given clus-

ter), and a policy functor pointing to one of the four

policies deﬁned in the previous section. The ﬁrst step

is to get the custom parameters and the begin/end mar-

ker strings for the given cluster. The next step is

to feed the entire begin marker string (starting with

a zero set hidden layer) and get the resulted hidden

state. This will capture the context learned from the

training data at the beginning of the ﬁles in that clus-

ter. Then, the code loops producing output characters

one by one using the probability distribution map (P)

returned by the FeedForward function in the current

state (h state). At each iteration, as seen in Figure 2,

the last produced output character and state are given

as parameters to ﬁnd the probability distribution map

over vocabulary. The loop ends when the last part of

the output (sufﬁx) is exactly the end marker string (or

until a certain maximum size was produced to avoid

blocking if the training was not good enough to get

to the end marker). The starting point of the concrete

implementation can be found in the script ﬁle named

sampleModel.py.

SampleNewTest(Cluster, PolicyType):

Params = GetParams(Cluster)

ICSOFT 2018 - 13th International Conference on Software Technologies

476

Figure 2: The process of updating the generative models.

BeginMarker, EndMarker = GetMarkers(Params)

foreach c in BeginMarker:

h_state, P = FeedForward(h0, internalRNN, c)

lastChar = c

output = ""

while the suffix of output != EndMarker :

lastChar = Policy(PolicyType, P, lastChar)

output += lastChar

h_state, P = FeedForward(h_state,

internalRNN, lastChar)

return output

A pseudocode deﬁning sampling policies is pre-

sented in the listing below. Roulette-wheel based

random selection is used with the Sample policy, and

with the SampleSpace one when the previous charac-

ter generated was a whitespace. If SampleSpace is

used but still inside a word, or if SampleFuzz sam-

pling method is used and the random value drawn is

higher than the fuzz threshold, then the character with

the highest probability from the vocabulary is chosen.

Instead, if the random value is smaller than fuzz thres-

hold, the character with the lowest probability is cho-

sen in an attempt to trick the program under test.

Policy(PolicyType, P, C):

switch Type:

case NoSample:

return argmax(P)

case SampleSpace:

if C == " " return roulettewheel(P)

else return argmax(P)

case Sample:

return roulettewheel(P)

case SampleFuzz:

if rand < FuzzThreshold:

return argmin(P)

else

return argmax(P)

default:

assert "no such policy"

4.3 Expert Knowledge

Different clusters might need different parameters for

optimal results. For example, training PDF objects

might require more time to get to the same loss re-

sult than the threshold set for learning HTTP reque-

sts. The optimal parameters can differ starting from

simple thresholds to the conﬁguration of the neural

network structure, i.e. the number of hidden layers or

states. The tool allows users to inject their own pa-

rameters for both learning and sampling new results,

by using a map data structure that looks more like an

expert system. If custom data is available in that map

(e.g. [”HT T P request cluster”, num hidden layers] =

1) for a particular cluster and parameters, then those

are used instead of the default ones. Another exam-

ple is the customization of the beginning/end markers

used to know when a certain input data starts and

ends. For well-known types, the user can override our

default method for assigning the markers with the cor-

rect ones (e.g. PDF objects start with “obj” and end

with “endobj”). Also, since Tensorﬂow can provide

graphical statistics added by users (Tensorboard) du-

ring both training and sampling, the tool allows users

to insert customized logs and graphics per cluster type

using the function hooks provided.

5 EVALUATION

As the previous work in the ﬁeld (Godefroid et al.,

2017) already evaluated the training efﬁciency of the

core method, i.e. learning a generative model with

RNNs and do inference over it to ﬁnd new inputs,

using PDF ﬁle types, we evaluate our tool using two

more parser applications: XML parser

and HTTP

parser

. However, we use our own mark system for

http://xmlsoft.org

https://github.com/nodejs/http-parser

An Automatic Test Data Generation Tool using Machine Learning

477

beginning/ending of a ﬁle, which works for generic

(any kind of) ﬁle types as mentioned in Section 4.

The two new mentioned test applications were used to

compare the results directly against the work in (Pad-

uraru et al., 2017), which uses random fuzz testing

driven by a genetic algorithm to get better coverage

over time, and the same two programs for evaluation.

5.1 Experiment Setup and Methodology

The experiments described below involved a cluster

of 8 PCs, each one with 12 physical CPU cores, tota-

ling 96 physical cores of approximately the same per-

formance (Intel Core i7-5930K 3.50 Ghz). Each of

the PC had one GPU device, an Nvidia GTX 1070.

The user should note that adding more GPUs into

the system could improve performance with our tool

since the benchmarks show that the GPU device was

in average about 15 times faster than the CPU both

for learning models and generating new tests.

In our tests, we ultimately care about the coverage

metric of a database of input tests: how many bran-

ches of a program are evaluated using all the avai-

lable tests, and how much time did we spend to get

to that coverage? Our implementation uses a tool

called Tracer that can run a program P against the

input test data and produce a trace, i.e., an orde-

red list of branch instructions B

, ....., B

that a pro-

gram encountered while executing with the given in-

put test: Tracer(P,test) = B

...B

. Because a pro-

gram can make calls to other libraries or system exe-

cutables, each branch is a pair of the module name and

offset where the branch instruction occurred: B

(module, o f f set). Note that we divide our program

in basic blocks, which are sequences of x86 instructi-

ons that contain exactly one branch instruction at its

end. We used a tracer tool developed by Bitdefen-

der company, which helped us in the evaluation pro-

cess, but there are also open-source tracer tools such

as Bintrace

. Having a set of input test ﬁles, we name

coverage the set of different instructions (pairs of

(module, o f f set)) encountered by Tracer when exe-

cuting all those tests. We are interested in maximizing

the size of this set usually, and/or minimizing the time

needed to obtain good coverage.

Speciﬁcally, when training generative models,

another point of interest is how efﬁcient is the trai-

ned model with different setups, i.e. how many newly

generated tests are correctly compiled by the HTTP

and XML parsers (Pass Rate metric) ? This could

help us make a correlation between the Pass Rate and

coverage metrics.

https://bitbucket.org/mihaila/bintrace

5.2 Training Data and Generation of

New Tests

The training set consisted of XML and PDF ﬁles that

were taken using web-crawling different websites. A

total of 12.000 ﬁles were randomly selected and sto-

red for each of these two categories. For HTTP re-

quests, we used an internal logger to collect 100.000

of such request. The folder grouping all these inputs

is named in our terminology corpus test set. A me-

tric to understand how well does the trained model

learn is named Pass Rate. This estimates (using the

output from grep tool) the percent of tests (from the

generated suite) that are well formatted for the par-

ser under test. As Figure 3 shows, and as expected,

the quality of trained model grows with the number

of epochs used for training (i.e. the number of full

passes over the entire training data set). Randomizing

only on spaces (i.e. using SampleSpace) gives better

results for Pass Rate metric since more data is used as

indicated as being optimal by the trained model. Ten-

sorﬂow was used for both training and inference, and

the hardware system considered was the one descri-

bed at the beginning of this section.

Figure 3: Pass Rate metric evaluation for different number

of epochs and models used to generate new tests.

Table 1 shows the time needed to perform model

learning over the entire corpus folder of 12.000 PDF

and XML ﬁles, and 100.000 HTTP requests using a

different number of epochs. Other parameters are also

important, the user should also take a look at the des-

cription of those inside the tool’s repository and try

to parametrize with expert knowledge for more opti-

mizations when dealing with new ﬁle types. Table 2

ICSOFT 2018 - 13th International Conference on Software Technologies

478

shows the timings for producing 10.000 new inputs

for PDF and XML ﬁles, and 50.000 of HTTP reque-

sts. As expected, since there is only inference through

a learned model, the timings are almost equal between

all models (we do not even show the difference bet-

ween Sample and SampleSpace since the difference

is negligible). Actually, from proﬁling the data tests

generation it takes more time to write the output data

(i.e. input tests) on disk rather than spending cycles

on inference.

Table 1: Time in hours to train models on different number

of epochs and using 12.000 ﬁles for PDF and XML, and

100.000 HTTP requests as training dataset.

Num epochs HTTP XML PDF objects

50 8h:25 7h:19 9h:11

40 6h:59 5h:56 8h:04

30 5h:35 4h:20 6h:15

20 3h:48 3h:42 4h:17

10 2h:10 1h:12 3h:02

Table 2: The average time needed to produce 10.000 new

inputs for PDF and XML ﬁles, and 50.000 new HTTP re-

quests.

File type Time in minutes

XML 49

HTTP 25

PDF 51

Main Takeaway: The time needed to train the model

is ﬁxed, depending on the number of epochs and a few

other parameters. After the training phase, the tool

can create huge databases of new inputs (valid ones)

quickly, which in the end can provide better code co-

verage than existing fuzzing methods. Those do not

need the training phase, but the new tests generated

are often rejected from early tests inside the program

because of their incorrect format.

5.3 Coverage Evaluation

For the coverage evaluation tables below, we consi-

dered only the model trained with 30 epochs, which

was the winner in terms of performance versus trai-

ning cost. Using 40 or 50 epochs increased just with a

few new lines the coverage over time, but the training

time is signiﬁcantly higher. Of course, the user should

experiment and ﬁnd the optimal number of epochs de-

pending on training data size for example, and their

budget time limit allocated for training.

Tables 3 and 4 show the coverage for XML and

HTTP ﬁle types by using three different evaluation

methods. The ﬁrst one, XML-fuzz+genetic / HTTP -

fuzz + genetic, considers the fuzzing method driven

by genetic algorithms as explained in (Paduraru et al.,

2017). The Sample and SampleSpace are the two mo-

dels used for sampling deﬁned above in this paper,

and which uses our tool. The main observation is that

with simple fuzzing (i.e. no use of generative mo-

dels), the coverage value converges quickly to a va-

lue, without necessarily growing by having more time

allocated. This happens mainly because the random

fuzzing methods produce many times inputs that are

not correct, being rejected by early outs, or difﬁcult

to deviate from a few common branches inside a pro-

gram even when adopting different policies to guide

fuzzing ((Paduraru et al., 2017), (Godefroid et al.,

2017)). However, fuzzing without learning the input

context techniques have their own advantage: they are

simple to implement and require no training time. For

instance, if smoke tests (Kaner et al., 2001) are nee-

ded after changing the user application’s source code

and input grammar, quick random fuzzing methods

are very efﬁcient since they do not require any trai-

ning time. Learning a generative model is not fea-

sible in this situation due to the limited time needed

to respond to the new code change. Actually, techni-

ques can be combined: classic fuzzing can be used

for smoke tests, while fuzzing with generative models

such as the one presented in this paper can be used to

perform longer and more performant tests.

Table 3: The number of branch instructions touched in com-

parison between random fuzzing driven by genetic algo-

rithms, Sample and SampleSpace models for XML ﬁles.

Model 9h 15h 24h 72h

XML-fuzz+genetic 1271 1279 1285 1286

XML-Sample 1290 1364 1455 1549

XML-SampleSpace 1291 1375 1407 1553

Table 4: The number of branch instructions touched in com-

parison between random fuzzing driven by genetic algo-

rithms, Sample and SampleSpace models for HTTP reque-

sts.

Model 9h 15h 24h 72h

HTTP-fuzz+genetic 229 230 230 232

HTTP-Sample 238 249 257 271

HTTP-SampleSpace 241 245 269 279

In 72 hours using the system described in the se-

tup, the system was able to get approximately 20%

more coverage than the best documented model on

the XML and HTTP cases. Also, please note again

that the two models evaluated were chosen to com-

pare against other documented results. Our tool is

able to produce generative models and training tests

after training on any kind of user inputs formats (e.g.

HTML, DOC, XLS, source code for different pro-

gramming languages, etc). An interesting aspect is

An Automatic Test Data Generation Tool using Machine Learning

479

that the Sample method has better results than Sam-

pleSpace one, although the Pass Rate metric shows

inverse results. Remember that by sampling each cha-

racter according to the probability distribution in the

generative model, it has a higher rate of making in-

puts incorrect (Figure 3). One possible explanation

for this is that having a high rate of correct inputs

can make the program avoid some instructions that

were verifying the code’s correctness in more detail.

Thus, those instructions might be encountered by Tra-

cer only when the inputs given are a mix between cor-

rect and (slightly) invalid. In (Godefroid et al., 2017)

there is also a discussion about performing random

fuzzing over the inputs learned using RNN methods,

but similar to our evaluation, the results are not better

than the Sample method. The other technique pre-

sented in (H

oschele and Zeller, 2016) that learns the

grammar of the input through dynamic tainting and

applicable currently only to Java programs, could not

be evaluated since the tool is not (yet) open-source

and could not be retrieved in any other way.

6 CONCLUSIONS AND FUTURE

WORK

This paper presented an open-source tool that is able

to assist users in automatic generation of test data for

evaluating programs, having as initial input a corpus

of example tests. Support for any kind of input ﬁle

formats, operating efﬁciently in distributed environ-

ments, online learning, and checkpoints are one of its

strongest features. The evaluation section shows the

efﬁciency of using recurrent neural networks to learn

generative models that are able to produce new tests,

from two main perspectives: improved instruction co-

verage over random fuzzing and the percent of cor-

rect input ﬁles produced from the learned model. As

future work, we plan to improve the clusterization

of ﬁles using autoencoders techniques that are able

to learn features from existing inputs, study the ef-

fectiveness of using Generative adversarial networks

(GANs) in improving tests coverage. Another topic is

to improve the usability of the tool by providing a vi-

sual interface for controlling parameters and injecting

expert knowledge in learning and generation proces-

ses in an easier way.

ACKNOWLEDGMENTS

This work was supported by a grant of Roma-

nian Ministry of Research and Innovation CCCDI-

UEFISCDI. project no. 17PCCDI/2018 We would

like to thank our colleagues Teodor Stoenescu and

Alexandra Sandulescu from Bitdefender, and to Alin

Stefanescu from University of Bucharest for fruitful

discussions and collaboration.

REFERENCES

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,

Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin,

M., Ghemawat, S., Goodfellow, I. J., Harp, A., Irving,

G., Isard, M., Jia, Y., J

ozefowicz, R., Kaiser, L., Kud-

lur, M., Levenberg, J., Man

e, D., Monga, R., Moore,

S., Murray, D. G., Olah, C., Schuster, M., Shlens, J.,

Steiner, B., Sutskever, I., Talwar, K., Tucker, P. A.,

Vanhoucke, V., Vasudevan, V., Vi

egas, F. B., Vinyals,

O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y.,

and Zheng, X. (2016). Tensorﬂow: Large-scale ma-

chine learning on heterogeneous distributed systems.

CoRR, abs/1603.04467.

Bastani, O., Sharma, R., Aiken, A., and Liang, P. (2017).

Synthesizing program input grammars. SIGPLAN

Not., 52(6):95–110.

Cho, K., van Merrienboer, B., G

ulc¸ehre, C¸ ., Bougares, F.,

Schwenk, H., and Bengio, Y. (2014). Learning phrase

representations using RNN encoder-decoder for sta-

tistical machine translation. CoRR, abs/1406.1078.

Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., and

Bengio, Y. (2015). Attention-based models for speech

recognition. CoRR, abs/1506.07503.

Coppit, D. and Lian, J. (2005). Yagg: An easy-to-use ge-

nerator for structured test inputs. In Proceedings of

the 20th IEEE/ACM International Conference on Au-

tomated Software Engineering, ASE ’05, pages 356–

359, New York, NY, USA. ACM.

Cui, W., Peinado, M., Chen, K., Wang, H. J., and Irun-Briz,

L. (2008). Tupni: Automatic reverse engineering of

input formats. In Proceedings of the 15th ACM Con-

ference on Computer and Communications Security,

CCS ’08, pages 391–402, New York, NY, USA. ACM.

Godefroid, P. (2007). Random testing for security: black-

box vs. whitebox fuzzing. In RT ’07.

Godefroid, P., Kiezun, A., and Levin, M. Y. (2008).

Grammar-based whitebox fuzzing. In Proceedings of

the 29th ACM SIGPLAN Conference on Programming

Language Design and Implementation, PLDI ’08, pa-

ges 206–215, New York, NY, USA. ACM.

Godefroid, P., Levin, M. Y., and Molnar, D. (2012).

Sage: Whitebox fuzzing for security testing. Queue,

10(1):20:20–20:27.

Godefroid, P., Peleg, H., and Singh, R. (2017). Learn&fuzz:

machine learning for input fuzzing. In Rosu, G.,

Penta, M. D., and Nguyen, T. N., editors, Proceedings

of the 32nd IEEE/ACM International Conference on

Automated Software Engineering, ASE 2017, Urbana,

IL, USA, October 30 - November 03, 2017, pages 50–

59. IEEE Computer Society.

ICSOFT 2018 - 13th International Conference on Software Technologies

480

Hanford, K. V. (1970). Automatic generation of test cases.

IBM Syst. J., 9(4):242–257.

oschele, M. and Zeller, A. (2016). Mining input gram-

mars from dynamic taints. In Proceedings of the

31st IEEE/ACM International Conference on Automa-

ted Software Engineering, ASE 2016, pages 720–725,

New York, NY, USA. ACM.

Kaner, C., Bach, J., and Pettichord, B. (2001). Lessons Le-

arned in Software Testing. John Wiley & Sons, Inc.,

New York, NY, USA.

ammel, R. and Schulte, W. (2006). Controllable combi-

natorial coverage in grammar-based testing. In Uyar,

U., Duale, A. Y., and Fecko, M. A., editors, Tes-

ting of Communicating Systems, pages 19–38, Berlin,

Heidelberg. Springer Berlin Heidelberg.

Majumdar, R. and Xu, R.-G. (2007). Directed test genera-

tion using symbolic grammars. In Proceedings of the

Twenty-second IEEE/ACM International Conference

on Automated Software Engineering, ASE ’07, pages

134–143, New York, NY, USA. ACM.

Paduraru, C., Melemciuc, M., and Stefanescu, A. (2017).

A distributed implementation using apache spark of a

genetic algorithm applied to test data generation. In

Bosman, P. A. N., editor, Genetic and Evolutionary

Computation Conference, Berlin, Germany, July 15-

19, 2017, Companion Material Proceedings, pages

1857–1863. ACM.

Purdom, P. (1972). A sentence generator for testing parsers.

BIT Numerical Mathematics, 12(3):366–375.

Rajpal, M., Blum, W., and Singh, R. (2017). Not all by-

tes are equal: Neural byte sieve for fuzzing. CoRR,

abs/1711.04596.

Sirer, E. G. and Bershad, B. N. (1999). Using pro-

duction grammars in software testing. SIGPLAN Not.,

35(1):1–13.

Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence

to sequence learning with neural networks. CoRR,

abs/1409.3215.

Sutton, M., Greene, A., and Amini, P. (2007). Fuzzing:

Brute Force Vulnerability Discovery. Addison-Wesley

Professional.

Utting, M., Pretschner, A., and Legeard, B. (2012). A taxo-

nomy of model-based testing approaches. Softw. Test.

Verif. Reliab., 22(5):297–312.

Zaremba, W., Sutskever, I., and Vinyals, O. (2014).

Recurrent neural network regularization. CoRR,

abs/1409.2329.

An Automatic Test Data Generation Tool using Machine Learning

481