2 DATASET
CHARACTERIZATION
As the tool used to evaluate the performance of at-
tack detection techniques, datasets must include a
wide variety of data inside them. This data should
be composed by both, normal and anomalous sam-
ples. This way, the evaluation can provide meaningful
metrics that show how the algorithm performs against
different environments. A good anomaly detection
method should not only detect most of the malicious
behaviour (i.e. a good True positive rate (TPR)) but
also should not confuse normal behaviour with ma-
licious one (i.e. a high False positive rate (FPR)).
The dataset composition can help to detect bad per-
formance from both points of view, but it is crucial to
evaluate their content before trusting the results they
can output.
KDD99 dataset is the most used dataset in the aca-
demic field as it has become a standard evaluation
benchmark. Despite this fact, this dataset is highly
outdated and does not represent the current threats
that can face a critical infrastructure. Moreover sev-
eral works (Brown et al., 2009) (McHugh, 2000) have
highlighted some deficiencies in this dataset that can
bias the results of the algorithms applied to it. This
fact arise the issue of searching another dataset that
meets the requirements of our scenario and that can
be used as a benchmark for anomaly detection algo-
rithms.
The process of creating a dataset that provides a
realistic scenario, while providing as much data as
possible and preventing biased information means a
significant challenge (Shiravi et al., 2012). In this sec-
tion we will briefly describe the main characteristics
that we have found more important during our work
with different kinds of datasets and the impact they
can have on algorithm performance evaluation.
• Generation Method: Dataset generation can be
either synthetic or real capture based. Real cap-
ture datasets are built by employing real traffic
collected from a real institution like a university,
a research facility or a private organization. On
the other hand synthetic datasets are manually
created by injecting malicious traffic into normal
traffic samples. These normal can also be synthet-
ically generated or be part of a real traffic capture.
Real capture datasets are inherently better as they
model real network behaviour and therefore can
offer the most realistic information about the ac-
tual characteristic of an attack.
Moreover the normal part of the dataset show the
real use of the network without needing to model
it via any kind of traffic generation pattern. De-
spite of all these advantages this kind of datasets
are really hard to find due to the complexity of
capturing real attacks and the privacy issues that
can arise from publicly share network traffic of
an organization. The method employed to gener-
ate the dataset has to be taken into account when
translating the results of the performance metrics
into actual conclusions.
• Network Data Format: The format in which the
dataset is presented determines the quantity of in-
formation that is offered by it. As the data rep-
resented is network traffic, the format are mainly
based on different standardized network traffic
representations. The traffic can be offered raw or
after performing some level of aggregation. For
sharing raw network traffic, PCAP format is the
most used one. It is a standardized format that
contains a direct copy of the traffic that travels
through a network, therefore it is a way to avoid
losing any kind of information when sharing net-
work traffic data. The main disadvantage is the
size of the data (i.e. It takes up the same size as
the actual data collected from the network) and
as a no-loose format, the privacy issues of shar-
ing a raw copy of the data. As a consequence
real data capture of a critical infrastructure is ex-
tremely hard to find as it would represent a huge
threat for the organization itself.
As opposed to PCAP , Netflow-like formats of-
fer a summarized view of the traffic collected in
the dataset. Their information unit is the traffic
flow, that is, a sequence of messages exchanged
between two network nodes. Each one of these
flows could be composed by different traffic pack-
ets but it is summarized as a single flow and char-
acterized by its duration, size, number of packets,
etc. This kind of formats solve some privacy and
size issues while retaining most of the core be-
haviour of the network and as a consequence are
widely employed for dataset generation.
• Anonymization Level: To solve the privacy is-
sues mentioned above different techniques are
employed to reduce the amount of private infor-
mation provided in the dataset. These techniques
try to preserve most of the actual behaviour of the
network, so attacks can still be detected and dis-
tinguished from normal traffic. The most basic
anonymization method is the aggregation offered
by the format itself. As we mentioned in the pre-
vious characteristic, if the dataset is offered in a
flow summarized format, the payload of the pack-
ets is removed. There exist datasets in raw for-
mat that offer Pcap files without the data payload
of the packets. Both techniques prevent leaking
DCCI 2016 - SPECIAL SESSION ON DATA COMMUNICATION FOR CRITICAL INFRASTRUCTURES
152