Guidelines and a Framework to Improve the Delivery of Network
Intrusion Detection Datasets
Brian Lewandowski
1,2
1
Computer Science, Worcester Polytechnic Institute, 100 Institute Road, Worcester, U.S.A.
2
Raytheon Technologies, 1001 Boston Post Road E, Marlborough, U.S.A.
Keywords:
Network Intrusion Detection, Datasets, Machine Learning, Deep Learning.
Abstract:
Applying deep learning techniques to perform network intrusion detection has expanded significantly in recent
years. One of the main factors contributing to this expansion is the availability of improved network intrusion
detection datasets. Despite recent improvements to these datasets, researchers have found it difficult to effec-
tively compare methodologies across a wide variety of datasets due to the unique features generated as part of
the delivered datasets. In addition, it is often difficult to generate new features using a dataset due to the lack
of source data or inadequate ground truth labeling information for a given dataset. In this work, we look at net-
work intrusion detection dataset development with a focus on improving the delivery of datasets from a dataset
researcher to other downstream researchers. Specifically, we focus on making dataset features reproducible,
providing clear labeling criteria, and allowing a clear path for researchers to generate new features. We outline
a set of guidelines for achieving these improvements along with providing a publicly available implementation
framework that demonstrates the guidelines using an existing network intrusion detection dataset.
1 INTRODUCTION
Network intrusion detection (NID) is a methodology
to protect computer networks by analyzing network
traffic in order to identify malicious network traffic
(Chou and Jiang, 2022). Researchers have begun to
leverage machine and deep learning techniques in or-
der to effectively combat the increasingly complex
and evolving attacks taking place on networks today
(Yang et al., 2022). In order to research and verify the
applicability of these data intensive techniques to net-
work intrusion detection systems (NIDS), one must
utilize datasets consisting of network scenarios that
involve both benign and malicious activity. To sup-
port these efforts a growing number of datasets have
been developed and analyzed (Chou and Jiang, 2022;
Ring et al., 2019; Yang et al., 2022). Despite the great
strides made in NID dataset development, researchers
have identified limitations which make it challeng-
ing to benchmark methods and perform feature engi-
neering (Chou and Jiang, 2022; Ferriyan et al., 2021;
Sarhan et al., 2021b; Sarhan et al., 2021c; Sarhan
This document does not contain technology or Techni-
cal Data controlled under either the U.S. International Traf-
fic in Arms Regulations or the U.S. Export Administration
Regulations.
et al., 2020; Sarhan et al., 2021c; Wolsing et al.,
2021).
In this work we seek to reduce the impact of lim-
itations that occur as a result of the handoff of NID
datasets from a dataset developer to downstream NID
researchers. We propose a set of guidelines to help
dataset developers overcome handoff limitations and
extend the positive impact these datasets can have on
downstream researchers. Our focus on the handoff of
NID datasets between researchers has not been well
explored in current NIDS dataset research. While
many of the guidelines are generic in nature, we pro-
vide details on how to specifically implement them for
NID datasets. In addition to the guidelines, we pro-
vide an open source containerized environment and
framework to support implementation of the guide-
lines
1
.
Figure 1 shows the dataset development process
adapted from descriptions in recent research to show
where this work logically fits (Sarhan et al., 2021b;
Komisarek et al., 2021). As can be seen highlighted
in the figure, we focus on improvements for NIDS
dataset feature and label generation which leads to
additional improvements for the final delivery of the
1
https://github.com/WickedElm/niddff
Lewandowski, B.
Guidelines and a Framework to Improve the Delivery of Network Intrusion Detection Datasets.
DOI: 10.5220/0012052300003555
In Proceedings of the 20th International Conference on Security and Cryptography (SECRYPT 2023), pages 649-658
ISBN: 978-989-758-666-8; ISSN: 2184-7711
Copyright
c
2023 by SCITEPRESS Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)
649
Network
Traffic
Generation
Collect
Source Data
Feature
Generation
Apply Labels
Evaluation
and Bench-
marking
Publish
Dataset
Figure 1: An overview of the NID dataset development process. The areas that the guidelines seek to improve are depicted in
blue with a gray background.
dataset. The provided set of guidelines can be used
such that the end delivery of a NIDS dataset includes
the original source network data as well as concrete
scripts for generating each feature and label. With
both of these items in hand, researchers will be able to
reliably recreate a dataset from source data, ensure the
same features are available across multiple datasets,
and perform additional feature engineering.
The main contributions of this work are as fol-
lows:
To the best of the authors’ knowledge this is the
first work to focus specifically on the handoff of
NID datasets
We identify the limitations that affect the handoff
of NID datasets between researchers
Guidelines are provided for overcoming limita-
tions currently present in the NID dataset handoff
process between researchers
An open source containerized environment is pro-
vided which contains a standard toolset and a fea-
ture engineering framework to support implemen-
tation of the guidelines
The remainder of this work is outlined as follows.
In Section 2 related works that look to improve the
NID dataset development process are explored. Sec-
tion 3 outlines the limitations related to NID datasets
that this work seeks to address. In Section 4 we dis-
cuss the details of our proposed guidelines along with
the developed containerized environment and frame-
work. We conclude and outline future work in Section
5.
2 RELATED WORK
Publicly available NID datasets that have been devel-
oped by researchers are well explored and analyzed
in the literature through a number of surveys (Chou
and Jiang, 2022; Ring et al., 2019; Yang et al., 2022).
These surveys break down the various datasets by dif-
ferent criteria such as data format, real versus syn-
thetic data, availability, as well as statistical data re-
garding the datasets. For this reason we focus this sec-
tion on other NID dataset development tools and ded-
icate Section 3 to discuss work related to NID dataset
limitations.
One of the earliest tools concerned with NID
datasets was FLAME (Brauckhoff et al., 2008). The
main goal of the FLAME tool was to take existing
netflow data and augment it by injecting new anoma-
lies into the existing flows. In doing this, it would
allow researchers to capture live network traffic and
then augment it later with anomalies resulting in a
dataset usable for developing NID methodologies.
With a goal similar to FLAME, the ID2T tool
also focused on augmenting network data with attacks
(Cordero et al., 2015; Cordero et al., 2021). The ID2T
tool, however, approached this from the packet level,
ingesting packet capture (PCAP) files as opposed to
netflow data. In addition, the ID2T tool is capable of
providing reports regarding the attacks injected such
that they can be used to facilitate data labeling. These
qualities allowed ID2T to be capable of injecting a
larger variety of network attacks making it useful for
the generation of new NID datasets that combine live
network traffic with synthetic attacks.
The INSecS-DCS tool (Rajasinghe et al., 2018) is
another NID dataset creation tool which has an ex-
panded scope compared to both FLAME and ID2T.
Rather than focus on injecting attacks, INSecS-DCS
focuses on processing packet data, live or from a
PCAP file, such that one can customize the features
to include in a final processed dataset. These features
can be captured as packet level statistics or based on
time windows.
Another related work introduces NDCT (Acosta
et al., 2021), which provides a toolkit for the collec-
tion and annotation of cybersecurity datasets. NDCT
is presented as a system that is primarily used during
a cybersecurity scenario exercise. During the scenario
execution, users are provided with dialogues to anno-
tate specific packets for tasks such as labeling. These
annotations can also be used to generate rules such
that similar packets receive the same labeling, mak-
ing the labeling task more efficient.
Our work is distinguished from these related
works in several ways. First, our guidelines and
framework do not explicitly focus on injecting attacks
into existing source data. Our implementation, how-
ever, complements both FLAME and ID2T such that
SECRYPT 2023 - 20th International Conference on Security and Cryptography
650
one could use both tools as part of a pipeline for NID
dataset creation within the framework. Similarly, one
could incorporate the INSecS-DCS tool for feature
creation within our framework. We currently incor-
porate both Zeek
2
and Argus
3
for our container en-
vironment, however, the intention is to grow the en-
vironment such that tools such as these can be in-
corporated. While NDCT focuses on supporting the
live annotation of network data during scenario ex-
ecution, our framework would support downstream
feature engineering on the resulting source data. The
other main differentiator for our work compared to
those discussed here, is that we focus on being able to
reproduce a dataset from source files and facilitate its
exchange between researchers. The previous works
in this space do not generally have this focus as they
seek to improve the actual NID data itself as opposed
to the process of creating it.
3 NID DATASET LIMITATIONS
3.1 Reproducibility
The main focus of the guidelines and framework pre-
sented in our work is to increase the ability of re-
searchers to reproduce datasets from source files. To
be clear, we are not concerned with reproducing and
re-executing a NID scenario. Rather, we would like
to take the resulting PCAP and netflow files from
such a scenario and be able to reliably reproduce the
dataset’s original features and then perform further
feature engineering.
An example of the need for this type of repro-
ducibility has been researched recently by analyz-
ing the usage of publicly available datasets by down-
stream researchers (Chou and Jiang, 2022). In most
instances, the original datasets are augmented in some
way, however, it was found that most of the work re-
lated to these augmentations was unable to be dupli-
cated due to insufficient code, documentation, or both
(Chou and Jiang, 2022). The framework delivered in
our work seeks to improve this situation by providing
a standard way to document and code dataset augmen-
tations from source files.
Other work has proposed a set of content and
process requirements for generating a reproducible
dataset (Ferriyan et al., 2021). The content require-
ments outlined include providing full PCAP files with
their data payload, anonymization of network traffic,
providing ground truth data, using up-to-date network
2
https://zeek.org/
3
https://openargus.org/
traffic, labeling the data, and providing information
regarding encryption. The process requirements per-
tain to information that should be provided in order to
make generating the dataset reproducible. Our guide-
lines support these requirements, and we look to ex-
tend them with our framework through inclusion of
the scripts used to generate features and perform la-
beling, along with full PCAP files. This leaves no am-
biguity in descriptions for how to regenerate a dataset.
In addition to these works, there is no shortage of
NID literature that discusses the need to have repro-
ducible datasets (Cordero et al., 2021; Lavinia et al.,
2020; Sharafaldin et al., 2018a; Kenyon et al., 2020).
3.2 Unclear Labeling Criteria
One of the major challenges that researchers face
when working with NID datasets is the lack of
datasets with complete and accurate labeling (Lavinia
et al., 2020; Moustafa et al., 2019). Many of these
labeling issues arise as the task is often performed
by human analysis, making it both time-consuming
and error-prone (Lavinia et al., 2020). In other cases,
the labeling criteria used is either incomplete or in-
fluenced negatively by previous errors in the dataset
generation process (Lanvin et al., 2022).
For these reasons, accurate labels along with truth
data is generally considered a major component of a
useful NID dataset (Cordero et al., 2021; Rajasinghe
et al., 2018; Komisarek et al., 2021). While ground
truth on its own is useful, it can be misleading de-
pending on the granularity in which it is provided.
For instance, given only IP addresses and timestamps
one would have to assume that any traffic related to
that IP address is malicious, however, it is typically
the case that a mixture of both benign and attack traf-
fic would be present. For this reason, our proposed
guidelines and framework call for the inclusion of la-
beling scripts which may make use of ground truth
data if necessary. We note that in the case of manu-
ally labeled data this scripting could simply be index-
based using the ordering of the data being processed.
3.3 No Standard Feature Set
In a recent series of papers, Sarhan et al. explore
limitations of current datasets and the impact these
limitations have on evaluating methods across mul-
tiple networks and transitioning research into practi-
cal applications (Sarhan et al., 2020; Sarhan et al.,
2021a; Sarhan et al., 2021c; Sarhan et al., 2021b).
The main limitation explored in these works is the fact
that with such varied features included with delivered
datasets, one cannot reliably compare a methodology
Guidelines and a Framework to Improve the Delivery of Network Intrusion Detection Datasets
651
across multiple networks to test for generalizability.
This leads to hindrances during the transition from re-
search to practical applications.
Our work looks to extend the ideas expressed by
Sarhan et al. in order to enable researchers to over-
come these identified limitations. We aim to make it
easier for researchers to provide a dataset that is re-
producible from source data and easily expanded or
adjusted. In this way, one could easily use a standard
feature set as well as research augmenting such a fea-
ture set for improvements.
While not the main focus, both (Komisarek et al.,
2021) and (Layeghy et al., 2021) discuss and tackle
the need to use a common feature set for comparison
of their methods as opposed to using proprietary fea-
tures delivered with most NID datasets. Both works
provide informative descriptions regarding the fea-
tures used in their research. In addition, (Layeghy
et al., 2021) provides the actual calculations for the
features used in their work. We believe this is a step
in the right direction for the level of detail necessary
to reproduce datasets from source data. We seek to
naturally extend this information into scripts that are
provided along with source network captures to make
reproducing and extending the dataset more accessi-
ble and leave less opportunity for error.
4 NID DATASET DELIVERY
GUIDELINES
4.1 The Intrinsic Value in NID Datasets
We believe it is worthwhile to provide a brief discus-
sion regarding the intrinsic value provided by NID
datasets as related to their development and subse-
quent distribution. Namely, the intrinsic value of a
NID dataset is created during the scenario develop-
ment, execution, and source data collection and not
by the final delivered features. To be clear, the fi-
nal features are valuable, but they are representative
of a separate feature engineering activity that takes
place after the intrinsic value of a network scenario
has been captured in source data. In other words,
the value provided by the NID dataset is derived from
the actual network intrusion scenario and its collected
source data. A researcher could provide any number
of derived features with varying degrees of value for
attack detection, however, the intrinsic value of the
source data remains constant as it is derived from the
scenario that was captured.
One goal of our framework is to highlight these
two separate activities by advocating for the delivery
of both source data and separate scripts that gener-
ate the features that take place during any subsequent
feature engineering. Providing both items delivers the
value of both activities to downstream researchers.
4.2 Guidelines in Detail
The main ideas behind the proposed guidelines are
simple in statement but oftentimes overlooked in
practice. Specifically considering the hand off of
datasets from one researcher to another; the guide-
lines focus on ease of access, reproducibility from
source data, verification, and extension. The guide-
lines are meant to provide general guidance for mak-
ing the delivery of NID datasets meet these four ar-
eas of focus and reduce the impact of the limitations
discussed in Section 3. We note that our framework
allows for the specific implementation of the guide-
lines to vary depending on the particular methods em-
ployed by researchers. While some common tools are
provided in our framework environment, we expect it
to expand to meet researchers’ needs as discussed fur-
ther in Section 5. In addition, it is important to make
the distinction that when we reference reproducibility
of a dataset, we refer to reproducing the dataset’s fi-
nal features from the original source data as opposed
to recreating and re-executing the dataset’s NID sce-
nario.
The ten guidelines are outlined and described in
Table 1 along with their justification and details re-
garding how NID researchers can implement each
guideline, with a focus on our companion frame-
work. Guidelines one through four pertain to provid-
ing downstream researchers with the resources neces-
sary to actively reproduce and enhance the provided
dataset. Guidelines five through nine outline steps
that can be taken to ensure that all the dataset features
and labels can be regenerated from source data, and
that the steps for this generation of features can be
verified and understood by downstream researchers.
Finally, guideline ten is specifically included to em-
phasize that the delivered datasets can be considered
active projects and adjust over time for any errors
found after initial presentation to researchers. This
aims to help avoid situations such as with the KDD
Cup ’99 (kdd, 1999) and CICIDS2017 (Sharafaldin
et al., 2018b) datasets, where researchers have found
issues with the original datasets resulting in multiple
variants of datasets being available with specific cor-
rections (Tavallaee et al., 2009; Lanvin et al., 2022;
Engelen et al., 2021).
SECRYPT 2023 - 20th International Conference on Security and Cryptography
652
Table 1: Guidelines for improving the handoff of NID datasets from dataset researchers to downstream researchers.
Guideline Justification Implementation Details
(1) Provide direct access
to all data and scripts for
dataset
The main purpose of this guideline
is to prevent barriers to obtaining
datasets. (Ring et al., 2019; Cordero
et al., 2021)
This can be achieved through a simple
download script. The implemented frame-
work provides a mechanism such that
dataset developers can provide metadata
consisting of a download URL and destina-
tion file name to meet this guideline.
(2) Include complete
source data to the most
detailed extent possible
Full source data is necessary to ade-
quately reproduce and/or augment a
dataset. (Ring et al., 2019; Cordero
et al., 2021; Ferriyan et al., 2021)
This should generally be a standard format
such as PCAP or netflow. Full PCAP files
are more favorable than partial PCAP files
with no payload. If only netflow data is
available, a full collection of attributes is
better than a partial collection.
(3) If possible, provide ac-
cess to all tools needed to
generate dataset
Differences in tools, environments,
and their versions can limit the abil-
ity of downstream researchers to ob-
tain the same results as intended by
the original dataset authors. Without
this, extending the dataset with fea-
ture engineering may not be success-
ful. (Chou and Jiang, 2022; Sarhan
et al., 2021c; Cermak et al., 2018)
The implemented framework meets this
guideline by providing a containerized envi-
ronment with specific versions of tools such
as Zeek and Argus. This ensures that users
of the framework can use the same baseline
of tools and environment as was used by the
original dataset developers.
(4) Provide documenta-
tion indicating how to re-
produce a dataset from
source data
Clear documentation reduces ambi-
guity provided in general descriptions
of dataset creation. Differences in
commands used to generate a dataset
from source can produce different re-
sults than the original dataset. (Chou
and Jiang, 2022; Ferriyan et al., 2021)
Versions of tools and specific commands
used to execute them should be docu-
mented. The provided framework is self-
documenting as researchers can review
YAML files for each dataset to view the
commands used to generate them as dis-
cussed in Section 4.3.
(5) Include source code
needed to reproduce
dataset features
Providing feature generation source
code ensures downstream researchers
can duplicate a dataset, verify fea-
ture correctness, and understand de-
tails of the feature calculation. (Fer-
riyan et al., 2021; Ring et al., 2019)
One should avoid making code too specific
to a particular user environment. The imple-
mented framework supports this guideline
with a containerized environment, specific
directories for feature generation scripts,
and infrastructure to support features gener-
ated with network analysis tools.
(6) The source code for
each feature should be
easily identifiable
This guideline is recommended to
make analysis of the features of a
dataset more accessible for down-
stream researchers. (Lanvin et al.,
2022; Chou and Jiang, 2022)
This can be implemented through naming
conventions for scripts that match the final
feature name and techniques such as using a
separate script or function for each feature.
The implemented framework supports this
by enforcing these conventions in its inter-
faces with network analysis tools.
(7) The generation of each
feature should be inde-
pendent from others
This guideline is recommended to
avoid execution dependencies be-
tween features and it facilitates the
ability to remove or add new features
by downstream researchers. This also
makes the code for each feature more
understandable and reviewable. (Lan-
vin et al., 2022; Chou and Jiang,
2022)
The implemented framework supports this
guideline in the way it interfaces with net-
work analysis tools to generate features
in an independent manner where possible.
In addition, the built-in framework encour-
ages this by providing standard configura-
tion files that can be used to identify each
feature script to run.
Guidelines and a Framework to Improve the Delivery of Network Intrusion Detection Datasets
653
Table 1: Guidelines for improving the handoff of NID datasets from dataset researchers to downstream researchers (cont.).
Guideline Justification Implementation Details
(8) Apply guidelines out-
lined for features to labels
as well
While labels are significant for model
training, during dataset generation
time, they can be considered a special
case of features. In this way, we want
to apply guidelines (4), (5), and (6) to
labels as well. (Ferriyan et al., 2021;
Lavinia et al., 2020; Cordero et al.,
2021; Rajasinghe et al., 2018; Komis-
arek et al., 2021)
The implemented framework supports this
goal by providing the same infrastructure
available for feature development to label
development.
(9) Make source code for
labeling distinct from
other features
This guideline is recommended to
make the labeling criteria used for
a dataset clear for collaborating re-
searchers. Because the label fea-
tures/procedure can inform machine
learning model design decisions it is
helpful to have it distinctly identifi-
able. For example, if the labeling cri-
teria is based on a single IP address,
it is likely that the IP address features
should not be provided to a model.
(Lanvin et al., 2022)
The implemented framework supports this
guideline by having a separate step of pro-
cessing for label scripts and by having them
contained in a separate directory for a given
dataset.
(10) Provide a mecha-
nism to receive and imple-
ment feedback from re-
searchers to correct issues
and improve dataset
This guideline encourages collabora-
tion between NID researchers, allows
a dataset to remain current, and pro-
vides a feedback loop to dataset re-
searchers to correct any issues found
by the research community. (Ring
et al., 2019; Lanvin et al., 2022)
The implemented framework supports this
guideline through its use of scripting and
metadata to describe a dataset such that each
dataset can be maintained in an independent
source code repository or as part of the de-
fault environment.
4.3 Framework Details
In this section we cover the main ideas of our con-
tainerized environment and implementation of the
guidelines. As an example of the implementation,
we developed a demo dataset which takes a single
PCAP file from the UNSW-NB15 dataset (Moustafa
and Slay, 2015) and duplicate most of the original
dataset’s features and extends them to contain new
features. For brevity, many specifics regarding the
framework’s usage have been omitted. For addi-
tional details we recommend consulting the frame-
work repository.
4.3.1 Container Environment
We provide a containerized environment to support
our implementation in order to improve reproducibil-
ity and eliminate the need to install multiple tools
used by other researchers. Currently, this minimal
environment includes the Zeek and Argus network
analysis tools as well as python
4
and a set of default
python libraries as described in the tool’s repository.
It is expected that this would grow in the future, how-
ever, we consider this an adequate starting point to
demonstrate its usefulness.
The intention of our framework is that the tool and
our container would be used in conjunction together,
however, the container environment could be used
on its own just to ensure specific versions of tools
are easily accessible. Running the container without
specifying a command to execute will place the user
into a shell prompt with access to the installed tools.
The intended method of executing the environment,
however, is to map the container’s disk drive /nidd f f
to the directory of the user’s local repository of our
tool infrastructure. This allows for the development
of a dataset using the framework and container in a
variety of ways.
4
https://www.python.org/
SECRYPT 2023 - 20th International Conference on Security and Cryptography
654
4.3.2 Framework Implementation
Our implementation provides a standard format for
defining and delivering NID datasets using configura-
tion files, naming conventions, and a standard direc-
tory structure. At the core of the implementation we
read in a YAML configuration file customized for a
dataset and use that information to fully process the
dataset from source. The high level algorithm fol-
lowed by the tool can be seen in Algorithm 1.
Algorithm 1: General processing used to generate a NID
dataset based on an input configuration file. The input file is
processed in a top-down manner with a loop for processing
multiple source files prior to combining them together at the
end.
Input: con f ig, YAML configuration file
Output: dataset, NID dataset suitable for ML
1: Read in con f ig
2:
3: Store documentation information from con f ig
4: Process setup options
5:
6: Read in metadata for source data
7: if download source == TRUE then
8: Download all source PCAP and Netflow files
9: end if
10:
11: for each source file do
12: Execute feature processing commands
13: Execute label processing commands
14: Execute post-processing commands
15: Save intermediary dataset file
16: end for
17:
18: Execute final dataset processing commands
19: Combine intermediary dataset files
20:
21: return dataset
4.3.3 Dataset Directory Structure
Each NID dataset has its configuration and generation
scripts contained in a dedicated directory. This allows
it to be maintained by the original dataset developers
and then plugged into the framework by consumers
of the dataset. The general structure of a dataset di-
rectory is shown in Figure 2 where one can see the
YAML configuration file, directories for source meta-
data, ground truth metadata, output files, and each
processing step’s files.
For the source and ground truth data, the direc-
tory contains metadata files which are in a comma-
separated format where each line contains a download
URL and the destination file name which is read in
dataset/
config.yaml
source/
pcaps.meta
ground truth/
gt.meta
output/
step acquire source data/
load .argus
load .python
load .zeek
step feature processing/
load .argus
load .python
load .zeek
step label processing/
load .argus
load .python
load .zeek
step post processing/
load .argus
load .python
load .zeek
step final dataset processing/
load .argus
load .python
load .zeek
Figure 2: A default directory structure for a dataset within
the proposed framework. Each processing step has its own
directory intended to contain loading scripts for supported
tools as well as any other scripts used in a given step. It
should be noted that these directories are only needed if they
are used for a given dataset. For instance, the framework
takes care of default processing for several stages but the
user has the option of customizing each stage with their own
scripts.
by the framework when acquiring source data. In ad-
dition, each processing step can contain simple files
with the naming convention load . < tool > where
< tool > is one of the framework’s supported tools
such as Zeek or Argus. While the particulars of how
each tool behaves varies, these files have each line
denote a single feature or process to run for a given
tool. If applicable, an associated script with the same
name as the feature it generates is contained in the
same directory. In other words, users can easily iden-
tify the features being generated by reviewing the
load . < tool > files and the scripts that they ref-
erence. This promotes having easy to identify source
code for each feature as indicated in guideline six, as
well as having self-contained features as indicated in
guideline seven.
Guidelines and a Framework to Improve the Delivery of Network Intrusion Detection Datasets
655
4.3.4 Dataset Configuration File
Each dataset has a YAML configuration file that
drives its creation. As seen in Listing 1 it contains
documentation, options, and can contain a mix of
built-in framework commands as well as custom com-
mands to execute. For example, the framework takes
information from the setup options section and de-
termines what source files to download during the
step acquire source data step. Other built-in com-
mands such as run zeek have default behavior re-
quiring little setup on the user’s part in the config-
uration file. In general, these commands look into
the current step’s directory and reads an associated
load . < tool > file. This file is then used by the
framework to either generate features, labels, or per-
form some other intermediary processing. Aside from
commands supported by the framework, user’s can
also specify any custom commands or scripting to ex-
ecute, and they will be processed in the order they
appear in the file. For these commands, users have
access to a number of built-in variables that can be
accessed in order to direct particulars such as paths
to source files to read in and where to place output.
The main benefit of this single configuration file is
that it fully self-describes how the dataset is created
and provides the information needed for users to ac-
cess the code used to generate features and perform
labeling.
4.3.5 Benefits for NID Dataset Developers
The framework implementation provides several ben-
efits for NID dataset developers. First, it provides
enough flexibility such that there are varying degrees
of buy-in for using the framework. For instance, sup-
pose a NID dataset researcher only provides source
files and ground truth data or has a previously gener-
ated dataset that they would like to incorporate into
the framework with little effort. This can be achieved
through the framework by generating the source file
metadata files and ground truth metadata files. While
minimum effort is required by the NID dataset re-
searcher, it provides additional accessibility of the
files to downstream consumers. On the other end
of the spectrum, the container environment provides
tools for analyzing source data which can be taken
advantage of by NID dataset researchers. This use of
the container allows downstream researchers to use
the same versions of the software when working with
the dataset.
Another benefit for NID dataset researchers is
that the framework implementation provides an or-
ganized structure to follow and self-documents how
the dataset features and labels were generated from
source data. When updating the dataset or expanding
it, the change history of the configuration files within
the framework can be inspected to track the changes
provided there are no updates to the source data. Ad-
ditionally, any improvements or feedback can be pro-
vided from end users back to the NID dataset re-
searcher by lightweight updates to these configuration
files.
The intent of this framework is such that no signif-
icant additional work is imposed on NID dataset de-
velopers as all the steps it encapsulates must already
be performed to generate a given dataset. The empha-
sis of the framework and guidelines is such that these
steps are simply organized in a standardized manner.
d oc u me n t a t i o n :
n i d d f f : n i d d f f / n i d d f f : 0 . 1
s e t u p o p t i o n s :
d a t a s e t n a m e : d e m o d a t a s e t
s o u r c e d a t a : unsw nb15
g r o u n d t r u t h d a t a : unsw nb15
c l e a n o u t p u t d i r e c t o r y : Tru e
e x p e c t e d o u t p u t s :
u n s w n b 1 5 d a t a s e t . c s v
a r g u s :
c l e a n : T r u e
a r g u m e n t s : S 60 m
e x e c u t e r a : T r u e
s t e p a c q u i r e s o u r c e d a t a :
downlo a d : T r ue
s t e p f e a t u r e p r o c e s s i n g :
r u n z e e k
r u n a r g u s
r u n p y t h o n s c r i p t s
s t e p l a b e l p r o c e s s i n g :
r u n p y t h o n s c r i p t s
s t e p p o s t p r o c e s s i n g :
r u n c o m b i n e f e a t u r e s
s t e p f i n a l d a t a s e t p r o c e s s i n g :
r u n c o m b i n e d a t a
Listing 1: A sample input file consumed by our framework
specifying where to obtain source data and how to process
it to produce a final dataset. Options can be overridden on
the command line if necessary.
4.3.6 Benefits for NID Dataset Consumers
This framework also provides benefits for down-
stream researchers using NID datasets. For re-
searchers looking to simply use the original dataset
as provided, there is generally no changes in work-
flow imposed by the framework though they would
SECRYPT 2023 - 20th International Conference on Security and Cryptography
656
Figure 3: A diff comparison of extracted Argus features
from the first PCAP of the UNSW-NB15 dataset. On the
top, the left hand side of the diff shows a portion of the
original Argus features from the original dataset while the
right shows the same section of the output but generated by
running Argus with no command line options on the source
PCAP. On the bottom, the left hand side of the diff shows
the same portion of the original Argus features from the
original dataset while the right now shows the same section
of the output generated by running Argus with the S 60
option. The differences on the top demonstrate the necessity
of having the exact command line options used to generate
dataset features in order to make a dataset reproducible.
be able to easily obtain the dataset using the down-
load metadata. For researchers seeking to analyze a
dataset, the container environment and configuration
files approach provides a way for them to reproduce
the dataset reliably since all the tools and the com-
mand line options used to run them are contained
within the scripts. As an example of this benefit,
we look at the implementation of our demo dataset,
which uses a single PCAP from the UNSW-NB15
dataset (Moustafa and Slay, 2015). As depicted in
Figure 3, without using a particular set of options for
Argus, one would receive results with an additional
2,529 rows compared to what the original dataset au-
thors intended. This was found experimentally for our
research but shows the value of the ambiguity that is
removed when researchers have the full commands
readily available. Similar benefits are gained by hav-
ing the full labeling criteria laid out in the dataset con-
figuration files.
An additional benefit comes in the form of being
able to generate a standard feature set from any source
data. If some standard feature set is not included
by the original dataset authors, a researcher can eas-
ily adapt the original dataset with a standard feature
set in order to facilitate comparisons across multiple
datasets. By following the guidelines and using the
framework, the scripts to produce such a feature set
become plug-n-play for any dataset that uses the same
source format.
Similar to this plug-n-play nature of scripts when
using the containerized environment and framework,
a similar benefit can be realized for individual fea-
tures of a dataset. As an example, one can consider
the situation where two researchers are using the same
container version and source dataset and perform dif-
ferent feature engineering. The use of the container
and framework allows them to exchange their feature
scripts or just the resulting data for individual fea-
tures and simply merge the results into their work. As
outlined in Section 5, this ability provides additional
benefits if the environment is expanded to include a
server-based component.
5 CONCLUSION AND FUTURE
WORK
In this work we propose a set of ten guidelines that
will improve the handoff of NID datasets between re-
searchers. The focus of these guidelines is to improve
ease of access, reproducibility from source data, veri-
fication, and the extension of datasets. We believe that
considering these areas while generating new datasets
will benefit both dataset developers and downstream
researchers using the datasets. The provided frame-
work demonstrates these goals and their associated
benefits.
While these guidelines are a step forward in
progress in this area of research, it does not elim-
inate all the complexities faced by researchers who
want to extend NID datasets. In future work we
aim to remove many of these additional complexities
by including server-based methods to facilitate these
guidelines. With the availability of a server environ-
ment, researchers could either use the container en-
vironment locally or interact with the server to per-
form scripting while leveraging the same container
environment in both contexts. In this approach, the
server could store source data locally eliminating the
need to download anything but a final feature set. Ad-
ditionally, if other researchers had already created a
feature on the server, the scripting and data has the
potential to be re-used without the need to regenerate
anything. This future work would be able to leverage
the framework developed here making it a significant
step towards even more efficiency gains.
ACKNOWLEDGMENTS
We would like to acknowledge Professor Randy Paf-
fenroth from Worcester Polytechnic Institute for his
valuable insights and guidance which helped shape
this work.
Guidelines and a Framework to Improve the Delivery of Network Intrusion Detection Datasets
657
REFERENCES
(1999). Kdd cup 99. http://kdd.ics.uci.edu/databases/kddc
up99/kddcup99.html. Accessed: 2022-08-08.
Acosta, J. C., Medina, S., Ellis, J., Clarke, L., Rivas, V., and
Newcomb, A. (2021). Network data curation toolkit:
Cybersecurity data collection, aided-labeling, and rule
generation. In MILCOM 2021 - 2021 IEEE Military
Communications Conference (MILCOM), pages 849–
854.
Brauckhoff, D., Wagner, A., and May, M. (2008). Flame: A
flow-level anomaly modeling engine. In CSET.
Cermak, M., Jirsik, T., Velan, P., Komarkova, J., Spacek, S.,
Drasar, M., and Plesnik, T. (2018). Towards provable
network traffic measurement and analysis via semi-
labeled trace datasets. In 2018 Network Traffic Mea-
surement and Analysis Conference (TMA), pages 1–8.
Chou, D. and Jiang, M. (2022). A survey on data-driven net-
work intrusion detection. ACM Computing Surveys,
54(9):1–36.
Cordero, C. G., Vasilomanolakis, E., Milanov, N., Koch,
C., Hausheer, D., and M
¨
uhlh
¨
auser, M. (2015). Id2t: A
diy dataset creation toolkit for intrusion detection sys-
tems. In 2015 IEEE Conference on Communications
and Network Security (CNS), pages 739–740.
Cordero, C. G., Vasilomanolakis, E., Wainakh, A.,
M
¨
uhlh
¨
auser, M., and Nadjm-Tehrani, S. (2021). On
generating network traffic datasets with synthetic at-
tacks for intrusion detection. ACM Trans. Priv. Secur.,
24(2).
Engelen, G., Rimmer, V., and Joosen, W. (2021). Trou-
bleshooting an intrusion detection dataset: the ci-
cids2017 case study. In 2021 IEEE Security and Pri-
vacy Workshops (SPW), pages 7–12.
Ferriyan, A., Thamrin, A. H., Takeda, K., and Murai,
J. (2021). Generating network intrusion detection
dataset based on real and encrypted synthetic attack
traffic. Applied Sciences, 11(17).
Kenyon, A., Deka, L., and Elizondo, D. (2020). Are pub-
lic intrusion datasets fit for purpose characterising the
state of the art in intrusion event datasets. Computers
& Security, 99:102022.
Komisarek, M., Pawlicki, M., Kozik, R., Hołubowicz, W.,
and Chora
´
s, M. (2021). How to effectively collect and
process network data for intrusion detection? Entropy,
23(11).
Lanvin, M., Gimenez, P.-F., Han, Y., Majorczyk, F., M
´
e,
L., and Totel, E. (2022). Errors in the CICIDS2017
dataset and the significant differences in detection per-
formances it makes. In CRiSIS 2022 - International
Conference on Risks and Security of Internet and Sys-
tems, pages 1–16, Sousse, Tunisia.
Lavinia, Y., Durairajan, R., Rejaie, R., and Willinger, W.
(2020). Challenges in using ml for networking re-
search: How to label if you must. In Proceedings
of the Workshop on Network Meets AI & ML, NetAI
’20, page 21–27, New York, NY, USA. Association
for Computing Machinery.
Layeghy, S., Gallagher, M., and Portmann, M. (2021).
Benchmarking the benchmark analysis of synthetic
nids datasets.
Moustafa, N., Hu, J., and Slay, J. (2019). A holistic re-
view of network anomaly detection systems: A com-
prehensive survey. Journal of Network and Computer
Applications, 128:33–55.
Moustafa, N. and Slay, J. (2015). Unsw-nb15: a compre-
hensive data set for network intrusion detection sys-
tems (unsw-nb15 network data set). In 2015 Mili-
tary Communications and Information Systems Con-
ference (MilCIS), pages 1–6.
Rajasinghe, N., Samarabandu, J., and Wang, X. (2018).
Insecs-dcs: A highly customizable network intrusion
dataset creation framework. In 2018 IEEE Canadian
Conference on Electrical & Computer Engineering
(CCECE), pages 1–4.
Ring, M., Wunderlich, S., Scheuring, D., Landes, D., and
Hotho, A. (2019). A survey of network-based in-
trusion detection data sets. Computers & Security,
86:147–167.
Sarhan, M., Layeghy, S., Moustafa, N., and Portmann,
M. (2020). Netflow datasets for machine learning-
based network intrusion detection systems. In Big
Data Technologies and Applications, pages 117–135.
Springer.
Sarhan, M., Layeghy, S., Moustafa, N., and Portmann, M.
(2021a). A cyber threat intelligence sharing scheme
based on federated learning for network intrusion de-
tection.
Sarhan, M., Layeghy, S., and Portmann, M. (2021b). Eval-
uating standard feature sets towards increased gener-
alisability and explainability of ml-based network in-
trusion detection.
Sarhan, M., Layeghy, S., and Portmann, M. (2021c). To-
wards a standard feature set for network intrusion de-
tection system datasets. Mobile Networks and Appli-
cations, 27(1):357–370.
Sharafaldin, I., Gharib, A., Lashkari, A. H., and Ghor-
bani, A. A. (2018a). Towards a reliable intrusion
detection benchmark dataset. Software Networking,
2018(1):177–200.
Sharafaldin, I., Lashkari, A. H., and Ghorbani, A. A.
(2018b). Toward generating a new intrusion detection
dataset and intrusion traffic characterization. ICISSp,
1:108–116.
Tavallaee, M., Bagheri, E., Lu, W., and Ghorbani, A. A.
(2009). A detailed analysis of the kdd cup 99 data
set. In 2009 IEEE symposium on computational intel-
ligence for security and defense applications, pages
1–6. Ieee.
Wolsing, K., Wagner, E., Saillard, A., and Henze, M.
(2021). Ipal: Breaking up silos of protocol-dependent
and domain-specific industrial intrusion detection sys-
tems.
Yang, Z., Liu, X., Li, T., Wu, D., Wang, J., Zhao, Y.,
and Han, H. (2022). A systematic literature review
of methods and datasets for anomaly-based network
intrusion detection. Computers & Security, page
102675.
SECRYPT 2023 - 20th International Conference on Security and Cryptography
658