necessary criteria with common updated attacks such
as DoS, DDoS, Brute Force, XSS, SQL Injection, In-
filtration, Port scan and Botnet. The dataset is com-
pletely labelled and more than 80 network traffic fea-
tures extracted and calculated for all benign and in-
trusive flows by using CICFlowMeter software which
is publicly available in Canadian Institute for Cyber-
security website (Habibi Lashkari et al., 2017). Sec-
ondly, the paper analyzes the generated dataset to se-
lect the best feature sets to detect different attacks and
also we executed seven common machine learning al-
gorithms to evaluate our dataset.
The rest of the paper is organized as follows. An
overview of the current available datasets between
1998 and 2016 is presented in Section 2. Section
3 discusses the designed network topology and the
attack scenarios. Section 4 presents the generated
dataset with explanation of eleven characteristics. Fi-
nally, the feature selection and machine learning anal-
ysis is discussed in section 5.
2 AVAILABLE DATASETS
In this section, we analyze and evaluate the eleven
publicly available IDS datasets since 1998 to demon-
strate their shortages and issues that reflect the real
need for a comprehensive and reliable dataset.
DARPA (Lincoln Laboratory 1998-99): The dataset
was constructed for network security analysis and ex-
posed the issues associated with the artificial injection
of attacks and benign traffic. This dataset includes e-
mail, browsing, FTP, Telnet, IRC, and SNMP activi-
ties. It contains attacks such as DoS, Guess password,
Buffer overflow, remote FTP, Syn flood, Nmap, and
Rootkit. This dataset does not represent real-world
network traffic, and contains irregularities such as the
absence of false positives. Also, the dataset is out-
dated for the effective evaluation of IDSs on modern
networks, both in terms of attack types and network
infrastructure. Moreover, it lacks actual attack data
records (McHugh, 2000) (Brown et al., 2009).
KDD’99 (University of California, Irvine 1998-99):
This dataset is an updated version of the DARPA98,
by processing the tcpdump portion. It contains differ-
ent attacks such as Neptune-DoS, pod-DoS, Smurf-
DoS, and buffer-overflow (University of California,
2007). The benign and attack traffic are merged to-
gether in a simulated environment. This dataset has
a large number of redundant records and is studded
by data corruptions that led to skewed testing results
(Tavallaee et al., 2009). NSL-KDD was created using
KDD (Tavallaee et al., 2009) to address some of the
KDD’s shortcomings (McHugh, 2000).
DEFCON (The Shmoo Group, 2000-2002: The
DEFCON-8 dataset created in 2000 contains port
scanning and buffer overflow attacks, whereas
DEFCON-10 dataset, which was created in 2002,
contains port scan and sweeps, bad packets, adminis-
trative privilege, and FTP by Telnet protocol attacks.
In this dataset, the traffic produced during the “Cap-
ture the Flag (CTF)” competition is different from
the real world network traffic since it mainly consists
of intrusive traffic as opposed to normal background
traffic. This dataset is used to evaluate alert correla-
tion techniques (Nehinbe, 2010) (Group, 2000).
CAIDA (Center of Applied Internet Data Analysis
2002-2016): This organization has three different
datasets, the CAIDA OC48, which includes differ-
ent types of data observed on an OC48 link in San
Jose, the CAIDA DDOS, which includes one-hour
DDoS attack traffic split of 5-minute pcap files, and
the CAIDA Internet traces 2016, which is passive traf-
fic traces from CAIDA’s Equinix-Chicago monitor on
the High-speed Internet backbone. Most of CAIDAs
datasets are very specific to particular events or at-
tacks and are anonymized with their payload, proto-
col information, and destination. These are not the
effective benchmarking datasets due to a number of
shortcomings, see (for Applied Internet Data Analy-
sis (CAIDA), 2002) (for Applied Internet Data Analy-
sis (CAIDA), 2007) (for Applied Internet Data Anal-
ysis (CAIDA), 2016) (Proebstel, 2008) (Ali Shiravi
and Ghorbani, 2012) for details.
LBNL (Lawrence Berkeley National Laboratory
and ICSI 2004-2005): The dataset is full header
network traffic recorded at a medium-sized site. It
does not have payload and suffers from a heavy
anonymization to remove any information which
could identify an individual IP (Nechaev et al., 2004).
CDX (United States Military Academy 2009): This
dataset represents the network warfare competitions,
that can be utilized to generate modern day labelled
dataset. It includes network traffic such as Web,
email, DNS lookups, and other required services. The
attackers used the attack tools such as Nikto, Nessus,
and WebScarab to carry out reconnaissance and at-
tacks automatically. This dataset can be used to test
IDS alert rules, but it suffers from the lack of traffic
diversity and volume (Sangster et al., 2009).
Kyoto (Kyoto University 2009): This dataset has
been created through honypots, so there is no pro-
cess for manual labelling and anonymization, but it
has limited view of the network traffic because only
attacks directed at the honeypots can be observed. It
has ten extra features such as IDS Detection, Mal-
ware Detection, and Ashula Detection than previous
available datasets which are useful in NIDS analy-
Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization
109