Data Sets for Cyber Security Machine Learning Models:

A Methodological Approach

Innocent Mbona

and Jan H. P. Eloff

Department of Computer Science, University of Pretoria, Lynwood Road, Pretoria, South Africa

Keywords: Cyber Security, Real-World Data, Synthetic Data, Zero-Day Attack, Network Intrusion Detection Systems.

Abstract: Discovering Cyber security threats is becoming increasingly complex, if not impossible! Recent advances in

artificial intelligence (AI) can be leveraged for the intelligent discovery of Cyber security threats. AI and

machine learning (ML) models depend on the availability of relevant data. ML based Cyber security solutions

should be trained and tested on real-world attack data so that solutions produce trusted results. The problem

is that most organisations do not have access to useable, relevant, and reliable real-world data. This problem

is exacerbated when training ML models used to discover novel attacks, such as zero-day attacks.

Furthermore, the availability of Cyber security data sets is negatively affected by privacy laws and regulations.

The solution proposed in this paper is a methodological approach that guides organisations in developing

Cyber security ML solutions, called CySecML. CySecML provides guidance for obtaining or generating

synthetic data, checking data quality, and identifying features that optimise ML models. Network Intrusion

Detection Systems (NIDS) were employed to illustrate the convergence of Cyber security and AI concepts.

1 INTRODUCTION

The convergence of artificial intelligence (AI) and

Cyber security is high in the research agenda of AI

and Cyber security experts. In general, the state of

Cyber security could benefit from advances made in

AI. Consider for example the discovery of insider

threats, such as disgruntled employees, who may

damage the reputation of an organisation. Insider

threats are difficult, if not impossible to discover

using techniques such as monitoring and auditing.

Alternatively, if you have a well-trained and tested

machine learning (ML) platform, the intelligent

discovery of insider threats becomes a reality!

However, progress made in the successful

convergence of Cyber security and AI is hampered by

the unavailability of relevant and sufficient volumes

of real-world attack data sets. Furthermore, most

organisations do not have access to useable, relevant,

and reliable real-world attack data, such as network

traffic data, to unlock the benefits of AI. In addition,

the availability of quality data sets that can be

efficiently used to train ML algorithms is negatively

https://orcid.org/0000-0001-9240-8035

https://orcid.org/0000-0003-4683-2198

impacted by privacy laws and regulations (Lu et al.,

2023).

This shortage of data sets that can be used on

Cyber security ML platforms is now demanding that

researchers, Cyber security professionals, and

organisations search for alternative data sets. The

Financial Times (Financial Times, 2023) reported

that AI companies are considering innovative ways to

address the shortage of data sets. Furthermore, it was

suggested that companies should consider generating

useable, relevant, and reliable data themselves.

However, it should be noted that generating your own

data for ML environments are likely to increase the

risk exposure of an organisation in the sense that the

data can be biased towards some goal, and therefore

miss real-world attack scenarios (Figueira & Vaz,

2022; Lu et al., 2023).

Existing literature lack a structured approach that

can guide organisations and Cyber security

professionals to ensure that important aspects such as

using quality data and identifying significant features

are adequately implemented, particularly for

discovering complex scenarios such as zero-day

attacks. This paper optimises existing techniques that

Mbona, I. and Eloff, J.

Data Sets for Cyber Security Machine Learning Models: A Methodological Approach.

DOI: 10.5220/0012598400003705

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 9th International Conference on Internet of Things, Big Data and Security (IoTBDS 2024), pages 149-156

ISBN: 978-989-758-699-6; ISSN: 2184-4976

149

include high data quality checks, feature selection and

ML models to address this problem.

The remainder of this paper address the following

research questions:

(i) Given the scarcity of real-world attack

data, how to create or obtain Cyber

security data sets for ML?

(ii) What data quality aspects should be

considered for Cyber security data sets

used in ML?

(iii) What are the important features to

monitor on Cyber security data sets to

effectively discover complex attacks

such as zero-day attacks?

2 LITERATURE REVIEW

The literature overview and related work section is

structured as follows:

• Obtaining Cyber security data sets for ML.

• Ensuring the quality of Cyber security data

sets.

• Performing feature selection on Cyber

security data sets.

2.1 Data Collection: Cyber Security

Data for Machine Learning

Obtaining Cyber security data for ML refers to the

methods used for collecting data to be used in the

experiments of a ML model (Abdallah & Webb,

2017; Kilincer et al., 2021). ML models rely on input

data used in the training and testing stages when

designing a ML model (Bonaccorso, 2020; Van Der

Walt & Eloff, 2018). It is important for a ML model

be trained on a data set that is representative of the

problem at hand (Amr, 2019; Bonaccorso, 2020), so

that once deployed, it can produce reliable results.

Specifically, if a ML model is trained on a flawed data

set, then a ML model is prone to producing unreliable

results (Brownlee, 2020).

ML based Cyber security solutions should be

trained on real-world attack data (Kilincer et al.,

2021). However, there are multiple shortcomings

regarding obtaining real-world data for ML based

Cyber security solutions: (i) lack of large real-world

attack data, (ii) sensitive nature of the data, and (iii)

security, confidentiality and privacy concerns

(Bowles et al., 2020; Lu et al., 2023). According to

Singh et al. (Singh et al., 2019) privacy and security

concerns are the main factors that cause organisations

to not publicly share their real-world attack data.

Anonymisation, which is a method of removing

identity related or sensitive information from a data

set, is a technique that can be considered to overcome

privacy and security challenge (Dankar & Ibrahim,

2021); however, it does not address the data size

challenge for the purposes of training a ML model

(Bowles et al., 2020; Kumar & Sinha, 2023).

To overcome these challenges, researchers have

proposed using synthetic attacks (Figueira & Vaz,

2022; Lu et al., 2023). Synthetic attacks are generated

in a controlled environment such as a laboratory,

usually based on a sample of real-world attacks using

data generating tools (Bowles et al., 2020; Dankar &

Ibrahim, 2021). An example of such a tool is

CICFlowMeter (Kaushik & Dave, 2021; Ullah &

Mahmoud, 2020) that generates network traffic

flows.

Synthetic data should depict the true

characteristics of real-world data and be privacy-

preserving (Dankar & Ibrahim, 2021). Specifically

for this paper, the question is, how to obtain or create

Cyber security data representative of novel zero-day

attacks? The problem is that existing literature lack a

standardised approach of training and testing ML

Cyber security solutions for discovering zero-day

attacks, while ensuring that the data sets used are of

high-quality. For instance, Zoppi et al. (Zoppi et al.,

2021) and Pu et al. (Pu et al., 2021) proposed creating

zero-day attack scenarios by using a particular set of

known (synthetic) attacks in the training phase and

use a different set of known attacks in the testing

phase of a ML model. Their argument is that an attack

that appears only in the testing phase, but not in the

training phase of a ML model can be treated as a zero-

day attack (Pu et al., 2021; Zoppi et al., 2021).

However, with this approach it is not clear how best

to choose attacks to use in the training and testing

stages to optimise a ML model. This problem of

obtaining or creating data for zero-day attacks is

addressed in Section 3.1.

2.2 Assess the Data: Checking Cyber

Security Data Quality

There are vast descriptions of data quality in the

literature across different fields of study and

industries. The characteristics to be considered are

mainly informed by the requirements of an application

system that takes the data as input. According to

Mahanti (Rupa Mahanti, 2019) the most common

characteristics of data quality include accuracy,

reliability, legitimacy, consistency, relevance,

flexibility, currency, completeness, availability,

uniqueness, and timeliness amongst others. These

characteristics of data quality are not explicitly

IoTBDS 2024 - 9th International Conference on Internet of Things, Big Data and Security

150

defined for a ML or AI problem. However,

characteristics such as accuracy and relevance are of

prime importance for ML problem solving in

particular for Cyber security solutions. CySecML

methodology aims to, amongst other things, provide

guidelines of data quality checks to be considered for

ML based Cyber security solutions. Cai et al. (Cai &

Zhu, 2015) presented a framework of data quality

characteristics that consists of: availability, usability,

reliability, relevance, and presentation quality.

Although all the data quality frameworks

mentioned in the previous few paragraphs are equally

applicable to the different application domains, the

research at hand adopted the Cai et al. (Cai & Zhu,

2015) framework. The main reason for using Cai et

al. (Cai & Zhu, 2015) is for simplicity and Cyber

security application domain needs.

2.3 Feature Selection: Feature

Selection on Cyber Security Data

Various attributes and features describe activities in

various application domains, such as NIDS

(Aljawarneh et al., 2018; Kurniabudi et al., 2020;

Moustafa & Slay, 2017). The information contained

in attributes and features can be used for analysis,

including input variables into ML models (García et

al., 2015; Khalid et al., 2014; Mukkamala & Sung,

2003). However, not all features are useful or

significant in the design of ML models.

Given the large number of attributes and features

to be potentially found in Cyber security data sets,

such as NIDS, the question is which features should

be monitored to discover malicious activities, in

particular complex malicious attacks such as zero-day

attacks? A feature or attribute is considered

significant if it can differentiate two or more data

classes (Mukkamala & Sung, 2003).

A plethora of feature selection methods exists, see

(Mbona & Eloff, 2022b; Zheng & Casari, 2018). For

the research at hand Benford’s law (BL) is presented

as a feature selection method to assist ML models used

in discovering Cyber security threats, see (Mbona &

Eloff, 2022a, 2022b) for further details. BL is

proposed as a feature selection method within the

CySecML methodology given its unique properties

for identifying anomalous behaviours in Cyber

security data sets (Mbona & Eloff, 2022a, 2022b).

3 CySecML METHODOLOGY

The overall aim of the CySecML methodology is to

assist organisations and researchers to follow a step

wise process for developing ML based solutions for

Cyber security related problems. For the purposes of

this paper, the CySecML methodology consists of

two components: data preparation and

InternetBotDetector (IBD). The data preparation

component consists of data collection or data

preparation, data quality checks, data cleaning,

feature selection and significant features. The IBD

model is based on ML models that assist in the

automation of discovering cyber attacks.

Figure1: Overview of the CySecML methodology.

The Cyber security data set that is used for illustrative

purposes to highlight and explain the application of

the CySecML methodology is the IoT intrusion-2020

(Ullah & Mahmoud, 2020) data set. A brief

description of the IoT intrusion-2020 data set follows:

Ullah et al. (Ullah & Mahmoud, 2020) presented

a network intrusion data set, called IoT Intrusion-

2020, based on synthetic attacks. This data set was

developed in 2020 based on various synthetic botnet

attacks that include flooding attacks amongst others,

and benign network traffic. In addition, the authors of

this data set used the CICFlowMeter tool (Kaushik &

Dave, 2021; Ullah & Mahmoud, 2020) to generate

network traffic data.

3.1 Data Creation: Cyber Security

Data for Zero-Day Attacks

The CySecML methodology includes a specific step

for the process of Cyber security data collection or

data creation (see Figure 1). To illustrate concepts

relating to Cyber security data collection the

discussion that follows focuses on collecting network

traffic and network intrusion data. Cyber security data

collection refers to obtaining relevant data for running

ML models.

Data Sets for Cyber Security Machine Learning Models: A Methodological Approach

151

It is well known that real-world Cyber security

data sets are most of the time not available due to

confidentiality and privacy reasons. Most

organisations and researchers are using synthetic data

that is based on real-world data. For the purposes of

this paper and in particular to develop ML models that

are able to discover zero-day network intrusion

attacks, the following publicly available Cyber

security data sets were considered: the IoT intrusion-

2020 (Ullah & Mahmoud, 2020), UNSW-NB15

(Moustafa & Slay, 2017), CICDDOS2019

(Sharafaldin et al., 2019), and

CIRACICDOHBRW2020 (Montazerishatoori et al.,

2020). In general, and particularly referring to the

network intrusion data sets, the attacks considered in

these data sets were synthetically generated.

However, they do not contain zero-day (i.e.,

unknown) attacks. Specifically, for the purposes of

this paper, a known attack is an attack that is found in

a particular data set with its label. Whereas an

unknown attack is a type of attack not part of a

particular data set (Mbona & Eloff, 2022a).

Now the question is, how do we collect, create,

obtain or fabricate data that is representative of

unknown attacks? The rule of thumb in creating a

zero-day attack type from known attacks is to not use

all known type of attacks in the training and testing

stages of a ML model (Ahmad et al., 2023; Zavrak &

Iskefiyeli, 2020; Zoppi et al., 2021). Using all

available network intrusion type of attacks in both the

training and testing phase of a ML model disregards

the presence of zero-day attacks (Ahmad et al., 2023),

thus this approach is not preferred. According to

Ahmad et al. (Ahmad et al., 2023), the two most

common approaches for creating a zero-day attack

type from known attacks are:

• Approach 1: randomly use different type of attacks

in the training and testing stages of a ML model.

• Approach 2: combine all known attacks from a data

set and create a new zero-day attack type that is based

on known type of attacks.

Approach 1

The first approach for creating a zero-day type of

attack for experimental purposes is to train and test a

ML model using different type of attacks that are

known (Zavrak & Iskefiyeli, 2020; Zoppi et al.,

2021). As far as the authors of this paper are

concerned, there is no scientific evidence in the

literature on choosing the “optimal” combination of

type of attacks to use as a zero-day attack type. In

addition, consider the IoT Intrusion-2020 data set for

example that contains eight malicious network traffic

types, one can have a minimum of: 8!/(8-2)! = 28,

possible combinations of zero-day attack types.

Therefore, the authors of this research do not consider

this approach feasible.

Approach 2

The second approach is to create a new unknown

attack type (i.e., zero-day) by combining known

attack types of a particular data set (Ahmad et al.,

2023; Mbona & Eloff, 2022a). In this way, a zero-day

attack type is created from known type of attacks and

this approach is referred to as creating known

unknowns (Ahmad et al., 2023; Kim et al., 2020;

Scheirer et al., 2014). The premise of this approach is

that a network traffic data set is based on the same

description of features and structure, thus, one can

combine different known type of attacks across the

same features. “Approach 2” is depicted in Figure 2.

Figure 2: Approach 2.

To illustrate, Approach 2 was used to expand the IoT

intrusion-2020 data set. In this data set, the benign

network traffic is labelled as “normal”, and malicious

network traffic is labelled as “anomaly” (Ullah &

Mahmoud, 2020). For the purposes of this paper, the

authors created a new attack (i.e., zero-day) by

combining anomaly network traffic. Specifically, a

zero-day (i.e., anomaly) attack is created by

combining SYN Flooding + UDP Flooding + HTTP

Flooding + ACK Flooding + Host Brute force + ARP

Spoofing + Scan host port + Scan port OS across the

same features.

IoTBDS 2024 - 9th International Conference on Internet of Things, Big Data and Security

152

3.2 Clean the Data: Quality of Cyber

Security Data for Machine

Learning

The CySecML methodology include the process of

ensuring that a Cyber security data set is of sufficient

quality for an ML problem as part of the data

preparation component (see Figure 1). The IoT

intrusion-2020 (Ullah & Mahmoud, 2020) data set is

examined following the requirements for data quality

as reported on by Cai et al. (Cai & Zhu, 2015):

• Availability: characterise the extent to which

data is easily made available and accessible to

individuals who have the right level of access to

obtain data. The IoT intrusion-2020 data set comply

as follows to this criterium: as this is a publicly

available data set, and easily accessible on the

internet.

• Usability: refers to the extent in which data

meets the requirements of a problem at hand. The IoT

intrusion-2020 data set comply as follows to this

criterium: as this data set contains benign network

traffic and a variety of malicious network traffic with

many instances.

• Reliability: once the question of usability has

been answered, the subsequent question is whether

one can trust the data in terms of accuracy and

completeness amongst other factors. The IoT

intrusion-2020 data set comply as follows to this

criterium: given that the cyber attacks such as

flooding attacks considered in this data set are real-

world cyber attacks.

• Relevance: relevance of data is closely linked

with usability of data with the exception that,

relevance asks the question as to why this data is

needed in the first place? The IoT intrusion-2020 data

set comply as follows to this criterium: as this data set

contains more than 80 features that include the

number of packets and flow duration of benign and

malicious network traffic.

• Presentation quality: refers to the manner in

which data is presented in terms of description and

structure amongst other factors. The IoT intrusion-

2020 data set comply as follows to this criterium: this

data set has clearly labelled features and network

traffic types.

3.3 Feature Selection on Cyber

Security Data for Machine

Learning

The CySecML methodology include a step “Feature

selection”, see Figure 1, to identify significant features

that can assist Cyber security ML models. For the case

at hand, the aim is to identify significant features that

are indicative of anomalous behaviour between benign

and malicious zero-day network traffic types. The IoT

Intrusion-2020 data set consists of benign and eight

malicious network traffic type of attacks.

As part of CySecML with specific reference to the

“Feature selection” step, Benford’s law (BL) is

presented as a feature selection method. Experiments

done by the authors of this paper and previously

published work (Mbona & Eloff, 2022a) highlighted

the benefits of employing BL as a feature selection

method to overcome the challenges relating to

analysing large volumes of network traffic data sets.

4 CySecML METHODOLOGY:

PROOF OF CONCEPT

This section presents implementation results of the

CySecML methodology (see Figure 1). In the “Data

collection or data creation” step, the IoT Intrusion-

2020 data set was not only identified as a quality ML

data set but it was also expanded upon to demonstrate

how an existing data set can be made relevant and

useful for Cyber security related ML models. The

“High data quality check” step identified the data

quality framework of Cai et al. (Cai & Zhu, 2015) to

be a useful data quality framework for Cyber security

data sets. The “Feature selection” step employed

Benford’s law as a simple but computationally

effective method for identifying the set of significant

features from the expanded IoT Intrusion-2020 data

set. Based on applying these 3 steps of the CySecML

methodology the following set of significant features,

for the discovery of zero-day network attack types

from: the IoT intrusion-2020 (Ullah & Mahmoud,

2020), UNSW-NB15 (Moustafa & Slay, 2017),

CICDDOS2019 (Sharafaldin et al., 2019), and

CIRACICDOHBRW2020 (Montazerishatoori et al.,

2020), was identified:

• Flow duration: is defined as a measure of a set

of packets passing a particular point in network traffic

during a certain time interval (NetFort Technologies

Limited, 2014; Umer et al., 2017). Flow features have

been shown in the past to be effective for discovering

malicious network traffic, see (Khodjaeva & Zincir-

Heywood, 2021; Umer et al., 2017; Zavrak &

Iskefiyeli, 2020).

• Flow inter-arrival times: measures the time

between two consecutive network flows. The flow

inter-arrival times measure was adopted by (Arshadi

& Jahangir, 2017; Asadi, 2016; Sethi et al., 2020) to

discover anomalous network traffic types.

Data Sets for Cyber Security Machine Learning Models: A Methodological Approach

153

• Forward and backward packets: measures

packets received and sent in a network (Ullah &

Mahmoud, 2020).

• Packet sizes: is a measure of actual packet size

(Ullah & Mahmoud, 2020).

• Segment size: can be described as the largest

amount of data in bytes that a device can handle

(Ullah & Mahmoud, 2020).

• Source to destination / destination to source

packet count: measures the number of packets from

source to destination, and vice versa (Moustafa &

Slay, 2017).

• Source/destination TCP base sequence number:

can be described as a number used to keep track of

every byte sent or received by a host (Moustafa &

Slay, 2017).

• Number of connections of the same

source/destination address: contains information

about the number of connections of the same

source/destination address (Montazerishatoori et al.,

2020).

The above listed significant features can now be

used as “Input data” in a ML model such as the IBD

model for the purposes of discovering zero-day

network intrusion attacks as depicted in Figure 1. The

IBD model was implemented using the one-class

support vector machine (OCSVM), see (Mbona &

Eloff, 2022a) for model and evaluation measure

details. The results in Table 1 demonstrate that

“Approach 2” which is based on a combination of all

malicious network traffic, produced the best results

for the discovery of zero-day network traffic.

Table 1: IBD model experimental results using the IoT

Intrusion-2020 data set.

Training set

“Known attacks”

Testing set

“Zero-day

attack”

score

(%)

MCC

score

(%)

Benign + ARP

spoofing + Scan

host port

UDP flooding 76 60

Benign + Scan port

OS + Scan host port

+ ARP spoofing

SYN flooding

+ HTTP

flooding

73 56

UDP flooding Host brute

force

62 31

Benign + SYN

flooding + UDP

flooding + HTTP

flooding + ACK

flooding +

Host brute force

ARP spoofing

+ Scan host

port + Scan

port OS

70 59

Benign network

traffic only

All combined

attacks

71 64

The results in Table 1 indicate that the OCSVM

algorithms performs well in terms of F

and MCC

scores, although it performed poorly for the UDP

flooding and Host brute force case. Although each

case in Table 1 contains different type of attacks in

the training and testing stages, i.e., “Approach 1” and

“Approach 2” the results of the OCSVM are not

consistent. Consequently, this research makes an

important contribution to the ongoing search for

designing and discovering zero-day attack scenarios.

5 CONCLUSION

This paper illustrates the convergence of Cyber

security and artificial intelligence (AI) with a focus

on data sets. A methodological approach (CySecML)

was proposed to assist Cyber security researchers and

professionals in developing Cyber security ML

solutions. It is demonstrated how the CySecML

methodology optimises existing techniques such as

data collection or data creation, checking data quality,

and identifying features. CySecML was used on

several Cyber security ML projects, most notably the

work performed on NIDS, as illustrated in this paper.

Future work will focus on extending the data

collection step by refining the methods used to

generate synthetic Cyber security data and

developing a metric approach for measuring the

compliance to standards for data quality

characteristics of Cyber security ML data. Regarding

NIDS, this study makes an important contribution to

the ongoing search for designing and discovering

zero-day attack scenarios.

REFERENCES

Abdallah, Z. S., & Webb, G. I. (2017). Data Preparation.

Encyclopedia of Machine Learning and Data Mining,

September 2018, 318–327.

https://doi.org/10.1007/978-1-4899-7687-1

Ahmad, R., Alsmadi, I., Alhamdani, W., & Tawalbeh, L.

(2023). Zero-day attack detection: a systematic

literature review. Artificial Intelligence Review,

February, 1–79. https://doi.org/10.1007/s10462-023-

10437-z

Aljawarneh, S., Aldwairi, M., & Yassein, M. B. (2018).

Anomaly-based intrusion detection system through

feature selection analysis and building hybrid efficient

model. Journal of Computational Science, 25, 152–

160. https://doi.org/10.1016/j.jocs.2017.03.006

Amr, T. (2019). Hands-On Machine Learning with scikit-

learn and Scientific Python Toolkits. OReilly Media,

384.

Arshadi, L., & Jahangir, A. H. (2017). An empirical study

on TCP flow interarrival time distribution for normal

IoTBDS 2024 - 9th International Conference on Internet of Things, Big Data and Security

154

and anomalous traffic. International Journal of

Communication Systems, 30(1).

https://doi.org/10.1002/dac.2881

Asadi, A. N. (2016). An approach for detecting anomalies

by assessing the inter-arrival time of UDP packets and

flows using Benford’s law. Conference Proceedings of

2015 2nd International Conference on Knowledge-

Based Engineering and Innovation, KBEI 2015, 2(6),

257–262. https://doi.org/10.1109/KBEI.2015.7436057

Bonaccorso, G. (2020). Mastering Machine Learning

Algorithms: Expert techniques for implementing

popular machine learning algorithms, fine-tuning your

models, and understanding how they work. In OReilly

Media. OReilly Media.

Bowles, J. K. F., Silvina, A., Bin, E., & Vinov, M. (2020).

On Defining Rules for Cancer Data Fabrication. 168–

176. https://www.ndc.scot.nhs.uk/National-Datasets/

Brownlee, J. (2020). Data preparation for machine

learning: data cleaning, feature selection, and data

transforms in Python. Machine Learning Mastery.

https://books.google.co.za/books?hl=en&lr=&id=uAP

uDwAAQBAJ&oi=fnd&pg=PP1&dq=Data+preparati

on+for+machine+learning:+data+cleaning,+feature+se

lection,+and+data+transforms+in+Python&ots=Cl8Gu

chLpT&sig=MmuT6WgKuVEHvbXGQj91vPH2M_k

Cai, L., & Zhu, Y. (2015). The challenges of data quality

and data quality assessment in the big data era. Data

Science Journal, 14. https://doi.org/10.5334/dsj-2015-

002

Dankar, F. K., & Ibrahim, M. (2021). Fake it till you make

it: Guidelines for effective synthetic data generation.

Applied Sciences (Switzerland), 11(5).

https://doi.org/10.3390/app11052158

Figueira, A., & Vaz, B. (2022). Survey on Synthetic Data

Generation, Evaluation Methods and GANs. MDPI

Mathematics, 10(15).

https://doi.org/10.3390/math10152733

Financial Times. (2023, July 21). AI systems create

synthetic data to train next generation models.

https://ft.pressreader.com/v99c/20230721/2817326839

68639

García, S., Luengo, J., & Herrera, F. (2015). Feature

selection. Intelligent Systems Reference Library, 72(6),

163–193. https://doi.org/10.1007/978-3-319-10247-

4_7

Kaushik, R., & Dave, M. (2021). Malware Detection

System Using Ensemble Learning: Tested Using

Synthetic Data. Data Engineering and Communication

Technology: Proceedings of ICDECT 2021, 63, 153–

164. https://doi.org/10.1007/978-981-16-0081-4_16

Khalid, S., Khalil, T., & Nasreen, S. (2014). A survey of

feature selection and feature extraction techniques in

machine learning. Proceedings of 2014 Science and

Information Conference, SAI 2014, 1(October), 372–

378. https://doi.org/10.1109/SAI.2014.6918213

Khodjaeva, Y., & Zincir-Heywood, N. (2021, August 17).

Network Flow Entropy for Identifying Malicious

Behaviours in DNS Tunnels. ACM International

Conference Proceeding Series.

https://doi.org/10.1145/3465481.3470089

Kilincer, I. F., Ertam, F., & Sengur, A. (2021). Machine

learning methods for cyber security intrusion detection:

Datasets and comparative study. Computer Networks,

188. https://doi.org/10.1016/j.comnet.2021.107840

Kim, S., Hwang, C., & Lee, T. (2020). Anomaly based

unknown intrusion detection in endpoint environments.

Electronics (Switzerland), 9(6), 1–21.

https://doi.org/10.3390/electronics9061022

Kumar, V., & Sinha, D. (2023). Synthetic attack data

generation model applying generative adversarial

network for intrusion detection. Computers and

Security, 125.

https://doi.org/10.1016/j.cose.2022.103054

Kurniabudi, Stiawan, D., Darmawijoyo, Bin Idris, M. Y.

Bin, Bamhdi, A. M., & Budiarto, R. (2020). CICIDS-

2017 Dataset Feature Analysis with Information Gain

for Anomaly Detection. IEEE Access, 8, 132911–

132921.

https://doi.org/10.1109/ACCESS.2020.3009843

Lu, Y., Shen, M., Wang, H., & Wei, W. (2023). Machine

Learning for Synthetic Data Generation: A Review.

ArXiv Preprint ArXiv:2302.04062.

http://arxiv.org/abs/2302.04062

Mbona, I., & Eloff, J. H. P. (2022a). Detecting Zero-Day

Intrusion Attacks Using Semi-Supervised Machine

Learning Approaches. IEEE Access, 10(July), 69822–

69838.

https://doi.org/10.1109/ACCESS.2022.3187116

Mbona, I., & Eloff, J. H. P. (2022b). Feature selection using

Benford’s law to support detection of malicious social

media bots. Information Sciences, 582, 369–381.

https://doi.org/10.1016/j.ins.2021.09.038

Montazerishatoori, M., Davidson, L., Kaur, G., & Habibi

Lashkari, A. (2020). Detection of DoH Tunnels using

Time-series Classification of Encrypted Traffic. 2020

IEEE Intl Conf on Dependable, Autonomic and Secure

Computing, Intl Conf on Pervasive Intelligence and

Computing, Intl Conf on Cloud and Big Data

Computing, Intl Conf on Cyber Science and Technology

Congress (DASC/PiCom/CBDCom/CyberSciTech).,

63–70. https://doi.org/10.1109/DASC-PICom-

CBDCom-CyberSciTech49142.2020.00026

Moustafa, N., & Slay, J. (2017). The significant features of

the UNSW-NB15 and the KDD99 data sets for

Network Intrusion Detection Systems. Proceedings -

2015 4th International Workshop on Building Analysis

Datasets and Gathering Experience Returns for

Security, BADGERS 2015, 1(November), 25–31.

https://doi.org/10.1109/BADGERS.2015.14

Mukkamala, S., & Sung, A. H. (2003). Identifying

Signiﬁcant Features For Network Forensic Analysis

Using Artiﬁcial Intelligent Techniques. International

Journal of Digital Evidence, 1(4), 1–17.

NetFort Technologies Limited. (2014). Flow Analysis

Versus Packet Analysis . What Should You Choose ?

White Paper, 6.

Pu, G., Wang, L., Shen, J., & Dong, F. (2021). A hybrid

unsupervised clustering-based anomaly detection

method. Tsinghua Science and Technology, 26(2), 146–

153. https://doi.org/10.26599/TST.2019.9010051

Data Sets for Cyber Security Machine Learning Models: A Methodological Approach

155

Rupa Mahanti. (2019). Data Quality: Dimensions,

Measurement, Strategy, Management, and Governance.

In Quality Press. Quality Press.

https://www.proquest.com/openview/d4fd7a60ba9f04

a4b438e6df4720bec3/1?cbl=34671&pq-

origsite=gscholar&parentSessionId=l3HZaJC3EtvEQ4

9NCH9M7LjojVXg6OK878K4hWSG%2BB4%3D

Scheirer, W. J., Jain, L. P., & Boult, T. E. (2014).

Probability models for open set recognition. IEEE

Transactions on Pattern Analysis and Machine

Intelligence, 36(11).

https://doi.org/10.1109/TPAMI.2014.2321392

Sethi, K., Kumar, R., Prajapati, N., & Bera, P. (2020). A

Lightweight Intrusion Detection System using

Benford’s Law and Network Flow Size Difference.

2020 International Conference on COMmunication

Systems and NETworkS, COMSNETS 2020, 10, 1–6.

https://doi.org/10.1109/COMSNETS48256.2020.9027

422

Sharafaldin, I., Lashkari, A. H., Hakak, S., & Ghorbani, A.

A. (2019). Developing realistic distributed denial of

service (DDoS) attack dataset and taxonomy.

Proceedings - International Carnahan Conference on

Security Technology, 2019-Octob(October).

https://doi.org/10.1109/CCST.2019.8888419

Singh, U. K., Joshi, C., & Kanellopoulos, D. (2019). A

framework for zero-day vulnerabilities detection and

prioritization. Journal of Information Security and

Applications, 46, 164–172.

https://doi.org/10.1016/j.jisa.2019.03.011

Ullah, I., & Mahmoud, Q. H. (2020). A Technique for

Generating a Botnet Dataset for Anomalous Activity

Detection in IoT Networks. IEEE Transactions on

Systems, Man, and Cybernetics: Systems, 2020-

Octob(May), 134–140.

https://doi.org/10.1109/SMC42975.2020.9283220

Umer, M. F., Sher, M., & Bi, Y. (2017). Flow-based

intrusion detection: Techniques and challenges.

Computers and Security, 70, 238–254.

https://doi.org/10.1016/j.cose.2017.05.009

Van Der Walt, E., & Eloff, J. (2018). Using Machine

Learning to Detect Fake Identities: Bots vs Humans.

IEEE Access, 6, 6540–6549.

https://doi.org/10.1109/ACCESS.2018.2796018

Zavrak, S., & Iskefiyeli, M. (2020). Anomaly-Based

Intrusion Detection from Network Flow Features Using

Variational Autoencoder. IEEE Access, 8, 108346–

108358.

https://doi.org/10.1109/ACCESS.2020.3001350

Zheng, A., & Casari, A. (2018). Feature engineering for

machine learning: principles and techniques for data

scientists. In O’Reilly Media. O’Reilly Media.

https://perso.limsi.fr/annlor/enseignement/ensiie/Featu

re_Engineering_for_Machine_Learning.pdf%0Ahttps:

//www.amazon.com/Feature-Engineering-Machine-

Learning-Principles/dp/1491953241

Zoppi, T., Ceccarelli, A., & Bondavalli, A. (2021).

Unsupervised Algorithms to Detect Zero-Day Attacks:

Strategy and Application. IEEE Access, 9, 90603–

90615. https://doi.org/10.1109/access.2021.3090957

IoTBDS 2024 - 9th International Conference on Internet of Things, Big Data and Security

156