Castles Built on Sand: Observations from Classifying Academic

Cybersecurity Datasets with Minimalist Methods

Laurens D’hooge

, Miel Verkerken

, Tim Wauters

, Filip De Turck

and Bruno Volckaert

IDLab-Imec, Department of Information Technology, Ghent University,

Technologiepark-Zwijnaarde 126, 9052 Ghent, Belgium

Keywords:

Machine Learning, Cybersecurity, Data Quality Issues, Baselines.

Abstract:

Machine learning (ML) has been a staple of academic research into pattern recognition in many ﬁelds, includ-

ing cybersecurity. The momentum of ML continues to speed up alongside the advances in hardware capabili-

ties and the methods they unlock, primarily (deep) neural networks. However, this article aims to demonstrate

that the non-judicious use of ML in two prominent domains of data-based cybersecurity consistently misleads

researchers into believing that their proposed methods constitute actual improvements. Armed with 17 state-

of-the-art datasets in trafﬁc and malware classiﬁcation and the simplest possible machine learning model this

article will show that the lack of variability in most of these datasets immediately leads to excellent models,

even if that model is only one comparison per feature.

1 INTRODUCTION

The term pattern recognition has largely become syn-

onymous with machine learning, especially in aca-

demic research. While in itself this is not a bad di-

rection since these methods have proven capable of

solving hitherto unsolvable tasks, eager adoption of

increasingly complex ML models in security-oriented

pattern recognition tasks presents a problem.

That problem is evaluating the true merit of these

newly proposed complex models on datasets where

the margins at the top are very narrow. While model

authors claim the superiority of their novel models,

the reality is that it is not possible to rank-order them

effectively because extremely simple ML models of-

ten score equally well.

This article’s chief contribution is the continued

afﬁrmation of the observation made by (Holte, 1993).

Within data-based cybersecurity, even the simplest

ML models perform well or even perfectly on the

available datasets. This observation from experiment

severely undercuts novel ML models’ claims that

they signiﬁcantly improve detection. To the authors’

https://orcid.org/0000-0001-5086-6361

https://orcid.org/0000-0002-1781-900X

https://orcid.org/0000-0003-2618-3311

https://orcid.org/0000-0003-4824-1199

https://orcid.org/0000-0003-0575-5894

knowledge, the concern of increased model complex-

ity without commensurate gains in classiﬁcation per-

formance has been raised by researchers in the ﬁeld,

but has not yet been experimentally validated. Our

motivation for this work stems from both our search

for the smallest effective models for high-throughput

low-latency intrusion detection systems and from our

prior work that established that well-known models

still reach outstanding classiﬁcation scores even un-

der severe data restrictions (D’hooge et al., 2021).

This article presents true baselines for 17 security

datasets obtained with a model at the lower bound of

ML model complexity, one rule (OneR, 1R). In many

cases its performance sits remarkably close to that of

most recently proposed classiﬁcation methods.

The article is structured as per the template. In

section 2 the literature on two broad domains of pat-

tern recognition in cybersecurity is outlined with a

particular focus on model innovation. Those two

domains are network trafﬁc classiﬁcation (primar-

ily intrusion detection) and malware detection. The

methodology is described in section 3 and focuses

on the models OneR and ensemble OneR. The re-

sults are conveyed per dataset in chronological order

in section 4. The discussion aggregates the results

and presents the insights and recommendations to im-

prove new detection algorithm proposals and to im-

prove the datasets themselves. The conclusion 6 and

future work 7 round out the article.

D’hooge, L., Verkerken, M., Wauters, T., De Turck, F. and Volckaert, B.

Castles Built on Sand: Observations from Classifying Academic Cybersecurity Datasets with Minimalist Methods.

DOI: 10.5220/0011853300003482

In Proceedings of the 8th International Conference on Internet of Things, Big Data and Security (IoTBDS 2023), pages 61-72

ISBN: 978-989-758-643-9; ISSN: 2184-4976

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

2 RELATED WORK

In both security-oriented network trafﬁc classiﬁcation

and malware classiﬁcation, academic research had al-

ready embraced machine learning as the method of

choice for further investigation by the 2010s. (Buczak

and Guven, 2015) and (Mishra et al., 2018) made this

observation the central theme of their landmark sur-

veys in intrusion detection. Across the entire spec-

trum of classical data mining and machine learning

methods, they were able to compile and compare

dozens of proposed algorithms. A bibliographical re-

view of malware research by (Ab Razak et al., 2016)

equally noted the rising prevalence of machine learn-

ing in malware recognition. (Shalaginov et al., 2018)

noted that static analysis was quickly getting infused

with ML methods to cope with the explosive growth

in the volume and variations of malware.

The aforementioned (older) surveys focus mostly

on classical ML methods, but a more recent survey by

(Ahmad et al., 2021) explicitly compares the amount

of new proposals with classical methods to new deep

learning approaches in intrusion detection. In their set

of recently proposed methods, they ﬁnd that 60% are

now pure deep learning methods, 20% are hybrids be-

tween deep learning and classical methods and the re-

maining 20% continue to innovate purely on classical

methods. Two years prior, (Liu and Lang, 2019) se-

lected 26 proposals in intrusion detection, 14 of which

relied on deep learning. In the challenges and future

directions, the authors already identiﬁed the low ef-

ﬁciency of the ever more complicated methods as a

growing issue to be solved.

Within the suite of available deep architectures,

auto-encoders (AE), convolutional neural networks

(CNN), recurrent neural networks (RNN), deep belief

networks (DBN) and restricted Boltzmann machines

(RBM) were the most widely studied before 2020

(Berman et al., 2019), (Aldweesh et al., 2020). The

same architectures enjoyed the greatest popularity in

malware detection during the same period (Naway

and Li, 2018), (Qiu et al., 2020).

Due to the speed with which new architectures are

proposed, tested and adopted in more well-known ap-

plications of ML (e.g. vision and language modelling

tasks), they are adapted equally rapidly by data-based

security researchers. Large models based on the trans-

former architecture, originally designed for language

modelling and quickly transmuted for vision tasks,

are the latest inspiration for methods like MalBERT

(Rahali and Akhlouﬁ, 2021) and DDoSTC (Wang and

Li, 2021).

The overall trend in the surveys and the latest

methods is clear. Ever more complicated detection al-

gorithms are proposed with performance gains indis-

tinguishable from run-to-run variance. Occasionally

model authors compare their new proposal to simpler

ML methods (very often to a stock random forest).

The impulse is commendable, but the execution is too

rushed. More adequate comparisons would include

XGBoost or Catboost and spend the same amount of

time and resources to optimize them as well as the

newly proposed algorithm. The step down to a sim-

pler model rarely occurs and even so, this article will

demonstrate that another model with a tiny fraction of

the complexity of XGBoost or Catboost still performs

good or even great on the same academic cybersecu-

rity datasets.

This literature section constrains itself to high-

quality reviews which surveyed hundreds of individ-

ual methods. 17 datasets were evaluated in this arti-

cle and due to the limited space, it was not possible

to provide individual examples for each dataset. For

those readers who are not familiar with the domain,

the reviews capture many of the included datasets and

do detailed reporting of the achievable performance.

Similarly, the dataset authors / publishers often in-

cluded baselines (relevant citations in table 1).

2.1 Included Datasets

Seventeen academic datasets have been evaluated in

this analysis. Broadly, they fall in two categories:

trafﬁc classiﬁcation and malware classiﬁcation. The

trafﬁc classiﬁcation datasets primarily include (multi-

class) intrusion detection datasets as well as some

specialty datasets. The set of malware datasets is not

as expansive, but stills cover Android, Windows and

Linux malware as well as one specialized dataset on

malware delivery through malicious PDF ﬁles. Ta-

ble 1 presents an overview of the included datasets

with their relevant citation(s) and year of publication.

The author maintains clean versions of every dataset

on Kaggle updated in accordance with the latest re-

search. All computation, preprocessing and analy-

sis are publicly available at https://www.kaggle.com/

dhoogla/datasets.

3 METHODOLOGY

The methodology section introduces the two algo-

rithms which have been used throughout the analy-

sis. Their simplicity is a deliberate design choice in-

formed by (Holte, 1993) meant to highlight the ease

with which many of the state-of-the-art ML security

datasets can be classiﬁed.

IoTBDS 2023 - 8th International Conference on Internet of Things, Big Data and Security

Table 1: The Included ML-Focused Security Datasets.

Trafﬁc Classiﬁcation Datasets

Name Year Purpose

NSL-KDD 2009 (Tavallaee et al., 2009) Multi-class intrusion detection

CTU-13 2014 (Garc

ıa et al., 2014) Botnet detection

UNSW-NB15 2015 (Moustafa and Slay, 2015) Multi-class intrusion detection

CIDDS-001 2017 (Ring et al., 2017a) Multi-class intrusion detection

CIDDS-002 2017 (Ring et al., 2017b) Port scanning detection

CIC-NIDS Collection

CIC-IDS 2017 (Sharafaldin. et al., 2018) Multi-class intrusion detection

CIC-DoS 2017 (Jazi et al., 2017) DoS detection

CSE-CIC-IDS 2018 (Sharafaldin. et al., 2018) Multi-class intrusion detection

CIC-DDoS 2019 (Sharafaldin et al., 2019) DDoS detection

CIC-Darknet 2020 (Habibi Lashkari et al., 2020) VPN & Tor detection

CIRA-CIC-DoHBrw 2020 (MontazeriShatoori et al., 2020) DNS covert channel detection

CIC-Bell-DNS-EXF 2021 (Mahdavifar et al., 2021) DNS data exﬁl detection

USB-IDS-1 2021 (Catillo et al., 2021) DoS detection

Distrinet-CIC-IDS 2021 (Engelen et al., 2021) Corrected issues in CIC-IDS2017

Malware Classiﬁcation Datasets

CCCS-CIC-AndMal (Rahali et al., 2020) 2020 (Keyes et al., 2021) Android malware

CIC-Malmem 2022 (Carrier et al., 2022) Obfuscated malware classiﬁcation

CIC-Evasive-PDFMal 2022 (Issakhani et al., 2022) Hidden malware in PDF ﬁles

3.1 The Simplest ML Model: One Rule

If you’re allowed only one comparison, which fea-

ture and which value would you pick? That’s the

central question of the One Rule (OneR) model.

Implementation-wise it is identical to a decision tree

model with just a root node. This article uses OneR

exclusively, but keeps a OneR model for every feature

in each dataset, rather than keeping only one. Figure

1 visually demonstrates the model.

Figure 1: OneR: one optimal split point on a number line.

The hard point splits that are found for each fea-

ture will either have some predictive power if a dis-

proportionate amount of samples from one class fall

below (or above) the split point. No such predictive

power will reveal itself for two possible reasons and

two edge cases.

1. there is an equal proportion of the classes on each

side of the split point

2. enough negative samples sit on either side of the

value range for the feature for the positive samples

The edge case for reason 1 is that all samples’ val-

ues for the feature fall on the split point (i.e. a fea-

ture with zero variability). Such features should be

removed by the dataset authors prior to publication.

Reason 2 does not need to be a problem if you are

allowed more than 1 decision. Its edge case is that

there is full overlap between the observed values for

a feature for all negative and all positive samples with

a majority of negative samples at each value.

3.1.1 The Simplest Ensemble

A single OneR model disregards all features but one.

This is extremely restrictive. We have opted to also

include an ensemble method with threshold optimiza-

tion. The simplest ensemble takes the outputs of each

OneR model with any predictive power and averages

them. These average outputs per sample then serve

as the ﬁnal predictions. In general, an uplift in clas-

siﬁcation performance is observed for the ensemble

because it combines the predictions of independent

models (so long as a majority of the independent clas-

siﬁers predicts the right outcome).

However, partly because an unweighted averaging

was used, it is possible to see a loss in classiﬁcation

performance. If there is a preponderance of models

with weak classiﬁcation performance then they will

drag the stronger models down. A change in the

weight contribution of the individual OneR models

tied to their individual performance would alleviate

this.

One ﬁnal trick which was employed is the use of

Youden’s J statistic (Youden, 1950). J is calculated as

the true positive rate minus the false positive rate and

allows setting the optimal threshold to assign predic-

tions to a class (in a binary context). In the imple-

mentation, the maximum J and subsequent optimal

threshold are calculated based on the predictions for

the training set and the same threshold is re-used for

the test set to avoid leaking information.

Castles Built on Sand: Observations from Classifying Academic Cybersecurity Datasets with Minimalist Methods

3.2 Sampling

Contrary to the prevailing wisdom and the litera-

ture, an 80-20 split is not required at all to train

performant models. The loss in performance from

inverting the standard split to its complement 20%

train 80% test split hardly affects performance. For

the datasets with millions of samples even more

aggressive sampling methods (i.e. 1-99 or even

0.1-99.9 train-test) remain effective with negligible

losses in performance (https://www.kaggle.com/code/

dhoogla/cic-ids2017-03-minimal-data-binary). For

the datasets with pre-designated splits (a.o. NSL-

KDD, UNSW-NB15, USB-IDS-1), those splits were

kept to keep comparability with the state-of-the-art

even though more aggressive sampling methods work

equally well.

3.3 Common Dataset Preprocessing

All datasets are provided as comma-separated-value

ﬁles (CSV). CSVs are readable as text, but not efﬁ-

cient compute or storage wise. Every dataset under-

went at least the following preprocessing steps.

• Dropping metadata shortcut features (D’hooge

et al., 2022a) and other contaminant features

(D’hooge et al., 2022b) (only for NSL-KDD,

UNSW-NB15 and the CIC-NIDS collection)

• Downsizing the features into appropriate types

• Removing samples with missing or corrupt values

• Removing duplicate samples to avoid inﬂating

classiﬁcation scores

3.4 A Note on Classiﬁcation Resolution

Throughout this analysis, the choice was made to

work with binary classiﬁcation. For most datasets

this is the most logical level because it coincides with

the labeling by the dataset authors. For the CIC col-

lection which has massive differences in sample size

for its available classes, classiﬁcation was done on

each available attack class versus benign trafﬁc. True

multi-class evaluation was eschewed because it would

obscure the class imbalance and because the model is

too simplistic.

4 RESULTS

The results are presented in two big blocks with the

network trafﬁc datasets ﬁrst and the malware datasets

second. Additional preprocessing (if any) is men-

tioned, the top-5 features are listed and the perfor-

mance of the unweighted ensemble of the successful

OneR models is presented. To keep the article maxi-

mally self-contained, the results are written in a dense

format. We recommend reading table 2 ﬁrst which

presents a summary of the numerical results and then

circling back to the detailed results for the individual

datasets.

4.1 Network Trafﬁc Classiﬁcation

Datasets

The results on the fourteen network trafﬁc classiﬁ-

cation datasets are presented ﬁrst and in chronolog-

ical order of dataset publication. Even though some

datasets are outdated by now and superseded by more

recent entries, certain design aspects of the older

datasets positively inﬂuence their usability and legit-

imately decrease OneR’s performance (a detailed ex-

planation is part of section 5).

4.1.1 NSL-KDD

NSL-KDD is included because it has been the most

studied NIDS dataset to date. Although its attacks are

no longer reﬂective of a modern network landscape,

the original dataset authors (KDD98/99) made sound

decisions to produce a varied dataset with separate

train and test sets with non-overlapping sets of cyber

attacks. For this evaluation, the given train-test split

was respected.

Even though NSL-KDD is already a rework of

the KDD99 dataset, more issues with contaminat-

ing features have recently been found (D’hooge

et al., 2022b). Those features have been removed

and the highest quality version of NSL-KDD is

publicly available https://www.kaggle.com/dhoogla/

datasets/nslkdd.

OneR models are just moderately capable on this

dataset with the top-5 features all scoring between

0.745 and 0.819 AUROC (scr-bytes, dst-bytes, ﬂag,

service, dst-host-same-srv-rate). OneR models for 22

out of the 36 features had at least some predictive

power (AU ROC > 0.5) and their unweighted ensem-

ble scored 0.929 AUROC (precision 0.94 and recall

0.86).

4.1.2 CTU-13

After removing contaminating metadata features

which act as shortcuts for ML models, the botnet

dataset CTU-13 only has 9 features. Worse still, the

observed values for benign and malicious samples

IoTBDS 2023 - 8th International Conference on Internet of Things, Big Data and Security

overlap nearly perfectly. None of the OneR models

yield any results better than random guessing.

4.1.3 UNSW-NB15

UNSW-NB15 has pre-designated train and test sets,

so those were respected in the analysis. Nearly all

published articles that use this dataset are unaware

that there are contaminating features, but for this anal-

ysis, the improved version of the dataset is used.

With the corrected UNSW-NB15 dataset OneR

models remain capable with the top-5 features be-

ing rate, sload, dbytes, dpkts and dmean each scor-

ing above .766 AUROC (max 0.781). Compared to

the full set of features (contaminating ones included)

a ﬂat loss of 0.1 AUROC has been observed. The

features stll and ct-state-ttl used to be the top-2 per-

formers with AUROCs above 0.85. For this dataset,

the ﬁrst instance of the ensemble underperforming an

individual feature is observed as the ensemble reaches

just 0.776 AUROC with precision at 0.96 and recall at

0.62.

4.1.4 CIDDS-001

After the removal of contaminating metadata features,

CIDDS-001 has 12 features, 6 of which are one-

hot-encoded versions of TCP ﬂags. With the ad-

ditional imbalance in benign versus malicious traf-

ﬁc (4,354,282-11875) and full overlap in values and

value ranges for the malicious trafﬁc, OneR models

are not sophisticated enough to create useful distinc-

tions.

4.1.5 CIDDS-002

Even though CIDDS-002 has the same minuscule

set of features and a worse class balance (benign

2,640,306 - malicious 3829) compared to CIDDS-

001, a few OneR models prove capable. During the

construction of CIDDS-002 only port scan attacks

were gathered.

Three features demonstrate splitting power with

a single comparison each. The observed values for

proto (0.833 AUROC), tos (0.782) and bytes (0.598)

have overlap between the benign and malicious sam-

ples. However, for the useful features, the malicious

samples have a majority on the edges of the range,

granting them their predictive power. The ensemble

OneR model reaches an AUROC of 0.878 with 78%

precision and 76% recall.

4.1.6 CIC-NIDS-Collection

Rather than evaluating the individual CIC-NIDS

datasets, the analysis has been conducted on the

global aggregate dataset of the NIDS collection.

Working on the collated version should improve the

variability due to the large increase in the number of

represented attacks.

Additionally, the CIC collection as used takes

(D’hooge et al., 2022a) and (D’hooge et al., 2022b)

into account, removing all shortcut metadata features

and all content features which have been identiﬁed as

contaminants. In total the dataset has more than 9 mil-

lion unique samples with 58 statistical features. For

academic researchers, it is the most comprehensive,

labeled NIDS dataset available.

The results will be discussed per represented at-

tack class. The analysis was done at this level of res-

olution because there are millions of DDoS and DoS

samples, whereas Portscan is represented by a mere

2255 (positive) samples. Collapsing all attack classes

into one dataset would skew the results to a DDoS /

DoS model.

DDoS: top-5 features: total-backward-packets, bwd-

packets/s, avg-fwd-segment-size, fwd-packet-length-

mean and fwd-packet-length-max (AUROCs between

0.702 and 0.728). 27/57 features contributed to the

ensemble OneR model which reaches 0.842 AUROC

with 0.655 precision and 0.8653 recall. Having the

combination of the DDoS samples of three datasets

(IDS17, IDS18 and DDoS19) deﬁnitely improved the

variability in the data (e.g. when compared to 4.1.11

DDoS).

DoS: top-5 features: packet-length-variance, bwd-

packet-length-std, ﬂow-bytes/s, fwd-seg-size-min,

bwd-packets/s (AUROCs between 0.701 and 0.817).

39/57 OneR models contribute to the ensemble which

reaches 0.905 AUROC with 0.820 precision and

0.811 recall. Again, having the DoS samples of three

NIDS datasets (IDS17, DoS17 and IDS18) increased

the classiﬁcation difﬁculty, especially for a model

such as OneR.

Bruteforce: top-5 features: subﬂow-fwd-bytes,

fwd-packets-length-total, fwd-header-length, fwd-

act-data-packets, bwd-header-length (AUROCs be-

tween 0.951 and 0.964). 40/57 OneR models con-

tribute to the ensemble which reaches 0.973 AUROC

with 0.979 precision and 0.936 recall.

Portscan: top-5 features: bwd-packet-length-

mean, bwd-packets-length-total, subﬂow-bwd-bytes,

packet-length-variance and packet-length-std (AU-

ROCs between 0.849 and 0.865). 34/57 OneR models

contribute to the ensemble which reaches 0.926 AU-

ROC with 0.692 precision and 0.925 recall.

Botnet: top-5 features: ﬂow-bytes/s, ﬂow-iat-mean,

avg-bwd-segment-size, bwd-packet-length-mean and

bwd-packets/s (AUROCs between 0.852 and 0.870).

38/57 OneR models contribute to the ensemble which

Castles Built on Sand: Observations from Classifying Academic Cybersecurity Datasets with Minimalist Methods

reaches 0.9793 AUROC with 0.996 precision and

0.948 recall.

Webattack: top-5 features: avg-packet-size, packet-

length-max, ﬂow-bytes/s, packet-length-variance and

packet-length-std (AUROCs between 0.801 and

0.804). 40/57 OneR models contribute to the ensem-

ble which reaches 0.886 AUROC with 0.728 preci-

sion and 0.796 recall.

Inﬁltration: is represented in both CIC-IDS datasets,

but only with 30 samples in CIC-IDS2017. CSE-CIC-

IDS2018 expanded the inﬁltration attacks, but con-

siders the trafﬁc from downloading a malicious ﬁle

from Dropbox as part of the malicious trafﬁc, heav-

ily padding the sample count. The quality of the ex-

periment to gather this attack class has not been suf-

ﬁcient. As a consequence no models perform well

on this attack class. For completeness: the ensemble

OneR model relied on positive contributions from just

4 features and reached an AUROC of 0.57.

On the individual data sets of the CIC-NIDS col-

lection results are more in-line with the results on the

updated Distrinet-CIC-IDS2017 (subsection 4.1.11).

The grouping into one larger, more varied dataset and

the removal of contaminating features has reduced the

effectiveness of the OneR model.

4.1.7 CIC-Darknet2020

The CIC Darknet dataset from 2020 has two compo-

nents. The dataset is designed around VPN and Tor

recognition and provides this labeling resolution to its

users (Tor/NonTor & VPN/NonPN).

On the Tor/NonTor task, the top-5 features were

bwd header length, total bwd packets, ack ﬂag count,

fwd init win bytes and bwd init win bytes with AU-

ROCs ranging from 0.753 up to 0.898. The ensem-

ble model reaches 0.980 AUROC with an F1 score of

0.732. A middling performance on the positive class

but high overall AUROC indicates that the model pre-

dicts very few false positives.

The VPN/NonVPN task had signiﬁcantly worse

performance. Its top-5 features do not reach higher

than 0.719 AUROC. (subﬂow bwd bytes, bwd packet

length max, total length of bwd packet, bwd packet

length mean and bwd segment size avg). The en-

semble with contributions by 53/77 features improves

slightly to 0.779 AUROC with an F1-score of 0.739,

indicating a pretty mixed performance in terms of

type I and type II errors.

4.1.8 CIRA-CIC-DoHBRW2020

CIRA-CIC-DoHBrw provides two layers of tasks. L1

focuses on the split between DoH and NonDoH trafﬁc

while L2 focuses on the recognition of DoH tunnel

abuse for exﬁltration / communication purposes for

malware as opposed to legitimate use.

On the L1-task, the top-5 features are packet-

time-mean, packet-time-stdev, packet-time-var, du-

ration and packet-length-mode (AUROCs between

0.767 and 0.856). The ensemble model reaches 0.903

AUROC with 60% precision and 82% recall. Nine of

the twenty-nine features contributed to the ensemble.

On the L2-task, the top-5 features are packet-time-

stdev, packet-time-var, ﬂow-sent-rate, ﬂow-bytes-sent

and ﬂow-bytes-received (AUROCs between 0.604

and 0.698). The ensemble model reaches an AU-

ROC of 0.799 with a precision-recall pair of 0.971 and

0.972. The ensemble took the predictions of 13/29

models into account.

4.1.9 CIC-Bell-DNSExf2021

CIC-Bell-DNSExf2021 is the second CIC dataset tar-

geting covert communication and data exﬁltration

over DNS. Two versions of the cleaning process for

this dataset have been performed. The ﬁrst tries to

stay as close to the data as possible. The second ver-

sion is much more opinionated and focuses on getting

the data as close to its own documentation as possi-

ble. For this analysis, the minimalistically cleaned

ﬁles were chosen because they deviate the least from

the raw dataset.

Individual OneR models are relatively capable of

discerning the malicious from the benign DNS traf-

ﬁc. The top-5 features are FQDN-count, subdomain-

length, special, sld and longest-word (AUROCs from

0.784-0.796). The ensemble model reaches 0.846

AUROC with an F1 score of 0.862. OneR models

for 24/41 features contributed to the ensemble.

4.1.10 USB-IDS-1

USB-IDS-1 focuses on DoS attacks and their poten-

tial defenses. Having captured millions of samples

of four DoS attacks under various defensive circum-

stances, it was expected that there would be a great

deal of variety in the samples. The compatibility with

the feature set of the CIC collection also opens up fur-

ther investigations into generalization.

The expectation of a varied dataset, impossible to

be classiﬁed by a model as simple as a single com-

parison, did not hold up. The top-5 features are bwd

bytes/bulk avg, total length of bwd packet, fwd iat

min, bwd psh ﬂags and fwd init win bytes, all with

AUROCs above 0.95. A single comparison with the

top feature fwd init win bytes produces a model with

0.999 AUROC. 46/77 features have immediate sep-

arating power. There is no need for the ensemble

model, but for completeness’ sake, its performance

IoTBDS 2023 - 8th International Conference on Internet of Things, Big Data and Security

was 0.997 AUROC with 0.982 F1 score.

4.1.11 Distrinet-CIC-IDS2017

The Distrinet update of CIC-IDS2017 offers the same

attack classes as the CIC collection. Even though this

is a single dataset, the distinction will be kept be-

cause the number of available samples for each of the

classes varies greatly.

DDoS: top-5 features: bwd-packet-length-mean, avg-

bwd-segment-size, packet-length-variance, packet-

length-std and bwd-packet-length-std (AUROCs from

0.993-0.998). 60/77 OneR models contribute to the

ensemble which reaches 0.999 AUROC with 0.995

precision and 0.979 recall.

DoS: top-5 features: fwd-iat-max, bwd-iat-mean,

ﬂow-duration, fwd-iat-total and fwd-seg-size-min

(AUROCs from 0.858-0.899). 64/77 OneR models

contribute to the ensemble which reaches 0.965 AU-

ROC with 0.906 precision and 0.823 recall.

Bruteforce: top-5 features: bwd-header-length,

bwd-psh-ﬂags, psh-ﬂag-count, fwd-psh-ﬂags and

down/up-ratio (AUROCs from 0.951-0.975). 53/77

contribute to the ensemble which reaches 0.998 AU-

ROC with 0.998 precision and 0.987 recall.

Portscan: top-5 features: fwd-packet-length-max,

fwd-packets-length-total, subﬂow-fwd-bytes, fwd-

packet-length-mean and avg-fwd-segment-size (AU-

ROCs from 0.951-0.955). 45/77 features contribute

to the ensemble which reaches 0.993 AUROC with

0.878 precision and 0.983 recall.

Botnet: top-5 features: packet-length-mean, avg-

packet-size, subﬂow-bwd-bytes, avg-bwd-segment-

size and bwd-packet-length-mean (AUROCs from

0.910-0.950). 53/77 features contribute to the ensem-

ble which reaches perfect 1.0 AUROC with 0.998 pre-

cision and perfect recall.

Webattack: top-5 features: bwd-header-length, bwd-

iat-std, init-bwd-win-bytes, init-fwd-win-bytes, fwd-

seg-size-min (AUROCs from 0.844-0.894). 56/77

features contribute to the ensemble which reaches

0.969 AUROC with 0.895 precision and 0.855 recall.

Even though the Distrinet lab materially improved

CIC-IDS2017, it remains a trivial dataset to classify.

4.2 Malware Classiﬁcation Datasets

Although far fewer malware classiﬁcation datasets

have been included, the results on them highlight the

same issue.

4.2.1 CCCS-CIC-AndMal2020

This mixture of around 400k samples of android ap-

plications (half of which are malware) has 14 cate-

gories of malware including adware, backdoors, spy-

ware, trojans, scareware, ... It has features captured

both through static and through dynamic analysis. For

the benign software samples, the 150+ dynamic fea-

tures were not captured, leaving only the 9500+ static

features. The static features were not individually

named in the dataset so they will be represented sim-

ply by their index (Fxxxx).

The top-5 features were F50, F57, F37, F48, F58

all with AUROC scores above 0.84 (max 0.875). Of

the 9503 features less than 500 had direct separating

power. The ensemble OneR model reaches 0.953 AU-

ROC with 0.953 precision and 0.869 recall.

4.2.2 CIC-MalMem2022

The second malware dataset in this article has been

designed for the detection of obfuscated malware with

features derived from memory dumps. It is a tabular

dataset, horizontally with 55 unique features are pro-

vided and vertically it has 58,596 records perfectly

balanced with 29,298 benign and 29,298 malicious

samples.

Our additional preprocessing was limited to

adding intermediate levels of labeling. Only the most

abstract level, the binary split between malicious and

benign samples is taken into account.

The top-5 features were svcscan.nservices,

svcscan.shared-process-services, svcscan.kernel-

drivers, handles.nmutant and dlllist.avg-dlls-per-proc,

all landing with AUROC scores above 0.987. Of the

55 available features, 51 can yield single comparison

models with separating power beyond 0.5 AUROC.

The ensemble model of the useful single feature

models reached 0.996 AUROC and 0.984 F1 (0.974

precision, 0.993 recall).

4.2.3 CIC-Evasive-PDFMal2022

The third and ﬁnal malware dataset centers on mal-

ware hidden in PDF ﬁles. It is almost balanced with

5555 malicious samples and 4468 benign samples.

Our preprocessing for this dataset was quite exten-

sive because two classes of data inconsistencies ex-

ist in CIC-Evasive-PDFMal2022. First, there are fea-

tures where a negative value is impossible based on

the documentation, but negative values occur. Sec-

ond, there are features which should be numeric, but

some samples have string or other non-numeric val-

ues.

The top-5 single-decision, single feature models

were built on startxref, metadatasize, javascript, js

and stream, all with AUROCs above 0.814. Luckily,

unlike CIC-MalMem2022 it is not possible to solve

the dataset well with one feature and one comparison.

Castles Built on Sand: Observations from Classifying Academic Cybersecurity Datasets with Minimalist Methods

For 20 of the 31 available features, single comparison

models have (at least some) effectiveness.

The ensemble of the 20 contributing OneR mod-

els reaches an AUROC of 0.986 with an F1 score of

0.9539 (precision 0.969, recall 0.939).

5 DISCUSSION

The result section 4 is extremely dense, but the core

thread is clearly visible. Even with a computational

budget of one comparison per feature, good to excel-

lent models can be found. Table 2 summarizes the

results on each dataset. The table only mentions the

performance of the ensemble of OneR models, since

it’s more often the better model. Still for every en-

try with a †, an individual OneR model had the best

performance. Figure 2 visualizes a blatant example

of lacking variability which leads to OneR being so

powerful.

Table 2: Result Summary.

Trafﬁc Classiﬁcation Datasets

Dataset AUROC Precision Recall

NSL-KDD 0.93 0.94 0.86

CTU-13 0.5 0.5 0.5

UNSW-NB15

†

0.78 0.96 0.62

CIDDS-001 0.5 0.5 0.5

CIDDS-002 0.88 0.78 0.76

CIC-NIDS Collection

DDoS 0.84 0.66 0.87

DoS 0.91 0.82 0.81

Bruteforce 0.97 0.98 0.94

Portscan 0.93 0.69 0.92

Botnet 0.98 1.0 0.95

Webattack 0.89 0.73 0.80

CIC-Darknet-Tor 0.98 0.59 0.97

CIC-Darknet-VPN 0.78 0.63 0.90

CIRA-CIC-DoHBrw-L1 0.90 0.60 0.82

CIRA-CIC-DoHBrw-L2 0.80 0.97 0.97

CIC-Bell-DNS-EXF 0.85 0.80 0.94

USB-IDS-1

†

1.0 1.0 0.96

Distrinet-improved-CIC-IDS2017

DDoS 1.0 1.0 0.98

DoS 0.97 0.91 0.82

Bruteforce 1.0 1.0 0.99

Portscan 1.0 1.0 1.0

Botnet 1.0 1.0 1.0

Webattack 0.97 0.90 0.86

Malware Classiﬁcation Datasets

CCCS-CIC-AndMal 0.95 0.95 0.87

CIC-Malmem

†

1.0 0.97 0.99

CIC-Evasive-PDFMal 0.99 0.97 0.94

Three additional observations present themselves:

1. The CIC-NIDS collection does introduce more

variability, not just additional volume which leads

to weaker performance for the OneR model

2. Datasets with intelligently predeﬁned train-

validation-test splits are less affected

3. More malware datasets have to be examined since

they were amongst the most affected

The ﬁrst observation leads to a hopeful conclu-

sion for future work. Datasets should strive towards

interoperability with the others to quickly achieve

higher variability, not just higher volume. The dif-

ferences in experimental setup and execution will

lead to new, unique samples. A comparison of ﬁg-

ures 3 (Distrinet-IDS-2017-DDoS) and 4 (CIC-NIDS-

Collection-DDoS) immediately reveals how the dif-

ference in distributions for the same features leads

OneR to score so well on the former and a lot poorer

on the latter.

NSL-KDD and UNSW-NB15, even though they

are quite a bit older than the latest datasets, have a

feature that protects them from easily ﬁnding models

with inﬂated performance. Creating designated train,

validation and test splits that are no mere consequence

of random sampling, but contain unseen attacks from

the same classes guarantees new patterns to more rig-

orously test generalization beyond the training.

Some datasets are not well-suited to ML evalu-

ation despite being created for that purpose. Both

CTU-13 and CIDDS-001 have so few features and

so much overlap in the available values for those

features between benign and malicious samples that

no method, however sophisticated, will reach ade-

quate performance. Both datasets are also extremely

class-imbalanced (CIDDS: benign 4,354,282, mali-

cious 11,875, CTU-13: benign 8,432,312, malicious

190,210). A visual explanation of the issue is visible

in ﬁgure 5. That visualization also reveals a short-

coming in the experiment’s design. All attacks were

short in duration and therefore did not generate a large

amount of trafﬁc (bytes). Worse still is the distribu-

tion of the third feature “ﬂows” which has 0 variance

because it has only one potential value (1.0).

Since OneR is the simplest model that can

be devised, it is not always a genuine contender

compared to recently proposed models. The gap in

complexity between OneR and the state-of-the-art

proposals is hard to overstate. Even if method

authors do not compare their proposal to OneR, they

should compare it to XGBoost (Chen et al., 2015),

Catboost (Prokhorenkova et al., 2018) or even just to

randomized decision trees (Geurts et al., 2006). For

the included datasets which cover the state-of-the-art

in NIDS and a fair portion of malware classiﬁcation,

method authors would discover that their proposals do

not outperform simpler, computationally less expen-

sive models. This rings particularly true for the slew

IoTBDS 2023 - 8th International Conference on Internet of Things, Big Data and Security

Figure 2: Lackluster variation in the data creates OneR’s excellent performance on CIC-Malmem2022.

Figure 3: The class-grouped distributions of Distrinet-DDoS demonstrate why it is so easily classiﬁed.

Figure 4: The class grouped distributions of CIC-NIDS collection DDoS show signiﬁcantly more variability in its samples.

Castles Built on Sand: Observations from Classifying Academic Cybersecurity Datasets with Minimalist Methods

Figure 5: CIDDS-001: Heavy class imbalance and complete overlap in the value ranges for the benign and malicious samples

or no variation at all.

of deep neural networks that have been proposed in

recent years.

6 CONCLUSION

Cybersecurity research has ventured far into machine

learning for its pattern recognition tasks. New model

proposals are numerous and often borrow / recom-

bine model advances discovered for other ML tasks.

Along the way it has lost its commitment to deliver-

ing models with excellent performance while keeping

computational cost low. Survey authors had already

noticed that the optimization for ever higher classiﬁ-

cation metrics resulted in growing model complexity

without keeping in mind that many security-related

pattern recognition tasks are time-sensitive and bene-

ﬁt from a dual optimization of model complexity and

computational efﬁciency. Their objections have not

dissuaded the research community from continuing to

increase model complexity even though the margins

for improvement on many state-of-the-art datasets are

slim to non-existent.

This article is inspired by (Holte, 1993) which

demonstrated that the simplest supervised machine

learning model, one rule (OneR, 1R) performed well

on the most commonly used datasets of its time. On

17 state-of-the-art cybersecurity datasets for network

trafﬁc classiﬁcation and malware recognition, the ef-

fectiveness of OneR and ensemble OneR is estab-

lished. Despite being nothing more than a single com-

parison, OneR (or ensemble OneR) often proves itself

competitive with recent detection proposals. Further

investigation into the datasets themselves has revealed

that many suffer from a lack of variability which al-

lows even OneR to be effective.

From a practical standpoint, the authors of this

article want to urge authors of new detection pro-

posals to compare their work to a well-tuned XG-

Boost or Catboost model before claiming superiority

for their model. Researchers working on any dataset

should also be aware of OneR’s performance even

if they don’t explicitly mention it in their proposals.

Ultimately, the authors of this article hope that this

practice will lead more researchers to investigate the

datasets themselves, to uncover their ﬂaws and dedi-

cate their future work to improving dataset quality so

that the more complex models will actually become

necessary. As it stands today, the complexity of many

novel detection methods is not warranted for the avail-

able datasets.

7 FUTURE WORK

For the network trafﬁc datasets this article should

clearly convey the issue, but the sampling of malware

classiﬁcation datasets is too sparse to carry the same

weight. Future work (primarily on Kaggle) will con-

tinue to include more security datasets (especially in

malware classiﬁcation) to establish baselines. Cyber-

security datasets have to improve, particularly when it

comes to variety. Future work will focus on this data

generation problem with particular interest in dataset

compatibility.

In network intrusion detection datasets another

fundamental problem exists that may be tied to the

lack of variability in the datasets. Proposed methods

yield excellent results in the standard intra-dataset

evaluation (the same dataset for training, validation

and testing), but they fail to generalize to inter-dataset

IoTBDS 2023 - 8th International Conference on Internet of Things, Big Data and Security

evaluation, even if the second dataset has been

collected under very similar circumstances.

ACKNOWLEDGEMENTS

This work has partially been supported by the AIDE

project which is a three-way partnership between gov-

ernment, academia and industry. It aims to achieve

robust, resilient and adaptive protection of computer

systems with a particular role for federated learning.

The AIDE project is funded by the Belgian Chan-

cellery of the Prime Minister, a federal public service,

as part of their ﬁnancing for the development of arti-

ﬁcial intelligence.

REFERENCES

Ab Razak, M. F., Anuar, N. B., Salleh, R., and Firdaus, A.

(2016). The rise of “malware”: Bibliometric analysis

of malware study. Journal of Network and Computer

Applications, 75:58–76.

Ahmad, Z., Shahid Khan, A., Wai Shiang, C., Abdullah, J.,

and Ahmad, F. (2021). Network intrusion detection

system: A systematic study of machine learning and

deep learning approaches. Transactions on Emerging

Telecommunications Technologies, 32(1):e4150.

Aldweesh, A., Derhab, A., and Emam, A. Z. (2020). Deep

learning approaches for anomaly-based intrusion de-

tection systems: A survey, taxonomy, and open issues.

Knowledge-Based Systems, 189:105124.

Berman, D. S., Buczak, A. L., Chavis, J. S., and Corbett,

C. L. (2019). A survey of deep learning methods for

cyber security. Information, 10(4):122.

Buczak, A. L. and Guven, E. (2015). A survey of data min-

ing and machine learning methods for cyber security

intrusion detection. IEEE Communications surveys &

tutorials, 18(2):1153–1176.

Carrier, T., Victor, P., Tekeoglu, A., and Lashkari, A. H.

(2022). Detecting obfuscated malware using memory

feature engineering. In ICISSP, pages 177–188.

Catillo, M., Del Vecchio, A., Ocone, L., Pecchia, A., and

Villano, U. (2021). Usb-ids-1: a public multilayer

dataset of labeled network ﬂows for ids evaluation.

In 2021 51st Annual IEEE/IFIP International Con-

ference on Dependable Systems and Networks Work-

shops (DSN-W), pages 1–6. IEEE.

Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y.,

Cho, H., Chen, K., et al. (2015). Xgboost: extreme

gradient boosting. R package version 0.4-2, 1(4):1–4.

D’hooge, L., Verkerken, M., Volckaert, B., Wauters, T., and

De Turck, F. (2022a). Establishing the contaminat-

ing effect of metadata feature inclusion in machine-

learned network intrusion detection models. In Inter-

national Conference on Detection of Intrusions and

Malware, and Vulnerability Assessment, pages 23–41.

Springer.

D’hooge, L., Verkerken, M., Wauters, T., Volckaert, B.,

and De Turck, F. (2021). Hierarchical feature block

ranking for data-efﬁcient intrusion detection model-

ing. Computer Networks, 201:108613.

D’hooge, L., Verkerken, M., Wauters, T., Volckaert, B., and

De Turck, F. (2022b). Discovering non-metadata con-

taminant features in intrusion detection datasets. In

2022 19th Annual International Conference on Pri-

vacy, Security & Trust (PST), pages 1–11. Ieee.

Engelen, G., Rimmer, V., and Joosen, W. (2021). Trou-

bleshooting an intrusion detection dataset: the ci-

cids2017 case study. In 2021 IEEE Security and Pri-

vacy Workshops (SPW), pages 7–12. IEEE.

Garc

ıa, S., Grill, M., Stiborek, J., and Zunino, A. (2014).

An empirical comparison of botnet detection methods.

Computers & Security, 45:100–123.

Geurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely

randomized trees. Machine learning, 63(1):3–42.

Habibi Lashkari, A., Kaur, G., and Rahali, A. (2020). Di-

darknet: a contemporary approach to detect and char-

acterize the darknet trafﬁc using deep image learning.

In 2020 the 10th International Conference on Com-

munication and Network Security, pages 1–13.

Holte, R. C. (1993). Very simple classiﬁcation rules per-

form well on most commonly used datasets. Machine

learning, 11(1):63–90.

Issakhani, M., Victor, P., Tekeoglu, A., and Lashkari, A.

(2022). Pdf malware detection based on stacking

learning. In Proceedings of the 8th International

Conference on Information Systems Security and Pri-

vacy - Volume 1: ICISSP,, pages 562–570. INSTICC,

SciTePress.

Jazi, H. H., Gonzalez, H., Stakhanova, N., and Ghorbani,

A. A. (2017). Detecting http-based application layer

dos attacks on web servers in the presence of sam-

pling. Computer Networks, 121:25–36.

Keyes, D. S., Li, B., Kaur, G., Lashkari, A. H., Gagnon, F.,

and Massicotte, F. (2021). Entroplyzer: Android mal-

ware classiﬁcation and characterization using entropy

analysis of dynamic characteristics. In 2021 Recon-

ciling Data Analytics, Automation, Privacy, and Se-

curity: A Big Data Challenge (RDAAPS), pages 1–12.

IEEE.

Liu, H. and Lang, B. (2019). Machine learning and deep

learning methods for intrusion detection systems: A

survey. applied sciences, 9(20):4396.

Mahdavifar, S., Hanafy Salem, A., Victor, P., Razavi, A. H.,

Garzon, M., Hellberg, N., and Lashkari, A. H. (2021).

Lightweight hybrid detection of data exﬁltration using

dns based on machine learning. In 2021 the 11th Inter-

national Conference on Communication and Network

Security, pages 80–86.

Mishra, P., Varadharajan, V., Tupakula, U., and Pilli, E. S.

(2018). A detailed investigation and analysis of us-

ing machine learning techniques for intrusion de-

tection. IEEE communications surveys & tutorials,

21(1):686–728.

MontazeriShatoori, M., Davidson, L., Kaur, G., and

Lashkari, A. H. (2020). Detection of doh tun-

nels using time-series classiﬁcation of encrypted

Castles Built on Sand: Observations from Classifying Academic Cybersecurity Datasets with Minimalist Methods

trafﬁc. In 2020 IEEE Intl Conf on Depend-

able, Autonomic and Secure Computing, Intl Conf

on Pervasive Intelligence and Computing, Intl

Conf on Cloud and Big Data Computing, Intl

Conf on Cyber Science and Technology Congress

(DASC/PiCom/CBDCom/CyberSciTech), pages 63–

70. IEEE.

Moustafa, N. and Slay, J. (2015). Unsw-nb15: a compre-

hensive data set for network intrusion detection sys-

tems (unsw-nb15 network data set). In 2015 Mili-

tary Communications and Information Systems Con-

ference (MilCIS), pages 1–6.

Naway, A. and Li, Y. (2018). A review on the use of deep

learning in android malware detection. arXiv preprint

arXiv:1812.10360.

Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush,

A. V., and Gulin, A. (2018). Catboost: unbiased boost-

ing with categorical features. Advances in neural in-

formation processing systems, 31.

Qiu, J., Zhang, J., Luo, W., Pan, L., Nepal, S., and Xiang,

Y. (2020). A survey of android malware detection

with deep neural models. ACM Computing Surveys

(CSUR), 53(6):1–36.

Rahali, A. and Akhlouﬁ, M. A. (2021). Malbert: Mal-

ware detection using bidirectional encoder representa-

tions from transformers. In 2021 IEEE International

Conference on Systems, Man, and Cybernetics (SMC),

pages 3226–3231. IEEE.

Rahali, A., Lashkari, A. H., Kaur, G., Taheri, L., Gagnon,

F., and Massicotte, F. (2020). Didroid: Android mal-

ware classiﬁcation and characterization using deep

image learning. In 2020 The 10th international

conference on communication and network security,

pages 70–82.

Ring, M., Wunderlich, S., Gr

udl, D., Landes, D., and

Hotho, A. (2017a). Creation of ﬂow-based data sets

for intrusion detection. Journal of Information War-

fare, 16:40–53.

Ring, M., Wunderlich, S., Gr

udl, D., Landes, D., and

Hotho, A. (2017b). Flow-based benchmark data sets

for intrusion detection. In Proceedings of the 16th

European Conference on Cyber Warfare and Security

(ECCWS), pages 361–369. ACPI.

Shalaginov, A., Banin, S., Dehghantanha, A., and Franke,

K. (2018). Machine learning aided static malware

analysis: A survey and tutorial. Cyber threat intel-

ligence, pages 7–45.

Sharafaldin., I., Habibi Lashkari., A., and Ghorbani., A. A.

(2018). Toward generating a new intrusion detection

dataset and intrusion trafﬁc characterization. In Pro-

ceedings of the 4th International Conference on Infor-

mation Systems Security and Privacy - ICISSP,, pages

108–116. INSTICC, SciTePress.

Sharafaldin, I., Lashkari, A. H., Hakak, S., and Ghorbani,

A. A. (2019). Developing realistic distributed denial

of service (ddos) attack dataset and taxonomy. In 2019

International Carnahan Conference on Security Tech-

nology (ICCST), pages 1–8.

Tavallaee, M., Bagheri, E., Lu, W., and Ghorbani, A. A.

(2009). A detailed analysis of the kdd cup 99 data set.

In 2009 IEEE Symposium on Computational Intelli-

gence for Security and Defense Applications, pages

1–6.

Wang, H. and Li, W. (2021). Ddostc: A transformer-based

network attack detection hybrid mechanism in sdn.

Sensors, 21(15):5047.

Youden, W. J. (1950). Index for rating diagnostic tests. Can-

cer, 3(1):32–35.

IoTBDS 2023 - 8th International Conference on Internet of Things, Big Data and Security