8 CONCLUSION
Having a reliable and publicly available IDS dataset
is an important concern for researchers in this do-
main. The Canadian Institute for Cybersecurity is a
major provider of IDS datasets. We analyzed one of
them and identified the key issues. Some features are
not calculated correctly, the protocols are partly in-
correct, and the way TCP packets are grouped into
flows is not suitable for machine learning processing.
Next, we proposed LycoSTand as a tool for process-
ing PCAP files and generated a new dataset from the
CIC-IDS2017 PCAP files. The tool and dataset are
publicly available so that they can be used to replicate
and improve our results.
A fair comparison of the original dataset with
LYCOS-IDS2017 demonstrated that the corrections
made by our tool have a positive impact on all tested
machine learning algorithms. Metrics such as accu-
racy, precision and recall are above 99.5% for all al-
gorithms, with the exception of LDA, which, in any
case, improves significantly from 88% to 96%. The
best results are obtained with Random Forest which
outperforms all other algorithms for all metrics. Fi-
nally, we observe that SVM or QDA that do not rank
well for CIC-IDS2017 become interesting algorithms
for LYCOS-IDS2017. This shows that a corrected
dataset may help researchers to reconsider the choice
of algorithms in their IDS studies.
The issues we identified in CIC-IDS2017 only ex-
ist in CSV files. This shows how important it is to
provide raw data along with the flow-based features.
Problem markers were found in all five datasets gen-
erated with CICFlowMeter. However, they are still
interesting because their PCAP files can be processed
with LycoSTand to obtain better flow-based datasets.
Because most of the publications related to these
datasets rely on algorithms processing the CSV files,
it is important to make available a corrected version
so that findings of research work are not impacted by
the erroneous CSV files.
Future work could be considered on the datasets
published by CIC and listed in section 7 with the goal
to confirm suspected issues. In addition, with the
rise of IoT devices, it would be interesting to study
how such devices can be protected from network in-
trusions. We intend to improve LycoSTand execution
time, deploy the model on a resource-constrained sys-
tem and replicate attacks to investigate whether the
training on the dataset can detect intrusion attacks
launched with a penetration testing toolset. Such re-
search could highlight some limitations of the ML-
based IDS solution for embedded system and deter-
mine if the current dataset is prone to concept drift.
REFERENCES
Bergstra, J. and Bengio, Y. (2012). Random search for
hyper-parameter optimization. The Journal of Ma-
chine Learning Research, 13:281–305.
Canadian Institute for Cybersecurity (2017a). Applica-
tions - cicflowmeter (formerly iscxflowmeter). https:
//www.unb.ca/cic/research/applications.html. Last
checked on Nov 27, 2021.
Canadian Institute for Cybersecurity (2017b). Intrusion de-
tection evaluation dataset (cicids2017). https://www.
unb.ca/cic/datasets/ids-2017.html. Last checked on
Nov 27, 2021.
Chicco, D., Tötsch, N., and Jurman, G. (2021). The
matthews correlation coefficient (mcc) is more reli-
able than balanced accuracy, bookmaker informed-
ness, and markedness in two-class confusion matrix
evaluation. BioData Mining, 14(1):13.
Draper-Gil., G., Lashkari., A. H., Mamun., M. S. I., and
Ghorbani., A. A. (2016). Characterization of en-
crypted and vpn traffic using time-related features. In
Proceedings of the 2nd International Conference on
Information Systems Security and Privacy (ICISSP),
volume 1, pages 407–414. INSTICC, SciTePress.
Engelen, G., Rimmer, V., and Joosen, W. (2021). Trou-
bleshooting an intrusion detection dataset: the ci-
cids2017 case study. In 2021 IEEE Security and Pri-
vacy Workshops (SPW), pages 7–12.
Gamage, S. and Samarabandu, J. (2020). Deep learning
methods in network intrusion detection: A survey and
an objective comparison. Journal of Network and
Computer Applications, 169:102767.
Gharib, A., Sharafaldin, I., Lashkari, A. H., and Ghorbani,
A. A. (2016). An Evaluation Framework for Intrusion
Detection Dataset. In International Conference on In-
formation Science and Security (ICISS), pages 1–6.
Ho, S., Jufout, S. A., Dajani, K., and Mozumdar, M.
(2021). A novel intrusion detection model for detect-
ing known and innovative cyberattacks using convo-
lutional neural network. IEEE Open Journal of the
Computer Society, 2:14–25.
Kingma, D. P. and Ba, J. (2015). Adam: A method for
stochastic optimization. 3rd International Conference
for Learning Representations.
Klambauer, G., Unterthiner, T., Mayr, A., and Hochre-
iter, S. (2017). Self-normalizing neural networks. In
Advances in Neural Information Processing Systems,
pages 971–980.
Lashkari, A. H., Gil, G. D., Mamun, M. S. I., and Ghorbani,
A. A. (2017). Characterization of tor traffic using time
based features. In Proceedings of the 3rd International
Conference on Information Systems Security and Pri-
vacy - Volume 1: ICISSP,, page 253–262. SciTePress.
Lashkari, A. H., Kadir, A. F. A., Taheri, L., and Ghor-
bani, A. A. (2018). Toward developing a system-
atic approach to generate benchmark android malware
datasets and classification. In 2018 International Car-
nahan Conference on Security Technology (ICCST),
pages 1–7.
Network Intrusion Detection: A Comprehensive Analysis of CIC-IDS2017
35