following way:
In section 2 we present same state-of-the-art solutions
in big data context. Section 3 introduces our solution.
Experimental results are described in section 4. Fi-
nally, section 5 concludes the paper and presents our
future works.
2 RELATED WORKS
Only few papers focus on Intrusion detection systems
in big data frameworks (Hassan et al., 2020). For in-
stance (Terzi et al., 2017) used clustering algorithm to
detects network anomaly from Netflow. Experiments
performed on CTU-i3 dataset achieve 96% of accu-
racy but high false alarm rate.
(Hassan et al., 2020) exploited conventional neu-
ral network and weight dropped long short-term
memory (WDLSTM) network to build the IDS. The
work was tested on the UNSW-NB15 dataset and
achieves an accuracy=97.17%.
(Mohamed et al., 2022) proposed an intrusion de-
tection framework using Apache Spark for IoT. Three
Spark’s MLlib was tested in BoT-IoT dataset based on
F1-measure, namely Random Forest, Decision Tree
and Naive Bayes. Experiments show that Decision
tree achieves the highest value of F1-measure in big
data context i.e when using the whole dataset with
97.9% for binary classification and 79% for multi-
classification.
(Liu et al., 2020) proposed a network intru-
sion detection system based on Deep Random For-
est. The model was deployed in Spark environment.
Four datasets were used in experimentation namely,
NSL KDD, UNSW-NB15, CICICDS2017 and CICI-
CDS2018 and good results were achieved.
(Al-Rawi, 2019) used two algorithms from
Spark’s MLlib; The first is Multi-Layer Perceptron
which classifies the data into normal or attacks. Data
classified as attacks are fitted to the second classifier,
which is the Random Forest, for further verification.
The proposed IDS performs an overall accuracy of
99.12% on UNSW-NB15 dataset.
Also (Kurt and Becerikli, 2018) performed a
comparison between different machine learning al-
gorithms provided by Spark’s MLlib namely, Logis-
tic Regression, Support Vector Machine, Naive Bayes
and Random Forest. Experiments on KDD99 dataset
show that Logistic Regression achieves the best accu-
racy with 99.1% . However Naive Bayes achieves the
lowest training and prediction time.
(Vimalkumar and Radhika, 2017) presented an
intrusion detection framework for smart grids using
Apache spark and various machine learning tech-
niques namely, Deep Neural Networks, Support Vec-
tor Machines, Random Forest, Decision Trees and
Naive Bayes. Also feature selection and dimensional-
ity reduction algorithms are exploited. Experimenta-
tion are done on the synchrophasor dataset and the re-
sults are compared using useful metrics i.e accuracy,
recall, false rate, specificity, and prediction time. Best
results were achieved by Naive Bayes classifier with
accuracy= 79.21%.
(Ouhssini et al., 2021) proposed a distributed IDS
for cloud systems based on big data tools and ma-
chine learning algorithms. The system is composed
of four components, namely network data collector, a
streamer based on Kafka, preprocessing/data clean-
ing and data normalizing/feature selection using k-
means algorithm. Different ML techniques are used
for anomaly detection. After Comparison, authors
chose decision Tree for their system because of its ac-
curacy and detection time.
(Bagui et al., 2021) introduced an IDS based on
Random Forest for a distributed big data environment
using Apache Spark. The classifier is tested using the
UNSW-NB15 dataset. Authors used information gain
and principal components analysis (PCA) to address
the issue of high dimensionality of the dataset. The
highest accuracy was obtained by the binary classifier
was 99.94%.
(Awan et al., 2021) applied machine learning ap-
proaches namely Random Forest (RF) and Multi-
Layer Perceptron (MLP) through Spark ML library
for the detection of Denial of Service (DoS) attacks.
The model achieved a mean accuracy of 99.5%
(Jemili and Bouras, 2021) proposed an Intrusion
Detection System based on big data fuzzy analyt-
ics. In fact, Fuzzy C-Means (FCM) is used to clus-
ter and classify the training dataset. Experimentation
are done with CTU-13 and UNSW-NB15 datasets and
shows high performance in terms of accuracy (97.2%)
and recall (96.4%).
Although works mentioned above are proposed
for big data context, most of them didn’t address same
big data challenges such as velocity since data in
big data environment are coming in very high speed,
hence they should be treated at real time.
For this motivation, we introduce in this write-up a
real time data preprocessing and detection within big
data environment.
Table 1 summarizes and compares between those
works and our solution based on experimental results
especially the accuracy rate and the time of training
and testing.
An Efficient Real Time Intrusion Detection System for Big Data Environment
1005