An Efﬁcient Real Time Intrusion Detection System for Big Data

Environment

Faten Louati

1 a

, Farah Barika Ktata

2 b

and Ikram Amous

3 c

MIRACL Laboratory, FSEGS, Sfax University, Sfax, Tunisia

MIRACL Laboratory, ISSATSo, Sousse University, Sousse, Tunisia

MIRACL Laboratory, Enet’com, Sfax University, Sfax, Tunisia

Keywords:

Intrusion Detection System, Big Data, Spark Streaming, Real Time Detection, Machine Learning.

Abstract:

Nowadays, Security is among the most difﬁcult issues in networks over the world. The problem becomes more

challenging with the emergence of big data. Intrusion detection systems (IDSs) are among the most efﬁcient

solutions. However, traditional IDSs could not deal with big data challenges and are not able to detect attacks

in real time. In this paper, a real time data preprocessing and attack detection are performed. Experiments on

the well-known benchmark NSL KDD dataset show good results either in terms of accuracy rate or time of

both training and testing and prove that our model outperforms other state-of-the-art solutions.

1 INTRODUCTION

All the data today could be considered as Big Data

because of the rapid increase of the use of cloud/edge

computing and 5G technologies which are utilized in

all aspects of life such as economics, politics, culture,

health-care to name a few. This rapid development

brings along to big challenges namely, security and

safety. The complexity of Big data makes the task

of processing and handling data very hard. Hence,

data are more vulnerable to different types of attacks.

Since its invention by Anderson in 1980 (Anderson,

1980), intrusion detection systems (IDS) have been in

continuous development and have been widely inves-

tigated by researchers as being among the most efﬁ-

cient solutions for networking security.

An IDS is a kind of software that monitors, ana-

lyzes networking trafﬁcs and sends an alert automat-

ically once a malicious activity is detected (Louati

and Ktata, 2020). There are two main techniques

for IDSs: signature-based technique and anomaly-

based technique. Signature-based IDS is based on a

database of signatures of attacks which is used to de-

cide if a given pattern is an attack or not. Although

this approach achieves high accuracy and low false

alarm rate, it is still unable to detect unseen attacks.

https://orcid.org/0000-0002-8582-6092

https://orcid.org/0000-0001-5706-4548

https://orcid.org/0000-0002-5893-9833

Thus, this technique is not suitable for big data con-

text since there are new kinds of attacks appearing ev-

ery day. On the other hand, anomaly-based approach

tackles this limit and achieves good detection rate for

known as well as unknown attacks, but the main draw-

back is that it causes a high false alarm rate i.e, it may

trigger an alert for a benign pattern. For this reason,

we used an enhanced anomaly-based intrusion detec-

tion by investigating Machine Learning (ML) algo-

rithms, since they perform high level of accuracy and

low value of false alarm rate in classiﬁcation problem.

Because most of the existing IDSs are still unable

to deal with the huge size of data in real time, in this

paper we proposed a new solution that performs a real

time intrusion detection system for big data environ-

ment. We address the challenges of big data such as

velocity and volume. We achieved a real time data

preprocessing and data classiﬁcation.

For this purpose, we created two clusters; the role

of the ﬁrst is preparing and preprocessing the incom-

ing streams of data in real time an in parallel way,

then sending them to the second cluster to be classi-

ﬁed in real time and in parallel way too. At this stage,

we used the benchmark NSL kDD dataset to simu-

late network trafﬁcs. Experimental results show that

our work outperforms other state-of-the-art solutions

in term of accuracy as well as time of both prepro-

cessing and detection,

The remaining part of the paper is organized in the

1004

Louati, F., Ktata, F. and Amous, I.

An Efﬁcient Real Time Intrusion Detection System for Big Data Environment.

DOI: 10.5220/0011885900003393

In Proceedings of the 15th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2023) - Volume 3, pages 1004-1011

ISBN: 978-989-758-623-1; ISSN: 2184-433X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

following way:

In section 2 we present same state-of-the-art solutions

in big data context. Section 3 introduces our solution.

Experimental results are described in section 4. Fi-

nally, section 5 concludes the paper and presents our

future works.

2 RELATED WORKS

Only few papers focus on Intrusion detection systems

in big data frameworks (Hassan et al., 2020). For in-

stance (Terzi et al., 2017) used clustering algorithm to

detects network anomaly from Netﬂow. Experiments

performed on CTU-i3 dataset achieve 96% of accu-

racy but high false alarm rate.

(Hassan et al., 2020) exploited conventional neu-

ral network and weight dropped long short-term

memory (WDLSTM) network to build the IDS. The

work was tested on the UNSW-NB15 dataset and

achieves an accuracy=97.17%.

(Mohamed et al., 2022) proposed an intrusion de-

tection framework using Apache Spark for IoT. Three

Spark’s MLlib was tested in BoT-IoT dataset based on

F1-measure, namely Random Forest, Decision Tree

and Naive Bayes. Experiments show that Decision

tree achieves the highest value of F1-measure in big

data context i.e when using the whole dataset with

97.9% for binary classiﬁcation and 79% for multi-

classiﬁcation.

(Liu et al., 2020) proposed a network intru-

sion detection system based on Deep Random For-

est. The model was deployed in Spark environment.

Four datasets were used in experimentation namely,

NSL KDD, UNSW-NB15, CICICDS2017 and CICI-

CDS2018 and good results were achieved.

(Al-Rawi, 2019) used two algorithms from

Spark’s MLlib; The ﬁrst is Multi-Layer Perceptron

which classiﬁes the data into normal or attacks. Data

classiﬁed as attacks are ﬁtted to the second classiﬁer,

which is the Random Forest, for further veriﬁcation.

The proposed IDS performs an overall accuracy of

99.12% on UNSW-NB15 dataset.

Also (Kurt and Becerikli, 2018) performed a

comparison between different machine learning al-

gorithms provided by Spark’s MLlib namely, Logis-

tic Regression, Support Vector Machine, Naive Bayes

and Random Forest. Experiments on KDD99 dataset

show that Logistic Regression achieves the best accu-

racy with 99.1% . However Naive Bayes achieves the

lowest training and prediction time.

(Vimalkumar and Radhika, 2017) presented an

intrusion detection framework for smart grids using

Apache spark and various machine learning tech-

niques namely, Deep Neural Networks, Support Vec-

tor Machines, Random Forest, Decision Trees and

Naive Bayes. Also feature selection and dimensional-

ity reduction algorithms are exploited. Experimenta-

tion are done on the synchrophasor dataset and the re-

sults are compared using useful metrics i.e accuracy,

recall, false rate, speciﬁcity, and prediction time. Best

results were achieved by Naive Bayes classiﬁer with

accuracy= 79.21%.

(Ouhssini et al., 2021) proposed a distributed IDS

for cloud systems based on big data tools and ma-

chine learning algorithms. The system is composed

of four components, namely network data collector, a

streamer based on Kafka, preprocessing/data clean-

ing and data normalizing/feature selection using k-

means algorithm. Different ML techniques are used

for anomaly detection. After Comparison, authors

chose decision Tree for their system because of its ac-

curacy and detection time.

(Bagui et al., 2021) introduced an IDS based on

Random Forest for a distributed big data environment

using Apache Spark. The classiﬁer is tested using the

UNSW-NB15 dataset. Authors used information gain

and principal components analysis (PCA) to address

the issue of high dimensionality of the dataset. The

highest accuracy was obtained by the binary classiﬁer

was 99.94%.

(Awan et al., 2021) applied machine learning ap-

proaches namely Random Forest (RF) and Multi-

Layer Perceptron (MLP) through Spark ML library

for the detection of Denial of Service (DoS) attacks.

The model achieved a mean accuracy of 99.5%

(Jemili and Bouras, 2021) proposed an Intrusion

Detection System based on big data fuzzy analyt-

ics. In fact, Fuzzy C-Means (FCM) is used to clus-

ter and classify the training dataset. Experimentation

are done with CTU-13 and UNSW-NB15 datasets and

shows high performance in terms of accuracy (97.2%)

and recall (96.4%).

Although works mentioned above are proposed

for big data context, most of them didn’t address same

big data challenges such as velocity since data in

big data environment are coming in very high speed,

hence they should be treated at real time.

For this motivation, we introduce in this write-up a

real time data preprocessing and detection within big

data environment.

Table 1 summarizes and compares between those

works and our solution based on experimental results

especially the accuracy rate and the time of training

and testing.

An Efﬁcient Real Time Intrusion Detection System for Big Data Environment

1005

Table 1: Comparison of cited works for Big Data context.

Ref. Approach Dataset Results Training

time

Testing

time

(Jemili and

Bouras, 2021)

big data fuzzy analyt-

ics

CTU-13 and

UNSW-NB15

accuracy=

97.2%

- -

(Awan et al.,

2021)

Random Gorest +

Multi Layer Percep-

tron + Spark

The appli-

cation layer

DDoS dataset

accuracy=

99.5%

34.11 min 0.46 min

(Bagui et al.,

2021)

Random Forest +

Spark

UNSW-NB15

dataset

accuracy=

99.94%

- -

(Ouhssini

et al., 2021)

Decision Tree +

spark + kafka

CIDDS-001

dataset

accuracy=

99.97%

(Vimalkumar

and Radhika,

2017)

ML algoritms +

spark

synchrophasor

dataset

accuracy=

79.21%

- 18.23

sec for

Random

Forest

(Hassan et al.,

2020)

Conventional Neural

Network + Weight

Dropped Long Short-

Term Memory

UNSW-NB15 accuracy

=97.17%

(Terzi et al.,

2017)

Clustering algorithm CTU-i3 accuracy=

96%

- -

(Mohamed

et al., 2022)

(Random Forest/ De-

cision Tree/ Naive

Bayes) + Spark

BoT-IoT f1 mesure=

97.9% for

binary clas-

siﬁcation

and 79%

for multi-

classiﬁcation

- -

(Liu et al.,

2020)

Deep Random Forest

+ Spark

NSL KDD/

UNSW-

NB15/ CI-

CIDS2017/

CICIDS2018

For NSL

KDD: Ac-

curacy=

99.1%

- 16.1 sec

for NSL

KDD

(Al-Rawi,

2019)

Multi Layer Percep-

tron + Random For-

est + Spark

UNSW-NB15 accuracy=

99.12%

- -

(Kurt and Be-

cerikli, 2018)

(Logistic Regression/

Support Vector Ma-

chine/ Naive Bayes/

Random Forest) +

Spark

KDD accuracy=

99.1%

4.041

hours

0.089

hours

Our solution real time Streaming

data preprocessing +

Real time streaming

data intrusion detec-

tion using spark

NSL KDD accuracy=0.99%32.043 sec 5.76 sec

3 THE PROPOSED SOLUTION

The main idea is to provide a solution that meets the

challenge of Big Data velocity while maintaining a

high detection rate. Most, if not all, state-of-the-

art solutions prepare the data ﬁrst and then perform

the detection/classiﬁcation task. This approach takes

twice, i.e. preparation time plus detection time. For

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

1006

this motivation, our solution consists of reducing time

by performing both preparation and detection tasks in

parallel and in real time. This means that the data ar-

riving at the system in a continuous stream is prepared

and classiﬁed at the same time and in real time. As

shown in Fig.1, our model is composed of two main

components, the ﬁrst one is responsible of data prepa-

ration and preprocessing in real time, the second one

is responsible of intrusion detection in real time too.

In this section, we give a brief background of the used

concepts and we explain the solution with details .

3.1 Background

1. Big Data: The term big data dates back to 2005,

it is typically deﬁned by three words: volume, ve-

locity and variety (Louati. et al., 2022). However,

some researchers extends those known 3Vs of big

data to 6Vs, namely:

• Volume: Size of the data

• Variety: Diversity of the data

• Velocity: Speed of the data

• Veracity: Uncertainty of the data

• Value : Usefulness of the data

• Variability: the way in with the data are used

and formatted

2. Apache Spark: We used Apache spark (Spark,

2014) which is widely used framework for Big

Data analysis, processing and parallel Machine

Learning. Spark uses in-memory computation

therefore, it executes all programs up to 100 xs

faster in memory, (and 10 xs faster on disk) than

Apache Hadoop.

Apache Spark system provides a high-level APIs

accessible in several programming languages

such as Scala, Java, Python and R and is com-

posed with Spark core and higher level libraries

such as Spark SQL which deals with SQL and

structured data, Spark MLlib which contains a

large set of ML and data mining algorithms,

GraphX which helps with graph processing, and

Spark Streaming for real-time stream processing

. Spark has the ability to process a huge amounts

of data in real time using big data analytic tools

and streaming engine which leads to many ben-

eﬁts that can be exploited in intrusion detection

ﬁeld.

We tested four Spark MLlib algorithms, namely

Decision Tree, Random Forest, Logistic Regres-

sion and Naive Bayes.

3. Decision Tree: Decision Tree (DT) is a powerful

supervised machine learning algorithm. It con-

sists of dividing the dataset into subsets based on

an attribute value test. Each node in the tree repre-

sents a test on an attribute, each branch represents

the result of the test, and each leaf node repre-

sents a class label. The main advantages of De-

cision tree is that it can handle high dimensional

data and has high accuracy rate.

4. Random Forest: Random Forest (Khan et al.,

2021) consists of a large number of individual de-

cision trees that operate as an ensemble. Each in-

dividual tree in the random forest spits out a class

prediction and the class with the most vote be-

comes the ﬁnal model’s prediction (Yiu, 2019).

5. Logistic Regression: Logistic regression algo-

rithm gives a relationship between a dependent

and one or more independent variables. It is usu-

ally used to makes predictions for continuous/real

variables also for categorical variables.

6. Naive Bayes: A Naive Bayes (NB) classiﬁer is

a probabilistic machine learning model used for

the classiﬁcation task and is based on Bayes’ the-

orem.

3.2 The Solution’s Architecture

We created two spark clusters using docker (Docker,

2013) each cluster is built on one or more docker con-

tainers. The use of docker brings many advantages

such as providing isolated environment for the appli-

cations. Thus, they could be deployed anywhere i.e

in the cloud or local machines. Besides, Docker helps

in data processing and analysis by providing packag-

ing and management of dependencies e.g. python’s

libraries.

The ﬁrst cluster (cluster preprocessor) works as

an agent responsible for preparing and preprocessing

of the incoming data streams in real time to be suit-

able for being ﬁtted in machine learning model. The

preprocessed streams are stored on a shared docker

volume, The second cluster (cluster detector) takes

those streams one by one and performs classiﬁca-

tion in real time. We used the four Spark Mllib al-

gorithms explained in section3.1 i.e, Random Forest,

Naive Bayes, Decision Tree and Logistic Regression

to train the cluster detector and create models. Then,

spark streaming is used for testing.

A comparison of those models is performed in

terms of training time, testing time and performance

metrics.

The application is divided into two phases:

The ﬁrst phase consists of an ofﬂine training.

This means that we train the model with batches (not

streams) of the training dataset. For this purpose, a

spark application was created in cluster preprocessor

An Efﬁcient Real Time Intrusion Detection System for Big Data Environment

1007

Figure 1: The proposed solution.

and ran as batch job. After being prepared, the new

training set is saved in a docker volume. This vol-

ume is shared between the two clusters and could be

accessed by both.

A second spark application was created in clus-

ter detector and ran as batch job too. This clus-

ter trains Random Forest classiﬁer using the prepro-

cessed training set. Once the training is completed,

the cluster detector saves the model in the docker vol-

ume.

In the second phase, an online classiﬁcation was

performed. We used ﬂow of data arriving in the form

of streams. To simulate real time situation we divided

the testing set into 30 parts where each part presents

collected data ﬂow.

In this phase, the cluster preprocessor performs

data preprocessing and data cleaning to each stream

of data as the same way as performed in the train-

ing phase and saves in the docker volume. At this

time, the cluster detector is listening to the volume

and checking if there are data arrived to be classiﬁed.

At this stage, the application uses NSL KDD

dataset as input and reads the data from a local ﬁle.

However, in the coming work we aim at improving

the solution to be more real by reading the data from

ingestion tools like Kafka.

The work-ﬂow of the application is as follows:

1. Ofﬂine training:

1- Create and run spark application in cluster

preprocessor as a batch job

2- Create and run spark application in cluster

detector as spark job

3- Cluster preprocessor prepares and prepro-

cesses the training dataset

4- Cluster preprocessor saves the preprocessed

dataset in shared docker volume

5- Training cluster detector’s model with pre-

processed train set

2. Online classiﬁcation:

6- Cluster preprocessor reads streaming data

and preprocess them stream by stream in real

time

7- Each preprocessed stream is saved in the

shared docker volume

8- Cluster detector accesses to the saved

streams and performs detection in real time

9- Output the results

Fig.1 depicts clearly the explained workﬂow.

To build the system we used docker compose

which helps to create and run multiple related docker

containers. As shown in Fig.2, our docker compose

contains six containers:

• One container for the preprocessor cluster created

from bitnami (Bitnami, 2022) spark image. It is a

single node cluster composed of only the master

i.e. no workers because data preparation does not

need that. The container runs on the port 8888.

• Three containers for the detector cluster created

from bitnami spark image. The cluster is com-

posed of master and two workers and runs on

ports 8080/4040. 1 core and 1G of memory are

assigned to each worker.

Those two containers used the same volume which

is bind mount to local directory. to visualize the re-

sults, we transform it to database to facilitate creating

queries. For this reason we created two other contain-

ers:

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

1008

Table 2: Number of samples of each class in the training set

(KDDTrain+).

Class Count

normal 67343

DoS 45927

Probe 11656

R2L 995

U2R 52

Total 125973

Table 3: Number of samples of each class in the testing set

(KDDTest+).

Class Count

normal 9711

DoS 7458

R2L 2754

Probe 2421

U2R 200

Total 22544

• One container for PostgreSQL (Postgres, 2022)

which is a powerful, open source object-relational

database system. The container is created from

postgres image and run on port 5432.

• One container created from Adminer (Adminer,

2022) image which is a full-featured database

management tool written in PHP and is available

for MySQL, PostgreSQL, SQLite, MS SQL, Ora-

cle, Firebird, SimpleDB, Elasticsearch and Mon-

goDB. The container runs on port 8088.

Figure.3 depicts the results of random forest clas-

siﬁer shown by Adminer

4 EXPERIMENTAL RESULTS

4.1 NSL KDD Dataset

We evaluate the performance of our solution using

NSL KDD dataset (Tavallaee et al., 2009) which is

widely used in intrusion detection researches.

NSL KDD dataset is an improved version of

KDD99 dataset, composed of 42 features. Each sam-

ple of the dataset is labeled as either normal or a spe-

ciﬁc kind of networking attacks. All the attacks could

be categorized as one of the main four classes of at-

tacks i.e Denial of Service (DoS), Remote to Local

(R2L), Probe and User to Root (U2R) (Table 2, Table

NSL KDD dataset is provided with two ﬁles KD-

DTest+.txt and KDDTrain+.txt.. The ﬁrst ﬁle rep-

resents the test set, as shown in table 3 the dataset

is composed of 22544 samples where 9711 samples

Table 4: Comparison of both training and testing time of

used MLlib algorithms.

Algorithm Training time (s) Testing time (s)

Random Forest 32.0438 5.7666

Decision Tree 38.2766 3.4811

Naive Bayes 1.4716 3.5905

Logistic Regression 6.1271 3.3405

are labeled as normal, 7458 samples are DoS at-

tacks, 2754 samples are R2L attacks, 2421 samples

are Probe attacks and 200 samples are U2R attacks.

The second ﬁle represents the train set, as shown

in table 2, the dataset is composed of 125973 samples

where 67343 samples are labeled as normal, 45927

samples are DoS attacks, 11656 samples are R2L at-

tacks, 995 samples are Probe attacks and 52 samples

are U2R attacks.

Hence, the dataset is very large therefore, it well rep-

resents the context of big data.

4.2 Discussion

To evaluate the solution we refer to usable met-

rics namely, accuracy, precision, recall, F1-measure,

true positive rate, false positive rate, log Loss and

hamming loss. All those metrics are measured us-

ing MultiClassiﬁcationEvaluator function from pys-

park.ml.evaluator. Table 5 compares the results of the

four used classiﬁers. As shown in the table, Random

Forest performs the best results. For time, as shown

in Table4, naive bayes gives the best time in training

and Logistic regression in testing. However, Decision

Tree classiﬁer achieves the worst training time. Ran-

dom Forest classiﬁer achieves the worst testing time.

We choose Random Forest classiﬁer in our solu-

tion because although it gives the worst testing time

but it still efﬁcient especially if we focus on the ex-

cellent results in terms of other metrics such as ac-

curacy, precision recall etc. Furthermore, if we com-

pare our results with other state-of-the art solutions,

we can prove that our model outperforms other mod-

els in terms of accuracy and time of both training and

testing as shown in Table1.

5 CONCLUSION

The characteristics of Big data generated in the net-

works such as high volume, high speed have made

attack detection by traditional approaches very difﬁ-

cult. That is why the invention of new techniques able

to perform big data analysis to make predictions and

classiﬁcation of large amount of data in real time, a

persistent need. The purpose of this work is to pro-

vide a new solution that improves the efﬁciency and

An Efﬁcient Real Time Intrusion Detection System for Big Data Environment

1009

Figure 2: Docker containers architecture.

Figure 3: Visualization of classiﬁcation’s results with Ad-

miner.

the rapidity of intrusion detection in the context of

big data by performing a real time preprocessing and

classiﬁcation of incoming streams of data.

In summary, the main contributions of this work

are:

• Building an intrusion detection system capable of

dealing with big data streams in near-real time.

• Addressing the challenges of big data environ-

ment namely, velocity and volume by reducing the

detection time.

• Providing a novel approach by executing prepro-

cessing and classiﬁcation at the same time as par-

allel jobs not in sequential manner as usually done

in previous works. This novel approach improves

well the detection time as shown in Table.4.

• Dealing with big data in a secure way by using the

docker technology.

• Taking advantages from the well-known ML al-

gorithms in the detection task and providing com-

parison between them

Experimental results show that our solution performs

the state-of-the art solution in term of speed (5.76 sec)

and accuracy (0.99%).

In the future, we plan to use real network trafﬁc in-

stead of dataset also we aim at running other efﬁcient

algorithms within Spark that do not exist in Spark’s

MLlib such as algorithms that uses neural networks.

REFERENCES

Adminer (2022). https://hub.docker.com/ /adminer.

Al-Rawi, A. A. (2019). Intrusion detection system using

apache spark analytic system.

Anderson, J. (1980). Computer security threat monitoring

and surveillance. Technical report, James P. Anderson

Company, Fort Washington.

Awan, M. J., Farooq, U., Babar, H. M. A., Yasin, A., Noba-

nee, H., Hussain, M., Hakeem, O., and Zain, A. M.

(2021). Real-time ddos attack detection system using

big data approach. Sustainability, 13(19).

Bagui, S., Jason, S., Russell, P., Bennett, T. A., and Sub-

hash, B. (2021). Classifying unsw-nb15 network traf-

ﬁc in the big data framework using random forest in

spark. International Journal of Big Data Intelligence

and Applications, 2.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

1010

Table 5: Comparison of experimental results of MLlib’s algorithms.

Metric Random Forest Decision Tree Naive Bayes Logistic Regression

F1-mesure 0.99 0.95 0.39 0.95

accuracy 0.99 0.95 0.43 0.95

weightedPrecision 0.99 0.95 0.71 0.95

weightedRecall 0.99 0.95 0.43 0.95

weightedTruePositiveRate 0.99 0.95 0.43 0.95

weightedFalsePositiveRate 0.01 0.03 0.18 0.04

weightedFMeasure 0.99 0.95 0.39 0.95

truePositiveRateByLabel 1.00 0.98 0.16 0.98

falsePositiveRateByLabel 0.01 0.06 0.00 0.07

precisionByLabel 0.99 0.95 1.00 0.93

recallByLabel 1.00 0.98 0.16 0.98

fMeasureByLabel 0.99 0.96 0.28 0.96

logLoss 0.04 0.17 19.29 0.16

hammingLoss 0.01 0.05 0.57 0.05

Bitnami (2022). https://hub.docker.com/r/bitnami/spark.

Docker (2013). https://spark.docker.org/.

Hassan, M. M., Gumaei, A. H., Alsanad, A., Alrubaian, M.,

and Fortino, G. (2020). A hybrid deep learning model

for efﬁcient intrusion detection in big data environ-

ment. Inf. Sci., 513:386–396.

Jemili, F. and Bouras, H. (2021). Intrusion detection based

on big data fuzzy analytics. In Kakulapati, V., editor,

Open Data, chapter 4. IntechOpen, Rijeka.

Khan, M. Y., Qayoom, A., Nizami, M., Siddiqui, M. S.,

Wasi, S., and Syed, K.-U.-R. R. (2021). Automated

prediction of good dictionary examples (gdex): A

comprehensive experiment with distant supervision,

machine learning, and word embedding-based deep

learning techniques. Complexity.

Kurt, E. M. and Becerikli, Y. (2018). Network intrusion de-

tection on apache spark with machine learning algo-

rithms. In Pimenidis, E. and Jayne, C., editors, Engi-

neering Applications of Neural Networks, pages 130–

141, Cham. Springer International Publishing.

Liu, Z., Su, N., Qin, Y., Lu, J., and Li, X. (2020). A deep

random forest model on spark for network intrusion

detection. Mobile Information Systems, 2020:1–16.

Louati, F. and Ktata, F. (2020). A deep learning-based

multi-agent system for intrusion detection. SN Applied

Sciences, 2.

Louati., F., Ktata., F., and Ben Amor., I. (2022). A dis-

tributed intelligent intrusion detection system based

on parallel machine learning and big data analysis.

In Proceedings of the 11th International Conference

on Sensor Networks - SENSORNETS,, pages 152–157.

INSTICC, SciTePress.

Mohamed, A., Mouhammd, A., Mohammad, A., and

Muhannad, M. (2022). An accurate iot intrusion de-

tection framework using apache spark.

Ouhssini, M., Afdel, K., Idhammad, M., and Agherrabi, E.

(2021). Distributed intrusion detection system in the

cloud environment based on apache kafka and apache

spark. In 2021 Fifth International Conference On In-

telligent Computing in Data Sciences (ICDS), pages

1–6.

Postgres (2022). https://hub.docker.com/ /postgres.

Spark (2014). https://spark.apache.org/.

Tavallaee, M., Bagheri, E., Lu, W., and Ghorbani, A. A.

(2009). A detailed analysis of the kdd cup 99 data set.

In 2009 IEEE Symposium on Computational Intelli-

gence for Security and Defense Applications, pages

1–6.

Terzi, D. S., Terzi, R., and Sagiroglu, S. (2017). Big data

analytics for network anomaly detection from netﬂow

data. In 2017 International Conference on Computer

Science and Engineering (UBMK), pages 592–597.

Vimalkumar, K. and Radhika, N. (2017). A big data

framework for intrusion detection in smart grids us-

ing apache spark. In 2017 International Conference

on Advances in Computing, Communications and In-

formatics (ICACCI), pages 198–204.

Yiu, T. (2019). Understanding random forest.

https://towardsdatascience.com/understanding-

random-forest-58381e0602d2.

An Efﬁcient Real Time Intrusion Detection System for Big Data Environment

1011