Detecting IoT Botnet Formation using Data Stream Clustering

Algorithms

Gabriel de Carvalho Arimat

and Admilson de Ribamar Lima Ribeiro

Federal University of Sergipe, Av. Marechal Rondon, s/n, Aracaju, Brazil

Keywords:

Internet of Things, Botnet, Machine Learning, Security.

Abstract:

The Internet of Things has gained much importance nowadays due to its applicability to many ecosystems

on day-to-day use. However, these embedded systems have several hardware constraints, and theses device’s

security has been neglected. Consequently, botnets malwares have taken advantage of poor security schemas

on these devices. This paper proposes unsupervised machine learning using data streams to detect the botnet

formation on the edge of the network. The results obtained by the algorithm includes an average of 98.43%

accuracy and taking about 20.07 ms to evaluate each sample from the stream, making it reliable and fast, even

in a more constrained device, such as Raspberry Pi 3 B+.

1 INTRODUCTION

The use of a botnet to launch some malicious attacks

is one of the main concerns in network security due

to its destructive potential. It comes from the spread-

ing and infection capabilities, creating and expanding

a ”zombie army” with great versatility, being able to

launch different attacks (Kambourakis et al., 2017).

These botnets can be formed by a mix of different

types of devices, going from small and more simple

embedded systems to a very capable desktop. At ﬁrst,

the main target was the traditional desktop and note-

book, since it was the mainstream platform world-

wide, making more prominent to big scales scenarios.

Botnets such as EarthLink Spammer (2000), Storm

(2007), Cutwail (2007), Grum (2008), Kraken (2008),

and Mariposa (2008) were the most famous botnets

from that time.

In the last years, the target os this infection has

shifted to the Internet of Things (IoT) scenarios. This

change came due to its ubiquity proposal, its low-

level security, and distributed nature, make it easier to

launch attacks with the risk of damaging the network

and the devices itself.

For its ubiquity proposal and evergrowing raise

in daily use, the botnets had shifted interest to those

networks. Because the botnets serve as platforms to

countless and dangerous attacks, detect its formation

https://orcid.org/0000-0003-1725-5934

https://orcid.org/0000-0003-2010-6024

is essential to network security of IoT.

Although there are some well-known variants of

botnet malware known, there are very few researches

considering a dynamic scenario, considering, for ex-

ample, the evolution of these malwares or new vari-

ants. Some traditional strategies, which in general are

using supervised machine learning algorithms, are not

the best suited since it needs prior knowledge from

the attack to insert into the dataset to train the models.

Besides, most parts being on the cloud raises the time

to respond to threats detected.

For this paper, we propose the use of DenStream,

an unsupervised machine learning algorithm, in an

edge device (instead of cloud) to detect the forma-

tion of botnets, improving the IoT network security.

The goal is to detect effectively malicious spam attack

released by the malware to attempt to control unpro-

tected devices.

The rest of the paper is organized as follows: Sec-

tion 2 presents the discussion of related works; Sec-

tion 3 shows the impacts of an IoT botnet in Internet

attacks and its relevance; Section 4 explains why was

chosen an unsupervised approach; Section 5 presents

the proposed solution; Section 6 presents the method-

ology, metrics, and experimental evaluations; Section

7 presents the conclusions and future works.

Arimatéa, G. and Ribeiro, A.

Detecting IoT Botnet Formation using Data Stream Clustering Algorithms.

DOI: 10.5220/0010180903950402

In Proceedings of the 16th International Conference on Web Information Systems and Technologies (WEBIST 2020), pages 395-402

ISBN: 978-989-758-478-7

395

2 RELATED WORK

Based on a systematic review, were veriﬁed what ma-

chine learning algorithms have been used or what has

been done about network security considering a re-

stricted environment (such as Wireless Sensor Net-

work or IoT) and using data stream. It showed almost

none works have been made for security reasons.

Many works (Zhao, 2005; Amza et al., 2011;

Kapoor and Dhavale, 2016; Kanoun et al., 2016;

Roopaei et al., 2017; Axenie et al., 2018; Afghah

et al., 2018; Bhattacharyya et al., 2018) did not have

concerns about restricted environments such as pro-

cessing time and memory. Donovan’s work (Dono-

van et al., 2018), although had a near restricted envi-

ronment (fog computing), did not have any concern

on security or used any machine learning approach,

making it not very suitable for a dynamic scenario.

Dey’s paper (Dey et al., 2016) on determine oc-

cupancy in a room using Random Forest and sensors

showed great accuracy, was not considered time or

resource usage. Considering that it uses the Ran-

dom Forest algorithm, which generates many com-

binational decision trees, it can consume many re-

sources from the device.

Li proposal (Li, 2014), based on Local and Global

Consistency (or LGC) and Support Vector Machine

(or SVM), aimed to classify sensors data stream. The

authors show the necessity to previously know the

patterns to be able to identify correctly due to its su-

pervised nature. It also stated the high cost due to its

constant retraining of the model.

Genetic approach (Schmidt et al., 2014) was also

considered, but is slower than algorithms such as

SVM and Naive Bayes, and to have good accuracy,

it is necessary to increase antibodies, which increase

computational cost and memory. It also did not con-

sider restricted environments.

AMWR proposal (Akbar et al., 2015; Akbar et al.,

2017), based on the moving window technique to an-

alyze data stream ﬂows, is a supervised algorithm,

making it necessary to label all initial data before

training. Besides, the authors state that its algorithm

could have an almost real-time response, but was not

measured response time or computational cost to eval-

uate its behaviour on restricted environments.

Vertical Hoeffding Tree (Kourtellis et al., 2016)

used in Elkhoukhi’s work (Elkhoukhi et al., 2018)

used parallelism. It showed promising results in

the experimentation present in Elkhoukhi’s paper

(Elkhoukhi et al., 2018) in occupation detection, but

outside security scope.

Singh (Singh et al., 2013), from a Wireless Sen-

sor Network (WSN) scenario, creates a concept of

CluStream on the distributed part and Support Vec-

tor Machine (SVM) on the centralized part to detect

forest ﬁres. It was used Clustream due to its ability

to deal with large data streams for being capable of

distributing the data. On the second part, it uses SVM

to process the groups formed to predict effectively the

existence of ﬁre or not.

P-DenStream (Lu et al., 2018) is a variation of the

original DenStream, allowing parallelization and bet-

ter load and operational cost distribution. This paral-

lelization occurs in the initial part, consisting of the

creation and weight of the micro-clusters attribution.

Due to DenStream nature, it showed a great candi-

date to restricted environments, such as embedded

systems, where the authors claim a reduction from

10 second from DenStream to 200 milliseconds on

P-DenStream.

FlockStream (Spezzano and Vinci, 2015) had con-

ceptual similarities with DenStream, also being a

great candidate to restricted environments. It was pro-

posed aiming enormous volume of data from streams

but focused on low-memory systems. It is an unsuper-

vised algorithm (not needing external supervision),

been tested in a WSN.

It was also considered the paper from Meidan

(Meidan et al., 2018), which proposed the usage of

autoencoders algorithms that can differentiate benign

from malicious data ﬂow within IoT botnet attacks.

Autoencoders are used as an anomaly detector, set-

ting the ﬁrst clusters and use these to evaluate new

samples as outliers or normal.

All the papers had its importance in its areas, but

to use their approaches to detect botnet formation has

some caveats. Were considered these caveats to pro-

pose a new approach to network security in the IoT

scenario.

Although Spezzano (Spezzano and Vinci, 2015),

Lu (Lu et al., 2018) and Singh (Singh et al., 2013)

papers had impactful contributions, none of them was

used in real restricted environments such as IoT net-

works, neither considered in a security application.

Spezzano (Spezzano and Vinci, 2015) did not demon-

strate any experiment or evaluate any characteristic to

evaluate this approach in IoT or any other restricted

environment.

Although P-DenStream (Lu et al., 2018) and Clus-

tream decentralized part on Singh’s paper (Singh

et al., 2013) has a distributed nature aiming to divide

better the processing load, it did not evaluate memory

and processing usage. Also, like Spezzano’s (Spez-

zano and Vinci, 2015), there was no consideration in

restricted environments and did not considerate secu-

rity.

The work proposed by Meidan (Meidan et al.,

DMMLACS 2020 - 1st International Special Session on Data Mining and Machine Learning Applications for Cyber Security

396

2018) is used on an IoT network and had considered

security application, but using a neural network al-

gorithm, which makes a not so lightweight approach

and not so fast alternative, been a great candidate to

a cloud application. But this type of application been

in the cloud makes this approach more susceptible to

latency and communication issues such as packet loss

or not having an Internet connection at all.

Based on those points, this work aims to use Den-

Stream as a lightweight unsupervised machine learn-

ing algorithm to create cluster groups of normal be-

haviour and then act as an anomaly detection algo-

rithm (similar to Meidan’s approach (Meidan et al.,

2018)). These anomalies are considered as malicious

trafﬁc, allowing dynamic response to threats. Besides,

for being a lightweight data stream processing algo-

rithm, it consumes fewer resources and gives faster re-

sponses, enabling many low costs single boards com-

puters to run it effectively.

3 IoT BOTNET IN INTERNET

ATTACKS

In IoT, thanks to its ubiquity, if many devices were

contaminated and taken control, it forms a botnet

network, which had great potential to launch dev-

astating attacks. According to Dietz et al. (Dietz

et al., 2018), botnets are networks of devices infected

with malware, allowing a malicious actor to control

them remotely. Especially in IoT, due to low-security

schemes and not proper conﬁguration on installation,

many devices are not well protected, serving as an en-

trance to contamination. The steps to forming a botnet

and launch of the attacks (which can which are seen

in Figure 1) are (Dietz et al., 2018):

1. Scan open ports;

2. Brute force attacks to gain access to IoT devices;

3. End possible concurrents to device’s control;

4. Connect the device to Command and Control

(C&C);

5. Download and execution of malicious scripts;

6. Spread and attack other vulnerable devices;

7. Launching attack from C&C.

There are attacks specialized in mounting these bot-

nets. Bashlite (Marzano et al., 2018) was one of the

ﬁrsts botnets to aim to control IoT devices. The dis-

persion starts with the initial device trying to connect

with a public IP, with a Telnet scan. Then, tries to au-

thenticate on found devices using the most common

ports and credentials. Once successful, that device

Figure 1: Botnet life cycle (Dietz et al., 2018).

will serve as a source to distribute requests on the next

attack. But, due to its simplicity, it demands more ef-

fort to set up the malware and C&C form the attacker.

The Mirai was discovered in August 2016. It cre-

ates botnets network for DDoS attacks from DVRs,

WebIP cameras and low-security Linux servers

(Kambourakis et al., 2017). It has the same structure

as Bashlite. The few differences are the Mirai capa-

bility which is not present in Bashlite (only available

through extensions) such as tools for contaminated

devices search for more vulnerable devices; DNS use

for attacks; binary protocol to send messages to difﬁ-

cult its discovery. The process and agents involved in

a Mirai attack can are seen in Figure 2.

Figure 2: Steps and agents involved on Mirai attack (Kolias

et al., 2017).

After a deployed botnet, a plethora of attacks can

be launched from it, serving as platforms to others at-

tacks, such as spam advertised pharmaceuticals, rob-

bing bank credentials and advertisement, click fraud

and distributed denial of service (DDoS) (Putman

et al., 2018). Kanich et al (Kanich et al., 2011)

wrote about a spam advertised pharmaceuticals ex-

ploit, where according to his article, a botnet can earn

almost $3.5M per year. These attack spam emails

about selling counterfeit medicine, where Viagra and

diet pills been the most common.

Click fraud is also another known use of a bot-

net. After the botnet deployed, it starts to fake click-

ing advertisement, allowing the host to proﬁt on those

Detecting IoT Botnet Formation using Data Stream Clustering Algorithms

397

counterfeited clicks. Although its simplicity, initia-

tive such as WhiteOps stated that a group based in

Russia proﬁts ”$3 to $5 million in counterfeit inven-

tory per day by targeting the premium video advertis-

ing ecosystem”.

DDoS are attacks to deplete the resources from a

server, leading to interruption of service. According

to Marzano et al (Marzano et al., 2018), this attack

aims at servers or network devices. Known famous

attacks registered were at Krebs on Security (reaching

620Gb/s) and Ars Technica (reaching 1Tb/s).

4 UNSUPERVISED X

SUPERVISED MACHINE

LEARNING

What differentiates an unsupervised from a super-

vised machine learning algorithm is the need from the

previous classiﬁcation of data for training the algo-

rithm. The unsupervised approach does not need pre-

vious categorization, grouping data with similar char-

acteristics.

This type of algorithm is better suited in scenar-

ios with no or little knowledge of the groups in the

dataset. On IoT botnet scenarios, where the infec-

tions are always adapting and evolving, and zero-day

attack makes a supervised approach not so reactive.

Therefore was chosen an unsupervised algorithm for

this approach.

But a downside that comes with unsupervised al-

gorithms it that they are not so accurate and more

susceptible to outliers than a supervised algorithm.

Besides, in an unsupervised approach, the most fea-

tures collected to the algorithm, the better. But as-

sociated with that increases processing time and re-

sources. The use of preprocessing techniques to over-

come these difﬁculties are used on this approach as

well to reduce or eliminate outliers on training data

and identify the most valuable features.

5 PROPOSED SOLUTION

5.1 DenStream

It was selected to this approach an unsupervised algo-

rithm due to no prior knowledge and more evolving

capability. It was also considered algorithms that can

use data-stream as input. These characteristics were

chosen because the massive data ﬂow in IoT, with ev-

ergrowing perspective, which can be difﬁcult and al-

most impossible to handle all data. The data stream

as input make the algorithms more performatic and

causes small footprint, which makes them lightweight

and fast.

The DenStream algorithm proposed by Cao et al’s

paper (Cao et al., 2006) as a specialized data mining

unsupervised clustering machine learning algorithm,

to adapt to evolving streams with low memory cost.

These characteristics were the reasons to be chosen to

the solution proposed.

The algorithm has two distinct parts: ofﬂine and

online phases. A DBScan (Ester et al., 1996) it is per-

formed on the ofﬂine part to discover the ﬁrst clusters.

These represent the common and benign behaviour of

the network. The parameters needed are:

• epsilon: The maximum distance between two

points for being considered as similar;

• minPoints: The minimal quantity of similar points

to create a cluster to these points.

The Ozkok’s paper (Ozkok, 2017) proposed an im-

provement to DBScan, estimating the best value to

epsilon due to minimal points passed as parameter. It

made this decision automatic by the algorithm, being

dependent only by one parameter. This improvement

was applied to facilitate the adjustment of its parame-

ters. To its ofﬂine phase, it was used 100 benign sam-

ples from each device to extract the PCA parameters

and the epsilon value.

After the ﬁrst clusters formed by the ofﬂine phase,

the algorithm receives new samples, trying to insert

in any existing groups. If it is not possible, it will be

labelled as an outlier and stored for a while. If other

samples become more similar to an outlier, it creates

outliers clusters, adapting to new scenarios. The code

was written in Python also using scikit-learn package

and its available in a public repository

To improve the overall accuracy of this proposal,

it was also applied a preprocessing phase prior to its

ofﬂine and online phases.

5.2 Preprocessing Phase

For preprocessing it was applied Local Outlier Factor

(LOF) as preprocessing algorithm to reduce noise in

the benign trafﬁc contained in the dataset.

The LOF was proposed in 2000 by Breunig et al

(Breunig et al., 2000), using concepts of ”core dis-

tance” and ”reachability distance”, present on DB-

Scan and DenStream as well, to determine outliers.

This algorithm tries to eliminate noise on the benign

dataset, allowing a better characterization of the data

https://github.com/gabriel-arimatea/unsupervised-ml-

botnet

DMMLACS 2020 - 1st International Special Session on Data Mining and Machine Learning Applications for Cyber Security

398

Figure 3: Architecture proposed by Meidan et al. 2018.

at the training phase. But some outliers can be mis-

judged by the algorithm due to those noises removed

by the LOF.

Principal Component Analysis (PCA) was also

applied to reduce dimensionality to make the algo-

rithm more efﬁcient and reduce processing time. The

PCA algorithm was proposed in 1901 by Pearson,

where it receives some correlated variables and trans-

forms it into linearly uncorrelated variables. It creates

fewer variables with more value to the algorithms,

eliminating redundancy by the correlation.

5.3 Architecture

Using the dataset made available from Meidan’s paper

(Meidan et al., 2018), were developed an hypothetical

architecture (seen on Figure 4) based on Meidan’s ar-

chitecture, seen on Figure 3. The proposed architec-

ture made the use of the dataset possible.

This architecture uses a message broker in pub-

lish/subscribe to store the data stream until it pro-

cessed by the device. It ensures that every data stream

will be processed accordingly.

6 EXPERIMENTAL EVALUATION

6.1 Dataset

The dataset used was proposed in Meidan’s paper

(Meidan et al., 2018). On his work, his architecture

Figure 4: Proposed Architecture architecture.

showed in Figure 3 infected with two major botnets

variants: Mirai and Bashlite. The dataset generated

has benign trafﬁc and ﬁve types of attacks executed

from the botnets. The spam attack used to infect other

devices was the only one considered from these at-

tacks, due to be the only attack used to form the botnet

network.

The dataset was created using four characteris-

tics, aggregated by different sources about the stream

packets, shown in Table 1. There were applied

some statistics, generating 23 features. Those create

115 characteristics, considering the decay factor of a

damped window.

Detecting IoT Botnet Formation using Data Stream Clustering Algorithms

399

Table 1: Meidan’s dataset features.

Characteristics Statistic Aggregated by Number of Features

Outbound Packet size Mean, Variance Source IP and MAC-IP, Channel, Socket 8

Packet count Number Source IP and MAC-IP, Channel, Socket 4

Packet jitter Mean, Variance, Number Channel 3

All Packet size Magnitude, Radius, Covariance, Correlation coefﬁcient Channel, Socket 8

6.2 Metrics

It was supplied random samples from benign and ma-

licious spam attack data to the generated model to

evaluate its capability, and some characteristics were

analyzed. In this scenario, the objective would be to

block all the malicious data (labelled as a negative)

and pass all benign data (marked as a positive). Con-

sidering these labels, metrics such as true positives

and negatives ratings are essentials.

Considering these metrics and false positives and

negatives ratings (which are complementary), the

confusion matrix was chosen, generating more met-

rics, such as accuracy. Also, considering how fast has

to be a response to stop propagation, time to create

the ﬁrst clusters and to analyze each new sample is

essential. The metrics used to evaluate the results are:

• True Negative Rate (TNR): benign data classiﬁed

as benign;

• True Positive Rate (TPR): malicious data classi-

ﬁed as malicious;

• False Negative Rate (FNR): malicious data classi-

ﬁed as benign;

• False Positive Rate (FPR): benign data classiﬁed

as malicious;

• Accuracy;

• Ofﬂine training time: time used to train the model;

• Online classiﬁcation time: time to analyse a new

data.

The algorithm was ported to a Raspberry Pi 3 B+,

running the algorithm and a Redis instance in a pub-

lish/subscribe conﬁguration to store and serve the data

collected by the Wireshark Sniffer showed on Figure

To evaluate were made 40 rounds with random

data from the dataset, using all nine devices, with 100

random samples for each device for ofﬂine training

and a balanced dataset to test the ﬁnal model gener-

ated by the algorithm.

6.3 Results

The confusion matrix generated by the sum of all ex-

ecutions can be seen in Table 2. The accuracy of each

run is shown in Figure 5. The results are in range:

0 10 20 30 40

100

Runs

Comparison between all runs.

Accuracy

Figure 5: Accuracy between runs.

• TNR: Between 97.52% and 100% (mean of 99%);

• TPR: Between 94.41% and 99.87% (mean of

97.85%);

• FNR: Between 5.59% and 0.13% (mean of

2.15%);

• FPR: Between 0% and 2.48% (mean of 1%);

• Accuracy: Between 97.2% and 99.15% (mean of

98.43% - Figure 5);

• Ofﬂine training time: mean of 45 ms;

• Online classiﬁcation time: Between 17.99 and

23.96 ms (mean 20.07 ms).

Table 2: Confusion Matrix considering all rounds.

Original Classﬁcation

Benign Malicious

Prediction

Benign 337,303 7,315

Malicious 3,401 333,623

To the autoencoder proposed by Meidan (Meidan

et al., 2018), the metrics cited by his paper are true

positives rating, false positives rating and time to de-

tect the infection. The table comparing the two results

can be seen in Table 3.

The proposed algorithm had less accuracy than

Meidan’s autoencoder but still have a high percent-

age. But in comparison, the DenStream is much

DMMLACS 2020 - 1st International Special Session on Data Mining and Machine Learning Applications for Cyber Security

400

Table 3: Comparison between algorithms.

Metrics (Meidan et al., 2018)

DenStream

Mean Variance

TPR 100% 97.85% 94.43% ∼ 99.87%

FPR 0.007% ∼ 0.01% 1% 0.01% ∼ 2.48%

Time 174 ∼ 212 ms 20.07 ms 17.99 ∼ 23.96 ms

faster, taking almost 90% less time to differentiate be-

nign and malicious data. Since autoencoder is a neu-

ral network, it has a much costly footprint than Den-

Stream, and need much more data to train also.

7 CONCLUSION

In this paper was showed that more lightweight algo-

rithms, such as DenStream, can be a great candidate

to detect botnet formation, making possible to run this

algorithm in more simple and low-cost devices, such

as a Raspberry Pi 3B+ (used in the experiment). It

also showed that, due to its light and efﬁcient way of

dealing with training and predicting, it could respond

to a threat much sooner.

In this paper was used DenStream as an unsuper-

vised machine learning algorithm, but the CluStream

showed as an option as well. As future work, it will be

tested using the CluStream and will be veriﬁed which

one is more effective to the problem.

It will also be studied applications for the algo-

rithm, which can be ported to an IoT specialist device

or inserted in an SDN context. For this, an analysis of

minimum hardware requirements to perform well had

to be made. It will also be studied possibles measures

to apply when the algorithm detects an attack.

REFERENCES

Afghah, F., Cambou, B., Abedini, M., and Zeadally, S.

(2018). A ReRAM Physically Unclonable Function (

ReRAM PUF )- Based Approach to Enhance Authen-

tication Security in Software Deﬁned Wireless Net-

works. International Journal of Wireless Information

Networks, 25(2):117–129.

Akbar, A., Carrez, F., Moessner, K., and Zoha, A. (2015).

Predicting complex events for pro-active iot applica-

tions. In 2015 IEEE 2nd World Forum on Internet of

Things (WF-IoT), pages 327–332.

Akbar, A., Khan, A., Carrez, F., and Moessner, K. (2017).

Predictive analytics for complex IoT data streams.

IEEE Internet of Things Journal, 4(5):1571–1582.

Amza, C., Leordeanu, C., and Cristea, V. (2011). Hybrid

network intrusion detection. In 2011 IEEE 7th Inter-

national Conference on Intelligent Computer Commu-

nication and Processing, pages 503–510.

Axenie, C., Tudoran, R., Bortoli, S., Al Hajj Hassan, M.,

Foroni, D., and Brasche, G. (2018). Starlord: Slid-

ing window temporal accumulate-retract learning for

online reasoning on datastreams. In 2018 17th IEEE

International Conference on Machine Learning and

Applications (ICMLA), pages 1115–1122.

Bhattacharyya, S., Katramatos, D., and Yoo, S. (2018).

Why wait? let us start computing while the data is

still on the wire. Future Generation Computer Sys-

tems, 89:563–574.

Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander,

J. (2000). LOF. In Proceedings of the 2000 ACM

SIGMOD international conference on Management of

data - SIGMOD '00. ACM Press.

Cao, F., Estert, M., Qian, W., and Zhou, A. (2006). Density-

based clustering over an evolving data stream with

noise. In Proceedings of the 2006 SIAM International

Conference on Data Mining. Society for Industrial and

Applied Mathematics.

Dey, A., Ling, X., Syed, A., Zheng, Y., Landowski, B., An-

derson, D., Stuart, K., and Tolentino, M. E. (2016).

Namatad: Inferring occupancy from building sensors

using machine learning. In 2016 IEEE 3rd World Fo-

rum on Internet of Things (WF-IoT). IEEE.

Dietz, C., Castro, R. L., Steinberger, J., Wilczak, C.,

Antzek, M., Sperotto, A., and Pras, A. (2018). IoT-

botnet detection and isolation by access routers. In

2018 9th International Conference on the Network of

the Future (NOF). IEEE.

Donovan, P. O., Gallagher, C., Bruton, K., and Sullivan, D.

T. J. O. (2018). A fog computing industrial cyber-

physical system for embedded low-latency machine

learning industry 4.0 applications. Manufacturing Let-

ters, 15:139–142.

Elkhoukhi, H., NaitMalek, Y., Berouine, A., Bakhouya, M.,

Elouadghiri, D., and Essaaidi, M. (2018). Towards

a real-time occupancy detection approach for smart

buildings. Procedia Computer Science, 134:114–120.

Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996).

A density-based algorithm for discovering clusters in

large spatial databases with noise. In Proceedings of

the Second International Conference on Knowledge

Discovery and Data Mining, KDD’96, page 226–231.

AAAI Press.

Kambourakis, G., Kolias, C., and Stavrou, A. (2017). The

mirai botnet and the IoT zombie armies. In MILCOM

2017 - 2017 IEEE Military Communications Confer-

ence (MILCOM). IEEE.

Kanich, C., Weavery, N., McCoy, D., Halvorson, T.,

Kreibichy, C., Levchenko, K., Paxson, V., Voelker,

G. M., and Savage, S. (2011). Show me the money:

Characterizing spam-advertised revenue. In Proceed-

ings of the 20th USENIX Conference on Security,

SEC’11, page 15, USA. USENIX Association.

Kanoun, K., Tekin, C., Atienza, D., and v. d. Schaar, M.

(2016). Big-data streaming applications scheduling

based on staged multi-armed bandits. IEEE Transac-

tions on Computers, 65(12):3591–3605.

Kapoor, A. and Dhavale, S. (2016). Control ﬂow graph

based multiclass malware detection using bi-normal

separation. Defence Science Journal, 66(2):138.

Detecting IoT Botnet Formation using Data Stream Clustering Algorithms

401

Kolias, C., Kambourakis, G., Stavrou, A., and Voas, J.

(2017). Ddos in the iot: Mirai and other botnets. Com-

puter, 50:80–84.

Kourtellis, N., Morales, G. D. F., Bifet, A., and Mur-

dopo, A. (2016). VHT: Vertical hoeffding tree. In

2016 IEEE International Conference on Big Data (Big

Data), pages 915–922. IEEE.

Li, F. (2014). A pattern query strategy based on semi-

supervised machine learning in distributed WSNs.

Journal of Information and Computational Science,

11(18):6447–6459.

Lu, J., Feng, J., Zhang, J., Xia, P., and Xiao, X. (2018).

A parallel approach on clustering trafﬁc data stream

based on the density. In 2018 Sixth International

Conference on Advanced Cloud and Big Data (CBD).

IEEE.

Marzano, A., Alexander, D., Fonseca, O., Fazzion, E.,

Hoepers, C., Steding-Jessen, K., Chaves, M. H. P. C.,

Cunha, I., Guedes, D., and Meira, W. (2018). The

evolution of bashlite and mirai IoT botnets. In 2018

IEEE Symposium on Computers and Communications

(ISCC). IEEE.

Meidan, Y., Bohadana, M., Mathov, Y., Mirsky, Y., Breit-

enbacher, D., Shabtai, A., and Elovici, Y. (2018). N-

baiot: Network-based detection of iot botnet attacks

using deep autoencoders. CoRR, abs/1805.03409.

Ozkok, F. O. (2017). A new approach to determine eps pa-

rameter of DBSCAN algorithm. International Journal

of Intelligent Systems and Applications in Engineer-

ing, 4(5):247–251.

Putman, C. G. J., Abhishta, and Nieuwenhuis, L. J. M.

(2018). Business model of a botnet. CoRR,

abs/1804.10848.

Roopaei, M., Rad, P., and Jamshidi, M. (2017). Deep learn-

ing control for complex and large scale cloud systems.

Intelligent Automation & Soft Computing, 23(3):389–

391.

Schmidt, B., Kountanis, D., and Al-Fuqaha, A. (2014). Ar-

tiﬁcial immune system inspired algorithm for ﬂow-

based internet trafﬁc classiﬁcation. In 2014 IEEE 6th

International Conference on Cloud Computing Tech-

nology and Science. IEEE.

Singh, Y., Saha, S., Chugh, U., and Gupta, C. (2013). Dis-

tributed event detection in wireless sensor networks

for forest ﬁres. In 2013 UKSim 15th International

Conference on Computer Modelling and Simulation.

IEEE.

Spezzano, G. and Vinci, A. (2015). Pattern detection in

cyber-physical systems. Procedia Computer Science,

52:1016–1021.

Zhao, Q. (2005). Learning with data streams – an NNTree

based approach. In Embedded and Ubiquitous Com-

puting – EUC 2005 Workshops, volume 3823, pages

519–528. Springer Berlin Heidelberg.

DMMLACS 2020 - 1st International Special Session on Data Mining and Machine Learning Applications for Cyber Security

402