CNN-HMM Model for Real Time DGA Categorization

Aimen Mahmood

, Haider Abbas

2 a

and Faisal Amjad

1 b

National University of Sciences and Technology, Islamabad, Pakistan

KTH-Royal Institute of Technology, Stockholm, Sweden

Keywords:

Domain Generation Algorithms, Domain Fluxing, Classiﬁcation of DGA, Word-List DGA, Pseudo Random

DGA, Convolution Neural Network.

Abstract:

To remotely control the target machine, hackers manage to establish a connection between victim and their

Command and Control server(C2). In order to hide their C2 they generate domain names algorithmically.

Such algorithms are called Domain Generation algorithms(DGA). These algorithmically generated domain

names are either gibberish as the characters are generated and concatenated randomly, or pure dictionary

words or the combination of the two. This paper presents an algorithm that classiﬁes the DGA running on

a compromised system either as gibberish, dictionary oriented or the mixed one, in real time. The proposed

algorithm consists of two distinct modules i) Network forensics to detect the DGA ii) Classiﬁcation of the

DGA using the combination of Hidden Markov Model and Convolution Neural Network in real time. The

algorithm is trained and tested against more than 0.21 million samples taken from more than 50 different

DGAs. The algorithm gives as good as 99% accuracy for all types of DGAs. In addition it can detect zero day

DGA as well as multiple DGAs running on a system.

1 INTRODUCTION

Nowadays the organizations are opting the paper-

less technology to support the green environment, but

making the data more vulnerable to the cyber attacks.

On the contrary, adversaries are becoming more ac-

tive by producing sophisticated malicious attacks.

The most damaging attacks like credential stealing,

ransomware, banking trojans and many more are con-

ducted through a malicious remote server called Com-

mand and Control(C2) servers. In such attacks, the

compromised system try to establish a remote con-

nection with a C2, which then remotely controls the

infected system.

Malware is an unwelcome software which enters

into the victim system suspiciously. Normally the IP

address or the domain name of C2 is hard-coded in

the malicious binary ﬁle, inturn can be blocked eas-

ily. However, to keep their C2 alive and active, bad

actors have opted a sophisticated technique called do-

main ﬂuxing in which the domain names are gener-

ated dynamically by in-cooperating ”Domain Gener-

ation Algorithm” (DGA) (Vormayr et al., 2017). Such

domain names are random in nature as they are based

https://orcid.org/0000-0002-2437-4870

https://orcid.org/0000-0003-4912-6168

on a seed value. If the seed value is known, then the

security analysts can predict the future domain names

and block them in advance.Alternatively adversaries

put all their effort to use such a seed that also changes

dynamically.

Figure1 shows the complete process of DGA.

Same DGA algorithm, with the same seed, executes

at both ends ie attacker as well as infected system.

Usually date/ time is used as a seed value, which is

taken either from an online source or through the http

response. In the simplest case malware generates the

alphabets/digits randomly and concatenate them into

a second level domain name(SLD) of a ﬁxed length,

deﬁned in the algorithm. Once being fully generated

then top level domains (TLD), which are predeﬁned,

are concatenated with SLD to make a fully qualiﬁed

domain name(FQDN). Usually it generates a huge list

of domain names but only few of them are registered,

with a shorter time to live(TTL). The victim system

sends the DNS queries to the DNS server for all those

domain names. If the domain name is not found in

DNS server, it is forwarded to the root name server

as depicted in ﬁgure 1. In case of un registered do-

main names, Non existent domain name(NX domain),

a prominent characteristic of DGA, is returned as a

response. However if a rendezvous point is achieved

822

Mahmood, A., Abbas, H. and Amjad, F.

CNN-HMM Model for Real Time DGA Categorization.

DOI: 10.5220/0012136800003555

In Proceedings of the 20th International Conference on Security and Cryptography (SECRYPT 2023), pages 822-829

ISBN: 978-989-758-666-8; ISSN: 2184-7711

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

Figure 1: Overview of Domain Generation Algorithm.

the IP address is returned to the victim system, thus

establishing the connection.

Here, as characters and digits are generated and

concatenated randomly thus leading to a gibberish do-

main names(R-DGA), another lead indicator of DGA.

Most of the researchers have used them in their detec-

tion systems. Being easily detectable, so now mali-

cious authors are using dictionary words in the do-

main names(W-DGA), thus making them harder to

detect.

DGAs are incorporated in various types of mal-

ware including banking Trojans(Oppenheim et al.,

2017), adware, malware for IoT (Antonakakis et al.,

2017),(Vinayakumar et al., 2020), Botnets(Wang

et al., 2017), malware for supply chains (Labs,

2017) etc. Over the last decade, banks have be-

come the soft target for the banking Trojans(Atzeni

et al., 2020). These Trojans steal the credentials of

the customers and may also empty their accounts.

Emotet(Bromium, 2019), a banking Trajan, ﬁrst seen

in 2014, is still active and becomes the one of the nas-

tier Trojans due its polymorphic nature(Technologies,

2020) and it incorporates DGA. Till date up to 50000

domain names have been used by it. Mirai is another

malware which has targeted IoT(Antonakakis et al.,

2017) and its variants are still active.

Malware with domain ﬂuxing are ordinarily de-

tected on the basis of their DNS behavior and the

lexical features of SLD . In this paper we are pre-

senting an algorithm to detect DGA by monitoring

the DNS response followed by hybrid model of Hid-

den Markov Model (HMM) and Convolution Neural

Network (CNN) to categorize the given SLD into R-

DGA, W-DGA or the mixture of both. Our contribu-

tions towards this paper are listed as under:

• It presents an algorithm which detects a DGA by

using live Network forensics.

• The algorithm extracts the domain name from the

DNS query and tokenize it using Hidden Markov

model, an extension of probabilistic model, into

its constituent words or subwords if exists any.

• The extracted subwords/words alongwith their

probabilities are then used for feature engineer-

ing. Multiple features, extracted from subwords

are fed into CNN for classiﬁcation.

• We have tested the algorithm for more than 56

different DGAs having the accuracy greater than

99%.

The paper is organized in 5 sections. Section II gives

the literature review, section III explains the proposed

framework. Section IV brieﬂy analyzes the results

and performance. Finally the paper is concluded in

section V.

2 LITERATURE REVIEW

DGA based malware are more sophisticated and dan-

gerous as their C2 remain untraceable due to the ran-

domness inherited by their SLDs. Taxonomy of such

SLD depend on a seed which can be dynamic in na-

ture. Various techniques have been proposed so far to

detect and classify these DGAs in real time or in close

to real time. Zaho et al., based on semantic features

of SLD, developed a DGA detection system(Zhao

et al., 2019). These semantic features include entropy,

length of the SLD, ratio of vowels, consonants, dig-

its etc. Malicious Domain names are detected on the

basis of N-gram and then are tested against the Alexa

dataset (Internet, 2018).

CNN-HMM Model for Real Time DGA Categorization

823

Figure 2: Overview of DGA Detection Algorithm.

DGAs are usually deployed for Bot network. All

the bots generate the DNS requests for the same do-

main names within a small time interval. The de-

tection algorithm tries to identify a pattern among all

these DNS queries, as they may inherit some similar-

ity(Wang et al., 2017). Although the detection accu-

racy in real time is high, but the approach is not suit-

able for the single system.Using the same approach

Erquiaga et al. (Erquiaga et al., 2016) considered the

behavioral models of DNS during a speciﬁc time win-

dow to detect DGAs.

Grill et al. (Grill et al., 2015) based their system

on NetFlow data, combination of a source IP and port

and a destination IP address and port, time stamps,

byte counters and packet counters. S. Tian et al. (Tian

et al., 2016) used mobile web trafﬁc for DGA detec-

tion. The detection was based on textual features, sta-

tistical features and DNS trafﬁc. The textual and se-

mantic features are actually humanly crafted features,

so the detection systems, based on these features, may

perform adversely for a zero day attack. And also a

minor change in the DGA algorithm will be effective

in bypassing such detection systems.

Researchers have deployed machine learning

algorithms using the above mentioned features.

These include deep neural networks, random for-

est(Yan et al., 2020), decision trees(Pereira et al.,

2018), clustering (Amini, 2008),(Bisio et al., 2017)

and (Pereira et al., 2018), support vector ma-

chines(SVM) (Wang et al., 2018), Recurrent Neu-

ral Networks(RNN), CNN(Hwang et al., 2020), Long

short term memory(LSTM)(Shahzad et al., 2021), au-

toencoders(Skansi, 2018). (Wang et al., 2018) has de-

ployed SVM to classify the DGA domain names from

benign domain names. The algorithm takes lexical

features like entropy, length of SLD, vowels, ratio of

consonants and the digits for classiﬁcation. Although

the data set was small but the algorithm gave accu-

racy as good as 89%. F. Bisio et al. (Bisio et al.,

2017) analyzes the DNS trafﬁc using K-means clus-

tering in a near-real-time in order to detect DGA.

(Xu et al., 2019) has developed a model based on N-

gram which works for both R-DGA and W-DGA. The

model gives more than 98% accuracy both on random

an as well as wordlist domain names. But they have

only tested on 16 different DGA malware. (Shahzad

et al., 2021) deployed LSTM but the detection was

focused on R-DGA only. CNN (Hwang et al., 2020)

based on linguistic features which worked well for R-

DGA only. Researchers have deployed deep neural

networks (Ren et al., 2020) but may only detect W-

DGA.

Although machine learning algorithms are best for

intelligent detection but their accuracy may suffer due

to imbalance of samples.(Liu et al., 2020) provided

a technique to handle imbalance sample set and thus

improved the accuracy. It is inferred from the litera-

ture that the above mentioned algorithms either focus

on R-DGA or on W-DGA but not on both and their

accuracy may suffer due to imbalanced training set.

In addition, there exists another DGA which is actu-

ally the mixture of both ie the domain names consists

of gibberish as well as dictionary words. It is inferred

from the above discussion that one machine learning

algorithm may work well for one category of DGA

but may suffer from another category. However, if

one is managed to classify them broadly into three ba-

sic classes and then deploy different machine learning

algorithms as per their class, accuracy will deﬁnitely

increase. So this research presents an algorithm to

classify between R-DGA, W-DGA and mixed DGA.

SECRYPT 2023 - 20th International Conference on Security and Cryptography

824

3 PROPOSED FRAMEWORK

Victim of a DGA based malware initiates the DNS re-

quests at a high rate. The response to most of these

DNS requests is NX domain, which is the key char-

acteristic of DGA malware, thus incorporating it in

the proposed algorithm. As shown in the ﬁgure 2 the

proposed algorithm has two main modules live net-

work monitoring and DGA classiﬁcation which are

discussed below.

3.1 Live Network Monitoring

The main characteristic of DGA malware is their net-

work activity. Most commonly such malware checks

the availability of internet by sending UDP packets

to some renowned websites like google, yahoo, face-

book etc. In addition, sometimes they also need to

read their seeds form online sources, hard-codded in

their binary ﬁles. Once the FQDN is created, they

start sending the http requests for them. Most of such

domain names are non existent and DNS resolver

send NX domain response. To capture live DNS re-

sponse we are using wire-shark tool and incorporated

in Python. Every DNS response is captured, if its suc-

cess full the system ignores it, if its unsuccessful the

system go for packet inspection for error code.Error

code 3 means its NX domain. In addition to the er-

ror code it also gives the source port, destination port,

domain name for which the query name been initiated

and many more.

3.2 DGA Classiﬁcation

SLD is extracted from FQDN which in turn is ex-

tracted from the unsuccessful DNS query response .

SLD, a sequence of alphabets, digits and special char-

acters (−, ) without spaces, is tokenize into its con-

stituent subwords via probabilistic model deﬁned by

Peter Norvig (Norvig, 2009). The model works on

maximum likelihood estimation(MLE) which is de-

termined via Bayes theorem.

The proposed technique is based on a statisti-

cal language model of unigrams, alongwith their oc-

currence frequency. The dataset comprising of uni-

grams and their frequency is taken from Google Cor-

pus (Google, 2012). It contains almost 1 trillion to-

kens, but to keep the size manageable we only utilize

the tokens having the frequency greater than 10000.

Any token having the occurrence frequency less than

10000 is considered as unknown word. The un-

igrams are independent but their characters follow

a particular sequence in a particular domain name,

thus making a Bayesian network, an acyclic directed

graph. HMM being a sequential classiﬁer and a tech-

nique to encode the Bayesian network, is deployed to

determine the probable word boundaries(Yan et al.,

2021)in a SLD. HMM consists of observable states,

Figure 3: HMM.

hidden states, initial probabilities, transition probabil-

ities and emission probabilities. Here, the observ-

able states are the positional characters in the do-

main name, and the hidden states are the probable

unigrams as shown in the ﬁgure 3. Here, in ﬁgure

3, the rightward arrow refers to the transition state

and downward arrow refers to the hidden states. For

example, if the given domain name is ’eatis’, then

the sample combinations are shown in the ﬁgure 3.

These sample sets include { ’e’,’a’,’t’,’i’,’s’ },{’ea’,

’tis’}, {’eati’,’s’},{’eat’,’i’,’s’}. Probabilities of each

set is determined using the relative frequency. Hidden

states or the unigrams in a particular set are indepen-

dent in nature, thus we can apply the chain rule as

depicted in equation 1.

Mathematical notation of the chain rule is:

P(C

1:n

) =

∏

k=1:n

P(C

) (1)

which is also known as Naive Bayesian Assumption.

The algorithm determines all the probable unique

sets of unigrams present in the SLD, compute

their probabilities through the chain rule equation(1)

and then using the Maximum Likelihood Estima-

tion(MLE), given in equation 2 determines the most

probable unigram boundaries within the SLD string.

P(C

1:n

) = argmax

P(c

) (2)

The unknown unigram must not have a zero prob-

ability but instead a lesser one as compared to the

known unigram. So it is calculated via equation 3.

Here, P depends on the length of unigram, so longer

the length smaller its probability.

P(unk) =

N ∗ 10

length

unk

(3)

Another challenge is to deploy an efﬁcient algorithm

for determining the most probable segments or the

subwords. The most common and the easiest tech-

nique is brute force ie computing the all possible se-

quences and selecting the one with the highest prob-

ability, thus having the exponential time complexity.

CNN-HMM Model for Real Time DGA Categorization

825

Figure 4: Machine Learning Framework.

To optimize the solution we have selected Viterbi al-

gorithm (Ciaperoni et al., 2022), a dynamic program-

ming technique. It reduces the time complexity by

avoid redundant calculations. Its time complexity is

linear in the length of the segment and quadratic in

the number of possible segments. This makes it much

more efﬁcient than the brute force approach, espe-

cially for longer domain names with many possible

segments.

3.3 Machine Learning Framework

Table 1: List of Features.

Features Description

Probabilities

of segments

Domain name is divided into prob-

able segments based on their proba-

bilities

Length of seg-

ments

Segments having small lengths

have higher probabilities

Count of Seg-

ments

Total number of segments in a do-

main name

Ratio of digits W-DGA has either 0 or 1 digits

whereas R-DGA may have many

digits depending upon its algorithm

Count of spe-

cial Charatcers

Most of the samples belonging to

W-DGA and Mixed DGA have spe-

cial characters

The selected features, extracted from SLD, are

summarized in table 1. There are DGA like ’abcbot’

which generates gibberish SLD with only 1 seg-

ment. On the other hand, ’NGIOWEB’ gener-

ates word based SLD, having as many as 14 sub-

words, including the special characters. The spe-

cial characters are considered as a separate seg-

ment. However, digits, if exists any, may become

the part of segment depending upon the probabil-

ity. For example, consider a SLD ’9995bc0c’ gen-

erated by ’antavmu’ is a single segment having the

probability 9.75E-20 computed from the equation

3. However, ’7286a423a16a13886a2511c8afa8df65’

is another SLD, a R-DGA generated by bami-

tal. The algorithm divides it into 2 segments:

’7286a423a16a13886a2511c8’ and ’afa8df65’ having

probability 9.75E-36 and 9.75E-20 respectively. It is

evident from the above examples the probability de-

creases as the length of the segment increases. Now

consider a ’Bigviktor’, a W-DGA, it generates do-

main like ’holdthenprofessional’. The proposed al-

gorithm breaks it into ’hold’, ’then’, ’professional’

with 6.222E-5, 3.60E-4 and 1.2E-4 respective prob-

abilities. Usually, the maximum length of a SLD is

upto 64 characters and the maximum segments we

found in more than 0.1 million samples are 14. But to

handle a zero day DGA, the proposed algorithm has

a cushion for 30 segments. As shown in the table 1,

we are also considering the length of these segments.

So, for 30 segments, there are 30 lengths. Missing

segments and the lengths are considered are zero. In

addition, total number of segments in a SLD, ratio

of digits and special character is also included. All

these features are used in fully connected neural net-

SECRYPT 2023 - 20th International Conference on Security and Cryptography

826

Table 2: Summary of Dataset.

Category DGA Samples DGA Samples DGA Samples

R-DGA

ABCBOT 27 ANTAVM 32 Bamital 104

Chinad 1000 Conﬁcker 494 Cryptolocker 1000

Dicrypt 1000 Dmsniff 1000 Dyre 1000

Emotet 5952 Enviserve 491 Flubot 30000

Fobber 596 Gamevoer 12000 Gspy 100

Locky 1178 Nymaim 481 Padcrypt 2279

Necurs 40950 Qakbot 24804 Ramnit 20060

blackhole 180 ccleaner 30 cooperstealer 18

feodo 263 m0yv 63 madmax 530

monerominer 2495 murofet 8560 mydoom 10037

necro 2962 omexo 40 proslikefan 100

pykspa 7930 qadars 2200 Ranybus 13302

rovnix 4453 shifo 2541 shiotob 8004

simda 12383 symmi 4256 tempedreve 2733

tinba 13420 tinyuke 200 tofsee 20

tordwm 500 vawtrak 845 vidro 100

virut 9712 wauchos 5940 xshellghost 100

W-DGA Bigviktor 1007 Matsnu 903 Suppobox 2303

Mixed Ngioweb 5274 Banjori 7306 kfos 121

work (FCNN) and CNN. FCNN gave more than 92%,

whereas CNN gave more than 99% accuracy, so we

selected CNN for DGA classiﬁcation. These results

prove the strength of the selected features as both ma-

chine learning algorithms gives accuracy greater than

90%.

Length of feature vector may vary with the length

of SLD. To normalize the feature set, zero padding

is introduced, thus generating a sparse matrix. This

sparse matrix becomes the input for a CNN. The se-

quential model of tensor ﬂow is used and the arrange-

ment of layers with their dimensions are depicted in

ﬁgure 4. To convert the sparse matrix into a matrix

having real values, embedding layer is deployed. The

output of the embedding layer then becomes the input

of the 1D convolution layer. Convolution layer has

256 ﬁlters of size 3x3, having the stride=1. Maximum

Pooling layer is used here to select the features hav-

ing the maximum values. Then comes the dense layer

with 128 hidden units and relu activation function to

handle the non linearity. In order to avoid overﬁtting,

two dropout layers are added as shown in the ﬁgure

4. And the last dense layer has 3 hidden units hav-

ing the softmax activation function for the multi class

classiﬁcation.

4 PERFORMANCE AND

RESULTS

The proposed algorithm is trained and tested on In-

tel(R) Core(TM) i7-1165G7 CPU @ 2.80 GHz 2.803

GHz and 16 GB RAM.The dataset comprising of dif-

ferent 57 DGA families as listed in table 2, is used for

training and testing. It is collected from Netlab (Net-

lab360, 2023), CIC dataset (for Cybersecurity (CIC),

2021) and kaggle (Kaggle, 2021) which are available

online. In addition, we also executed the DGA al-

gorithms which are available at github (Bader, 2023)

and tested our algorithm on them. In total we incorpo-

rated almost 0.27 million domain names for training

and testing. Almost 23% of the dataset is redundant,

which is discarded and the remaining 0.21 million is

used for training and testing. Out of 0.21 million do-

main names, 60% samples are selected randomly for

training and the remaining 40% are used for testing.

Training data is further divided into training and val-

idation in the ratio of 60:40 respectively. The model

is trained using 300 epochs and the its accuracy and

loss are shown in the ﬁgure 5 (a) and (b) respectively.

Accuracy graph 5(a) shows that the model keeps

on learning till 100 epoch and after that learning rate

becomes stable. On the other hand 5(b) shows that

loss decreases with the increase in number of epoch.

Also the confusion matrix for the test data is plotted

in the ﬁgure 6. The rows represents the true class and

the column represents the predicted class. Here, label

’0’,’1’ and ’2’ represents R-DGA, W-DGA and Mixed

DGA respectively. Confusion matrix shows that R-

DGA has 100% accuracy as compared to the rest of

the two. However, there are some missed classiﬁca-

tion in case of W-DGA and Mixed DGA. Although

the overall accuracy is still greater than 99%.

After the training , then the complete system is

tested against the DGA based malware, which are ex-

ecuted in VMware with customized settings. When

the malware generates the DNS request, wireshark

traces it and looks for NX domain. In case of NX do-

main, the domain name is passed to the proposed al-

CNN-HMM Model for Real Time DGA Categorization

827

(a) Accuracy (b) Loss

Figure 5: Accuracy and Loss.

13203

100%

1094

99%

4617

99%

True Label

Predicted Label

Figure 6: Confusion Matrix.

gorithm which then breaks it into its constituent parts

and ﬁnally enters into machine learning framework

for classiﬁcation.

To optimize the solution in terms of time complex-

ity and space complexity Viterbi algorithm(Cahyani

and Vindiyanto, 2019) is used as it reduces the time

complexity by using dynamic programming to avoid

redundant calculations and considering only the most

likely segment at each step.

5 CONCLUSION

DGAs are deployed to hide the malicious C2. DGAs

may generate gibberish domain names and the sophis-

ticated ones generate the dictionary oriented domain

names or the combination of both, thus difﬁcult to de-

tect. Researchers have devised different techniques,

usually based on humanly crafted features, to detect

and classify such DGAs, but they are speciﬁc to one

class only. This research proposes an algorithm to dif-

ferentiate between the gibberish domain names, dic-

tionary oriented domain names and Mixed ones. Now

the researchers can deploy different machine learning

techniques to each of these categories for better accu-

racy.

REFERENCES

Amini, P. (2008). Kraken botnet inﬁltration. Available at

http://dvlabs.tippingpoint.com/blog/2008/04/28/krak

en-botnet-infiltrationdvlabs.tippingpoint.com.

Antonakakis, M., April, T., Bailey, M., Bernhard, M.,

Bursztein, E., Cochran, J., Durumeric, Z., Halderman,

J. A., Invernizzi, L., Kallitsis, M., et al. (2017). Un-

derstanding the mirai botnet. In 26th {USENIX} secu-

rity symposium ({USENIX} Security 17), pages 1093–

1110.

Atzeni, A., Diaz, F., Lopez, F., Marcelli, A., Sanchez, A.,

and Squillero, G. (2020). The rise of android banking

trojans. IEEE Potentials, 39(3):13–18.

Bader, J. (2023). Reverse engineering of domain generation

algorithms. Available at:https://github.com/baderj/do

main generation algorithms.

Bisio, F., Saeli, S., Lombardo, P., Bernardi, D., Perotti, A.,

and Massa, D. (2017). Real-time behavioral dga de-

tection through machine learning. In 2017 Interna-

tional Carnahan Conference on Security Technology

(ICCST), pages 1–6. IEEE.

Bromium (2019). Emotet:a technical analysis of the de-

structive, polymorphic malware. Available at https:

//www.bromium.com/wp-content/uploads/2019/07/B

romium-Emotet-Technical-Analysis-Report.pdf.

Cahyani, D. E. and Vindiyanto, M. J. (2019). Indonesian

part of speech tagging using hidden markov model–

ngram & viterbi. In 2019 4th International Confer-

ence on Information Technology, Information Systems

and Electrical Engineering (ICITISEE), pages 353–

358. IEEE.

Ciaperoni, M., Gionis, A., Katsamanis, A., and Karras, P.

(2022). Sieve: A space-efﬁcient algorithm for viterbi

decoding. In Proceedings of the 2022 International

Conference on Management of Data, pages 1136–

1145.

Erquiaga, M. J., Catania, C., and Garc

ıa, S. (2016). Detect-

ing dga malware trafﬁc through behavioral models. In

2016 IEEE Biennial Congress of Argentina (ARGEN-

CON), pages 1–6. IEEE.

for Cybersecurity (CIC), C. I. (2021). Cic-bell-dns 2021

SECRYPT 2023 - 20th International Conference on Security and Cryptography

828

dataset. Availabe at : https://www.unb.ca/cic/dataset

s/dns-2021.html.

Google (2012). Corpus-n gram viewer. Available at: http:

//storage.googleapis.com/books/ngrams/books/datase

tsv2.html.

Grill, M., Nikolaev, I., Valeros, V., and Rehak, M. (2015).

Detecting dga malware using netﬂow. In 2015

IFIP/IEEE International Symposium on Integrated

Network Management (IM), pages 1304–1309. IEEE.

Hwang, C., Kim, H., Lee, H., and Lee, T. (2020). Effective

dga-domain detection and classiﬁcation with textcnn

and additional features. Electronics, 9(7):1070.

Internet, A. (2018). Alexa dataset. Available at https://ww

w.alexa.com/topsites.

Kaggle (2021). Kaggle:domain generation algorithm

dataset. Available at:https://www.kaggle.com/dat

asets/gtkcyber/dga-dataset.

Labs, K. (2017). Shadowpad: Popular server management

software hit in supply chain attack. Available at https:

//media.kasperskycontenthub.com/wp-content/uploa

ds/sites/43/2017/08/07172148/ShadowPad\ technic

al\ description\ PDF.pdf.

Liu, Z., Zhang, Y., Chen, Y., Fan, X., and Dong, C. (2020).

Detection of algorithmically generated domain names

using the recurrent convolutional neural network with

spatial pyramid pooling. Entropy, 22(9):1058.

Netlab360 (Accessed: May 19, 2023). Network security

research lab at 360. https://data.netlab.360.com/dga/.

Norvig, P. (2009). Natural Language Corpus

Data:Beautiful Data, pages 219–242. O’Reilly

Media.

Oppenheim, M., Zuk, K., Meir, M., and Kessem, L. (2017).

Qakbot banking trojan causes massive active directory

lockouts. Available at https://securityintelligence.com

/qakbot-banking-trojan-causes-massive-active-direc

tory-lockouts/.

Pereira, M., Coleman, S., Yu, B., DeCock, M., and Nasci-

mento, A. (2018). Dictionary extraction and detection

of algorithmically generated domain names in passive

dns trafﬁc. In International Symposium on Research

in Attacks, Intrusions, and Defenses, pages 295–314.

Springer.

Ren, F., Jiang, Z., Wang, X., and Liu, J. (2020). A dga

domain names detection modeling method based on

integrating an attention mechanism and deep neural

network. Cybersecurity, 3(1):1–13.

Shahzad, H., Sattar, A. R., and Skandaraniyam, J. (2021).

Dga domain detection using deep learning. In 2021

IEEE 5th International Conference on Cryptography,

Security and Privacy (CSP), pages 139–143. IEEE.

Skansi, S. (2018). Autoencoders. In Introduction to Deep

Learning, pages 153–163. Springer.

Technologies, C. P. S. (2020). July‘s most wanted malware:

Emotet strikes again after ﬁve-month absence. Avail-

able at https://blog.checkpoint.com/2020/08/07/julys

-most-wanted-malware-emotet-strikes-again-after-f

ive-month-absence/.

Tian, S., Fang, C., Liu, J., and Lei, Z. (2016). Detecting ma-

licious domains by massive dns trafﬁc data analysis.

In 2016 8th International Conference on Intelligent

Human-Machine Systems and Cybernetics (IHMSC),

volume 1, pages 130–133. IEEE.

Vinayakumar, R., Alazab, M., Srinivasan, S., Pham, Q.-

V., Padannayil, S. B., and Simran, K. (2020). A

visualized botnet detection system based deep learn-

ing for the internet of things networks of smart

cities. IEEE Transactions on Industry Applications,

56:4436–4456.

Vormayr, G., Zseby, T., and Fabini, J. (2017). Botnet com-

munication patterns. IEEE Communications Surveys

Tutorials, 19(4):2768–2796.

Wang, T., Lin, H., Cheng, W., and Chen, C. (2017). Dbod:

Clustering and detecting dga-based botnets using dns

trafﬁc analysis. Computers and Security, 64:1–15.

Wang, Z., Jia, Z., and Zhang, B. (2018). A detection scheme

for dga domain names based on svm. In Proceedings

of the 2018 international conference on Mathematics,

Modelling, Simulation and Algorithms (MMSA 2018).

Atlantis Press.

Xu, C., Shen, J., and Du, X. (2019). Detection method of

domain names generated by dgas based on semantic

representation and deep neural network. Computers&

Security, 85:77–88. https://www.sciencedirect.com/

science/article/pii/S0167404818312938.

Yan, F., Liu, J., Gu, L., and Chen, Z. (2020). A semi-

supervised learning scheme to detect unknown dga

domain names based on graph analysis. In 2020 IEEE

19th International Conference on Trust, Security and

Privacy in Computing and Communications (Trust-

Com), pages 1578–1583. IEEE.

Yan, X., Xiong, X., Cheng, X., Huang, Y., Zhu, H., and

Hu, F. (2021). Hmm-bimm: Hidden markov model-

based word segmentation via improved bi-directional

maximal matching algorithm. Computers & Electrical

Engineering, 94:107354.

Zhao, H., Chang, Z., Bao, G., and Zeng, X. (2019). Mali-

cious domain names detection algorithm based on n-

gram. Journal of Computer Networks and Communi-

cations, 2019.

CNN-HMM Model for Real Time DGA Categorization

829