Deep Convolutional Neural Network and Character Level Embedding

for DGA Detection

ao Rafael Greg

orio

1 a

, Adriano Mauro Cansian

1 b

, Leandro Alves Neves

1 c

and Denis Henrique Pinheiro Salvadeo

2 d

Department of Computer Science and Statistics (DCCE), S

ao Paulo State University (UNESP),

ao Jos

e do Rio Preto, S

ao Paulo, Brazil

Institute of Geociences and Exact Sciences (IGCE), S

ao Paulo State University (UNESP), Rio Claro, Brazil

Keywords:

Domain Generation Algorithms, DGA, Convolutional Neural Networks, Embedding, NLP, Short Text

Classiﬁcation, Cybersecurity.

Abstract:

Domain generation algorithms (DGA) are algorithms that generate domain names commonly used by botnets

and malware to maintain and obfuscate communication between a botclient and command and control (C2)

servers. In this work, a method is proposed to detect DGAs based on the classiﬁcation of short texts, high-

lighting the use of character-level embedding in the neural network input to obtain meta-features related to

the morphology of domain names. A convolutional neural network structure has been used to extract new

meta-features from the vectors provided by the embedding layer. Furthermore, relu layers have been used to

zero out all non-positive values, and maxpooling layers to analyze speciﬁc parts of the obtained meta-features.

The tests have been carried out using the Majestic Million dataset for examples of legitimate domains and

the NetLab360 dataset for examples of DGA domains, composed of around 56 DGA families. The results

obtained have an average accuracy of 99.12% and a precision rate of 99.33%. This work contributes with a

natural language processing (NLP) approach to DGA detection, presents the impact of using character-level

embedding, relu and maxpooling on the results obtained, and a DGA detection model based on deep neural

networks, without feature engineering, with competitive metrics.

1 INTRODUCTION

Botnets are groups of computers or devices infected

by malware that can be controlled remotely through

a command and control (C2) server (Kambourakis

et al., 2019). Attackers use botnets to carry out var-

ious types of cyberattacks, such as distributed denial

of service (DDoS) attacks, directing members of their

botnet to attack a speciﬁc target or for the system-

atic leakage of information from corporate and gov-

ernment networks (Shahzad et al., 2021).

Domain generation algorithms (DGA) are existing

codes in most malware families that generate domain

names. The generated domains are used to maintain

communication between the infected machine and the

C2 servers (Wong, 2023; Huang et al., 2022). This

strategy allows the C2s to, for example, change their

https://orcid.org/0000-0001-7783-2567

https://orcid.org/0000-0003-4494-1454

https://orcid.org/0000-0001-8580-7054

https://orcid.org/0000-0001-8942-0033

IP addresses without affecting communication with

botnet member machines. Therefore, DGAs are used

to increase the communication resilience between C2

and botnet members and improve the obfuscation ca-

pacity of the C2 server (Kambourakis et al., 2019).

Attackers register large numbers of domains com-

patible with their generating algorithms on regis-

trants

on the internet. This strategy allows that in

the event of a domain being listed due to abuse, an-

other domain name can be used. This large num-

ber of DGA domain names makes using blocklists a

problematic alternative, as a list of a few thousand do-

mains will likely be a performance drag on any ﬁre-

wall system (Ren et al., 2020). Thus, identifying re-

quests for DGA domains on a computer network, or

even noticing them during the registration process,

can help mitigate botnet-related threats, prevent im-

pending attacks on corporate networks, and support

Companies authorized to sell second-level domain

names directly to individuals or companies (Tanenbaum

et al., 2021).

Gregório, J., Cansian, A., Neves, L. and Salvadeo, D.

Deep Convolutional Neural Network and Character Level Embedding for DGA Detection.

DOI: 10.5220/0012605700003690

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 26th International Conference on Enterprise Information Systems (ICEIS 2024) - Volume 2, pages 167-174

ISBN: 978-989-758-692-7; ISSN: 2184-4992

167

proactive cybersecurity.

In this context, some works focused on detecting

DGAs have been developed using approaches based

on classic machine learning, with the manual devel-

opment of attributes, to map the characteristics that

characterize these domain names. For instance, the

proposal of (Piras et al., 2022) presents an explainable

approach based on attributes of data collected from

domain name system (DNS) servers and compares the

benchmarked results of some classical machine learn-

ing algorithms such as random forest, adaboost and

decision tree. That approach depends on data collec-

tion windows related to DGAs on DNS servers. The

work presents the comparison of results obtained with

windows of ﬁve and fourteen days, demonstrating that

the best results are obtained with the largest data cap-

ture window.

Approaches using deep neural networks (DNN)

and lexical features of DGAs are also found in recent

literature. B. Yu, J. Pan, J. Hu, A. Nascimento and

M. De Cock (Yu et al., 2018) discuss some models of

DNNs for the detection of DGAs comparing them to

results from models based on lexical features. The re-

searchers applied models of convolutional neural net-

works (CNN) and long-short term memory (LSTM)

trained with about 1.6 million samples (80% of the

dataset) and obtained results of about 98% accuracy.

Vranken, H.; Alizadeh, H. (Vranken and Alizadeh,

2022) propose the use of Term Frequence-Inverse

Document Frequency (TF-IDF) to evaluate the rele-

vance of 5000 n-grams of 1, 2, and 3 letters in the

composition of domain names. Their model based

on an LSTM network achieved around 97% accuracy.

The approach with manually developed characteris-

tics can be more complex as it requires more special-

ized work in addition to possibly failing to include

speciﬁc characteristics of DGA families that may not

be present in the used data set.

Other approaches use DNNs, directly exploring

domain names as data input, without manually pro-

ducing characteristics. Ren, F., Jiang, Z., Wang, X.

and Liu, J. (Ren et al., 2020) present a hybrid model

of a DNN, with CNN, bi-directional long-short term

memory (BiLSTM) layers and an attention mecha-

nism, also using 80% of the dataset for training and

20% for testing, about 1.3 million samples. The ob-

tained results have been of the order of 99% accu-

racy for the classiﬁcation between DGA and legiti-

mate. The same work also presents a classiﬁcation by

DGA family, using 24 families of DGA domains. H.

Shahzad, A. R. Sattar and J. Skandaraniyam (Shahzad

et al., 2021) propose removing the top-level domain

(TLD) during pre-processing, using “google” instead

of “google.com”. The proposed DNN model has been

based on three types of recurrent networks (LSTM,

Bi-LSTM, and GRU-Gated Recurrent Unit). The re-

sults have been about 87% accuracy with a training

dataset of about 1.8 million samples and 44 DGA

families. It is observed that recent works that address

the detection of DGAs with natural language process-

ing (NLP) techniques also use manually developed re-

sources at some point.

The works that propose directly using domain

names as input for their models are based on hybrid

structures, normally using CNN, Recurrent, LSTM,

BiLSTM networks, or attention mechanisms in the

same model. Yu Wang, Rui Pan, Zuchao Wang and

Lingqi Li (Wang et al., 2023) utilize the dataset pro-

vided by NetLab360 for DGA examples, which in-

cludes approximately 35 DGA families. They en-

code the input characters into values ranging from

0 to 38 to obtain the initial feature vector. A CNN

model concatenated with a BiLSTM network is pre-

sented to extract meta-features for the purpose of de-

tecting DGAs. The results reported in the study show

an accuracy of 94.51%, a recall of 95.05%, and a

precision of 93.11% for binary detection. Weiqing

Huang, Yangyang Zong, Zhixin Shi, Leiqi Wang and

Pengcheng Liu (Huang et al., 2022) uses pre-trained

word embedding models such as BERT (Devlin et al.,

2019) and ELMo (Peters et al., 2018) to generate the

initial feature vector from domain names. This ini-

tial vector is fed into a CNN and then to the classiﬁer,

achieving an accuracy of 96.08% in its best result.

Thus, an approach based on NLP with character-

level embedding and CNN, exclusively using domain

names as input for the proposed model, to extract

the best meta-features for detecting DGAs has not

yet been completely covered in specialized literature.

Therefore, our method aims to contribute by verifying

the effectiveness of using character-level embedding

in detecting DGAs in datasets with heterogeneity of

DGA families.

In this scenario, this work presents an approach

based on a CNN, to extract features from the domain

names themselves, without the feature engineering

process. The proposal also used character-level em-

bedding at the proposed network’s input to ﬁnd char-

acteristics related to the morphology of the analyzed

domain names. Furthermore, the proposed approach

uses relu layers to make the network sparse, returning

zero for any non-positive value, and maxpooling lay-

ers to highlight speciﬁc characteristics of the analyzed

classes. The main contributions of this paper are:

• Indicate the use of character-level embedding as

an effective strategy to improve DGA detection

results in DNN architectures.

ICEIS 2024 - 26th International Conference on Enterprise Information Systems

168

• Describe how the combined use of relu, maxpool-

ing and embedding layers can improve the results

and performance of CNNs in detecting DGAs.

• Present a DNN architecture with competitive re-

sults in detecting DGAs without feature engineer-

ing.

This paper is organized as follows. Section 2

presents the methodology used in data pre-processing,

the datasets used, tools, and the technologies that sup-

ported the development of this work, in addition to the

description of the proposed model. Section 3 presents

the results obtained in distinct tests, comparisons with

different combinations of the proposed model and an

overview of the obtained results in relation to the liter-

ature. Finally, in Section 4, conclusions and proposals

for future work are presented.

2 METHODOLOGY

This Section presents the methodology used for the

development of the proposed model, pre-processing

of datasets, and the technologies used. Section 2.1

presents the development environment and the tech-

nologies applied, the data sets, their pre-processing,

and division into sets for training and testing. Section

2.2 presents the proposed model, each of the layers

used along with their functions, hyperparameters, and

a summary to facilitate the reproduction of this exper-

iment. Section 2.3 presents the metrics used to eval-

uate the results obtained with training and testing the

proposed model.

2.1 Development Environment and

Datasets

The algorithms of this work have been executed us-

ing the Kaggle Notebook platform (Kaggle Team,

2023), a public and free platform, where it is possi-

ble to use a monthly Graphics Processing Unit (GPU)

processing quota, essential for processing large vol-

umes of data and training DNNs in less time. All

codes have been developed in the Python 3 pro-

gramming language, using the Tensorﬂow (Abadi

et al., 2015), Pandas (The Pandas Development Team,

2020), and Numpy (Harris et al., 2020) libraries.

The notebook containing the complete experiment

has been published, thus allowing the complete repro-

duction of this work. The source code is available

at https://www.kaggle.com/code/rafaelgregrio/deep-

convolutional-embedding-dga-detector.

Also, the Majestic Top Million(Majestic, 2023)

dataset containing 1 million domains has been used

as the legitimate domain data source. As a source

of DGA domain names, the dataset provided by Net-

Lab360(NetLab 360, 2022) has been used in this

investigation, containing about 1 million DGA do-

mains, distributed across 56 families, collected in real

trafﬁc. Legitimate domain names have been labeled 0,

and the domain names from DGA families have been

labeled 1.

Domain names have been encoded character by

character to ASCII code. The maximum length

checked across the domain name samples has been

73 characters, so smaller domain names have been

padded with zeros from the end to the maximum

length. Table 1 shows examples of domain names al-

ready encoded to ASCII.

Table 1: Examples of domain names and each character en-

coded by ASCII.

domain label 1 2 . . .

gafsjfdues.com 1 103 97 . . .

yvhbrwh.info 1 121 118 . . .

nwqpswrwsn.us 1 110 119 . . .

awnspwrpnn.com 1 97 119 . . .

cghgfedguvol.net 1 99 103 . . .

mayermoney.com 0 109 97 . . .

sqlservercentral.com 0 115 113 . . .

lottomatica.it 0 108 111 . . .

In order to train the proposed model, 20% of the

total data have been proportionally selected. To test,

ﬁve samples, each one with 5% of the total data have

been selected randomly and proportionally.

2.2 Proposed Model

The model proposed in this work has been constructed

based on an approach rooted in a DNN, using domain

names converted, character by character, to their re-

spective ASCII codes as input data. This initial fea-

ture vector is then passed through an embedding layer

that converts each ASCII value into a vector of 20

features related to the context of the character in the

initial string. The output of the embedding layer is

connected to a CNN, which aims to extract relevant

features from the domain names based on the vectors

received from the upper embedding layer. In addition,

the relu layers have been applied to sparsify the net-

work, returning zero for any non-positive values, and

maxpooling layers to highlight speciﬁc features of the

analyzed classes.

Deep Convolutional Neural Network and Character Level Embedding for DGA Detection

169

2.2.1 Embedding Layer

Embedding layers are widely used in NLP, where

each word is represented by an array of ﬂoating point

values (Goodfellow et al., 2016; Zhang et al., 2023).

Embedding layers are trainable. Therefore, the rep-

resentation of each word can be learned by the net-

work. In this work, embedding at the character level

has been used. Therefore, each character of a domain

name, after being transposed into its ASCII code, is

taken to the embedding layer, where it is represented

by a vector of 20 values. This feature vector con-

structed by the embedding layer is then passed to the

CNN part of the proposed model.

2.2.2 Deep Convolutional Neural Network

Convolutional neural networks are capable of ﬁnding

speciﬁc features in images, requiring minimal pre-

processing (Goodfellow et al., 2016; Zhang et al.,

2023). In this work, the CNN part of the proposed

model is designed to extract relevant features for the

analyzed classes, legitimate and DGAs, from the fea-

ture vector generated by the embedding layer posi-

tioned at the model’s input data. Our model has six

convolutional layers of distinct dimensions, forming

a deep CNN.The dimensions of each of the convo-

lutional layers and also the other layers used in the

proposed model are better illustrated in Figure 1.

2.2.3 ReLU and MaxPooling Layers

Positioned within the convolutional layers are relu

and maxpooling layers, contributing to maximizing

the model’s intrinsic ability to discern and iden-

tify pertinent attributes associated with each distinct

class (Bahera H. Nayef, 2023). Rectiﬁed linear unit

(ReLU) is an activation function that, in practice,

turns every negative input into zero (Zhang et al.,

2023), as described by equation 1. This feature is de-

sirable for two reasons, ﬁrstly neurons in the network

that receive zero values will not be activated, making

the network sparse. Secondly, the positive values that

will be treated by the network tend to be more repre-

sentative. Thus, the relu layer ends up working as a

high-contrast, highlighting important areas.

ReLU(x) = (x)

= max(0, x) (1)

MaxPooling layers downsample the feature map

extracted by the convolutional layers (Zhang et al.,

2023). In the model proposed in this work, max-

pooling layers are used so that the feature maps high-

lighted by the relu layer can be condensed in the max-

pooling layer. This integration of layers and architec-

tural elements is intended to enhance the model’s ca-

pacity to extract and process information efﬁciently,

thereby facilitating its performance in the identiﬁca-

tion of signiﬁcant features relevant to the classiﬁca-

tion of diverse classes.

2.2.4 Flatten and Dense Layers

The ﬂatten layer has been used after the convolutional

part of the proposed model to reshape the format of

the feature vector produced by the upper layers, mak-

ing it compatible with the fully connected dense lay-

ers positioned at the end of the model. Finally, two

fully connected layers, with 15 and 1 neuron respec-

tively, act as classiﬁers using the feature vector re-

ceived from the ﬂatten layer. Figure 1 shows the sum-

mary of the model, its layers, and its dimensions.

Figure 1: An illustrative overview of the proposed model.

2.2.5 Hyperparameters and the Experiment

Execution

Another important aspect of the proposed model is

the hyperparameters. In deep learning models, hyper-

parameters are parameters that are not learned during

the model training but need to be deﬁned before the

start of this process (Koutsoukas et al., 2017; Dalli,

2022). They are set externally to the model and play

a crucial role in the architecture and training of the

neural network. Unlike model parameters such as

weights and biases, which are adjusted automatically

ICEIS 2024 - 26th International Conference on Enterprise Information Systems

170

by the optimization algorithm during training, hyper-

parameters are predetermined and directly impact the

performance and behavior of the model. Thus, the

hyperparameters used to compile the proposed model

are shown in Table 2.

Table 2: Hyperparameters used to compile the proposed

model.

Hyperparameter Value

L1 Regularizer 1e-5

L2 Regularizer 1e-4

Optimizer Adam

Activation Function ReLU

Learning Rate 1e-4

Batch Size 500

The L1 and L2 regularizers have been used to

apply penalties to the layer parameters, being added

to the loss function to avoid overﬁtting (Salehin and

Kang, 2023). The low values adopted, 1e-5 and 1e-

4 respectively, are low enough not to have a sudden

impact on the training process. Adam optimizer has

been selected because it is a computationally efﬁcient

stochastic gradient descent method for problems with

a large number of parameters and data (Kingma and

Ba, 2017). ReLU activation function has been chosen

to increase the sparsity of the network due to its char-

acteristic of turning every negative value received into

zero (Zhang et al., 2023) and the network’s learning

rate has been set to a low value, 1e-4, to seek a slower

evolution of parameter updates and thus seek the best

global minimum loss.

The experiment has been conducted in three

stages. The ﬁrst involved data pre-processing, where

domain names have been converted, character by

character, to their respective ASCII codes, up to a

limit of 73 characters—the maximum length observed

in the examples used. Domain names shorter than 73

characters have been padded with zeros from the end.

In a second phase, a subset of 20%, approximately

400 thousand examples, has been selected from the

complete dataset for training the proposed model. In

addition, ﬁve separate samples of 5% each have been

set aside for conducting tests with the trained model.

These datasets have been used to train and test, in ad-

dition to the main model, two other models with dif-

ferent layers combinations to assess the impact of us-

ing embedding, relu and maxpooling on the obtained

results. The results are presented in Section 3. Fi-

nally, the model has been trained for 20 epochs and

the best accuracy has been selected using the callback

function. Then the prediction of the ﬁve test samples

has been performed. For each test, an evaluation has

been made based on the metrics presented in Subsec-

tion 2.3. The complete process diagram is detailed in

Figure 2.

Figure 2: Experiment process diagram with data pre-

processing, data sampling, training and testing steps.

2.3 Metrics and Evaluations

The model has been evaluated using the most com-

monly accepted metrics currently: Precision (equa-

tion 2), Recall (equation 3), Accuracy (equation 4),

False Positive Rate (FPR) (equation 5). These met-

rics are obtained as follows:

Precision(Prec.) =

T P

FP + T P

(2)

Recall(Rec.) =

T P

FN + T P

(3)

Accuracy(Acc.) =

T N + T P

T N + T P + FN + FP

(4)

FalsePositiveRate(FPR) =

T N + FP

(5)

Where:

• True Positive (TP): DGA domain names classiﬁed

as DGA;

• True Negative (TN): Legitimate domain names

classiﬁed as Legitimate;

• False Positive (FP): Legitimate domain names

classiﬁed as DGA;

• False Negative (FN): DGA domain names classi-

ﬁed as Legitimate;

Deep Convolutional Neural Network and Character Level Embedding for DGA Detection

171

3 RESULTS

This section presents the results obtained in the de-

veloped experiments. In order to evaluate the perfor-

mance of the proposed model, called Model 1, and the

effectiveness of combining the embedding, relu and

maxpooling layers with the deep CNN for detecting

DGAs, two other models have been trained and tested

using the same datasets.

Model 2 has the same CNN and embedding net-

work structure but without the relu and maxpooling

layers. Model 3 has the same CNN structure with

relu and maxpooling layers but without the embed-

ding layer in the data input. The structures of the

evaluated models are presented in Table 3.

Table 3: Tested models and their respective base architec-

tures.

Model Architecture

Model 1 CNN+Embedding+ReLU+MaxPooling

Model 2 CNN+Embedding

Model 3 CNN+ReLU+MaxPooling

The average (Avg.), median (Med.) and standard

deviation (Stdev.) of the metrics results obtained for

the positive class in the ﬁve test samples for each

of the three models have been calculated to compare

the results. Table 4 presents the results obtained by

Model 1.

Table 4: Results of the metrics obtained by Model 1 on each

of the ﬁve test data sets, their average, median, and standard

deviation.

Prec. Rec. Acc. FPR

Test 1 99.32% 98.92% 99.11% 0.69%

Test 2 99.36% 98.98% 99.16% 0.65%

Test 3 99.33% 98.95% 9913% 0.68%

Test 4 99.33% 98.94% 99.13% 0.68%

Test 5 99.31% 98.86% 99.08% 0.70%

Avg. 99.33% 98.93% 99.12% 0.68%

Med. 99.33% 98.94% 99.13% 0.68%

Stdev. 0.02% 0.04% 0.03% 0.02%

It is possible to verify that Model 1, complete

with the embedding layer in the data input, relu and

maxpooling layers, obtained satisfactory results in the

tests carried out with the dataset used. It is noted that

the FPR remained below 1% in all test sets and the

average accuracy of 99.33% reinforces this assess-

ment. The average recall of 98.93% demonstrates that

the model has been indeed successful in capturing the

majority of the DGA examples presented. The aver-

age accuracy of 99.12% again demonstrates the bal-

ance of the proposed model depending on both classes

examined.

Table 5 presents the results obtained by Model 2.

Table 5: Results of the metrics obtained by Model 2 on each

of the ﬁve test data sets, their average, median, and standard

deviation.

Prec. Rec. Acc. FPR

Test 1 98.32% 97.69% 97.99% 1.70%

Test 2 98.28% 97.86% 98,06% 1,74%

Test 3 98.38% 97.75% 98.05% 1.64%

Test 4 98.37% 97.83% 98.09% 1.65%

Test 5 98.37% 97.69% 98.02% 1.65%

Avg. 98.34% 97.76% 98.04% 1.68%

Med. 98.37% 97.75% 98.05% 1.65%

Stdev. 0.04% 0.08% 0.04% 0.04%

Based on the results obtained by Model 2, a model

without the relu and maxpooling layers, it is noticed

that in all metrics the results have been slightly below

those obtained by Model 1, around 1% lower. The

FPR more than doubled, from 0.68% in Model 1 to

1.68% in Model 2. These results demonstrate the im-

portance of the relu and maxpooling layers for detect-

ing DGAs by the proposed model.

Table 6 presents the results obtained by Model 3.

Table 6: Results of the metrics obtained by Model 3 on each

of the ﬁve test data sets, their average, median, and standard

deviation.

Prec. Rec. Acc. FPR

Test 1 95.97% 91.42% 93.74% 3.91%

Test 2 95.93% 91.39% 93.70% 3.94%

Test 3 95.95% 91.33% 93.69% 3.92%

Test 4 95.90% 91.25% 93.62% 3.97%

Test 5 95.79% 91.24% 93.56% 4.08%

Avg. 95.91% 91.33% 93.66% 3.96%

Med. 95.93% 91.33% 93.69% 3.94%

Stdev. 0.07% 0.08% 0.07% 0.07%

The results obtained by Model 3, a model with-

out the embedding layer in the data input, have been

even worse than those of Model 2 in relation to Model

1. All metrics have been around 4% below the met-

rics obtained by Model 1 and the FPR increased about

sixfold, from 0.68% in Model 1 to 3.96% in Model 3.

These results conﬁrm the importance of using the em-

bedding layer in the data input of the proposed model.

Table 7 compares the average of the metrics obtained

by the three models.

Table 7: Comparison of the average metrics values obtained

by the three tested models.

Model Prec. Rec. Acc. FPR

Model 1 99.33% 98.93% 99.12% 0.68%

Model 2 98.34% 97.76% 98.04% 1.68%

Model 3 95.91% 91.33% 93.66% 3.96%

Observing the average results of the metrics pre-

sented in Table 7, it can be veriﬁed that Model 1 pro-

ICEIS 2024 - 26th International Conference on Enterprise Information Systems

172

posed with the combination of the different layers em-

bedding, relu, and maxpooling provided the best re-

sults. It can also be veriﬁed that removing the embed-

ding layer (Model 3) resulted in a greater impact on

metrics than removing the relu and maxpooling lay-

ers (Model 2). Therefore, it is possible to state, based

on these results, that the use of character-level embed-

ding has been fundamental for achieving the best met-

rics. However, the relu and maxpooling layers also

contribute to the better result obtained by Model 1,

which is the model proposed in this work.

3.1 A Comprehensive Overview of the

Proposed Model in Relation to the

Literature

In specialized literature, distinct approaches have

been proposed for detecting DGAs, exploring dif-

ferent datasets. The models have been developed

based on diverse combinations, using both classical

machine learning (ML) and deep learning (DL) tech-

niques. Features have been extracted either manually

or through meta-features obtained by neural network

models. A comprehensive overview is crucial to high-

light the quality of this study, which adopts a natural

language processing approach using character-level

embedding and CNN. This overview is presented in

Table 8.

Table 8: Overview of the results of some recent works, with

different approaches for detecting DGAs.

Work Architecture Acc.

Proposed Model 1

Embedding +

CNN + ReLU +

MaxPooling

99.13%

Ren

et al., 2020

CNN + BiLSTM +

Attention

98.82%

Huang

et al., 2022

e-trained Embeddings +

CNN +

MaxPooling

96.08%

Wang

et al., 2023

CNN + BiLSTM

94.51%

Vranken and

Alizadeh, 2022

TF-IDF + LSTM 90.69%

Shahzad

et al., 2021

LSTM +

BiLSTM + GRU

87.00%

Analyzing the data presented in the overview, it is

observed that the results obtained by this study align

with works found in specialized literature, consider-

ing the variety of existing approaches. In this compre-

hensive overview, Model 1 exhibited higher accuracy

than the other approaches. However, this naturally

does not imply that our model is superior, given the

differences in techniques and datasets. Nevertheless,

the approach has demonstrated effectiveness and ro-

bust metrics with the dataset used. Finally, it is worth

noting that the majority of works focused on DGA

detection show results close to the ideal, with metrics

above 90%. Therefore, the contribution of this work

lies in presenting an approach and demonstrating the

effectiveness of using character-level embedding for

the DGA detection problem.

4 CONCLUSION AND FUTURE

WORKS

This work presented a brief overview of DGA, how

they represent important cyber threats today, and

presents some recent works, their results for detect-

ing DGAs, as well as their approaches.

The proposed model proved to be robust, with re-

sults aligned with the results of the most current stud-

ies in the same direction. From the tests carried out

with changes to the main Model 1, it has been pos-

sible to verify the impact of the combined use of the

relu and maxpooling layers for the CNN model in de-

tecting DGAs. It has been observed from the results

obtained that the inclusion of the embedding layer

contributed signiﬁcantly to improving the results ob-

tained.

Checked that the impact of using the relu and

maxpooling layers has been smaller in the results

obtained than using the embedding layer at the

character-level in the data input. While the model

without the relu and maxpooling layers obtained met-

rics around 1% lower than the full model in the tests,

the model without embedding obtained results around

4% lower. Thus, the complete model (Model 1), with

all layers, has been more efﬁcient in detecting DGAs

during the experiment.

It is noteworthy that for the FPR metric, Model 1

achieved superior performance, with an FPR of less

than half compared to Model 2 and almost six times

lower than Model 3. In high DNS query volume en-

vironments, FPR can signiﬁcantly impact the perfor-

mance and usability of predictive models.

In future works, it is suggested an approach based

on Transformers, a recent technique that has sur-

passed in some cases the results of CNNs. Studies

focusing on speciﬁc DGA families may help to bet-

ter understand this threat type, perhaps using mod-

els based on NLP approaches. It is also suggested an

unsupervised or semi-supervised approach to detect

DGA domains, including some integration with DNS

trafﬁc collection mechanisms in local networks, thus

being able to work as a powerful cybersecurity tool.

Deep Convolutional Neural Network and Character Level Embedding for DGA Detection

173

ACKNOWLEDGEMENTS

This study was ﬁnanced in part by the Coordenac¸

de Aperfeic¸oamento de Pessoal de N

ıvel Supe-

rior - Brasil (CAPES), the National Council for

Scientiﬁc and Technological Development CNPq

(Grant #313643/2021-0) and N

ucleo de Informac¸

e Coordenac¸

ao do Ponto BR - NIC.br.

REFERENCES

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,

Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin,

M., Ghemawat, S., Goodfellow, I., Harp, A., Irving,

G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kud-

lur, M., Levenberg, J., Man

e, D., Monga, R., Moore,

S., Murray, D., Olah, C., Schuster, M., Shlens, J.,

Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Van-

houcke, V., Vasudevan, V., Vi

egas, F., Vinyals, O.,

Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and

Zheng, X. (2015). TensorFlow: Large-scale machine

learning on heterogeneous systems. Software avail-

able from tensorﬂow.org.

Bahera H. Nayef, Siti Norul Huda Sheikh Abdullah, R. S.

A. M. S. (2023). Text extraction with optimal bi-lstm.

Computers, Materials & Continua, 76(3):3549–3567.

Dalli, A. (2022). Impact of hyperparameters on deep learn-

ing model for customer churn prediction in telecom-

munication sector. Mathematical Problems in Engi-

neering, 2022:4720539.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2019). BERT: Pre-training of deep bidirectional

transformers for language understanding. In Burstein,

J., Doran, C., and Solorio, T., editors, Proceedings

of the 2019 Conference of the North American Chap-

ter of the Association for Computational Linguistics:

Human Language Technologies, Volume 1 (Long and

Short Papers), pages 4171–4186, Minneapolis, Min-

nesota. Association for Computational Linguistics.

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep

Learning. MIT Press. http://www.deeplearningbook.

org.

Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers,

R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor,

J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer,

S., van Kerkwijk, M. H., Brett, M., Haldane, A., del

ıo, J. F., Wiebe, M., Peterson, P., G

erard-Marchant,

P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi,

H., Gohlke, C., and Oliphant, T. E. (2020). Array pro-

gramming with NumPy. Nature, 585(7825):357–362.

Huang, W., Zong, Y., Shi, Z., Wang, L., and Liu, P. (2022).

Pepc: A deep parallel convolutional neural network

model with pre-trained embeddings for dga detection.

In 2022 International Joint Conference on Neural Net-

works (IJCNN), pages 1–8.

Kaggle Team (2023). Kaggle. https://www.kaggle.com/.

Kambourakis, G., Anagnostopoulos, M., Meng, W., and

Zhou, P. (2019). Botnets: Architectures, Countermea-

sures, and Challenges. CRC Press, Boca Raton, 1 edi-

tion.

Kingma, D. P. and Ba, J. (2017). Adam: A method for

stochastic optimization.

Koutsoukas, A., Monaghan, K. J., Li, X., and Huan, J.

(2017). Deep-learning: investigating deep neural net-

works hyper-parameters and comparison of perfor-

mance to shallow methods for modeling bioactivity

data. Journal of Cheminformatics, 9(1):42.

Majestic (2023). Majestic Million. https://pt.majestic.com/

reports/majestic-million.

NetLab 360 (2022). NetLab360. https://data.netlab.360.

com/.

Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark,

C., Lee, K., and Zettlemoyer, L. (2018). Deep contex-

tualized word representations.

Piras, G., Pintor, M., Demetrio, L., and Biggio, B. (2022).

Explaining machine learning dga detectors from dns

trafﬁc data.

Ren, F., Jiang, Z., Wang, X., and Liu, J. (2020). A dga

domain names detection modeling method based on

integrating an attention mechanism and deep neural

network. Cybersecurity, 3:4.

Salehin, I. and Kang, D.-K. (2023). A review on dropout

regularization approaches for deep neural networks

within the scholarly domain. Electronics, 12(14).

Shahzad, H., Sattar, A., and Skandaraniyam, J. (2021). Dga

domain detection using deep learning. pages 139–143.

Silveira, M. R., Marcos da Silva, L., Cansian, A. M., and

Kobayashi, H. K. (2021). Detection of newly reg-

istered malicious domains through passive dns. In

2021 IEEE International Conference on Big Data (Big

Data), pages 3360–3369.

Tanenbaum, A. S., Feamster, N., and Wetherall, D. (2021).

Redes de Computadores. Pearson Education do Brasil,

Porto Alegre, 6 edition.

The Pandas Development Team (2020). pandas-dev/pandas:

Pandas.

Vranken, H. and Alizadeh, H. (2022). Detection of dga-

generated domain names with tf-idf. Electronics,

11(3).

Wang, Y., Pan, R., Wang, Z., and Li, L. (2023). A classi-

ﬁcation method based on cnn-bilstm for difﬁcult de-

tecting dga domain name. In 2023 IEEE 13th Inter-

national Conference on Electronics Information and

Emergency Communication (ICEIEC), pages 17–21.

Wong, A. D. (2023). Detecting domain-generation al-

gorithm (dga) based fully-qualiﬁed domain names

(fqdns) with shannon entropy.

Yu, B., Pan, J., Hu, J., Nascimento, A., and De Cock, M.

(2018). Character level based detection of dga do-

main names. In 2018 International Joint Conference

on Neural Networks (IJCNN), pages 1–8.

Zhang, A., Lipton, Z. C., Li, M., and Smola, A. J. (2023).

Dive into Deep Learning. Cambridge University

Press. https://D2L.ai.

ICEIS 2024 - 26th International Conference on Enterprise Information Systems

174