GANMCMCRO: A Generative Adversarial Network Markov Chain
Monte Carlo Random Oversampling Algorithm for Imbalance
Datasets
Najmeh Abedzadeh and Matthew Jacobs
EECS Dept., School of Engineering, Catholic University of America, Washington, DC, U.S.A.
Keywords: Imbalanced Datasets, Machine Learning, Random Oversampling, Markov Chain Monte Carlo, Generative
Adversarial Networks.
Abstract: Machine learning techniques have showcased their adeptness in identifying patterns within data, yet their
efficacy diminishes when dealing with imbalanced datasets—a pervasive concern, especially apparent in the
realm of Intrusion Detection Systems (IDS). IDS, pivotal for monitoring malicious activities in networks or
systems, requires strategic interventions to address dataset imbalances and increase machine learning model
accuracy. Of note, imbalanced IDS datasets harbour covert cyber-attacks amid their substantial imbalances,
intricately complicating detection for conventional machine learning methods. This study introduces novel
algorithms designed to rectify imbalances within IDS datasets. The first algorithm, named Markov Chain
Monte Carlo Random Oversampling (MCMCRO), seamlessly integrates Markov Chain Monte Carlo
(MCMC) and Random Oversampling techniques to systematically synthesize fresh data. Additionally,
MCMCRO's novel data synthesis capability is harnessed within the Generative Adversarial Network
framework to formulate the second algorithm, GANMCMCRO (Generative Adversarial Networks Markov
Chain Monte Carlo Random Oversampling). This framework augments the potency of MCMCRO's data
generation function within the data generator model. An evaluation conducted on the CSE-CIC-IDS2018
Dataset substantiates the efficacy of both algorithms. MCMCRO showcases a recall of 0.66, precision of 1,
an F1 score of 0.79, and an overall accuracy of 0.91. Similarly, GANMCMCRO attains a recall of 0.81,
precision of 0.82, an F1 score of 0.81, and an overall accuracy of 0.88, providing compelling evidence of their
prowess in mitigating the challenges posed by imbalanced datasets. This research advances the field by
introducing innovative techniques that demonstrate substantial potential in enhancing the accuracy of machine
learning models for imbalanced data domains, particularly IDS datasets.
1 INTRODUCTION
In today's dynamic cybersecurity landscape,
protecting networks against evolving cyber threats
remains paramount. Intrusion Detection Systems
(IDS) play a critical role in identifying and mitigating
malicious activities within network environments
(Vij and Saini, 2021). This study focuses on network-
based intrusion detection systems (NIDS) specialized
in detecting external threats aiming to breach network
defences. IDS employ two primary protective
approaches: location-based defence, guarding
networks or hosts against attacks, and data-centric
defence, preserving data integrity amidst a range of
malicious activities.
Among diverse cybersecurity methodologies,
machine learning uses historical data to refine threat
detection models (Seeber and Rodosek, 2015),
However, machine learning-based IDS face
challenges in training accurate models due to
imbalanced datasets, where certain classes are
underrepresented, skewing model performance.
Consequently, research efforts concentrate on
addressing this issue to enhance algorithm efficacy.
With escalating cyber threats, detection systems
must anticipate established and emerging risks.
Intricate networks and the internet provide a backdrop
for adversaries to blend with legitimate activities
(Abedzadeh and Jacobs, 2022). Thus, IDS
frameworks need to adapt to evolving landscapes and
attack vectors. Intrusion detection primarily falls into
two paradigms: signature-based and anomaly-based.
Abedzadeh, N. and Jacobs, M.
GANMCMCRO: A Generative Adversarial Network Markov Chain Monte Carlo Random Oversampling Algorithm for Imbalance Datasets.
DOI: 10.5220/0012259600003584
In Proceedings of the 19th International Conference on Web Information Systems and Technologies (WEBIST 2023), pages 587-594
ISBN: 978-989-758-672-9; ISSN: 2184-3252
Copyright © 2023 by SCITEPRESS Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)
587
Signature-based IDS identify known threats using
predefined patterns or signatures, while anomaly-
based methods focus on deviations from norms,
uncovering unexpected events or behaviours. By
learning typical benign behaviours, anomaly-based
IDS identify anomalies, presenting a more adaptive
approach compared to traditional methods.
Based on our comprehensive survey of various
resampling algorithms aimed at mitigating
imbalanced IDS datasets, it was observed that only a
singular study has explored the utilization of Markov
Chain Monte Carlo (MCMC) and Generative
Adversarial Network (GAN) methodologies for this
purpose (Abedzadeh and Jacobs, 2022). The
outcomes of this study indicated that the individual
application of MCMC, random oversampling, or
GAN techniques to the CSE-CIC-IDS2018 dataset
yielded insufficient outcomes in rectifying the
inherent data imbalance compared with Logistic
Regression (LgR). Consequently, to address this gap,
our paper introduces a novel approach termed
Generative Adversarial Networks Markov Chain
Monte Carlo random oversampling
(GANMCMCRO), wherein various combinations of
these methods, along with random oversampling, are
explored to determine if their integration can lead to
performance improvements.
This study forges ahead by introducing two novel
resampling strategies. Firstly, the innovative MCMC
algorithm is introduced, which generates synthetic
data using a distinctive methodology that merges
MCMC and random oversampling techniques.
Testing tests underscore its superiority over extant
resampling methods. Secondly, the MCMCRO
function takes canter stage as the data generator
within GANMCMCRO—a framework that
assimilates MCMCRO's synthesized data within a
GAN model. Importantly, GANMCMCRO's data
generator employs MCMCRO instead of random
data, signifying a distinctive methodology.
The contribution of this work includes two facets:
1) The introduction of MCMCRO, a novel
imbalanced dataset resampling technique.
2) The implementation of GANMCMCRO, an
innovative framework that seamlessly integrates
MCMCRO-generated data into a GAN model for
novel model training.
Finally, a comprehensive evaluation that
benchmarks the performance of MCMCRO and
GANMCMCRO against established algorithms,
further solidifying their efficacy and innovation
within the realm of imbalanced dataset handling and
intrusion detection.
2 RELATED WORKS
The application of machine learning, Deep Learning,
and Reinforcement Learning has been widely
explored in various research papers to address the
challenge of imbalanced datasets (Abedzadeh and
Jacobs, 2023). This is particularly relevant in IDSs
where these methodologies are leveraged for
detecting malicious cyber-attacks. While these
algorithms have been employed for such purposes,
the imbalance in IDS datasets has garnered limited
attention in comparison. (Abedzadeh and Jacobs,
2022) conducted a comprehensive survey to explore
techniques aimed at addressing imbalance within IDS
datasets. Their work, delved into the intricacies of
handling imbalanced data, shedding light on the
methods prevalent in the domain. Meanwhile,
(Subiksha et al., 2021) provided a critical perspective
on the landscape of machine learning techniques for
IDS by conducting a survey focused on the CSE-CIC
IDS dataset. Remarkably, their findings indicated that
merely a minority of examined IDS research (18%)
ventured into the creation of novel algorithms to
rectify dataset imbalance, with the majority opting for
established methodologies.
(Liu et al., 2020) undertook a comprehensive
investigation into classification algorithms,
encompassing a diverse range including Random
Forest, Support Vector Machine (SVM), XGBoost,
LSTM, Mini-VGGNet, and AlexNet. In their study,
they introduced the innovative Difficult Set Sampling
Technique (DSSTE) to effectively mitigate the
challenge of dataset imbalance. This approach
involves a meticulous division of the dataset into
distinct 'easy' and 'difficult' subsets, strategically
leveraging the Edited Nearest Neighbour (ENN)
algorithm. The 'difficult' subset consists of instances
that share striking similarities, rendering their
differentiation challenging, while the 'easy' subset
comprises instances that are more discernible. By
employing K-Means clustering on the 'easy' subset
and intelligently compressing majority samples, the
authors iteratively merged the refined dataset with a
minority subset extracted from the 'difficult' set. This
systematic approach yielded promising results,
achieving an accuracy of 0.8284 and F1-score of
0.8166 on the NSL-KDD dataset.
(Akila et al., 2019) proposed a hybrid sampling
technique for addressing imbalanced datasets,
integrating oversampling using Synthetic Minority
Over-sampling Technique (SMOTE) and
undersampling through the ENN technique. The
proposed approach was rigorously evaluated on the
NSL-KDD dataset with various classifiers including
DMMLACS 2023 - 3rd International Special Session on Data Mining and Machine Learning Applications for Cyber Security
588
Random Forest, SVM, and Naive Bayes. The hybrid
method exhibited remarkable performance metrics,
achieving accuracy of 98.17%, precision of 96.32%,
recall of 99.84%, and F1-score of 97.02%.
Additionally, Generative Adversarial Networks
(GANs) have been applied in various domains,
including balancing datasets. (Shi et al., 2021)
introduced Sample Equalization for Intrusion
Detection System (SE-IDS), which employs GANs
for balancing industrial network datasets. They
employed a filtering mechanism to under-sample the
majority class, achieving impressive results with
AUC of 99.12, macro F1 of 98.10, macro recall of
97.23, and macro precision of 99.03. In parallel
endeavours, the CSE-CIC-IDS2018 dataset
witnessed separate applications of MCMC and GANs
as resampling techniques, with their efficacy
benchmarked against LgR as a reference. While LgR
showcased peak performance without resampling,
achieving notable accuracy (0.88), precision (0.88),
and recall (0.99), MCMC and GANs emerged as
alternative approaches for addressing data imbalance.
However, their standalone outcomes underscored the
insufficiency of MCMC and GANs as individual
resampling strategies (Abedzadeh and Jacobs, 2022)
Inspired by these findings, we embarked on an
exploration to synergize MCMC, oversampling, and
GANs in a novel framework—GANMCMCRO. This
endeavour aims to ascertain whether their combined
potential could pave the way for effective resampling
of imbalanced IDS datasets, an avenue hitherto
unexplored by existing methodologies.
3 METHODOLLOGY
3.1 Dataset
The CSE-CIC-IDS2018 dataset
(SOLARMAINFRAME, 2018) widely recognized in
the field of IDS, comprises 10 distinct files that capture
network traffic over varying timestamps which was
simulated during early 2018. It exhibits a diverse range
of network traffic attributes, including features related
to packet sizes, durations, protocols, and source-
destination pairs. Notably, this dataset is derived from
a variety of network-based attacks, including DoS,
DDoS, and port scanning. Each file, generated from
network traffic of a single day, encompasses
approximately 79 attributes. The dataset's binary
classification categorizes instances as either benign or
infiltration, maintaining an imbalanced distribution
with a ratio of 1 infiltration instance to 8 benign
instances. For the present study, the dataset dated
02/28/2018 is chosen for evaluation due to its
pronounced class imbalance, containing 540,568
benign instances and 68,462 infiltration instances.
3.2 Pre-Processing
Pre-processing involves addressing missing values
and outliers, followed by the encoding of categorical
data. Subsequently, dimensionality reduction is
carried out using the forward selection algorithm
(Galit, et al., 2019). This iterative method starts with
an empty feature set and gradually incorporates the
most significant variables, employing a Regression
algorithm to select attributes with statistically
significant (p<0.05) and high R^2 values, indicative
of their relevance to the binary classification label
(benign or infiltration). It is noteworthy that the steps
are not applied to the test set, or any other new
unlabelled data that require classification. Upon
employing this algorithm on the CSE-CIC-IDS2018
Dataset, a subset of 38 variables out of the original 79
is retained. Then, a standardization and normalization
process are applied to the data, with the label column
being encoded as 0 for Benign and 1 for Infiltration
instances. Ultimately, the data is partitioned into a
training set comprising 60% of the data and a testing
set encompassing 40%, thereby facilitating the testing
of the proposed models and the number of records did
not change after pre-processing.
3.3 Method
This section explains the MCMC algorithm, the
MCMCRO algorithm, and the GANMCMCRO
framework in a clear manner, highlighting their
unique features. In selecting the machine learning
algorithms for our experimentation on the CSE-CIC-
IDS2018 dataset, we aimed to cover a diverse range
of approaches commonly used in the field of intrusion
detection systems. The choice of algorithms was
based on their popularity in previous research studies,
as well as their potential suitability for the task at
hand. The parameters were selected through a
combination of empirical experimentation and
established best practices recommended in relevant
research. Figure 1 illustrates the method's steps,
starting from initial data preparation to achieving a
well-suited dataset for Machine Learning. We use the
Python programming language for implementation,
in line with current practices.
The process begins by dividing the pre-processed
training data into categorical and continuous parts.
The innovative MCMCRO algorithm generates data
using two strategies. Random Sampling is used for
GANMCMCRO: A Generative Adversarial Network Markov Chain Monte Carlo Random Oversampling Algorithm for Imbalance Datasets
589
the categorical part, while the MCMC technique is
applied to the continuous part. The MCMCRO
algorithm skilfully combines these distinct datasets,
introducing a fresh approach for creating balanced
datasets.
Building on this, the GANMCMCRO framework
takes things further by taking the MCMCRO
algorithm into the generator module of the GAN
algorithm as an input, pushing the field's boundaries.
Both MCMCRO and GANMCMCRO are conceived
as novel contributions in the realm of resampling
algorithms designed explicitly for addressing the
dataset imbalance in the context of IDS. Subsequent
sections (1, 2, and 3) delve into the nuanced
descriptions of each algorithm.
Figure 1: The process of MCMCRO and GANMCMCRO.
3.3.1 MCMC Algorithm
The MCMC algorithm is a way to estimate data
patterns using probability and sampling methods
(Jason, 2019) It helps to gather related samples,
which is different from the separate sampling in
regular Monte Carlo (Rosenbluth, 2023). Leveraging
the Metropolis-Hastings (MH) algorithm (Metropolis
and Ulam, 1949) a pivotal MCMC implementation,
the procedure exhibits a complexity ranging from
lower bound O(d) to upper bound O(d^2), contingent
upon the dimensionality of the parameter space,
denoted as 'd'. The MH algorithm is particularly
useful when direct calculation of next state
probability distribution is unfeasible. Within this
project, the MH algorithm is harnessed for MCMC
realization (Moukarzel, 2018). The algorithm's
involves drawing a sample from a symmetric
distribution, evaluating the likelihood of the
generated data within the distribution, and
subsequently confirming acceptance based on the
fidelity of the new data's selection from the
distribution. Crucially, the logarithm of both the prior
and likelihood functions is employed for acceptance
or rejection determination, assuring numerical
stability and preventing underflow issues.
3.3.2 MCMCRO Algorithm
The MCMCRO algorithm combines the strengths of
both methodologies, strategically leveraging MCMC
for continuous variables and harnessing random
oversampling for categorical counterparts. The
algorithm's intricate mechanics are detailed below.
Experimental testing underscored the inadequacy of
solely oversampling the CSE-CIC-IDS2018 dataset,
as random data generation might not yield substantial
performance improvements (Abedzadeh and Jacobs,
2022). Consequently, the MCMCRO of algorithm
emerges as a solution, seamlessly fusing MCMC and
random oversampling strategies. For categorical
variables, MCMCRO artfully selects values from the
category list, while continuous data undergoes
MCMC-based synthesis, as elucidated in the Method
section. The algorithm's complexity mirrors that of
MCMC, with a lower bound of O(d) and an upper
bound of O(d^2), where 'd' represents the dimension
of the parameter space for continuous variables.
Simultaneously, the time complexity for categorical
variables remains optimal at O(1). Subsequently, the
data synthesized by the MCMCRO algorithm
Algorithm 1: The proposed MCMCRO algorithm.
Input: minority data
Output: newly generated data for the minority class
1. Select Minority Data:
1.1. Identify and isolate the minority class
instances from the dataset.
2. Generate Synthetic Data:
For each variable in the dataset, perform the
following steps:
2.1. Categorical Column:
If the current variable is categorical:
Randomly select one category from the pre-
existing data.
Since the data points are independent, random
selection of a category is appropriate for
generating new instances.
2.2. Continuous Column:
Else, if the current variable is continuous:
Apply the Markov Chain Monte Carlo (MCMC)
algorithm to generate synthetic data.
MCMC techniques leverage the current data
distribution to generate new points that
approximate the distribution.
3. Append Generated Data:
Incorporate the newly generated synthetic data
into the original dataset. This integration enriches
the dataset with artificial instances that align with
the minority class characteristics.
DMMLACS 2023 - 3rd International Special Session on Data Mining and Machine Learning Applications for Cyber Security
590
seamlessly integrates into the GANMCMCRO
framework, serving as pivotal input for training both
generator and discriminator models while iteratively
refining their weight updates. This methodological
integration epitomizes the distinctive contribution of
this work detailed in Algorithm 1.
3.3.3 GANMCMCRO Algorithm
The GANMCMCRO framework combines GANs
and the new MCMCRO resampling algorithm in a
unique way, introducing an innovative approach, as
outlined in Algorithm 2. GANs, an exemplar of
unsupervised Deep Learning, orchestrate a dynamic
interplay between two adversarial Deep Neural
Networks (DNNs) (Goodfellow et al., 2014).
Algorithm 2: The GANMCMCRO algorithm.
input: original dataset,
output: rained generator model
1. Define Discriminator:
1.1. Create a Neural Network model with 128 nodes.
1.2. Set 100 epochs for training.
1.3. Use the Dense function.
1.4. Implement LeakyReLU activation with an alpha
of 0.2.
1.5. LeakyReLU helps mitigate the dying ReLU
problem.
2. Define Generator:
2.1. Create a Neural Network model with 128 nodes.
2.2. Set 100 epochs for training.
2.3. Use the Dense function.
2.4. Implement LeakyReLU activation with an alpha
of 0.2.
3. Create Combined Model:
3.1. Build a sequential model by combining the
generator and discriminator.
3.2. Utilize the Adam optimizer.
3.3. Adam adjusts learning rates for each network
weight [15].
3.4. This model will be used to update the generator.
4. Load Original Dataset:
4.1. Import data from the original dataset for training.
5. Training Phase:
5.1. Train Discriminator:
5.1.1. Train discriminator using real data from the
dataset.
5.2. Train Generator:
5.2.1. Train generator using data generated by the
MCMCRO algorithm.
5.3. Save Generator Model:
5.3.1. Save the trained generator model for future use.
The generator crafts synthetic data by interpolating
training data, while the discriminator undertakes the
intricate task of distinguishing between authentic and
generated data (Jason, 2019). The generator operates
under a probabilistic framework, crafting data to
deceive the discriminator to minimize the
discrepancy between predicted and expected
outcomes through a minmax equation given in
Equation 1 (Jason, 2019). The inherent challenge of
GANs lies in maintaining equilibrium between the
networks, as the convergence speeds of the two
DNNs can diverge (Zhang, 2017).
𝑚𝑖𝑛𝑚𝑎𝑥𝑉
𝐺,𝐷
= 𝐸
[𝑙𝑜𝑔𝐷
𝑥
+ 𝐸
[log
1−𝐷𝐺
𝑧
]
Equation 1 minmax algorithm
The GANMCMCRO framework harnesses the
innovative MCMCRO resampling algorithm for data
generation within the GAN paradigm. By training the
discriminator on genuine data and employing the
MCMCRO algorithm within the generator, the
GANMCMCRO approach forges a symbiotic
relationship that optimally leverages both
methodologies. This symbiosis ultimately culminates
in the generation of a discriminator-enhanced
generator model, capable of engendering synthetic
data closely aligned with authentic instances. The
resultant generator model, honed through the unique
GANMCMCRO framework, emerges as a potent tool
for data synthesis. Upon training, the framework
amalgamates the newly generated data with the
original dataset, harmonizing the distribution to
rectify imbalances. This innovative data balancing
methodology is rigorously evaluated in the
subsequent sections through an array of machine
learning algorithms, demonstrating its contribution in
addressing the pivotal issue of imbalanced datasets.
3.4 Testing
To test our approach's effectiveness, we undertake a
rigorous evaluation of both MCMCRO and
GANMCMCRO using the CSE-CIC-IDS2018
dataset. Past studies have emphasized the limited
superiority of conventional methods like MCMC,
GAN, and random oversampling in comparison to
LgR on the IDS dataset (Abedzadeh and Jacobs,
2022). Our focus shifts to a comparative assessment
involving notable machine learning algorithms,
including LgR to ascertain the suitability of the
MCMC and GAN combination, as a new systematic
resampling technique for imbalanced datasets.
GANMCMCRO: A Generative Adversarial Network Markov Chain Monte Carlo Random Oversampling Algorithm for Imbalance Datasets
591
3.4.1 MCMCRO Testing
The testing process initiates with the dataset division
into distinct training and testing subsets. Within this
framework, we thoroughly train various machine
learning algorithms—Decision Tree (DT), Random
Forest (RF), LgR, Bagging (Bag), Extra Tree (ET), K
Nearest Neighbour (KNN), Linear SVC (LSVC),
Linear Discriminant Analysis (LDA), and Easy
Ensemble (EE)—on the training subset, focusing on
highlighting MCMCRO's impact. Parameters for all
algorithms remain at their default settings. Following
this training phase, each model undergoes a
comprehensive testing procedure to classify the
testing subset. The assessment includes accuracy,
precision, recall, F1 score, temporal efficiency, and t-
test (p-value) analysis. Accuracy shows out of all the
activities the system looked at, how many did it get
right? Precision represents when the system says
there's an intrusion, how often is it actually correct?
F1 Score finds a good balance between avoiding false
positives (wrongly identifying normal activity as
intrusion) and false negatives (missing actual
intrusions). Recall shows out of all the actual
intrusions.
3.4.2 GANMCMCRO Testing
Similar analyses are conducted for the
GANMCMCRO algorithm, employing consistent
dataset partitioning and assessing performance on
distinct training and testing sets. We present results
with and without the GANMCMCRO algorithm,
comparing classification metrics for the best and
worst-performing machine learning algorithms under
both scenarios
4 EXPERIMENTS
4.1 MCMCRO Testing
The discernible strides achieved through the
MCMCRO algorithm are most evident in the
substantial improvement witnessed across crucial
performance indices. This comprehensive
comparison is depicted in Table 1, wherein we
showcase the classification metrics of machine
learning algorithms both with and without the
MCMCRO algorithm applied to the CSE-CIC-
IDS2018 Dataset which was described in the
Methodology section.
Table 1: Comparing accuracy, precision, recall, F1 score, and time of machine learning Algorithms with/without MCMCRO
algorithm. The machine learning algorithms are Decision Tree (DT), Logistic Regression (LgR), Bagging Classifier (Bag),
Random Forest (RF), Extra Tree (ET), K Nearest Neighbor (KNN), Linear SVC (LSVC), Easy Ensemble (EE), and Linear
Discriminant Analysis (LDA). Yellow highlights the best performance and blue highlights the worst performance.
Parameters/Algo
rithms
DT LgR Bag RF ET KNN LSVC LDA EE
accuracy
without
MCMCRO
0.83 0.81 0.84 0.86 0.83 0.88 0.89 0.89 0.48
accuracy with
MCMCRO
0.86 0.87 0.87 0.89 0.87 0.90 0.91 0.91 0.91
precision
without
MCMCRO
0.50 0.47 0.49 0.49 0.48 0.54
0.47 0.47
0.53
precision with
MCMCRO
0.82 0.94 00.86 0.88 0.84 0.91 0.94 0.94 0.94
F1 score without
MCMCRO
0.50 0.47 0.49 0.49 0.48 0.49 0.47 0.47
0.42
F1 score with
MCMCRO
0.81 0.87 0.83 0.84 0.82 0.86 0.87 0.87 0.87
recall without
MCMCRO
0.50 0.47 0.49 0.50 0.49 0.51 0.50 0.50 0.58
recall with
MCMCRO
0.80 0.83 0.81 0.82 0.81 0.83 0.83 0.83 0.83
time without
MCMCRO
17.99 0.50 112.59 187.05 93.74 1513.09 32.38 1.95 123.37
time with
MCMCRO
0.0384 18.51 69.99 112.58 74.98 25.00 94.51 1.94 311.97
t-test
0.042 0.038 0.044 0.045 0.039 0.049 0.054 0.054 0.004
DMMLACS 2023 - 3rd International Special Session on Data Mining and Machine Learning Applications for Cyber Security
592
Table 2: Comparing accuracy, precision, recall, F1 score, and time of machine learning Algorithms with/without
GANMCMCRO algorithm. The machine learning algorithms are Decision Tree (DT), Logistic Regression (LgR), Bagging
Classifier (Bag), Random Forest (RF), E.
Parameters/Algo
rithms
DT LgR Bag RF ET KNN LSVC LDA EE
accuracy without
GANMCMCRO
0.83 0.89 0.84 0.86 0.83 0.88 0.89 0.89 0.48
accuracy with
GANMCMCRO
0.81 0.86 0.85 0.87 0.85 0.76 0.86 0.83 0.88
precision without
GANMCMCRO
0.50 0.47 0.49 0.49 0.48 0.54
0.47 0.47
0.53
precision with
GANMCMCRO
0.79 0.84 0.83 0.85 0.82 0.73 0.84 0.80 0.87
F1 score without
GANMCMCRO
0.50 0.47 0.49 0.49 0.48 0.49 0.47 0.47
0.42
F1 score with
GANMCMCRO
0.78 0.83 0.83 0.85 0.83 0.73 0.83 0.81 0.86
recall without
GANMCMCRO
0.50 0.50 0.49 0.50 0.49 0.51 0.50 0.50 0.58
recall with
GANMCMCRO
0.77 0.83 0.84 0.85 0.84 0.73 0.83 0.82 0.86
time without
GANMCMCRO
17.99 5.95 112.59 187.05 93.74 1513.09 32.38 1.95 123.37
time with
GANMCMCRO
29.60 5.35 223.49 371.23 125.55 9038.73 94.57 2.23 597.04
t-test
0.072 0.075 0.053 0.053 0.048 0.216 0.075 0.097 0.002
4.2 GANMCMCRO Testing
A detailed dissection of the accuracy shifts, before
and after the GANMCMCRO algorithm's application,
underscores LgR, LDA, and LSVC as pre-
GANMCMCRO leaders, achieving an accuracy of
0.89 as detailed in Table 2. However, the post-
GANMCMCRO landscape redefines excellence,
with EE commanding an accuracy of 0.88, thereby
setting a new benchmark. Precision disparities
manifest as KNN exhibits pre-GANMCMCRO
superiority with a precision of 0.54, while EE ascends
as the post-GANMCMCRO victor with a resolute
precision of 0.87.
5 DISCUSSION
Within this paper, a pioneering contribution is
unveiled through the introduction of two novel
resampling algorithms named MCMCRO and
GANMCMCRO, further augmenting its potential.
While our proposed algorithms exhibit significant
promise, it's noteworthy to mention a current
limitation regarding their computational time. The
findings of the previous research (Abedzadeh and
Jacobs, 2022) revealed notable metrics, with the LgR
algorithm trained on different data from the same
dataset achieving an accuracy of 0.88, precision of
0.88, and a commendable recall of 0.99. This proved
to be the highest performance level compared with
MCMC, random oversampling, and GAN. In the
present study, we undertake a comprehensive
comparison between our results obtained with the
LgR algorithm before and after the implementation of
both the MCMCRO and GANMCMCMRO
resampling techniques. The primary aim of this
analysis is to discern the potential enhancement that
these resampling methodologies can bring to the
performance of machine learning algorithms,
specifically the LgR algorithm. By evaluating the
changes in performance metrics following the
application of these techniques, we seek to provide
insights into the effectiveness and impact of these
resampling strategies on model performance and
overall predictive capabilities.
A pivotal aspect of our work lies in the
performance evaluation of different machine learning
algorithms under the influence of MCMCRO and
GANMCMCRO. The outcome of this analysis
highlights the commendable performance of linear
discriminant analysis as the optimal machine learning
algorithm within the MCMCRO context and easy
ensemble within the GANMCMCRO context.
GANMCMCRO: A Generative Adversarial Network Markov Chain Monte Carlo Random Oversampling Algorithm for Imbalance Datasets
593
6 CONCLUSION
In the realm of Intrusion Detection Systems (IDS), a
critical bastion safeguarding vulnerable network
environments from various covert attacks, the
challenge arises from these threats' ability to mimic
legitimate network actions. This proves to be a
significant hurdle for machine learning algorithms,
which struggle due to the scarcity of malicious
instances for effective training. In response, this study
introduces pioneering solutions: the MCMCRO and
GANMCMCRO algorithms. MCMCRO creatively
tackles imbalanced datasets by generating synthetic
data, achieving equilibrium in the CSE-CIC-IDS2018
Dataset. Expanding on this, the GANMCMCRO
framework blends MCMCRO with GANs,
augmenting data generation and balance.
The impact of these innovations resonates in
experimental outcomes. Integrating MCMCRO with
linear discriminant analysis redefines Infiltration
activity detection, yielding substantial precision,
recall, F1 score, and accuracy enhancements. These
advancements are statistically significant (p<0.054),
reflecting the potency of the approach. Beyond
compatibility with diverse machine learning
algorithms, the optimized integration within
GANMCMCRO showcases adaptability and
effectiveness, as seen in the results applying
MCMCRO and GANMCMCRO to the CSE-CIC-
IDS2018 Dataset alongside Easy Ensemble.
In the landscape of IDS and imbalanced data
handling, these contributions mark a pioneering
milestone, forging innovative pathways for network
security. With an anticipation of future extensions to
complex datasets, this approach empowers machine
learning algorithms for precise network defence.
REFERENCES
C. Vij and H. Saini,Intrusion Detection Systems:
Conceptual Study and Review,” 2021 6
th
International
Conference on Signal Processing, Computing and
Control (ISPCC), 2021, pp. 694-700.
S. Seeber and G.D. Rodosek, “Towards an adaptive and
effective IDS using OpenFlow”, IFIP International
Conference on Autonomous Infrastructure
Management and Security, pp. 134-139, 2015, June.
N. Abedzadeh, M. Jacobs, “A Survey in Techniques for
Imbalanced IDS datasets,” ICICCS 2022: 16.
International Conference on Intelligent Computing and
Control Systems, August 08-09, 2022, in Montreal,
Canada.
N. Abedzadeh, M. Jacobs, “Using Markov Chain Monte
Carlo Algorithm for Sampling Imbalance Binary IDS
datasets,” 12
th
International Workshop on Security,
Privacy, Trust for Internet of Things (IoTSPT) at the
ICCCN 2022, IEEE, July 2022 in Hawaii.
N. Abedzadeh, M. Jacobs, “A Reinforcement Learning
Framework with Oversampling and Undersampling
Algorithms for Intrusion Detection System,” AIADFSB
journal of Applied Sciences, 2023.
Subiksha Srinivasa Gopalan1, Dharshini Ravikumar1, Dino
Linekar, Ali Raza1, Maheen Hasib, “Balancing
Approaches towards ML for IDS: A Survey for the
CSE-CIC IDS dataset,” 2020 International Conference
on Communications, Signal Processing, and their
Applications (ICCSPA), IEEE, 2021.
Lan Liu 1, Pengcheng Wang, Jun Lin, and Langzhou Liu,
“Intrusion Detection of Imbalanced Network Traffic
Based on Machine Learning and Deep Learning,” IEEE
access, 2020.
Akila, R., Elayaraja, E., & Sudha, A. (2019). Handling
Imbalanced Intrusion Detection System Data using
Hybrid Sampling and Machine Learning Techniques.
International Journal of Computer Applications,
975(8887), 13-17.
P. Shi, X. Chen, X. Kong and X. Cao, "SE-IDS: A Sample
Equalization Method for Intrusion Detection in
Industrial Control System," 2021 36th Youth Academic
Annual Conference of Chinese Association of
Automation (YAC), Nanchang, China, 2021, pp. 189-
195, doi: 10.1109/YAC53711.2021.9486601.
SOLARMAINFRAME, "IDS 2018 Intrusion CSVs (CSE-
CIC-IDS2018)," https://www.kaggle.com/datasets/
solarmainframe/ids-intrusion-csv, 2018.
Galit Shmueli, Peter C. Bruce, Peter Gedeck, Nitin R. Patel,
“Data Mining for Business Analytics: Concepts,
Techniques and Applications in Python”, Mathematics,
Nov 5, 2019.
Jason Brownlee, "A Gentle Introduction to Markov Chain
Monte Carlo for Probability,”, https://machinelearning
mastery.com/markov-chain-monte-carlo-for-probabili
ty/, 2019
M. N. Rosenbluth, “in The Monte Carlo Method in the
Physical Sciences”, edited by J. E. Gubernatis
(American Institute of Physics), New York, 200
N. Metropolis and S. Ulam, “The Monte Carlo method,” J.
Am. Stat. Assoc., vol. 44, no. 247, pp. 335–341, 1949.
Joseph Moukarzel, “From Scratch: Bayesian Inference,
Markov Chain Monte Carlo and Metropolis Hastings,
in python,” https://towardsdatascience.com/from-
scratch-bayesian-inference-markov-chain-monte-
carlo-and-metropolis-hastings-in-python-
ef21a29e25a, Nov 2018.
Goodfellow et al, Generative Adversarial Networks (2014)
Jason Brownlee, “How to Develop a Conditional GAN
(cGAN) From Scratch”, https://machinelearningmaste
ry.com/how-to-develop-a-conditional-generative-adve
rsarial-network-from-scratch/, July 2019
H. Zhang, C. Luo, X. Yu, and P. Ren, “Mcmc based
generative adversarial networks for handwritten
numeral augmentation,” in Proc. Int. Conf. Commun.
Signal Process. Syst., 2017, pp. 2702–2710.
DMMLACS 2023 - 3rd International Special Session on Data Mining and Machine Learning Applications for Cyber Security
594