
and utilized four machine learning classifiers—k-
nearest neighbor (KNN), na
¨
ıve Bayes (NB), random
forest (RF), and decision tree (DT)—to evaluate their
effectiveness. The researchers used a total of seven
features, including packet duration, packet length,
and port information etc. Though the findings indi-
cated that their machine learning models could suc-
cessfully identify username enumeration attacks, with
improved performance , the study lacked testing with
real-world data.
The authors in (Hynek et al., 2020) presents a
novel approach for detecting SSH brute-force attacks
in high-speed networks using machine learning.The
detection system architecture includes data prepro-
cessing, an ML-based detector, and a knowledge base
for post-processing detected events. Unlike host-
based methods, this network-level approach captures
detailed traffic information, including packet lengths
and inter-packet times etc. The authors created a
dataset from real network traffic with over 30,000 la-
beled SSH biflow records, half of which are brute-
force attacks. They evaluated over 70 features and se-
lected 11 that provided good detection accuracy using
the AdaBoosted Decision Tree model.
The paper described in (Wanjau et al., 2021) pro-
poses a CNN-based model to detect brute-force at-
tacks on SSH logs. They identified the increasing
difficulty of detecting these attacks due to the high
speed and volume of network traffic, which often ob-
scures malicious activities. The model is trained using
the CIC-IDS 2018 dataset, which includes contem-
porary benign and malicious network activities. The
researchers employ feature selection and data nor-
malization techniques to preprocess the data, trans-
forming it into images suitable for CNN process-
ing. The results show that the CNN-based model
significantly outperforms traditional machine learn-
ing methods such as Naive Bayes, Logistic Regres-
sion, Decision Tree, k-Nearest Neighbour, and Sup-
port Vector Machine in detecting SSH brute-force at-
tacks.
The paper described in (Garre et al., 2021) pro-
poses a machine learning-based approach for detect-
ing SSH botnet infections. This research addresses
the exponential increase in botnet activity, exacer-
bated by zero-day attacks and obfuscation techniques,
which traditional detection methods struggle to man-
age. The authors utilized High-Interaction Honeypots
(HIH) to capture detailed attack behaviors and log
data, creating a dataset consisting of executed com-
mands and network information during SSH sessions.
This dataset was used to train a supervised learning
model to identify botnet infections during the initial
infection phase. This study underscores the poten-
tial of machine learning techniques in enhancing early
botnet detection and preventing compromised devices
from participating in malicious activities.
Our observations on the past approaches to SSH
attack detection are as follows:
• Most proposed solutions, regardless of the tech-
nology used, focus on detecting individual attacks
separately. To the best of our knowledge, none of
them consider the entire spectrum of attack sce-
narios possible on SSH.
• Rule-based approaches are less complex to imple-
ment and can effectively detect malicious behav-
ior. However, they are not robust against sophisti-
cated and obfuscated attack strategies.
• Rule-based approaches are less complex to imple-
ment and can effectively detect malicious behav-
ior. However, they are not robust against sophisti-
cated and obfuscated attack strategies.
3 PROPOSED METHODOLOGY
In this section, we describe the architecture of our
proposed SSH log-based attack detection and classi-
fication system for attacker profiling. Our method-
ology integrates rule-based techniques with machine
learning algorithms to create a robust, multi-faceted
defense mechanism. The architecture, illustrated in
Figure 1, is designed to parse, process, and analyze
SSH logs using a dual analytical strategy. It employs
predefined security rules for immediate threat iden-
tification, while simultaneously using predictive ma-
chine learning models for deeper analysis and classi-
fication.
Data Collection – The raw SSH log data for four
months, spanning from June 16, 2021, to October 10,
2021, are collected from an SSH server hosted in the
cloud. Additionally, we gather Cowrie log data for
six months, from March 1, 2023, to August 23, 2023.
Both datasets include various types of attacks, which
could be either manual or automated. We designate
the Cowrie Honeypot data as D0, which comprises
5,941,378 log entries. We divide the SSH server-
generated log dataset into two parts: D1 and D2. D1
contains data from June 16, 2021, to September 17,
2021, amounting to 3,312,998 log entries, while D2
encompasses data from September 18, 2021, to Oc-
tober 10, 2021, with 199,853 log entries. Initially,
we use the D0 dataset for pattern-based feature ex-
traction. Ultimately, the D1 dataset is employed for
initial training and testing, and the D2 dataset is used
to evaluate the performance of predictive models on
previously unseen data.
ICISSP 2025 - 11th International Conference on Information Systems Security and Privacy
152