estimation to estimate the parameters. Using this
model to classify the malicious URLs into two classes,
‘malicious’ or ‘benign’. It comes with a modest
computational burden, simplicity in comprehension
and implementation. To implement this model, it is
necessary to assume the training data at least could be
fitted in logistic distribution which could be allowed
to use maximum likelihood estimation to estimate the
parameters. According to paper (Chiramdasu et al
2021 & Vanitha et al 2019), it has been proven that
using logistic regression to classify URL into
malicious and benign is available and sometimes
better than other traditional machine learning
algorithm such as K neighbors. One advantage of
logistic regression model is its modest occupation and
cost of computational resources which make this
model relatively efficient in terms of processing and
training time. Additionally, it is simple to implement
logistic regression and it is easy to make
comprehension, enabling users to easily understand
and utilize the technique for predictive modeling
tasks. In this research, it is highly expected that LR
model will have overall good performances on
extremely large dataset and hybrid feature
combination as this model has such properties.
Nevertheless, LR model probably not fit well based on
selected feature groups.
2.2.2 Neural Network
An Artificial Neural Network (ANN), also referred to
as Multi-layer Perceptron (MLP), usually consist with
one input layer, one output layer and at least one
hidden layer between the input and output layer. The
more hidden layer it has, the more complex model it
will be, taking more to train but it could also achieve
higher accuracy of prediction. In A Aljofe et al’s study
in 2020, the paper explored the availability and
performance of using neural network on phishing
URL detection and posted some features might be
suitable for neural network. However, it does not take
other types of malicious url into consideration and it
has some noise of some sensitive words. In this paper,
mainly use neural network model with 'Relu' and
'Logistic' activation to train and make prediction.
Making neural network model as main comparison
with LR and KNN to analyze and assess the
performances and differences. Because the neural
network should have better score on large dataset as
this model possesses a strong capacity to extract
information from big data and construct highly
complex models. But this model will take longer time
to train and require appropiate data pre-processing at
the same time.
As the dataset contains over 20,000 data, choosing
‘adam’ (Jais et al 2019) will be the best optimizer as
‘adam’ has been proved as best optimizer for large
training. However, with the limitation of
computational resource, the more hidden layer with
more neurons will make the cost of training much
higher. Therefore, as testing several possible number
of layers, it is better to apply two hidden layers with 5
neurons each to make MLP classifier have the best
performance on the dataset. And for this research, the
activation will test ‘Relu’ as default and ‘Logistic’ as
comparison to choose the one with better performance.
2.2.3 K Neighbors Classifier
K-Nearest Neighbors (KNN) is a widely-used
supervised classification algorithm known for its
simplicity and easy to make adjustment on parameter.
The principle behind KNN involves assigning a class
label to an instance based on the categories of its
nearest neighbors, determined by measuring their
distances. Implementing KNN is straightforward, as it
does not require parameter estimation or complex
training procedures. It usually has good performance
on classification. However, KNN is considered a lazy
algorithm, as it involves extensive computations for
classification. It requires scanning all training samples
to calculate distances, leading to high memory usage
and slow inference speed. The larger the dataset, the
more time and resources will cost. In this research,
applying KNN as a comparative model against other
training models. The KNN model is expected to have
better performance on classify malicious and benign
URLs as this is only two classifications. However, this
model might occupy large amount computation
resources to train and predict due to the large volume
of data. As mentioned in Shah’s study in 2020, LR
usually outperforms KNN on large dataset and
complex situation, but KNN could have better
performance on selected feature group. Hence, it is
meaningful to implement KNN model in this research
as we choose lexical feature group to compare with
hybrid feature group.
2.3 Evaluation Metrics
Four types of scores will be implemented as
evaluation metrics: accuracy, F1, precision, recall.
These four metrics will show the all-rounded scores of
the model.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑃) (1)
Recall = TP/(TP+FN)
(2)
Accuracy=(TP+TN)/(TP+FP+TN+FN)
(3)
DAML 2023 - International Conference on Data Analysis and Machine Learning
340