Using Chat GPT for Malicious Web Links Detection

Thomas Kaisser

and Claudia-Ioana Coste

Department of Computer Science, Babes

-Bolyai University, Mihail Kog

alniceanu street, no. 1, Cluj-Napoca, Romania

Keywords:

Malicious Web Links Detection, Machine Learning, Ensembles, OpenAI API, Chat GPT.

Abstract:

Over the last years, the Internet has monopolized most businesses and industries. These outstanding advance-

ments lead to the dangerous development of specialized threats employed to outsmart everyday users, collect

personal data and ﬁnancial beneﬁts. One of the most relevant attacks is malicious web links, which can be

inserted into private messages, emails, social media posts and others to deceive consumers and trick them into

clicking. Present approach will classify links based on multiple manually extracted features. Then, we per-

form a feature importance analysis. Moreover on a smaller dataset, we employ OpenAI’s models to classify

and then add a new feature representing the Chat GPT classiﬁcation. Thus, we manage to improve the overall

performance of multiple machine learning methods. The ﬁrst experiment considers only a Random Forest

classiﬁer but in the second one, we added thirteen other intelligent algorithms and ensembles constructed from

the best performing ones. The best obtained accuracy (95%) is reached by the RF model on the whole dataset.

1 INTRODUCTION

The last decade was marked by an undeniable evolu-

tion of the online environment. According to Statista

(Statista Research Department, 2024), in 2023 there

were 93.09% of all European households having an

Internet connection. Moreover, 2022 was irrevoca-

bly marked by the launch of the Chat GPT 3 by

OpenAI, which proved to be one of the most per-

formant general online chatbot. After it, multiple

companies developed other Large Language Models

(LLMs) for general usage, providing human-like as-

sistance for everyday tasks. LLMs are implemented

using the self-attention mechanism, which was cre-

ated to make the model attentive towards speciﬁc in-

put data that may be relevant later. Transformers are

an encoder - decoder type of architecture, where mul-

tiple multi-headed attention layers are used (Vaswani

et al., 2017). These types of layers will compute an

attention score associated with each token, how rele-

vant it is considering the context.

Even if Chat GPT and other popular LLMs (e.g.,

Gemini, Llama 2, Copilot, etc.) provide online, easy-

to-use and fast functionalities to users for a large vari-

ety of redundant tasks (e.g., code debugging, content

creation, brainstorming, explaining new concepts,

etc.) it also introduced an easy manner to create harm-

https://orcid.org/0009-0001-0025-6415

https://orcid.org/0000-0001-8076-9423

ful content to be spread online. The malicious con-

tent can be created by evading the OpenAI’s guards

(OpenAI, 2024b). An analysis on how Chat GPT is

able to generate phishing attacks is tackled in (Roy

et al., 2023), where nine examples of credentials theft

attacks were deployed by impersonating 50 famous

website brands (e.g., Facebook, Amazon, PayPal,

etc.). The malicious strategies include: regular phish-

ing, reCAPTCHA, vicious links presented as QR

codes, Browser-in-the-Browser attack, iFrame injec-

tion (clickjacking), exploiting DOM classiﬁers, poly-

morphic URLs (Uniform Resource Locators), text en-

coding exploits and browser ﬁngerprinting. All the

attacks were implemented using Chat GPT’s prompts

with little to no human inﬂuence.

Present paper proposes to investigate the usage of

Chat GPT models in detecting malicious tasks. Our

contributions are the following:

• Build a baseline Random Forest (RF) model with

multiple hand-crafted features and analyze their

importance;

• Link classiﬁcation using the OpenAI’s Chat GPT

4 and 3.5-turbo and four different prompts;

• Enrich the baseline model with an additional fea-

ture representing the Chat GPT’s classiﬁcation;

• Experiment with multiple machine learning (ML)

algorithms and ensembles.

This research paper is split into ﬁve chapters. We

continue with the ”Introduction” 1 section, then with

Kaisser, T. and Coste, C.-I.

Using Chat GPT for Malicious Web Links Detection.

DOI: 10.5220/0013069200003825

In Proceedings of the 20th International Conference on Web Information Systems and Technologies (WEBIST 2024), pages 425-432

ISBN: 978-989-758-718-4; ISSN: 2184-3252

425

a section 2 reviewing the state-of-the-art concerning

malicious web links detection. Next, we describe our

empirical methodology with the steps taken towards

achieving our objectives 3. The following section will

describe our experiments, conﬁgurations and results

4, including comparisons with other approaches. Fi-

nally, we conclude this study and draw some future

directions for research.

2 MALICIOUS WEB LINKS

DETECTION

Malicious web links detection is often a binary clas-

siﬁcation or a multi-classiﬁcation problem. To prop-

erly detect the link’s class, scientists take into con-

sideration a large variety of characteristics, extracted

from the URL (i.e., content-independent) or from the

web content (i.e., content-dependent). Most of the

published approaches are working with ML, however

there are ideas considering blacklists, anti-viruses,

and complex network theory.

2.1 Content Independent Approaches

Content-independent solutions are employing just

features extracted from the URL, such as lexical,

computer network information based on Domain

Name System (DNS), WHOIS or other external ser-

vices. As example in (Oshingbesan et al., 2021)

there are a total of 380 lexical features extracted from

the URL (e.g., word2vec, N-grams, etc.). The ap-

proach compared ten ML algorithms such as Logis-

tic Regression (LR), Linear Support Vector Machine

(SVM), Decision Tree (DT), RF, Categorical Boost-

ing, K-Nearest Neighbor (KNN), Feed Forward Neu-

ral Network (FFNN), Naive Bayes (NB), K-Means

and Gaussian Mixture Model. The experiments are

done on multiple datasets, which were aggregated

from multiple sources. KNN was the best perform-

ing algorithm and word2vec features were not found

to be relevant in the link classiﬁcation.

With more types of characteristics extracted from

the URL (i.e., lexical, DNS related and third-party in-

formation), (Mahdavifar et al., 2021) propose a KNN

achieving 98.9% accuracy. The ﬁnal model depicted

the best thirteen features according to information

gain. Additionally, it was proved that third-party data

(e.g., domain age, geolocation, domain name, Alexa’s

rank, etc.) was more relevant in link classiﬁcation.

KNN was compared with SVM, Multi-layer Percep-

tron (MLP), Gaussian Naive Bayes (GNB) and LR.

The experiments were conducted on a novel dataset

with 400,000 benign samples, and 13,011 malicious.

One of a more traditional and old method of de-

tecting the maliciousness of a link is using blacklists,

which are public-available lists containing the mali-

cious domains. These lists are periodically updated.

Such a method was implemented in (Ma et al., 2009)

as the ﬁrst step of the labeling. Blacklist approaches

have disadvantages, such as that they are not suitable

for zero-days attacks. Moreover, web domains and

web content are very dynamic in time. A malicious

website could be taken down and a new safe website

could be replacing it and vise versa.

2.2 Content Dependent Approaches

Content-dependent strategies operate with other fea-

tures which can be extracted from the content of

the web page, from Hypertext Markup Language

(HTML), JavaScript (JS), Cascading Style Sheets

(CSS) or media ﬁles, such as images, audios, videos,

font ﬁles, etc.

In (Wejinya and Bhatia, 2021), there are added

handcrafted content-based features besides lexical

and host-based ones. Out of a total of 30 charac-

teristics, the NB model works best with just 15 of

them considered to be the most relevant. Similarly,

in (Kumi et al., 2021) the model is enriched with

content-based features extracted from the HTML or

JS data. The classiﬁcation is done using a data-mining

algorithm, association classiﬁer, which is achieving

an accuracy of 95.8%. It is compared with other

approaches and with other ML methods (e.g., LR,

SVM, NB). The most important features proved to

be the entropy of the domain name, JS tags, DOM

functions and other information extracted from the JS

ﬁles. Likewise, (Nagy et al., 2023) is proposing a sim-

ilar idea with most attributes elected from the JS and

HTML content. All features were passed through the

chi-square test and depicted just the best 15 of them to

form the detection model. The compared models in-

clude RF, NB, Convolutional Neural Network (CNN)

and Long-Short Term Memory (LSTM), the best re-

sults being reached with NB (96.01% accuracy).

A more complex approach is done by (Rozi

et al., 2021), where JS code was parsed and trans-

formed into a graph, which was then passed through

graph2vec and given as input to the classiﬁcation

model. The intelligent method was chosen based

on multiple comparisons between: a neural network

(NN), MLP, NB, LR, DT, Gradient Boosted Tree

(XGB), and SVM with Radial basis function kernel.

Recently, there have been developed a novel solu-

tion with the rise of generative artiﬁcial intelligence

as the one proposed in (Koide et al., 2023). LLMs are

employed to help with the phishing/non-phishing link

WEBIST 2024 - 20th International Conference on Web Information Systems and Technologies

426

classiﬁcation. The models used take into considera-

tion multiple features: URL, HTML and text infor-

mation collected from the screenshot of the website.

The detection models were developed as text models

and as multimodal ones, the last ones are able to pro-

cess images as input. The proposed solution, Chat-

PhishDetector, is also checking the website for brand

impersonation. The dataset on which the experiments

were driven, contains 1,000 samples for each class.

The LLMs were conﬁgured with two prompts, which

asked for an explaination on why a certain class was

chosen. The authors compared Chat GPT 4, 4vision

and 3.5-turbo; Gemini Pro and Pro Vision; Llama-2.

Even before, cybersecurity companies were en-

riching their products with Generative AI. An ex-

ample is VirusTotal’s Code Insight (Quintero, 2023)

utilizing Sec-PaLM, a proﬁcient LLM provided by

Google. The product helps cybersecurity specialists

to faster analyze code, by translating the code into

natural language. Another example is described in

(Tushkanov, 2023) where a series of simple experi-

ments are delivered with the task to classify phish-

ing websites. The performance achieved is around

87% accuracy with a high false positive rate, which

means that many websites were considered unsafe

even though they were safe.

Still, there are not a large variety of solutions in

using the capacities of LLMs, in different ways. Thus,

we propose to enrich ML models with an supplimen-

tary feature representing the classiﬁcation done by a

LLM.

3 EMPIRICAL METHODOLOGY

Herein, we present the methodology followed when

experimenting with malicious web links detection

task and Chat GPT. Present empirical study passes

through the next stages:

• Dataset selection and preprocessing;

• Feature modeling;

• Chat GPT classiﬁcation;

• ML models development;

• Comparisons.

3.1 Dataset Selection and Preprocessing

The depicted dataset can be accessed at (Siddhartha,

2021). It includes a total of 651,191 URLs, classiﬁed

into four categories: 428,103 benign, 96,457 deface-

ment, 94,111 phishing, and 32,520 malware. It suc-

cessfully captures a large variety of web threats, ag-

gregating data from multiple sources, such as ISCX-

URL-2016 (Mamun et al., 2016), Faizan’s github (Jo-

erg, 2017), Phishtank (PhishTank, 2023), PhishStorm

(Marchal et al., 2014) and ”malwaredomainlist.com”

(malwaredomainlist, 2010).

Even though we started with multi-classiﬁcation

we continued with binary classiﬁcation since it was

easier to compare with other approaches. All labels

from the dataset including: phishing, defacement and

malware were remapped to ”unsafe”. The benign

samples were renamed to ”safe”. The dataset is not

balanced and we consider this a proper and realis-

tic representation on how many benign and malicious

links there are in a real scenario.

3.2 Feature Modeling

The characteristics elected were based on the pre-

vious work done in the domain of malicious / be-

nign link classiﬁcation using manually-engineered

features. Present experiment could be split into three

stages:

• the simple stage - adding each feature one by one

and observe if accuracy increases;

• the feature importance score - where characteris-

tics were dropped based on their relevance score;

• covariance matrix - where the highly coupled fea-

tures were eliminated.

The simple method, where starting from an initial set

of features, we added one at a time to observe some

improvements in accuracy score. We started with an

initial set of features S

. The baseline model on which

this experiment was performed is RF. After running

the experiments, we developed the ﬁnal set of fea-

tures S

. All attributes were extracted based on the

URL and they fall into the lexical and host-based cat-

egories. In table 1 there can be found a compilation of

all features tried and added into the model. The gray

rows indicate the starting set of characteristics for the

model together with a brief explanation. Pink and

white rows contain the added features that we have

selected to enrich the model and improve its perfor-

mance.

The next experiment considers the importance

scores, which are calculated as the mean decrease in

impurity across all trees within the RF model. The

computation is automatically done by the Sklearn li-

brary (Pedregosa et al., 2011). In this stage, we fur-

ther eliminate the features with a small contribution

to the classiﬁcation.

In the ﬁnal stage, the covariance matrix was com-

puted and based on it, the most redundant features

were dropped. If there are multiple features highly

correlated, the one having the lowest importance

score will be eliminated.

Using Chat GPT for Malicious Web Links Detection

427

Table 1: The features selected and their descriptions.

Feature name Description

has IP address checking if the URL contains an IP address

no. full stops counting the ”.” sign

no. ”@” sign counting the ”@” sign

Google Index checks if the URL is indexed by Google service, offering an indication on the legitimacy of the

URL (Vikramaditya, 2024)

no. embedded domains count of the domains found within the URL (separated by ”//”)

no. directories counting the directories found in the URL path (separated by ”/”)

length of the URL total length of the URL

no. digits counting the digits found in the URL

no. special chars. count the special characters (”/”, ”%”, ”#”, ”&”, ”=”, ”?”) found in the URL

Shannon entropy computed on the network location of the URL (including the domain, port or subdomains if

there is any) (Lin, 1991)

longest token FQDN the length of the longest token found in the Fully Qualiﬁed Domain Name (FQDN) considering

the network location

no. vowels counting the vowels (”a”, ”e”, ”i”, ”o”, ”u”) found in the URL path

query length length of the query string (the URL part between the ”?” and the fragment sign ”#”)

is HTTPS checking the presence of the HTTPS protocol in the URL

no. phishing words counting general phishing words (”webscr”, ”secure”, ”banking”, ”ebayisapi”, ”account”, ”con-

ﬁrm”, ”login”, ”signing”) (Alabdulmohsin et al., 2016)

port indicator checking the presence of the port number within the URL

3.3 Chat GPT Classiﬁcation

We employ two OpenAI models: GPT 3.5-turbo and

GPT 4, using OpenAI Python API (OpenAI, 2024a).

Four simplistic prompts were engineered and they are

exempliﬁed in Table 2. The ﬁrst one is the most sim-

ple one. The second one and the third one are inspired

by the Zero-Shot technique detailed in (Kojima et al.,

2022). Lastly, the forth prompt considers the Chain

of Though technique described in (Wei et al., 2022).

Due to ﬁnancial limitations and the costs charged

by the OpenAI’s models we proposed to run our tests

on just 1000 random links. Depending on the prompt

type, GPT-3.5-turbo has a cost between 0.12$ and

0.27$ per testing set, while GPT-4 is more expen-

sive with 1.72$ - 9.06$. The longer the prompt the

more expensive it was to test the model. There will

be eight combinations of prompts and models which

will be tested and compared. Then, we enriched the

RF model and the other ML models with an additional

feature representing the LLM’s labeling. By doing so,

we should achieve a greater performance. The mod-

els using Chat GPT features will be trained and tested

on a total of 1000 links randomly sampled from the

dataset (Siddhartha, 2021). 800 of the web links will

be used for training and 200 of them for testing.

3.4 ML Models Development

The baseline model used is a RF implemented in

Sklearn library (Pedregosa et al., 2011) and a man-

ual parameter calibration after multiple runs. The

rest of the used models are the following: RF, XGB,

ADA, DT, LR, Gaussian NB (GNB), Multinomial NB

(MNB), Complement NB (CNB), Quadratic Discrim-

inant Analysis (QDA), Linear Discriminant Analy-

sis (LDA), Passive Aggressive Classiﬁer (PAC), SVM

and KNN. All algorithms were utilized from SKlearn

Python library (Pedregosa et al., 2011). The parame-

ters for all models were the default ones.

Moreover, we employ an ensemble using the three

of the best performing algorithms. The ensemble is

developed with a Voting Classiﬁer from (Pedregosa

et al., 2011). Multiple types of ensembles are tried

with different voting mechanisms (”hard”, ”soft”) and

different weights generated based on the accuracy

score of the individual classiﬁers. For all tests the

dataset was randomly split into 80% training and 20%

testing.

3.5 Comparisons

Our proposed approach is combining ML models and

ensembles with a Chat GPT feature. This idea will be

compared with other similar literature solutions us-

ing the same dataset (Siddhartha, 2021). All records

from the dataset are used in (Zhang and Yan, 2023),

in (Shetty et al., 2023) and in (Coste, 2024). Ad-

ditionally, there are other literature papers using the

same dataset but on a subset of 40,000 links. Thus,

to preserve a fair comparison, regarding the number

of records, we consider just these three solutions rel-

evant for comparisons.

WEBIST 2024 - 20th International Conference on Web Information Systems and Technologies

428

Table 2: The prompts tried for both Chat GPT models.

No. Prompt

1 Check if the url is safe or not, respond with SAFE for safe url and UNSAFE for unsafe url

2 Check if the url is safe or not, respond with SAFE for safe url and UNSAFE for unsafe url. Try to look at the

entropy of the domain, the length of the domain, the number of the special characters, the longest token in the url.

3 Check if the url is safe or not, respond with SAFE for safe url and UNSAFE for unsafe url. Try to look at the

entropy of the domain, the length of the domain, the number of the special characters, the longest token in the url.

Examples: 1. http://www.ikenmijnkunst.nl/index.php/exposities/exposities-2006,unsafe

2. http://peluqueriadeautor.com/index.php?option=com virtuemart&page=shop.browse&

category id=31&Itemid=70,unsafe

3. movies.yahoo.com/shop?d=hv&cf=info&id=1800340831,safe

4. duckduckgo.com/1/c/Roman Catholic cathedrals in Canada,safe

5. alexpay2.beget.tech,unsafe

6. http://worldoftanks.ru/ru/content/guide/payments instruction/mobile-payments-rostelekom-ural-utel/,safe

7. http://www.artedesignsas.it/catalogo.html?page=shop.browse&category id=14,unsafe

4 Check if the url is safe or not, respond with SAFE for safe url and UNSAFE for unsafe url. Look at these features:

- Number of directory levels

- Length of the URL: 38 characters

- Number of special characters (from ”/%#&=?”)

- Shannon entropy of the domain

- Length of the longest token in the FQDN

- Number of dots in the URL

- Number of vowels in the path

Examples: friars.com/sports/m-baskbl/archive/prov-m-baskbl-2003.html is Safe because the values of the

extracted features are: [4, 58, 4, 0.0, 0, 2, 10]

http://www.martin-busker.de/administrator/help/en-GB/css/Facture/ c4d12146ebce8e1684d3542308399779/8fa39

dab95edb1b676b638a672278eae/particuliers-45636.php is Unsafe because the values of the extracted features are:

[8, 153, 10, 3.78, 13, 3, 25]

4 EXPERIMENTS AND RESULTS

In the following section, we describe the feature im-

portance experiment and how by adding the GPT’s

prediction into different ML models they are able to

improve prediction in malicious web links detection.

4.1 Feature Importance Results

All feature importance experiments could be split into

3 stages (i.e, simple stage, feature importance stage

and covariance matrix stage) as previously described

in the Methodology section 3.2. The ﬁrst assessment

started with a simple RF model and a predeﬁned set of

characteristics selected from other solutions from the

state-of-the-art. The initial features are marked with

pink in Table 1 from Methodology section 3.2. To the

initial set, there were added one at a time the rest of

the features to observe an improvement in accuracy.

Features marked with white (see Table 1) did not im-

prove the performance of the RF, and the experiment

needed more time to extract these features. The gray

ones added a signiﬁcant increase in metrics.

The second experiment took into consideration the

feature importance score computed by the Sklearn RF

model (Pedregosa et al., 2011). Current stage includes

ﬁve runnings and the features obtaining the lowest

score were dropped in an iterative method such that

the evolution of the accuracy could be observed as

well. Finally, the following features were dropped

due to their low score: IP address, number of ”@”

sign, Google index, the number of embedded domains

and of directories. With the ﬁnal set of features (i.e.,

number of full stops, number of directories, length

of the URL, count of special characters, Shannon en-

tropy, length of the longest token in FQDN and num-

ber of vowels), the RF model achieved 93.78% accu-

racy.

Further, the ﬁnal stage will select features consid-

ering the covariance matrix. There was observed a

high correlation between the length of the URL, the

count of digits and the count of special characters.

This correlation is normal to happen since both the

number of digits and special characters are included

in the total length of the web link. If the URL is longer

it is a high probability that the number of special char-

acters and digits is larger. Therefore, considering the

importance score, we dropped the number of digits to

avoid this redundancy.

4.2 Using Chat GPT for Link

Classiﬁcation

OpenAI’s models were used for a standalone classiﬁ-

cation using two models (GPT-3.5-turbo and GPT-4)

Using Chat GPT for Malicious Web Links Detection

429

and four prompts as described in Table 2. Then, based

on the best combination of a model and a prompt, a

new feature was added into the baseline RF model.

The tests were done on 1000 randomly sampled links.

Considering the obtained performance, GPT-4

outperformed GPT-3.5-turbo as it can be observed in

Table 3. The best model was GPT-4 with the third

prompt, reaching 65% accuracy. The best model with

GPT-3.5-turbo was achieved with the second prompt.

Regarding the cost, GPT-3.5-turbo is considerably

cheaper than GPT-4. Moreover, if we compare the

best classiﬁcation with GPT-3.5-turbo and the best of

GPT-4, we can observe a small difference in perfor-

mance but a signiﬁcant one regarding cost.

Afterwards, we added a new feature (”ope-

nAICallCheck”) in the previously developed RF

model with the ﬁnal aim to further increase its ac-

curacy. This feature represents the link classiﬁcation

as done by the GPT-4 model with prompt 3, our best

GPT conﬁguration. We observed that on the evaluat-

ing set of 1000 links, the RF accuracy rose from 88.8

to 89%. The metrics were computed for ﬁve different

dataset splits. Thus, even though GPT models do not

have a high accuracy on their own, using their classiﬁ-

cation as input to a ML model, there may be a modest

increase in performance.

Table 3: The performance of the chat GPT models (3.5-

turbo and 4) for link classiﬁcation (testing set).

Prompt Acc.(%) Precision Recall F1 Cost

GPT-3.5-turbo

1 48.5 65 54 40 0.12 $

2 63 66 65 62 0.15 $

3 54.5 65 58 51 0.24 $

4 44.5 64 51 34 0.23 $

GPT-4

1 60 64 60 60 1.72 $

2 61 67 64 60 3.15 $

3 65 65 65 65 7.35 $

4 64 66 65 64 9.06 $

4.3 Performance of the Other ML

Models

Taking into consideration the potential of the ”ope-

nAICallCheck” feature, we propose to further ex-

periment with multiple ML algorithms and ensemble

models. Table 4 details the accuracy scores for all

ML models employed and they were computed on

the testing set on ﬁve dataset splits. The ﬁrst two

columns contain the accuracies without the openAI’s

feature and with it. The ﬁnal column has the accu-

racies achieved on the whole dataset, which will be

used for reference and to observe if the ML models

generalize well. These experiments do not take into

Table 4: The accuracy of the all ML models (testing set).

Model Acc. Acc. (with GPT) Acc. (all)

RF 88.8 89 95

XGB 93.4 93.3 92.86

ADA 92.6 93.3 91.08

DT 91 90.9 94.35

LR 87.9 88.3 85.43

GNB 85.1 86.2 84.56

MNB 86.9 87.8 84.55

CNB 86.6 86.9 84.47

MLP 90.8 91.2 93.3

KNN 87.8 88.5 93.38

QDA 88.2 87.3 84.67

LDA 88.1 87.9 84.99

PAC 87.7 72.1 83.84

SVC 86.9 87.9 87.33

account OpenAI’s prediction due to ﬁnancial limita-

tions. The total cost for all the 651,191 links would

have reached 4,700 $ by using GPT-4 and prompt 3.

It can be observed that most algorithms have a mod-

est improvement in accuracy when adding the ”ope-

nAICallCheck” feature. Usually, there was a slight

decrease in accuracy was noted for XGB, DT, QDA,

and LDA. However, for the PAC algorithm the GPT’s

information proved to be rather detrimental. This may

happen because PAC is an online learning algorithm,

where the training set is processed sequentially and

the model is updated in the same manner. PAC is suit-

able for large data while small amounts of data may

not be enough. For the rest of the ML methods (i.e.,

RF, ADA, LR, NB algorithms, MLP, KNN, and SVC)

there can be seen an increase between 0.2 and 1.1,

which we consider to be relevant.

Regarding generalization, RF, DT, MLP and KNN

prove to achieve a greater performance when trained

on more data. Even though the rest of the algorithms

generalize well, there is not a signiﬁcant drop in ac-

curacy when trained on all 651,191 records. It is def-

initely a case of overﬁtting and it should be investi-

gated more.

Overall, the best performing ML methods are

XGB, ADA and DT, which will be depicted to form

the heterogeneous ensembles. The weights represent

the accuracy score obtained by the models in an in-

dividual setting. Ensembles were calibrated consid-

ering the voting mechanisms (i.e., soft or hard) and

by adding weights or not. The experiment was con-

ducted in the same conﬁguration as the one for the

single models. While for most ensembles, adding

the GPT’s feature was detrimental, for the no-weights

soft-voting ensemble it was observed a light increase

in accuracy (0.1%). This may be due to the fact that

hybrid models need more data to be effective. This is

sustained by our results on the whole dataset, where

all ensembles signiﬁcantly improved in accuracy (1-

WEBIST 2024 - 20th International Conference on Web Information Systems and Technologies

430

5%). Thus, testing the ”openAICallCheck” feature on

just 1000 links may not be enough to properly train

an ensemble model. Moreover, by comparing the en-

semble with the individual results of the classiﬁers on

1000 URLs, it did not lead to an increase in accu-

racy. Although, on the whole dataset, the ensemble

achieves rather better results (e.g., 94.29%).

4.4 Comparisons and Discussion

Figure 1: Comparisons with other literature approaches.

Our purpose was to investigate the support of using

Chat GPT in classifying web links as malicious or

benign. This idea was tested with multiple intelli-

gent algorithms including ensembles. The best per-

forming algorithm on the whole dataset (Siddhartha,

2021) was RF with 95% accuracy. On the smaller

dataset formed by 1000 randomly sampled links, the

best accuracy was obtained by XGB, closely followed

by ADA. Also, both of these models were the most

accurate when adding the GPT’s prediction as a fea-

ture. The best ensemble was formed by XGB, ADA

and DT and it is characterized by a hard voting mech-

anism and weights. This ensemble has the highest

accuracy rate on the 1000 links dataset. On the whole

dataset, the soft-voting ensemble has a higher accu-

racy rate. Figure 1 presents all our best models. The

blue bars represent the classiﬁcation for the whole

dataset even if it is a binary classiﬁcation or it is a

multi-classiﬁcation. The red ones mark the results

obtained on the smaller dataset of 1000 links. The

green bars signify the accuracy scores reached by our

best algorithms on the smaller dataset, but adding the

GPT’s classiﬁcation to the model. The orange bars

present the best accuracies obtained from the stan-

dalone OpenAI’s models.

Considering the whole dataset, there can be ob-

served that our models outperform the solution pro-

vided by (Zhang and Yan, 2023), but unfortunately,

they are behind with other more performant models

from (Shetty et al., 2023) and (Coste, 2024). Re-

garding the smaller dataset using just hand-crafted

features, the ensemble model does not reach a bet-

ter accuracy rate compared to the individual model,

which we would have expected. By comparing with

the models using the ”openAICallCheck” character-

istic, we can observe a small improvement for ADA,

but for the ensemble or XGB, the accuracy has a tiny

decrease. Nevertheless, set side by side with the stan-

dalone GPT’s models (i.e., gpt-3.5-turbo and gpt-4),

our models certainly are a better solution. We chose

to not include the solution from (Koide et al., 2023)

in the Figure 1 because the comparison would not be

fair since the dataset is different. Still, their approach

is much more accurate with higher metrics obtained

and multiple LLMs engines and tasks. All in all, our

approach can pave the way for novel solutions by ex-

tracting features from OpenAI’s models to advance

the classiﬁcation of malicious links.

5 CONCLUSIONS AND FUTURE

WORK

Malicious web links account for multiple security at-

tacks directed against inexperienced users and can

lead to drive-by-downloads, credentials theft, imper-

sonating brands and deceiving people. Present pa-

per proposes to tackle the application of OpenAI’s

models (i.e., GPT-3.5-turbo and GPT-4) to counter-

act web-malware. Our experiments contain a feature

importance analysis on web links with multiple hand

crafted features. Then, a large variety of intelligent

methods including ML models and ensembles were

extended with a feature considering the GPT’s pre-

diction. The experiments proved that by appending

OpenAI’s prediction of a link as a new feature model

it can slightly improve the accuracy of most algo-

rithms. The best models achieve 94-95% accuracy on

the whole dataset.

Using the capabilities provided by LLMs could

lead to major improvements regarding cybersecurity.

For future work, we propose to add other Chat GPT

related features into the ML classiﬁcation models

such as an explanation on why a link is malicious,

its domain similarities with other brands on the In-

ternet, etc. As well, other engines, such as Gemini,

Llama, GPT-4o etc. should be use for comparisons.

Additionally, by using transfer learning, a large lan-

guage model could be trained speciﬁcally on the task

of malicious web links detection. Moreover, the prob-

lem should be addressed on a larger scale since links

need to be checked in a continuous and fast way not

to interfere with the online environment.

Using Chat GPT for Malicious Web Links Detection

431

REFERENCES

Alabdulmohsin, I., Han, Y., Shen, Y., and Zhang, X. (2016).

Content-agnostic malware detection in heterogeneous

malicious distribution graph. In Proceedings of the

25th ACM International on Conference on Informa-

tion and Knowledge Management, pages 2395–2400.

Coste, C. I. (2024). Malicious web links detection based on

image processing and deep learning models (accepted

for publication). In The 23rd IEEE/WIC International

Conference on Web Intelligence and Intelligent Agent

Technology.

Joerg, S. (2017). Using-machine-learning-to-detect-

malicious-urls. faizan dataset link (Retrieved: August

22, 2024).

Koide, T., Fukushi, N., Nakano, H., and Chiba, D. (2023).

Detecting phishing sites using chatgpt. arXiv preprint

arXiv:2306.05816.

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y.

(2022). Large language models are zero-shot reason-

ers. Advances in neural information processing sys-

tems, 35:22199–22213.

Kumi, S., Lim, C., and Lee, S.-G. (2021). Malicious url

detection based on associative classiﬁcation. Entropy,

23(2):182.

Lin, J. (1991). Divergence measures based on the shannon

entropy. IEEE Transactions on Information theory,

37(1):145–151.

Ma, J., Saul, L. K., Savage, S., and Voelker, G. M. (2009).

Beyond blacklists: learning to detect malicious web

sites from suspicious urls. In Proceedings of the 15th

ACM SIGKDD international conference on Knowl-

edge discovery and data mining, pages 1245–1254.

Mahdavifar, S., Maleki, N., Lashkari, A. H., Broda, M.,

and Razavi, A. H. (2021). Classifying malicious do-

mains using dns trafﬁc analysis. In 2021 IEEE Intl

Conf on Dependable, Autonomic and Secure Com-

puting, Intl Conf on Pervasive Intelligence and Com-

puting, Intl Conf on Cloud and Big Data Computing,

Intl Conf on Cyber Science and Technology Congress

(DASC/PiCom/CBDCom/CyberSciTech), pages 60–

67. IEEE.

malwaredomainlist (2010). Malware domain list. URL to

malwaredomainlist (Retrieved: August 22, 2024).

Mamun, M. S. I., Rathore, M. A., Lashkari, A. H.,

Stakhanova, N., and Ghorbani, A. A. (2016). Detect-

ing malicious urls using lexical analysis. In Network

and System Security: 10th International Conference,

NSS 2016, Taipei, Taiwan, September 28-30, 2016,

Proceedings 10, pages 467–482. Springer.

Marchal, S., Franc¸ois, J., State, R., and Engel, T. (2014).

Phishstorm: Detecting phishing with streaming ana-

lytics. IEEE Transactions on Network and Service

Management, 11(4):458–471.

Nagy, N., Aljabri, M., Shaahid, A., Ahmed, A. A., Alnasser,

F., Almakramy, L., Alhadab, M., and Alfaddagh, S.

(2023). Phishing urls detection using sequential and

parallel ml techniques: Comparative analysis. Sen-

sors, 23(7):3467.

OpenAI (2024a). Openai platform. URL to malwaredo-

mainlist (Retrieved: August 22, 2024).

OpenAI (2024b). Usage policies. Link to article (Retrieved:

August 22, 2024).

Oshingbesan, A., Okobi, C., Ekoh, C., Richard, K., and

Munezero, A. (2021). Detection of malicious web-

sites using machine learning techniques. preprint,

none(none):1–5.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,

Thirion, B., Grisel, O., Blondel, M., Prettenhofer,

P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,

A., Cournapeau, D., Brucher, M., Perrot, M., and

Duchesnay, E. (2011). Scikit-learn: Machine learning

in Python. Journal of Machine Learning Research,

12:2825–2830.

PhishTank (2023). PhishTank - Out of the Net, into the

Tank - Developer Information. PhishTank website

(Retrieved: August 22, 2024).

Quintero, B. (2023). Introducing virustotal code insight:

Empowering threat analysis with generative ai. Link

to article (Retrieved: August 22, 2024).

Roy, S. S., Naragam, K. V., and Nilizadeh, S. (2023). Gen-

erating phishing attacks using chatgpt. arXiv preprint

arXiv:2305.05133.

Rozi, M., Ban, T., Kim, S., Ozawa, S., Takahashi, T., and

Inoue, D. (2021). Detecting malicious websites based

on javascript content analysis. In Computer Security

Symposium 2021, Dubrovnik, Croatia. Computer Se-

curity Symposium 2021.

Shetty, U., Patil, A., and Mohana, M. (2023). Malicious

url detection and classiﬁcation analysis using machine

learning models. In 2023 International Conference

on Intelligent Data Communication Technologies and

Internet of Things (IDCIoT), pages 470–476. IEEE.

Siddhartha, M. (2021). Malicious urls dataset. Kaggle - Ma-

licious URLs dataset (Retrieved: August 22, 2024).

Statista Research Department (2024). Household internet

access in the european union 2023. Link to article

(Retrieved: August 22, 2024).

Tushkanov, V. (2023). Investigating chatgpt phishing detec-

tion capabilities. Link to article (Retrieved: August

22, 2024).

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, L., and Polosukhin, I.

(2017). Attention is all you need. arXiv preprint

arXiv:1706.03762.

Vikramaditya, N. (2024). Googlesearch-python. Link to

library (Retrieved: August 22, 2024).

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F.,

Chi, E., Le, Q. V., Zhou, D., et al. (2022). Chain-of-

thought prompting elicits reasoning in large language

models. Advances in neural information processing

systems, 35:24824–24837.

Wejinya, G. and Bhatia, S. (2021). Machine learning for

malicious url detection. In ICT Systems and Sustain-

ability, pages 463–472. Springer, Singapore.

Zhang, L. and Yan, Q. (2023). Detect malicious websites by

building a neural network to capture global and local

features of websites. Research Square.

WEBIST 2024 - 20th International Conference on Web Information Systems and Technologies

432