Enhanced Predictive Clustering of User Proﬁles: A Model for Classifying

Individuals Based on Email Interaction and Behavioral Patterns

Peter Waﬁk

1 a

, Alessio Botta

3 b

, Luigi Gallo

3 c

, Gennaro Esposito Mocerino

3 d

Cornelia Herbert

2 e

, Ivan Annicchiarico

, Alia El Bolock

1 f

and Slim Abdennadher

1 g

Department of Informatics and Computer Science, German International University, Cairo, Egypt

Department of Applied Emotion and Motivation Psychology, Ulm University, Ulm, Germany

Department of Electrical Engineering and Information Technologies, University of Napoli Federico II, Naples, Italy

peter.waﬁk@giu-uni.de, {alessio.botta, luigi.gallo3, gennaro.espositomocerino}@unina.it,

{cornelia.her r@giu-uni.de

Keywords:

Clustering, Predictive Clustering, Deep Learning, Neural Networks, Behavioral Analysis, Personalized

Content Delivery, Social Engineering.

Abstract:

This study introduces a predictive framework to address a gap in user proﬁling, integrating advanced cluster-

ing, dimensionality reduction, and deep learning techniques to analyze the relationship between user proﬁles

and email phishing susceptibility. Using data from the Spamley platform (Gallo et al., 2024), the proposed

framework combines deep clustering and predictive models, achieving a Silhouette Score of 0.83, a Davies-

Bouldin Index of 0.42, and a Calinski-Harabasz Index of 1676.2 with k-means and Self-Organizing Maps

(SOM) for clustering user proﬁles. The results further highlight the effectiveness of Linear Support Vector

Machines (SVM) and neural network models in classifying cluster membership, providing valuable decision-

making insights. These ﬁndings demonstrate the efﬁcacy of advanced non-linear methods for clustering com-

plex user proﬁle features, as well as the overall success of the semi-supervised model in enhancing clustering

accuracy and predictive performance. The framework lays a strong foundation for future research on tailored

anti-phishing strategies and enhancing user resilience.

1 INTRODUCTION

The rise of digital communication, particularly via

email, has created an expansive pool of data that of-

fers rich opportunities to understand user behavior.

Previous research suggests that email interactions are

not only a means of communication, but also reﬂect

individual characteristics, preferences, and cognitive

vulnerabilities and therefore pose a major challenge

to privacy protection. This also applies to the tac-

tics used in email phishing attacks (Lawson et al.,

2020). The exploitation of email as a medium for

phishing attacks has grown alarmingly sophisticated,

underscoring the need for user-centric defenses. Ad-

https://orcid.org/0009-0002-6151-6775

https://orcid.org/0000-0002-3365-1446

https://orcid.org/0000-0001-8770-9773

https://orcid.org/0009-0009-0655-2280

https://orcid.org/0000-0002-9652-5586

https://orcid.org/0000-0002-5841-1692

https://orcid.org/0000-0003-1817-1855

dressing this challenge demands a deeper understand-

ing of both technical patterns and human behaviors

(Gallo et al., 2024). Traditional methods of phish-

ing detection, predominantly focus on binary out-

comes—predicting whether an individual will fall

victim to an attack—while overlooking the broader

potential of human proﬁling (Kim and Cho, 2024).

These approaches often fail to account for the psy-

chological and behavioral dimensions that inﬂuence

user decisions, such as impulsivity, risk perception,

and trust dynamics. Such traits are critical for un-

derstanding how users interact with digital content

and for developing tailored defenses against phish-

ing attacks (Van Der Heijden and Allodi, 2019; Allodi

et al., 2019).

This study addresses these limitations by intro-

ducing a novel predictive framework combining deep

learning with behavioral analysis. By relating email

interaction patterns to psychological traits, the frame-

work holistically analyzes user proﬁles to predict per-

sonalized email characteristics. This enables the cre-

ation of customized email structure elements aligned

Waﬁk, P., Botta, A., Gallo, L., Mocerino, G. E., Herbert, C., Annicchiarico, I., El Bolock, A. and Abdennadher, S.

Enhanced Predictive Clustering of User Proﬁles: A Model for Classifying Individuals Based on Email Interaction and Behavioral Patterns.

DOI: 10.5220/0013302800003899

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 11th International Conference on Information Systems Security and Privacy (ICISSP 2025) - Volume 2, pages 363-374

ISBN: 978-989-758-735-1; ISSN: 2184-4356

363

with user-speciﬁc traits, advancing personalized con-

tent delivery and mitigating phishing risks.

This research aims to bridge the gap between tra-

ditional binary phishing detection models and the un-

tapped potential of comprehensive human proﬁling.

By identifying the interplay between email traits and

user proﬁles, the proposed framework seeks to en-

hance phishing prevention strategies. Positioned at

the intersection of psychology, machine learning, and

cybersecurity, this study introduces a scalable and

innovative solution to modern challenges in digital

communication, paving the way for more adaptive

and user-centric defenses.

The remainder of the paper is structured as fol-

lows: Section 2 discusses the background and related

work, emphasizing the role of human factors in phish-

ing and clustering methodologies. Section 3 outlines

the proposed methodology, including dataset char-

acteristics, preprocessing, clustering, and prediction

models implemented. Section 4 presents the results,

followed by an in-depth discussion. Finally, Section 5

concludes with key ﬁndings and directions for future

research.

2 BACKGROUND AND RELATED

WORK

Phishing attacks have become increasingly sophisti-

cated over the past decade, posing signiﬁcant chal-

lenges for cybersecurity. Despite the advancements in

detection technologies, phishing continues to exploit

psychological vulnerabilities, emphasizing the need

for solutions that address both technical and psycho-

logical aspects (Dhamija et al., 2006). This section

explores the evolution of phishing research, highlight-

ing the critical role of human factors and advance-

ments in technology to accommodate this which form

the foundation of this study.

2.1 Role of Human Factors in Phishing

Susceptibility

Phishing emails are crafted to exploit cognitive and

psychological vulnerabilities, making the human el-

ement a critical weakness in cybersecurity. Studies

have shown that individuals’ susceptibility to phish-

ing often depends on traits such as impulsivity, cu-

riosity, and risk perception (Van Der Heijden and

Allodi, 2019; Allodi et al., 2019). Research has

also linked personality traits, such as those from the

Big Five model, to phishing susceptibility (Parrish Jr

et al., 2009). Demographic factors like age and ed-

ucation, though less predictive, have been studied

to understand the broader landscape of vulnerabili-

ties (Dhamija et al., 2006). Tailored phishing attacks

leveraging persuasion principles, such as authority

and scarcity, further underscore the importance of

psychological factors (Cialdini and Cialdini, 2007).

This work builds on these insights by selecting

a dataset capable of capturing all of these traits and

cluster users based on their behavioral and cognitive

proﬁles. By correlating email traits with user re-

sponses, the study aims to predict phishing suscep-

tibility and inform tailored interventions.

2.2 Overview of Phishing Susceptibility

Based on User Proﬁles

Recent research on phishing susceptibility has fo-

cused on the impact of personality traits, cognitive

abilities, and online behaviors. Analyzing user pro-

ﬁles has been a key approach, though it faces chal-

lenges due to the lack of datasets speciﬁcally designed

for such studies (Wang et al., 2012). Despite this, no-

table studies have emerged to address this gap. For

instance, (Tornblad et al., 2021) identiﬁed 32 predic-

tors of phishing susceptibility, but noted that existing

models used limited predictors and lacked accuracy.

(Wang et al., 2012) proposed a high-accuracy ma-

chine learning model but relied on self-reported data

and missed dynamic phishing aspects. Similarly, (Al-

bladi and Weir, 2018) explored phishing susceptibil-

ity on social networks but insufﬁciently analyzed how

personality traits inﬂuence decision-making.

The mentioned studies, along with many others,

often rely on static, limited datasets and lack the inte-

gration of advanced deep proﬁling techniques. Future

research should seek to address these limitations by:

1. Expanding Dataset Scope: Utilizing datasets

covering diverse psychological and behavioral di-

mensions.

2. Applying Advanced Clustering Techniques:

Use deep clustering methods to identify complex

patterns in user behavior and susceptibility.

3. Conducting Comprehensive Analysis: Explore

the interplay between personality traits, cognitive

abilities, and online behaviors in greater depth.

Filling these gaps will enable the development of

more accurate and actionable models for predicting

and mitigating phishing risks, enhancing the effec-

tiveness of anti-phishing strategies.

ICISSP 2025 - 11th International Conference on Information Systems Security and Privacy

364

2.3 Clustering and Predictive

Algorithms for User Proﬁling

Clustering algorithms have long been used to clas-

sify individuals based on interaction patterns, cogni-

tive traits, and behavioral data. Techniques such as

k-means and hierarchical clustering have proven ef-

fective in identifying user groups, offering insights

that could be utilized for cybersecurity applications

(Chandola et al., 2009). For example, clustering

users by their susceptibility to phishing enables tar-

geted training and awareness programs. Building on

this foundation, this study employs advanced clus-

tering techniques to classify users and predict their

susceptibility to phishing attacks. By integrating be-

havioral and psychological traits, it offers a compre-

hensive perspective on user vulnerabilities, enabling

tailored interventions and strengthening cybersecurity

defenses.

2.4 Ethics Statement

This study complies with ethical standards for data

collection, processing, and analysis. The dataset,

obtained from the Spamley platform, was fully

anonymized to ensure participant privacy and conﬁ-

dentiality. No personally identiﬁable information was

used, and all data handling adhered to GDPR and rele-

vant data protection laws (GDPR, 2016). The predic-

tive clustering framework developed in this research

is intended for ethical applications, such as enhanc-

ing personalized content delivery and improving cy-

bersecurity defenses. The model is speciﬁcally de-

signed to respect user privacy and avoid misuse, such

as unauthorized proﬁling or exploitation of sensitive

user traits. By focusing on anonymized and behav-

ioral insights, the framework provides actionable ben-

eﬁts without compromising ethical principles. This

study emphasizes transparency and integrity in its

methodologies to ensure the responsible use of the

proposed model.

3 METHODOLOGY

The methodology employs a multi-stage process ap-

plied to the Spamley responses dataset. After pre-

processing, a clustering model classiﬁes individuals

based on their proﬁle, followed by a predictive model

to assign new individuals to the generated clusters.

Email traits, such as subject and body content, are

identiﬁed by analyzing top emails per cluster that in-

dividuals misjudged their legitimacy and replicating

their key features. Figure 1 outlines the Methodology

workﬂow.

Figure 1: Methodology Workﬂow.

3.1 Datasets

This study analyzes user responses to phishing emails

using datasets generated from the Spamley platform,

with a focus on behavioral patterns in assessing email

legitimacy. Two primary datasets were employed:

1. Emails Characteristics Dataset: This dataset in-

cludes 136 emails, equally split between phishing

and legitimate types, in both Italian and English,

sourced from actual inboxes. Each email is cat-

egorized by technical and psychological features,

such as subject, context, phishing links, and cog-

nitive manipulations like authority and scarcity

(Gallo et al., 2024). These features are docu-

mented in a standardized schema to allow consis-

tent reference.

2. Email Responses Dataset: A survey was com-

pleted by 1027 participants, with 731 valid re-

sponses after pre-processing. This dataset records

demographic information, internet habits, and

psychological traits, including Big Five personal-

ity traits and self-reported cognitive vulnerabili-

ties (Gallo et al., 2024). Participants subsequently

classiﬁed email legitimacy, with their responses

recorded for clustering analysis.

Enhanced Predictive Clustering of User Proﬁles: A Model for Classifying Individuals Based on Email Interaction and Behavioral Patterns

365

This approach enables robust analysis of the re-

lationship between user characteristics and phishing

susceptibility, offering valuable insights for design-

ing tailored cybersecurity interventions and aware-

ness programs (Gallo et al., 2024).

3.2 Data Pre-Processing

Effective data pre-processing is crucial for ensuring

the quality and consistency of datasets used in predic-

tive clustering. This stage ensures the data is clean,

structured, and ready for analysis, supporting the reli-

ability of clustering and predictive models (Kotsiantis

et al., 2006; Han et al., 2022). The following 7 pre-

processing steps were applied to the individuals’ re-

sponses dataset to prepare it for the clustering phase:

1. Addressing Missing and Irrelevant Data: Ini-

tial cleaning involved removing redundant meta-

columns (e.g., hash, ﬁrst name, last name, etc..)

deemed irrelevant to the analysis. Completely

empty columns and rows with over 30% missing

values were also removed, adhering to best prac-

tices for handling incomplete data (Little and Ru-

bin, 2019).

2. Feature Engineering: A new column, result, was

created to quantify the number of emails correctly

identiﬁed as legitimate or phishing. This feature

provided additional insights into user behavior,

enhancing the dataset’s predictive power.

3. Feature Selection: To reduce dimensional-

ity, features that uniquely identify the individ-

ual’s biographic traits as well as their psycho-

logical and behavioral traits were selected, so

that the clustering would be built on diverse

meaningfully-related traits. The ﬁnal retained

features include: computer science knowledge,

time on internet, educationField id as well as 27

other features all listed in Appendix A.

4. Outlier Detection and Treatment: Outliers were

identiﬁed using the interquartile range (IQR)

method (Aggarwal and Aggarwal, 2017). De-

pending on their relevance, outliers were either

corrected or removed, ensuring data consistency

and preventing skewed model performance.

5. Feature Normalization: Min-Max scaling was

applied to numerical features, standardizing them

to a uniform range. This step is critical for

distance-based clustering methods (Sammut and

Webb, 2011).

6. Encoding Categorical Variables: While most

categorical variables were already encoded in the

received dataset, label encoding was applied to

three remaining columns to prepare them for anal-

ysis (Pedregosa et al., 2011).

7. Handling Imbalanced Data: Imbalanced cate-

gorical columns were addressed by calculating

weights inversely proportional to the frequency of

each class. These weights emphasized minority

classes during model training without altering the

underlying data distribution.

These steps produced a clean and well-structured

dataset, ready to be utilized by clustering and predic-

tive clustering models and ensure robust and repro-

ducible results.

3.3 Clustering

This study adopts a quantitative methodology, em-

ploying clustering techniques to classify users based

on their email interaction characteristics. The objec-

tive is to develop a model that effectively groups in-

dividuals according to their traits and behavioral pat-

terns. Therefore, the dataset containing individuals’

responses was utilized to apply the clustering algo-

rithms on.

3.4 Clustering Evaluation Metrics

To ensure robust and reliable clustering results, this

study employed a diverse range of clustering evalua-

tion metrics. These metrics assess intra-cluster com-

pactness, inter-cluster separation, and overall topo-

logical accuracy, ensuring the validity of the clus-

tering results. The metrics include the Silhouette

Score, Davies-Bouldin Index, Calinski-Harabasz In-

dex, Quantization Error, Topographic Error, and Gap

Statistics. Each metric and its mathematical formula-

tion is described below.

Silhouette Score: evaluates the quality of cluster-

ing by comparing the average intra-cluster distance to

the mean nearest-cluster distance. It is deﬁned as:

S(i) =

b(i) − a(i)

max(a(i),b(i))

, (1)

where a(i) is the mean distance between a data point

i and all other points in the same cluster, and b(i)

is the mean distance between i and all points in the

nearest neighboring cluster. The overall Silhouette

Score is the mean of S(i) for all data points. Higher

scores (closer to 1) indicate well-separated and com-

pact clusters (Rousseeuw, 1987).

Davies-Bouldin Index (DBI): quantiﬁes the aver-

age similarity between each cluster and its most sim-

ilar cluster, where similarity is a ratio of intra-cluster

dispersion to inter-cluster separation. It is calculated

ICISSP 2025 - 11th International Conference on Information Systems Security and Privacy

366

as:

DBI =

∑

i=1

max

j̸=i



+ σ

d(c

)



, (2)

where σ

is the average distance of points in cluster

i to their centroid c

, and d(c

) is the distance be-

tween centroids c

and c

. Lower DBI values indicate

better cluster separation (Davies and Bouldin, 1979).

Calinski-Harabasz Index: measures the ratio of

between-cluster dispersion to within-cluster disper-

sion. It is deﬁned as:

CH =

trace(B

)/(k − 1)

trace(W

)/(n − k)

, (3)

where B

is the between-cluster scatter matrix, W

the within-cluster scatter matrix, k is the number of

clusters, and n is the number of data points. Higher

values indicate well-separated clusters (Calinski and

Harabasz, 1974).

Quantization Error: For Self-Organizing Maps

(SOMs), it measures the average distance between

each data point and its best matching unit (BMU) on

the Self-Organizing Map (SOM). It is calculated as:

QE =

∑

i=1

∥x

− m

BMU(i)

∥ (4)

where N is the number of data points, x

is a data

point, and m

BMU(i)

is the prototype vector of the BMU

for x

. A lower Quantization Error indicates that

the SOM effectively captures the data structure (Sun,

2000).

Topographic Error: evaluates how well the

SOM preserves the topological properties of the in-

put space. It is deﬁned as:

T E =

∑

i=1

u(x

) (5)

where u(x

) = 1 if the ﬁrst and second BMUs of x

are

not adjacent, and u(x

) = 0 otherwise. A lower To-

pographic Error indicates better preservation of input

space topology (Vesanto and Alhoniemi, 2000).

These metrics collectively offer a comprehensive

framework for evaluating clustering performance, en-

suring reliable and valid results.

3.4.1 Clustering Algorithms Using Principal

Component Analysis (PCA)

For a dataset derived from the Spamley platform,

fundamental clustering methods-including k-means,

Gaussian Mixture Models (GMM), and agglomera-

tive clustering—were tested, relying on dimensional-

ity reduction via PCA. These methods served as an

initial step to identify the most suitable algorithm for

clustering individuals.

After initial clustering, silhouette analysis and

Davies-Bouldin Index which are explained in the sub-

section 3.4 were employed to determine the optimal

number of clusters (Rousseeuw, 1987).

The results of all four clustering algorithms were

suboptimal. Among them, k-means performed the

best; however, its clustering quality remains inade-

quate based on the silhouette scores and other evalua-

tion metrics. This suggests that the study should shift

towards more advanced techniques, such as deep clus-

tering algorithms, to improve clustering performance.

3.5 Deep Clustering

3.5.1 Generative Adversarial Network (GAN)

for Dimensionality Reduction

K-Means Clustering Using GAN. A hybrid ap-

proach was introduced, combining GANs for di-

mensionality reduction with k-means for clustering.

GANs were chosen for their ability to transform high-

dimensional data into a latent space that captures

meaningful patterns, enhancing its suitability for clus-

tering. This section details the methodology, includ-

ing GANs architecture, training settings, and cluster-

ing evaluation, ensuring clarity and reproducibility.

Dimensionality Reduction with GANs. The

GANs architecture consisted of two primary compo-

nents:

• Generator: The generator transformed random

noise into synthetic samples that mirrored the

structure of the input data. It used a dense layer

with ReLU activation to produce outputs match-

ing the input dimensions.

• Discriminator: The discriminator evaluated the

authenticity of the generated samples using a

dense layer with sigmoid activation. Its training

was optimized using binary cross-entropy loss.

The GAN was trained iteratively, where the gener-

ator and discriminator were updated using the Adam

optimizer with a learning rate of 5×10

−5

. Each GAN

conﬁguration was evaluated across encoding dimen-

sions ranging from 2 to 15, with the number of train-

ing epochs set to 50.

Optimal Latent Encoding Selection. Latent fea-

tures were generated by the trained generator for each

encoding dimension. These features were clustered

using k-means, and the clustering quality was as-

sessed using multiple metrics: the Silhouette Score,

Davies-Bouldin Index, and Calinski-Harabasz In-

dex, as brieﬂy explained in subsection 3.4 The en-

Enhanced Predictive Clustering of User Proﬁles: A Model for Classifying Individuals Based on Email Interaction and Behavioral Patterns

367

coding dimension with the highest silhouette score

and a combined metric score (maximizing silhou-

ette and Calinski-Harabasz while minimizing Davies-

Bouldin) was selected.

Optimal Cluster Determination. K-Means clus-

tering was applied across a range of clusters (k = 2

to 15). The optimal number of clusters was de-

termined by analyzing the same metrics and select-

ing the one with maximum silhouette and Calinski-

Harabasz while having minimum Davies-Bouldin.

The results demonstrate a better balance be-

tween dimensionality reduction and clustering preci-

sion compared to PCA-based clustering. However,

despite the notable improvement in clustering scores,

this approach still lags behind the other two dimen-

sionality reduction techniques and their correspond-

ing clustering results, discussed below.

3.5.2 Self-Organizing Maps (SOMs) for

Dimensionality Reduction

K-Means Clustering Using SOM.

This approach employs Self-Organizing Maps

(SOMs) for dimensionality reduction combined

with k-means clustering to identify patterns in

high-dimensional data. SOMs provide topology-

preserving transformations, while k-means extracts

distinct clusters, resulting in interpretable and struc-

tured representations. The methodology encompasses

dimensionality reduction, clustering, evaluation us-

ing multiple metrics, and visualization to ensure

reproducibility and reliability.

Dimensionality Reduction Using SOMs.

Introduced by (Kohonen, 1982), SOMs are artiﬁcial

neural networks designed to project high-dimensional

data onto a lower-dimensional grid while preserving

topological relationships. For this study, SOMs were

conﬁgured with the following parameters:

• Sigma: 0.5

• Learning Rate: 0.5

• Training Iterations: 100

To pre-process the data, an auto-encoder was used to

compress high-dimensional data into a latent space

before applying SOM. The auto-encoder was trained

with:

• Learning Rate: 5 × 10

−5

• Batch Size: 50

• Epochs: 20

• Early Stopping Patience: 5

This combination leveraged the topology-preserving

properties of the SOM and the ability of the auto-

encoder to capture latent features.

Optimal Encoding Dimension Selection.

The optimal encoding dimension was determined

by evaluating clustering quality metrics, including

silhouette score, Davies-Bouldin index, Calinski-

Harabasz index, quantization error, and topographic

error which were explained brieﬂy in subsection 3.4,

also a combined score of maximum silhouette and

Calinski-Harabasz while having minimum Davies-

Bouldin, guided the selection of the optimum encod-

ing dimension that best captures the structure of the

dataset.

Optimal Cluster Determination.

K-Means clustering was applied to the SOM-mapped

features across a range of cluster counts (k = 2 to

15). The optimal k was determined using the same

multi-metric score evaluation mentioned in the previ-

ous paragraph, ensuring robust and meaningful clus-

ter selection.

Visualization of SOM Clusters.

The clustered data was visualized on a 15×15 hexag-

onal grid, where the color of each cell represented

its cluster label. Boundaries and centroids were

highlighted for clarity, and convex hulls were drawn

around clusters to enhance interpretability. Figure 2

provides an example visualization, illustrating cluster

density and distribution.

Figure 2: Clusters Visualization on SOM Hexagonal Grid.

This integrated approach emphasizes the inter-

pretability of SOM clusters while preserving robust

clustering accuracy, offering actionable insights into

the structure of the dataset.

3.5.3 Auto-Encoder-Based Clustering

K-Means Clustering Using Auto-Encoders.

This approach leverages auto-encoders, a type of neu-

ICISSP 2025 - 11th International Conference on Information Systems Security and Privacy

368

ral network for unsupervised learning, in combination

with k-means clustering to analyze user responses

on Spamley’s test. Auto-encoders effectively reduce

the dimensionality of high-dimensional data by map-

ping it into a latent space that retains essential fea-

tures while discarding irrelevant information, making

it a reliable framework for analyzing various types of

complex datasets in different study directions (Abou

El-Naga et al., 2022).

Dimensionality Reduction Using Auto-Encoders.

The auto-encoder architecture was conﬁgured with

the following parameters to achieve effective dimen-

sionality reduction:

• Learning Rate: 5 × 10

−5

• Batch Size: 50

• Epochs: 20

• Early Stopping Patience: 5

The auto-encoder consists of two components:

• Encoder: Compresses high-dimensional input

into a lower-dimensional latent space using a

dense layer with ReLU activation.

• Decoder: Reconstructs the input from the latent

space, ensuring minimal reconstruction loss, with

a dense layer using sigmoid activation.

The model was trained on the dataset with a val-

idation split of 30%, leveraging early stopping to

prevent overﬁtting. The training and validation loss

trends were plotted for each encoding dimension to

ensure convergence and identify the most suitable di-

mensionality for clustering.

Optimal Encoding Dimension Selection.

To determine the best encoding dimension, k-means

clustering was applied to the latent features ex-

tracted by the auto-encoder across a range of dimen-

sions (2–15). Clustering quality was evaluated using

silhouette score, Calinski-Harabasz index, Davies-

Bouldin index. The encoding dimension with the

highest silhouette score and the overall combined

metric score were selected as optimal.

Optimal Cluster Determination.

K-means clustering was applied to the latent features

across a range of cluster numbers (k = 2 to 12). The

optimal k was determined by analyzing multiple met-

rics mentioned in subsection 3.4.

In conclusion, integrating the feature extraction

capabilities of auto-encoders with k-means and val-

idating the results using robust clustering evaluation

techniques provided reliable and adaptable outcomes

for analyzing the user responses dataset from Spam-

ley and generating clusters of user proﬁles.

3.6 Reproducibility and Robustness

To ensure the reliability of all of the models that used

k-means in their clustering approach the following

features were considered:

• Random Seed: A ﬁxed seed (42) was used for all

stochastic operations.

• Consensus-Based Metrics: Optimal k was se-

lected based on a consensus of multiple metrics.

• Manual Centroid Initialization: Final centroids

were saved and reused for consistent clustering re-

sults.

3.7 Utilization of Generated Labels

After selecting the best clustering approach, clusters

were assigned labels ranging from 0 to n-1, where n

is the total number of clusters. A new column, ”la-

bels” was added to enable easy extraction of all the

rows that belong to the same cluster. Additionally,

the ”labels” column will serve as the target variable

in the supervised learning algorithm that will be used

to predict cluster membership for new users.

A more in-depth analysis was conducted to iden-

tify emails that were misclassiﬁed as legitimate de-

spite being phishing, and vice versa, by the major-

ity of individuals within each cluster. This analy-

sis utilized the email ids feature, which was excluded

during clustering due to the randomized sampling of

emails presented in each test attempt (Gallo et al.,

2024), as its inclusion could negatively inﬂuence clus-

tering outcomes. This insight proved crucial in iden-

tifying email features that tend to deceive users. A

function was then developed to generate a histogram

displaying the top 10 email IDs that misled users. The

identiﬁed deceiving email IDs were then passed to

the emails dataset which was generalized to create

a feature-based scheme rather than relying on static

email attributes. This scheme can then used to craft

new emails that align with the user proﬁle.

3.8 Cluster Prediction Models

The methodology leverages a range of Machine

Learning (ML) and Deep Learning (DL) models to

predict cluster assignments.

Dataset Splitting: Data is split into an 80%-20%

ratio for training and testing. Features (X) include 30

selected attributes, while the target variable (y) repre-

sents cluster labels. Consistent random state initial-

ization ensures reproducibility.

Enhanced Predictive Clustering of User Proﬁles: A Model for Classifying Individuals Based on Email Interaction and Behavioral Patterns

369

Machine Learning Models:

• Random Forest (RF): An ensemble method that

builds multiple decision trees and aggregates pre-

dictions, optimized by tuning the depth of the tree

by hyperparameters and the number of estimators.

• Gradient Boosting Machines (GBM): Sequen-

tially enhances weak classiﬁers, reducing bias and

variance. Fine-tuning included the learning rate

and number of boosting stages.

• XGBoost: Combines gradient boosting with reg-

ularization and early stopping for computational

efﬁciency and accuracy in handling complex data.

• Support Vector Machines (SVM): Utilized lin-

ear and radial basis function (RBF) kernels to

separate data with maximum margin. Parameters

such as C and γ were optimized.

• k-Nearest Neighbors (k-NN): Assigns labels

based on majority class among k nearest neigh-

bors, with k = 5 selected for balanced perfor-

mance.

• Naive Bayes (NB): A probabilistic model lever-

aging Gaussian assumptions, suitable for high-

dimensional data.

Deep Learning Models:

• Artiﬁcial Neural Networks (ANN): A feed-

forward network with dense layers and dropout

for overﬁtting control. Trained for 50 epochs us-

ing the Adam optimizer.

• Convolutional Neural Networks (CNN): Imple-

mented as a 1D architecture for sequential data,

extracting local patterns through convolution and

pooling layers.

3.9 Cluster Prediction Evaluation

Metrics

The performance of all of the cluster prediction mod-

els mentioned was evaluated by a variety of metrics,

each assessing different aspects such as the accuracy

of the model, robustness, and generalization capabili-

ties.

Accuracy: measures the proportion of correctly

predicted labels out of the total labels and is deﬁned

as:

Accuracy =

TP + TN

TP + TN + FP +FN

(6)

where TP, TN, FP, and FN represent True Positives,

True Negatives, False Positives, and False Negatives,

respectively. Accuracy provides a simple and intu-

itive measure but may be misleading in imbalanced

datasets (Powers, 2020).

Precision: evaluates the proportion of true posi-

tive predictions among all positive predictions. It is

calculated as:

Precision =

TP + FP

(7)

High precision indicates that the model produces

fewer false positive predictions (Powers, 2020).

Recall: measures the proportion of true positives

correctly identiﬁed by the model. It is deﬁned as:

Recall =

TP + FN

(8)

Recall is particularly useful in scenarios where mini-

mizing false negatives is critical (Powers, 2020).

F1-Score: harmonic mean of Precision and Re-

call, providing a single metric to balance both mea-

sures. It is given by:

F1-Score = 2 ·

Precision · Recall

Precision + Recall

(9)

The F1-Score is particularly useful when dealing with

imbalanced datasets (Yedidia, 2016).

4 RESULTS AND DISCUSSION

This section provides detailed analysis of clustering

and predictive model results, comparing methods to

identify the most effective techniques for accurate

clustering. Key ﬁndings and evaluations are discussed

to assess performance and alignment with research

objectives

4.1 Clustering Performance According

to Different Dimensionality

Reduction Techniques

The performance of three dimensionality reduction

techniques—Self-Organizing Maps (SOM), auto-

encoders, and Generative Adversarial Networks

(GANs)—combined with k-means clustering is eval-

uated. Each method offers a distinct approach

to transforming high-dimensional data into lower-

dimensional representations, facilitating clustering.

Given the stochastic nature of k-means clustering,

where initial centroid positions are selected randomly

in each run, the outcomes for the optimum encoding

dimension, optimum number of clusters, and evalu-

ation metric scores varied across iterations. To en-

sure reliable and reproducible results, each technique

was subjected to a loop of 300 iterations, where the

most frequently observed optimal number of clusters

(k) was recorded. This iterative approach minimized

ICISSP 2025 - 11th International Conference on Information Systems Security and Privacy

370

variability and allowed for a robust analysis of the re-

sulting evaluation metrics. At the conclusion of all

iterations, the metrics corresponding to the most con-

sistent clustering outcomes were documented and an-

alyzed.

This approach in assessing the results ensures that

the reported results accurately reﬂect the clustering

effectiveness of each dimensionality reduction tech-

nique, providing a reliable basis for comparison and

insights into their suitability for the given dataset. In

Table1 a comparison is presented including all the re-

sults of each approach.

Table 1: Clustering Metrics Comparison Table.

The performance of clustering techniques was

evaluated across four dimensionality reduction meth-

ods: Self-Organizing Maps (SOM), auto-encoders,

Generative Adversarial Networks (GANs), and Prin-

cipal Component Analysis (PCA). The evaluation uti-

lized a range of clustering metrics, including the Sil-

houette Score, Davies-Bouldin Index, and Calinski-

Harabasz Index, to assess the quality of cluster com-

pactness, separation, and overall structure. These

metrics provide complementary insights into the ef-

ﬁcacy of the clustering process, ensuring a compre-

hensive evaluation framework.

Principal Component Analysis (PCA) and K-

Means: PCA served as the baseline for dimensional-

ity reduction, yielding the lowest performance across

all metrics. The Silhouette Score of 0.113 and the

Calinski-Harabasz Index of 82.851 indicate poor clus-

ter compactness and separation, while the Davies-

Bouldin Index of 0.234 suggests signiﬁcant overlap

between clusters. Unlike the other methods, PCA did

not optimize an encoding dimension, as it reduces di-

mensionality linearly. The results highlight the lim-

itations of PCA in capturing the non-linear relation-

ships inherent in the data, reafﬁrming the superior-

ity of non-linear techniques such as SOMs and auto-

encoders for clustering tasks.

Generative Adversarial Network (GAN) and

K-Means: Despite the potential of GANs for gen-

erating rich latent representations, underperformed

compared to SOMs and auto-encoders. The Silhou-

ette Score of 0.409 and the Davies-Bouldin Index of

0.941 indicate less cohesive clusters with higher over-

lap. The Calinski-Harabasz Index of 441.323 reﬂects

weaker cluster dispersion. The optimal conﬁguration

achieved an encoding dimension of 2 and 3 clusters,

suggesting that GANs struggled to identify distinct

patterns in the data. This low performance can be

attributed to the sensitivity of GANs to noise dur-

ing training and the limited number of epochs used

to avoid over-ﬁtting. Also using GANs to expand

the dataset risks introducing synthetic anomalies and

noise, reducing the authenticity and reliability of the

dataset. That is why it will not be considered in fur-

ther studies.

Auto-Encoder and K-Means: Auto-encoder-

based dimensionality reduction also delivered com-

petitive results, with a Silhouette Score of 0.795 and a

Calinski-Harabasz Index of 1268.813, demonstrating

the effectiveness of the method in capturing mean-

ingful latent representations. However, the Davies-

Bouldin Index of 0.625 suggests that the clusters were

slightly less compact than SOM. The auto-encoder

successfully reduced the dimensionality to 2, and the

optimal number of clusters was determined to be 4,

similar to SOM. This outcome highlights the capabil-

ity of auto-encoders to balance data compression with

the preservation of key features relevant to clustering.

Self-Organizing Maps (SOM) and K-Means:

The combination of SOM with k-means clustering

achieved the highest overall performance across all

metrics. SOM preserved the topological structure of

the data during dimensionality reduction, resulting in

well-separated and cohesive clusters. A Silhouette

Score of 0.834 indicates strong intra-cluster similarity

and inter-cluster separation, while the Davies-Bouldin

Index of 0.424 reﬂects tight and distinct clusters. The

Calinski-Harabasz Index, with a value of 1676.239,

further supports the robustness of this approach. The

optimal conﬁguration was achieved with an encoding

dimension of 2 and 4 clusters, demonstrating the abil-

ity of the model to maintain data integrity while sim-

plifying its representation.

Key Findings.

The results highlight the existence of four distinct user

proﬁle clusters, providing a foundation for analyzing

the dominant deceptive traits in each cluster inﬂuenc-

ing each group. The results also demonstrate the su-

periority of SOM and auto-encoders with k-means for

the responses dataset from Spamley. SOMs, in partic-

ular, provided the most robust and interpretable clus-

ters, while auto-encoders offered competitive perfor-

mance with slightly lower cluster compactness. GAN

proved to be unreliable due to its low scores and sen-

Enhanced Predictive Clustering of User Proﬁles: A Model for Classifying Individuals Based on Email Interaction and Behavioral Patterns

371

sitivity to noise during training. PCA, while widely

used, proved inadequate for this dataset, underscoring

the importance of using advanced non-linear methods

for complex clustering problems.

These ﬁndings highlight the critical role of di-

mensionality reduction techniques in enabling effec-

tive clustering and provide a strong foundation for fu-

ture work in personalized user proﬁling and predic-

tive analytics. The demonstrated advantages of SOMs

and auto-encoders suggest that they are well-suited

for applications requiring robust clustering in high-

dimensional and behaviorally rich datasets.

4.2 Predictive Clustering Performance

The predictive clustering performance was evaluated

using four key metrics: Accuracy, Precision, Recall,

and F1 Score. These metrics which are further ex-

plained in subsection 3.9, were chosen to comprehen-

sively assess the ability of the models to predict the

cluster of each user.

4.3 Performance Across Models

The results, as illustrated in Figure 3, demonstrate no-

table variations in performance across models.

Figure 3: Comparison of Performance Metrics Across All

Supervised Models.

• SVM (Linear) demonstrated the highest perfor-

mance exceeding 90% in all metrics, making it

the most reliable for predictive clustering.

• ANN and CNN Models performed strongly, with

scores exceeding 80% in all metrics, emphasizing

their ability to handle complex datasets.

• Gradient Boosting, Random Forest, and XG-

Boost showed competitive performance but

slightly lower Recall, suggesting a preference for

Precision over sensitivity.

• Naive Bayes underperformed, particularly in Re-

call and F1 Scores, likely due to its simplifying

assumptions that do not suit complex dependen-

cies in the data.

• k-NN offered balanced results but was outper-

formed by deep learning and ensemble-based

methods.

Key Findings.

• Superiority of Linear SVM: The performance of

the SVM model suggests that the cluster bound-

aries are well-separated in the feature space, mak-

ing it the most effective choice for predictive clus-

tering in this dataset.

• Strength of Deep Learning Models: The robust

performance of ANN and CNN highlights their

ability to capture non-linear relationships and sub-

tle patterns, making them well-suited for proﬁling

tasks.

• Limitations of Naive Bayes: The signiﬁcant gap

in Recall and F1 Scores for Naive Bayes under-

scores the importance of choosing models that

can accommodate the inherent complexity of user

proﬁling datasets.

5 CONCLUSIONS AND FUTURE

WORK

This study introduced a novel predictive clustering

framework designed for the Spamley dataset, integrat-

ing email interaction patterns and user traits to en-

hance cybersecurity user proﬁling. By leveraging ad-

vanced dimensionality reduction techniques, includ-

ing Self-Organizing Maps (SOMs), autoencoders,

and Generative Adversarial Networks (GANs), the

framework delivered its most robust performance

with SOMs, achieving a Silhouette Score of 0.83,

a Davies-Bouldin Index of 0.42, and a Calinski-

Harabasz Index of 1676.2. These results address

the limitations of traditional methods, demonstrating

the effectiveness of advanced non-linear techniques

for clustering complex user proﬁles. The clustering

models identiﬁed four distinct clusters, their analy-

sis would provide foundational insights for the devel-

opment of tailored phishing countermeasures. Addi-

tionally, Support Vector Machines (SVMs) and neu-

ral network models proved to be effective in classify-

ing cluster membership, enabling predictions of email

characteristics that manipulate user proﬁles. This

framework offers actionable insights for personalized

content delivery and targeted awareness campaigns to

mitigate phishing attacks more effectively.

ICISSP 2025 - 11th International Conference on Information Systems Security and Privacy

372

Future work will focus on analyzing the clus-

ters generated by this model and documenting the

insights, following the expansion of the Spamley

dataset to improve the generalizability and accuracy

of the models. Additionally, efforts will be directed

toward exploring further variables that can be incor-

porated into the model to reﬁne user proﬁling.

REFERENCES

Abou El-Naga, A. H., Sayed, S., Salah, A., and Mohsen,

H. (2022). Consensus nature inspired clustering

of single-cell rna-sequencing data. IEEE Access,

10:98079–98094.

Aggarwal, C. C. and Aggarwal, C. C. (2017). An introduc-

tion to outlier analysis. Springer.

Albladi, S. M. and Weir, G. R. (2018). User characteristics

that inﬂuence judgment of social engineering attacks

in social networks. Human-centric Computing and In-

formation Sciences, 8:1–24.

Allodi, L., Chotza, T., Panina, E., and Zannone, N.

(2019). The need for new antiphishing measures

against spear-phishing attacks. IEEE Security & Pri-

vacy, 18(2):23–34.

Calinski, T. and Harabasz, J. (1974). A dendrite method

for cluster analysis. Communications in Statistics,

3(1):1–27.

Chandola, V., Banerjee, A., and Kumar, V. (2009).

Anomaly detection: A survey. ACM Computing Sur-

veys (CSUR), 41(3):1–58.

Cialdini, R. B. and Cialdini, R. B. (2007). Inﬂuence: The

psychology of persuasion, volume 55. Collins New

York.

Davies, D. L. and Bouldin, D. W. (1979). A cluster separa-

tion measure. IEEE transactions on pattern analysis

and machine intelligence, (2):224–227.

Dhamija, R., Tygar, J. D., and Hearst, M. (2006). Why

phishing works. In Proceedings of the SIGCHI confer-

ence on Human Factors in computing systems, pages

581–590.

Gallo, L., Gentile, D., Ruggiero, S., Botta, A., and Ventre,

G. (2024). The human factor in phishing: Collect-

ing and analyzing user behavior when reading emails.

Computers & Security, 139:103671.

GDPR, G. D. P. R. (2016). General data protection reg-

ulation. Regulation (EU) 2016/679 of the European

Parliament and of the Council of 27 April 2016 on the

protection of natural persons with regard to the pro-

cessing of personal data and on the free movement of

such data, and repealing Directive 95/46/EC.

Han, J., Pei, J., and Tong, H. (2022). Data mining: concepts

and techniques. Morgan kaufmann.

Kim, S.-H. and Cho, S.-B. (2024). Detecting phishing urls

based on a deep learning approach to prevent cyber-

attacks. Applied Sciences, 14(22):10086.

Kohonen, T. (1982). Self-organized formation of topolog-

ically correct feature maps. Biological cybernetics,

43(1):59–69.

Kotsiantis, S. B., Kanellopoulos, D., and Pintelas, P. E.

(2006). Data preprocessing for supervised leaning.

International journal of computer science, 1(2):111–

117.

Lawson, P., Pearson, C. J., Crowson, A., and Mayhorn,

C. B. (2020). Email phishing and signal detection:

How persuasion principles and personality inﬂuence

response patterns and accuracy. Applied ergonomics,

86:103084.

Little, R. J. and Rubin, D. B. (2019). Statistical analysis

with missing data, volume 793. John Wiley & Sons.

Parrish Jr, J. L., Bailey, J. L., and Courtney, J. F. (2009).

A personality based model for determining suscepti-

bility to phishing attacks. Little Rock: University of

Arkansas, pages 285–296.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,

Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,

Weiss, R., Dubourg, V., et al. (2011). Scikit-learn:

Machine learning in python. the Journal of machine

Learning research, 12:2825–2830.

Powers, D. M. (2020). Evaluation: from precision, recall

and f-measure to roc, informedness, markedness and

correlation. arXiv preprint arXiv:2010.16061.

Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to

the interpretation and validation of cluster analysis.

Journal of computational and applied mathematics,

20:53–65.

Sammut, C. and Webb, G. I. (2011). Encyclopedia of ma-

chine learning. Springer Science & Business Media.

Sun, Y. (2000). On quantization error of self-organizing

map network. Neurocomputing, 34(1-4):169–193.

Tornblad, M. K., Jones, K. S., Namin, A. S., and Choi,

J. (2021). Characteristics that predict phishing sus-

ceptibility: a review. In Proceedings of the Human

Factors and Ergonomics Society Annual Meeting, vol-

ume 65, pages 938–942. SAGE Publications Sage CA:

Los Angeles, CA.

Van Der Heijden, A. and Allodi, L. (2019). Cognitive triag-

ing of phishing attacks. In 28th USENIX Security Sym-

posium (USENIX Security 19), pages 1309–1326.

Vesanto, J. and Alhoniemi, E. (2000). Clustering of the self-

organizing map. IEEE Transactions on neural net-

works, 11(3):586–600.

Wang, J., Herath, T., Chen, R., Vishwanath, A., and Rao,

H. R. (2012). Research article phishing susceptibil-

ity: An investigation into the processing of a targeted

spear phishing email. IEEE transactions on profes-

sional communication, 55(4):345–362.

Yedidia, A. (2016). Against the f-score.

URL: https://adamyedidia. ﬁles. wordpress.

com/2014/11/fscore. pdf.

Enhanced Predictive Clustering of User Proﬁles: A Model for Classifying Individuals Based on Email Interaction and Behavioral Patterns

373

APPENDIX

This appendix outlines the selected characteristics of each individual included in the dataset, to cluster the indi-

viduals accordingly.

Table 2: Selected Key Features for Clustering.

Feature Description

age Integer number representing age.

gender ”Male”, ”Female” and ”Other”.

years job experience Integer Number representing the number of years

computer science knowledge Score value from 1 to 5 where 5 means strong background.

phishing attack 0 or 1 where 1 means experienced phishing attack before.

antiPhishing course ever 0 or 1 where 1 means familiarity with cybersecurity awareness content.

time on internet Score value from 1 to 10 where 10 means excessive time on the internet.

educationField id,

jobField id

Both features share the same IDs (1 to 15), deﬁned as follows:

1. Natural Sciences 9. Society and Culture

2. Mathematics and Physics 10. Arts and Entertainment

3. Information Technology 11. Culinary, Hospitality

4. Engineering 12. Law

5. Architecture and Building 13. Finance

6. Agriculture and Related Studies 14. Psychology

7. Health 15. Other

8. Management and Commerce

educationLevel id IDs (1 to 4), deﬁned as follows:

1. High school graduate or below 3. Master’s degree

2. Bachelor’s degree 4. Doctorate degree

employmentType id IDs (1 to 9), deﬁned as follows:

1. Trainee 6. Teacher

2. Employee 7. R&D

3. Manager 8. Entrepreneur

4. Executive 9. Freelancer

5. Student

work hours prior test Integer number representing the number of hours.

test location Device type used while reading the email, represented as a string.

self conﬁdence Rating of self-conﬁdence from 0 to 5 where 5 means very Conﬁdent.

impulsivity Rating of impulsivity from 0 to 5 where 5 means very impulsive.

curiosity Rating of curiosity from 0 to 5 where 5 means very curious.

risk propensity Rating of risk propensity from 0 to 5, where 5 is the highest value.

risk perception Rating of risk perception from 0 to 5, where 5 is the highest value.

privacy data Rating of care towards data privacy from 0 to 5, where 5 is the highest.

extraversion Rating of Personality trait from 0 to 5, where 5 means very outgoing.

agreeableness Rating of Personality trait from 0 to 5, where 5 means very cooperative.

conscientiousness Rating of Personality trait from 0 to 5, where 5 means very organized.

emotional stability Rating of Personality trait from 0 to 5, where 5 means very calm.

openness Rating of Personality trait from 0 to 5, where 5 means very curious.

scarcity Rating how effective the scarcity persuasion principle is in decision-making

from 0 to 5, where 5 means very effective.

consistency Rating how effective the consistency persuasion principle is in decision-making

from 0 to 5, where 5 means very effective.

social proof Rating how effective the social proof persuasion principle is in decision-making

from 0 to 5, where 5 means very effective.

gratitude Rating how effective the gratitude persuasion principle is in decision-making

from 0 to 5, where 5 means very effective.

authority Rating how effective the authority persuasion principle is in decision-making

from 0 to 5, where 5 means very effective.

education job interaction Feature engineered value resulted from educationLevel id × jobField id

ICISSP 2025 - 11th International Conference on Information Systems Security and Privacy

374