
Table 5: Tweets obtained through SBERT showing how tweets containing a description of a CVE were merged into a cluster.
Tweet
CVE-2022-30161 : #Windows Lightweight Directory Access Protocol LDAP Remote Code Execution
Vulnerability. This CVE ID is unique from CVE-2022-30139.... https://t.co/pQY3uvtcJH
Attackers could exploit a now-patched spoofing vulnerability (CVE-2022-35829 aka FabriXss) in Service
Fabric... https://t.co/LoyRYEmnXZ https://t.co/YTUo4gssFH
4.490.000 article-text corpus and one of 886.000 full
arXiv papers. The filtering applied in this work en-
sures consistent data that surely includes a text that
mentions a CVE. However, the model would also
need to be trained with texts that are more general
but still related to the vulnerability domain. This im-
provement would guarantee a broader set of results.
In addition, the creation of the clusters using the
K-means model should be explored in depth, opti-
mally considering the initialization parameters of the
model. Choices could fall on selecting the initial cen-
troids of the clusters by sampling based on an empiri-
cal probability distribution of the points’ contribution
to the overall inertia, rather than choosing the clusters
randomly from the data for the initial centroids.
Also since in this specific case the initial number
of clusters is not known a priori, hierarchical cluster-
ing could be considered. In fact, this type of algorithm
returns as the result of the analysis a dendrogram that
starts with each data point as a separate cluster and
then proceeds to join the closest cluster pairs until all
data points belong to a single cluster, thus allowing
the optimal number to be reached.
ACKNOWLEDGEMENTS
This work has been partially supported by EU DUCA,
EU CyberSecPro, SYNAPSE, PTR 22-24 P2.01 (Cy-
bersecurity) and SERICS (PE00000014) under the
MUR National Recovery and Resilience Plan funded
by the EU - NextGenerationEU projects.
REFERENCES
Alperin, K., Joback, E., Shing, L., and Elkin, G. (2021).
A framework for unsupervised classificiation and data
mining of tweets about cyber vulnerabilities. arXiv
preprint arXiv:2104.11695.
Dai, A. M., Olah, C., and Le, Q. V. (2015). Document
embedding with paragraph vectors. arXiv preprint
arXiv:1507.07998.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2018). Bert: Pre-training of deep bidirectional trans-
formers for language understanding. arXiv preprint
arXiv:1810.04805.
Dion
´
ısio, N., Alves, F., Ferreira, P. M., and Bessani, A.
(2019). Cyberthreat detection from twitter using deep
neural networks. In 2019 international joint confer-
ence on neural networks (IJCNN), pages 1–8. IEEE.
ENISA (2022). Enisa threat landscape 2022. In
https://www.enisa.europa.eu/publications/enisa-
threat-landscape-2022.
Huang, P., He, P., Tian, S., Ma, M., Feng, P., Xiao, H.,
Mercaldo, F., Santone, A., and Qin, J. (2022). A vit-
amc network with adaptive model fusion and multiob-
jective optimization for interpretable laryngeal tumor
grading from histopathological images. IEEE Trans-
actions on Medical Imaging, 42(1):15–28.
Huang, P., Tan, X., Zhou, X., Liu, S., Mercaldo, F., and
Santone, A. (2021). Fabnet: fusion attention block
and transfer learning for laryngeal cancer tumor grad-
ing in p63 ihc histopathology images. IEEE Journal
of Biomedical and Health Informatics, 26(4):1696–
1707.
Huang, P., Zhou, X., He, P., Feng, P., Tian, S., Sun, Y., Mer-
caldo, F., Santone, A., Qin, J., and Xiao, H. (2023).
Interpretable laryngeal tumor grading of histopatho-
logical images via depth domain adaptive network
with integration gradient cam and priori experience-
guided attention. Computers in Biology and Medicine,
154:106447.
Lau, J. H. and Baldwin, T. (2016). An empirical
evaluation of doc2vec with practical insights into
document embedding generation. arXiv preprint
arXiv:1607.05368.
Le, B. D., Wang, G., Nasim, M., and Babar, A.
(2019). Gathering cyber threat intelligence from
twitter using novelty classification. arXiv preprint
arXiv:1907.01755.
Le, Q. and Mikolov, T. (2014). Distributed representations
of sentences and documents. In International confer-
ence on machine learning, pages 1188–1196. PMLR.
Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sen-
tence embeddings using siamese bert-networks. In
Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing. Associa-
tion for Computational Linguistics.
Zhou, X., Tang, C., Huang, P., Mercaldo, F., Santone, A.,
and Shao, Y. (2021). Lpcanet: classification of laryn-
geal cancer histopathological images using a cnn with
position attention and channel attention mechanisms.
Interdisciplinary Sciences: Computational Life Sci-
ences, 13(4):666–682.
Cybersecurity-Related Tweet Classification by Explainable Deep Learning
445