Learning will continue to play a vital role in the fu-
ture of malicious URL filtering. In this work, the de-
scriptive features derived from benign and malicious
domain name datasets were used to make predictions
on their nature using the Random Forests and SVM
algorithms. The final precision and recall rates pro-
duced when only using descriptive features and not
considering host-based features were up to 85% and
87% for Random Forests and up to 90% and 88% for
SVM respectively. Those results play a vital role in
the understanding of the operational relationship be-
tween features and thus contribute knowledge into the
correct grouping of features and the creation of multi-
model classifiers. The features were found to have
significantly different impact factors than some other
cases in the literature, proving the importance of plac-
ing great care in the selection of training datasets. Af-
ter the model was finalised and fine-tuned, it was pub-
lished in the Splunk Search and Reporting app where
it was used against new data to generate alerts. This
was the final step towards the automation of the detec-
tion process. Scheduled training was also configured
using Splunk, furthering the system’s autonomy.
There are several research pathways which can be
undertaken to improve the performance of this sys-
tem, some being parallel to and some being stem-
ming from the existing literature. A great addition to
our methodology would be the use of a dataset com-
posed of real-world passive DNS data for the training
phase that would allow for the generation of more fea-
tures, thus leading towards the elimination of noise.
As we have a passive DNS infrastructure under devel-
opment, we plan in the near future to make use of a
higher volume of real-world data as training datasets,
which would lead to the further improvement of our
model.
REFERENCES
Alexa (2019). The top 1.000.000 sites on the web.
Antonakakis, M., Perdisci, R., Dagon, D., Lee, W., and
Feamster, N. (2010). Building a dynamic reputation
system for dns. In USENIX security symposium, pages
273–290.
Antonakakis, M., Perdisci, R., Lee, W., Vasiloglou, N., and
Dagon, D. (2011). Detecting malware domains at the
upper dns hierarchy. In USENIX security symposium,
volume 11, pages 1–16.
Bilge, L., Kirda, E., Kruegel, C., and Balduzzi, M. (2011).
Exposure: Finding malicious domains using passive
dns analysis. In Ndss, pages 1–17.
Blum, A., Wardman, B., Solorio, T., and Warner, G. (2010).
Lexical feature based phishing url detection using on-
line learning. In Proceedings of the 3rd ACM Work-
shop on Artificial Intelligence and Security, pages 54–
60. ACM.
Breiman, L. (2001). Random forests. Machine learning,
45(1):5–32.
da Luz, P. M. (2014). Botnet detection using passive dns.
Radboud University: Nijmegen, The Netherlands.
Fouchereau, R. and Rychkov, K. (2019). Global DNS
Threat Report Understanding the Critical Role of
DNS in Network Security.
Hirani, M., Jones, S., and Read, B. (2019). Global dns hi-
jacking campaign: Dns record manipulation at scale.
blog, Jan.
Khalil, I., Yu, T., and Guan, B. (2016). Discovering ma-
licious domains through passive dns data graph anal-
ysis. In Proceedings of the 11th ACM on Asia Con-
ference on Computer and Communications Security,
pages 663–674. ACM.
Kintis, P., Miramirkhani, N., Lever, C., Chen, Y., Romero-
G
´
omez, R., Pitropakis, N., Nikiforakis, N., and An-
tonakakis, M. (2017). Hiding in plain sight: A longitu-
dinal study of combosquatting abuse. In Proceedings
of the 2017 ACM SIGSAC Conference on Computer
and Communications Security, pages 569–586. ACM.
Krebs, B. (2018). The Year Targeted Phishing Went Main-
stream.
Lam, I.-F., Xiao, W.-C., Wang, S.-C., and Chen, K.-T.
(2009). Counteracting phishing page polymorphism:
An image layout analysis approach. In International
Conference on Information Security and Assurance,
pages 270–279. Springer.
Lin, M.-S., Chiu, C.-Y., Lee, Y.-J., and Pao, H.-K. (2013).
Malicious url filtering—a big data application. In
2013 IEEE international conference on big data,
pages 589–596. IEEE.
L
´
opez S
´
anchez, J. (2019). M
´
etodos y t
´
ecnicas de detecci
´
on
temprana de casos de phishing. Universitat Oberta de
Catalunya (UOC).
Mamun, M. S. I., Rathore, M. A., Lashkari, A. H.,
Stakhanova, N., and Ghorbani, A. A. (2016). Detect-
ing malicious urls using lexical analysis. In Interna-
tional Conference on Network and System Security,
pages 467–482. Springer.
Marchal, S., Franc¸ois, J., State, R., and Engel, T. (2014).
Phishstorm: Detecting phishing with streaming ana-
lytics. IEEE Transactions on Network and Service
Management, 11(4):458–471.
Marchal, S., Franc¸ois, J., Wagner, C., State, R., Dulaunoy,
A., Engel, T., and Festor, O. (2012). Dnssm: A
large scale passive dns security monitoring frame-
work. In 2012 IEEE Network Operations and Man-
agement Symposium, pages 988–993. IEEE.
Moubayed, A., Injadat, M., Shami, A., and Lutfiyya, H.
(2018). Dns typo-squatting domain detection: A data
analytics & machine learning based approach. In 2018
IEEE Global Communications Conference (GLOBE-
COM), pages 1–7. IEEE.
Nikiforakis, N., Balduzzi, M., Desmet, L., Piessens, F., and
Joosen, W. (2014). Soundsquatting: Uncovering the
use of homophones in domain squatting. In Inter-
Phishing URL Detection Through Top-level Domain Analysis: A Descriptive Approach
297