on other NLP tasks, such as Named Entity Recogni-
tion (NER) (Aguilar et al., 2017). Furthermore, the
presented work did not investigate paths of the files,
which could be pivotal evidence when the file name
is meaningless, such as a file with a name made up of
numbers or random characters (e.g., kf3kfk3985.png).
Also, the metadata of the file, such as its header, size,
extension, could provide further clues to predict its
class correctly.
The assessment of transformer-based models,
such as BERT (Luo et al., 2018), RoBERTa (Liu et al.,
2019), and XLNet (Yang et al., 2019) for text classifi-
cation is part of our immediate future research, as they
have shown promising results on various NLP tasks.
ACKNOWLEDGEMENTS
This research has been funded with support from the
European Commission under the 4NSEEK project
with Grant Agreement 821966. This publication re-
flects the views only of the author, and the Euro-
pean Commission cannot be held responsible for any
use which may be made of the information contained
therein.
REFERENCES
Aguilar, G., Maharjan, S., Monroy, A. P. L., and Solorio, T.
(2017). A multi-task approach for named entity recog-
nition in social media data. In Proceedings of the 3rd
Workshop on Noisy User-generated Text, pages 148–
153.
Aizawa, A. (2003). An information-theoretic perspective of
tf–idf measures. Information Processing & Manage-
ment, 39(1):45–65.
Al-Nabki, M. W., Fidalgo, E., Alegre, E., and de Paz, I.
(2017). Classifying illegal activities on tor network
based on web textual contents. In Proceedings of
the 15th Conference of the European Chapter of the
Association for Computational Linguistics, volume 1,
pages 35–43.
Al-Nabki, M. W., Fidalgo, E., Alegre, E., and Fern
´
andez-
Robles, L. (2019). Torank: Identifying the most influ-
ential suspicious domains in the tor network. Expert
Systems with Applications, 123:212–226.
Alsmadi, I. and Hoon, G. K. (2018). Term weighting
scheme for short-text classification: Twitter corpuses.
Neural Computing and Applications, pages 1–13.
Beaufort, R., Roekhaut, S., Cougnon, L.-A., and Fairon, C.
(2010). A hybrid rule/model-based finite-state frame-
work for normalizing sms messages. In Proceedings
of the 48th Annual Meeting of the Association for
Computational Linguistics, pages 770–779. Associa-
tion for Computational Linguistics.
Breiman, L. (2001). Random forests. Machine Learning,
45(1):5–32.
Chaves, D., Fidalgo, E., Alegre, E., and Blanco, P. (2019).
Improving speed-accuracy trade-off in face detectors
for forensic tools by image resizing. In V Jor-
nadas Nacionales de Investigaci
´
on en Ciberseguridad
(JNIC), pages 1–2.
Chen, H., Mckeever, S., and Delany, S. J. (2017). Harness-
ing the power of text mining for the detection of abu-
sive content in social media. In Advances in Computa-
tional Intelligence Systems, pages 187–205. Springer.
Chen, J., Yan, S., and Wong, K.-C. (2018). Verbal ag-
gression detection on twitter comments: convolutional
neural network for short-text sentiment analysis. Neu-
ral Computing and Applications.
Chouchoulas, A. and Shen, Q. (1999). A rough set-based
approach to text classification. In International Work-
shop on Rough Sets, Fuzzy Sets, Data Mining, and
Granular-Soft Computing, pages 118–127. Springer.
Europol (2019a). Child sexual exploitation.
https://www.europol.europa.eu/crime-areas-and-
trends/crime-areas/child-sexual-exploitation. Ac-
cessed: 2019-11-08.
Europol (2019b). Eu policy cycle - empact.
https://www.europol.europa.eu/crime-areas-and-
trends/eu-policy-cycle-empact. Accessed: 2019-11-
08.
Fidalgo, E., Alegre, E., Gonz
´
alez-Castro, V., and
Fern
´
andez-Robles, L. (2018). Boosting image classi-
fication through semantic attention filtering strategies.
Pattern Recognition Letters, 112:176–183.
Fidalgo Fern
´
andez, E., Alegre Guti
´
errez, E., Fern
´
andez
Robles, L., and Gonz
´
alez Castro, V. (2019). Fusi
´
on
temprana de descriptores extra
´
ıdos de mapas de
prominencia multi-nivel para clasificar im
´
agenes. Re-
vista Iberoamericana de Autom
´
atica e Inform
´
atica in-
dustrial, 0(0).
Gangwar, A., Fidalgo, E., Alegre, E., and Gonz
´
alez-Castro,
V. (2017). Pornography and child sexual abuse detec-
tion in image and video: A comparative evaluation. In
8th International Conference on Imaging for Crime
Detection and Prevention (ICDP), pages 37–42.
Garc
´
ıa-Olalla, O., Alegre, E., Fern
´
andez-Robles, L., Fi-
dalgo, E., and Saikia, S. (2018). Textile retrieval based
on image content from CDC and webcam cameras in
indoor environments. Sensors (Switzerland), 18(5).
Genkin, A., Lewis, D. D., and Madigan, D. (2007). Large-
scale bayesian logistic regression for text categoriza-
tion. Technometrics, 49(3):291–304.
Hotho, A., N
¨
urnberger, A., and Paaß, G. (2005). A brief
survey of text mining. In Ldv Forum, pages 19–62.
Citeseer.
Imran, M., Mitra, P., and Srivastava, J. (2016). Cross-
language domain adaptation for classifying crisis-
related short messages. In ISCRAM 2016 Conference
Proceedings - 13th International Conference on Infor-
mation Systems for Crisis Response and Management.
Information Systems for Crisis Response and Man-
agement, ISCRAM.
Joachims, T. (1998). Text categorization with support vec-
tor machines: Learning with many relevant features.
In N
´
edellec, C. and Rouveirol, C., editors, Machine
Learning: ECML-98, pages 137–142, Berlin, Heidel-
berg. Springer Berlin Heidelberg.
File Name Classification Approach to Identify Child Sexual Abuse
233