File Name Classiﬁcation Approach to Identify Child Sexual Abuse

Mhd Wesam Al-Nabki

1,2 a

, Eduardo Fidalgo

1,2 b

, Enrique Alegre

1,2 c

and Roc

ıo Al

aiz-Rodr

ıguez

1,2 d

Department of Electrical, Systems and Automation, Universidad de Le

on, Spain

Researcher at INCIBE (Spanish National Cybersecurity Institute), Le

on, Spain

Keywords:

Short Text Classiﬁcation, File Name Classiﬁcation, Active Learning, Character-level Convolutional Networks,

Child Sexual Abuse.

Abstract:

When Law Enforcement Agencies seize a computer machine from a potential producer or consumer of Child

Sexual Exploitation Material (CSEM), they need accurate and time-efﬁcient tools to analyze its ﬁles. However,

classifying and detecting CSEM by manual inspection is a high time-consuming task, and most of the time,

it is unfeasible in the amount of time available for Spanish police using a search warrant. An option for

identifying CSEM is to analyze the names of the ﬁles stored in the hard disk of the suspect person, looking in

the text for patterns related to CSEM. However, due to the particularity of this ﬁle names, mainly its length

and the use of obfuscated words, current ﬁle name classiﬁcation methods suffer from a low recall rate, which

is essential in the context of this problem. This paper presents our ongoing research to identify CSEM through

their ﬁle names. We evaluate two approaches of short text classiﬁcation: a proposal based on machine learning

classiﬁers exploring the use of Logistic Regression and Support Vector Machine and an approach using deep

learning by adapting two popular Convolutional Neural Network (CNN) models that work on character-level.

The presented CNN achieved an average class recall of 0.86 and a recall rate of 0.78 for the CSEM class. The

CNN based classiﬁer could be integrated into forensic tools and services that might support Law Enforcement

Agencies to identify CSEM without the need to access systematically to the visual content of every ﬁle.

1 INTRODUCTION

Child Sexual Exploitation Material (CSEM) is de-

ﬁned as sexual abuse of a person under 18 years old,

together with producing images or videos of the abuse

and sharing the content online (Europol, 2019a). One

of the features of some Darknet networks, such as

The Onion Router (Tor)

or FreeNet

(Al-Nabki et al.,

2017; Al-Nabki et al., 2019) and also Peer to Peer

(P2P) networks, like eDonkey, (Panchenko et al.,

2012; Peersman et al., 2016) is their capability of

preserving a high level of privacy and anonymity of

their users. This characteristic allows pedophiles to

easily share CSEM far away from Law Enforcement

Agencies (LEAs) monitoring. Therefore, in 2017, the

Council of the European Union (EU) has prioritized

https://orcid.org/0000-0002-3975-3478

https://orcid.org/0000-0003-1202-5232

https://orcid.org/0000-0003-2081-774X

https://orcid.org/00000-0003-4164-5887

https://www.torproject.org/

https://freenetproject.org/

cybercrimes related to Child Sexual Abuse (CSA) to

be the most serious crime between the years 2018 and

2021 (Europol, 2019b).

CSEM producers and consumers might save this

content on their local computer machines, at least

temporary. When an LEA enters a home to inspect

a computer of a suspect, a police agent reviews the

ﬁles contained in the investigated hard drive, trying to

determine whether or not the suspected of pedophilia

has stored CSEM in the computer (Gangwar et al.,

2017). This process needs to be accomplished in a

limited time and as accurately as possible (Chaves

et al., 2019).

In this paper, we present our ongoing research

based on Natural Language Processing to identify

CSEM. More speciﬁcally, we are designing a text

classiﬁer to decide whether a given ﬁle is CSEM

or not, according to its name. The same process,

based on the classiﬁcation of the content of the ﬁles,

whether images or videos, is out of the scope of this

paper and research.

Building a supervised text classiﬁer for CSEM

identiﬁcation is a challenging task due to several rea-

228

Al-Nabki, M., Fidalgo, E., Alegre, E. and Aláiz-Rodríguez, R.

File Name Classiﬁcation Approach to Identify Child Sexual Abuse.

DOI: 10.5220/0009154802280234

In Proceedings of the 9th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2020), pages 228-234

ISBN: 978-989-758-397-1; ISSN: 2184-4313

sons. First, a binary supervised algorithm requires

training samples of Non-CSEM and CSEM ﬁles.

However, there are no publicly available datasets of

the latter class, and crawling samples from a P2P net-

work or the Darknet is illegal (Garc

ıa-Olalla et al.,

2018). Therefore, only CSEM ﬁle names obtained

legally, i.e., provided by LEAs, could be used.

Second, a ﬁle name typically is a text of small

length, which leads to a sparse representation of the

samples because we have a massive number of fea-

tures, while an instance is only represented with a

few of them. Finally, CSEM producers or consumers

tend to invent a personalized ﬁle name style to create

their own vocabulary, abbreviations, and acronyms to

circumvent detection tools, using a personalized ob-

fuscated style of writing. For example, in a sample

named “!!!!yoB0yXX”, the exclamation marks refer

to the age of a boy, and the letter O is replaced by the

number zero. Hence, most likely, this ﬁle could be

related to the abuse of a four years old boy.

In this paper, we present an approach to address

the aforementioned challenges by designing a super-

vised File Name Classiﬁcation (FNC) system. We ex-

plore two different text classiﬁcation approaches: 1)

conventional machine learning (ML) pipeline and 2)

deep learning (DL) approach using a Convolutional

Neural Network (CNN). We evaluate the classiﬁers

on a custom dataset of 65, 351 samples.

The rest of the paper is organized as follows: Sec-

tion 2 presents the related work. After that, Section 3

describes the designed classiﬁcation pipelines. Next,

Section 4 explains how the dataset is created and what

are its main features. Then, in Section 5, we describe

the experimentation performed and discuss the results

obtained. Finally, Section 6 presents the conclusions

by pointing to our ongoing and future research.

2 RELATED WORK

The problem of short text classiﬁcation is widely stud-

ied under different research lines (Sriram et al., 2010;

Sun, 2012; Shang et al., 2013; Rana et al., 2014; Lee

and Dernoncourt, 2016; Alsmadi and Hoon, 2018). In

the following, we explore two text classiﬁcation prob-

lems that are close to File Name Classiﬁcation (FNC).

First, news headlines classiﬁcation task attempts

to group news articles based on their titles, in which

the title is made up of a few words. (Rana et al., 2014)

proposed a pipeline for news headlines classiﬁcation

that consists of three stages: data pre-processing, text

representation, and classiﬁcation. In the data pre-

processing step, the text was tokenized into words,

and spaces replaced special characters, stop words

were removed, and the text was stemmed. For the

text representation, the authors used Term Frequency-

Inverse Document Frequency (TF-IDF), Information

Gain (IG) (Shang et al., 2013), and Boolean Weight

(BW) (Chouchoulas and Shen, 1999). Finally, in

the classiﬁcation stage, Rana et al. explored Naive

Bayes (NB) (Kim et al., 2006), Support Vector Ma-

chine (SVM) (Joachims, 1998), K-Nearest Neighbor

(KNN) (Hotho et al., 2005), and Decision Trees (DT)

(Safavian and Landgrebe, 1991). However, the core

difference between our problem and news headlines

classiﬁcation is that the latter has high-quality input

text, where the punctuation marks are maintained cor-

rectly, and there are no misspelled words.

The second research line is classifying tweets

from Twitter. According to (Perez, 2019), the most

common length of a tweet is 33 characters, while the

maximum number of characters is 280. Despite the

length of the text, Perez (Perez, 2019) work also deals

with the low-quality input text as in our work. Thus,

both might use abbreviations to save space or mis-

spell some words. (Imran et al., 2016) pre-processed

the tweets by removing hyperlinks, mentions and stop

words, and then they used the N-grams and IG tech-

niques for feature extraction and Random Forest clas-

siﬁer (Breiman, 2001). (Alsmadi and Hoon, 2018)

addressed short noisy text classiﬁcation on Twitter.

They proposed a supervised word weighting schema

to highlight essential terms in short noisy text along

with an SVM classiﬁer.

(Chen et al., 2018) proposed a framework to iden-

tify cyberbullying on Twitter. For text representa-

tion, they compared pre-trained language models, like

Word2Vec (Mikolov et al., 2013) and GloVe (Pen-

nington et al., 2014), with traditional text encod-

ing techniques, such as TF-IDF, and they realized a

decline in the performance when embedding-based

were used. For classiﬁcation, they compared tradi-

tional ML classiﬁers like Logistic Regression (LR)

and SVM with DL classiﬁers, like Long Short-Term

Memory (LSTM) (Lee and Dernoncourt, 2016) and

Convolutional Neural Network (CNN) (Zhang et al.,

2015).

Few researchers employed ﬁle names classiﬁca-

tion approach to recognize child sexual activities.

(Panchenko et al., 2012) attempted to normalize ﬁle

names using short message service (SMS) normaliza-

tion techniques proposed by (Beaufort et al., 2010).

On the top of the normalized text, they trained an

SVM classiﬁer and obtained an accuracy of 96.97%

on their dataset. The study of (Peersman et al., 2014)

proposed a framework called iCOP to detect CSEM

activities on P2P networks is proposed. The ﬁrst stage

of their classiﬁcation pipeline was a dictionary-based

File Name Classiﬁcation Approach to Identify Child Sexual Abuse

229

ﬁlter that was constructed manually and held CSEM

keywords. Also, they used character n-gram of size

two, three, and four, to capture more features about

the ﬁle name. These features were used to train a bi-

nary SVM classiﬁer. Afterward, in their recent work

(Peersman et al., 2016), they used the same represen-

tation but benchmarked more classiﬁers, like SVM

and NB. They evaluated their proposal on a custom

dataset, and they observed that the SVM classiﬁer

could identify CSEM ﬁle names with a recall rate of

0.43.

3 METHODOLOGY

In this section, we present the methodology used to

build the proposed ﬁle name classiﬁer (FNC). In the

following, we explore two distinct approaches based

on machine learning and deep learning.

The pre-processing step is common to both ap-

proaches, where special characters and numbers are

replaced by # and $, respectively, reducing the spar-

sity of the features. For example, the input text

“!!!!yoB0yXX” will be encoded into “####yoB$yXX”.

Another beneﬁt of this representation is the removal

of the duplicated instances that differ only by their

names. For example, a folder in a seized computer

could have more than 100 images named IMG01.png,

IMG02.png, ... IMG100.png, whereas all these names

are repeating the same information of IMG##.png.

3.1 Machine Learning Approach

The typical design of a supervised text classiﬁer has

two main components: text representation to encode

the input samples into feature vectors and a classiﬁer

that separates these feature vectors into CSEM and

Non-CSEM examples.

A crucial step in a text classiﬁcation pipeline is

ﬁnding an adequate representation of the text. Tok-

enizing a ﬁle name on word-level may not be the most

efﬁcient solution because the text of a ﬁle name might

be joined as a single word or using a special charac-

ter. Therefore, we tokenize the text at the character

level, following the work of other researchers (Peers-

man et al., 2014; Peersman et al., 2016).

The N-gram technique extracts all the patterns of

two to ﬁve consecutive characters of a given ﬁle name,

which form a set of tokens. In this work, we com-

bine the widely used Term Frequency-Inverse Docu-

ment Frequency (TF-IDF) technique (Aizawa, 2003)

with N-gram. TF-IDF is a statistical model that as-

signs weights to ﬁle name tokens. It accentuates those

whose frequency is higher in a given ﬁle name and, at

the same time, de-emphasizes tokens that frequently

occur in many ﬁles. This way, it overcomes the issue

of misspelled words or personalized naming style in

ﬁle names. Table 1 shows an example of two to ﬁve

grams of a ﬁle name “####yoB$yXX”. Furthermore,

to discard noisy tokens, we set thresholds for the min-

imum and the maximum term frequency.

Table 1: Example of preprocessing and tokenizing a ﬁle

name with two, three, four, and ﬁve grams.

Original

File Name

!!!!yoB0yXX

Preprocessing ####yoB$yXX

2-grams ##, ##, ##, #y, yo, oB, B$, $y, yX, XX

3-grams ###, ###, ##y, #yo, yoB, oB$, B$y, $yX, yXX

4-grams ####, ###y, ##yo, #yoB, yoB$, oB$y, B$yX, $yXX

5-grams ####y, ###yo, ##yoB, #yoB$, yoB$y, oB$yX, B$yXX

For this work, we select two commonly used super-

vised classiﬁers that have shown good performance

on text classiﬁcation tasks (Peersman et al., 2016; Al-

Nabki et al., 2017). They are Support Vector Machine

(SVM) (Schohn and Cohn, 2000) and Logistic Re-

gression (LR) (Genkin et al., 2007).

3.2 Deep Learning Approach based on

Convolutional Neural Networks

Convolutional Neural Networks (CNN) have been

widely and successfully used for image classiﬁcation

(Fidalgo et al., 2018; Fidalgo Fern

andez et al., 2019).

Besides the machine learning approaches mentioned

above, we employed CNN to classify ﬁle names. We

adapted the models of (Zhang et al., 2015) and (Kim

et al., 2016), as they showed promising performance

on Natural Language Processing (NLP) tasks.

The model of Kim et al. was intended for the lan-

guage modeling task and uses only character-level in-

puts but the prediction is performed at the word-level.

However, we adapt it to a text classiﬁcation problem

by replacing the subsequent recurrent layers with a

dense layer to perform softmax over the classes. Sim-

ilarly, the model of Zhang et. al applied only on

characters and does not require acquiring knowledge

about the syntactic or semantic structure of the ad-

dressed language. Unlike Kim’s model, this one is

dedicated to text classiﬁcation tasks.

4 DATASET CONSTRUCTION

Building a supervised ﬁle name classiﬁer requires

collecting samples of both classes, CSEM and Non-

CSEM. For the Non-CSEM class, we refer to a dataset

published by the National Software Reference Li-

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

230

brary (NSRL)

. We were able to access more than

32 million ﬁle names, but this will make our problem

very imbalanced and skewed to the Non-CSE class.

Therefore, we selected an initial subset of 64, 000

Non-CSEM examples, resulting in 56, 021 after ap-

plying the pre-processing step.

Regarding the CSEM class, we were able to col-

lect these examples thanks to the collaboration be-

tween the Spanish National Cybersecurity Institute

(INCIBE)

and Spanish LEAs. This latter provided

us a list with dumps of hard disks that had been seized

from criminals’ computers. The list had 64, 857

CSEM samples. However, after pre-processing them,

the number decreased to 9, 330 unique instances.

Table 2: Description of the used dataset to train the ﬁle

names classiﬁer.

File Name Class CSEM Non-CSEM Total

Before preprocessing 64,857 64,000 128,857

After preprocessing 9,330 56,021 65,351

5 EMPIRICAL EVALUATION

5.1 Experimental Setting

The experiments were conducted on a PC with an In-

tel(R) Core(TM) i7 processor with 32 GB of RAM

under Windows-10. We used Python3 with Keras

framework

for the CNN implementation and Scikit-

Learn

for the machine learning classiﬁers.

Regarding the conﬁguration of the ML approach,

we used character n-grams, extracting patterns from

two to ﬁve grams. Also, we set thresholds for the

minimum and the maximum gram proportion to 0.995

and 0.001, respectively. For the LR, we set the param-

eter C to 10, empirically, which refers to the inverse

of regularization strength. Concerning the SVM clas-

siﬁer, we used a linear kernel, and we left all the other

parameters to their default values, as set by the Scikit-

Learn library.

Concerning the neural network model of (Zhang

et al., 2015), we used the same model hyperparame-

ters as described in their paper, except for the input

feature-length. We set it to 128 instead of 1014 be-

cause the longest ﬁle name we had was 125 charac-

ters. Likewise, for the CNN model of (Kim et al.,

2016), we left the same hyperparameters as described

https://www.nist.gov/software-quality-group/about-

nsrl/nsrl-introduction

In Spanish, it stands for the Instituto Nacional de

Ciberseguridad de Espa

https://github.com/fchollet/keras

https://scikit-learn.org/stable/

in their paper. The only change was made to the out-

put layer when we replaced the recurrent layers with

a dense layer with softmax function.

In order to estimate the model performance, we

used 70% of the samples to train the model, and 30%

to test the models. The validation set forms 10% of

the training set. Table 3 describes the dataset used to

train and test the model.

Table 3: A description of the dataset used to train, validated,

and test the ﬁle name classiﬁer.

File Name Class Train Validation Test

Number of Samples 36,548 9,137 19,666

To control the number of iterations while training the

neural network, we set an early stopping criterion,

which is triggered if there is no further improvement

on the validation set during 10 epochs of training.

5.2 Evaluation Metric

The principal objective of this work is to assist LEAs

in the detection of CSEM through their ﬁle names,

avoiding the exposure of an agent to CSEM. In this

context, it is desirable to have a low number of false

negatives - a ﬁle named with CSEM content identiﬁed

as a Non-CSEM - than a low number of false posi-

tives, i.e., Non-CSEM ﬁle name wrongly categorized

as a CSEM. Therefore, we need to pay more atten-

tion to the recall of the CSEM class rather than the

Non-CSEM class.

Recall metric for a class is calculated as the total

number of samples correctly classiﬁed for that class

(the True Positives TP), over the total number of sam-

ples of that class (the True Positives TP and the False

Negatives FN). Equation (1) shows how Recall is es-

timated for a given class.

Recall =

T P

T P + FN

. (1)

Nevertheless, the precision of a classiﬁer is also a cru-

cial factor in measuring its performance, as it shows

the proportion of samples that were correctly identi-

ﬁed. Class precision is calculated as a ratio of cor-

rectly classiﬁed ﬁle names of that class (the True Pos-

itives TP) to the total number of predicted positive

samples of that class (the True Positives TP and the

False Positives FP), and it is given in Equation (2).

Precision =

T P

T P + FP

. (2)

Finally, the F1 score of a class summarizes the two

before mentioned metrics as it refers to the harmonic

File Name Classiﬁcation Approach to Identify Child Sexual Abuse

231

mean of the precision and recall and it is calculated

following to Equation (3).

F1 =

2 ∗ (Recall ∗ Precision)

(Recall + Precision)

. (3)

Additionally, it has been proved that the accuracy

metric is not reliable when the dataset is imbalanced

(Chen et al., 2017), as in our case, where the majority

of the samples are Non-CSEM ﬁle names. An alter-

native metric is to use average class accuracy, which

refers to the average recall of the CSEM and the Non-

CSEM classes, rather than using overall dataset level

accuracy.

5.3 Empirical Results

In this section, we compared the performance of the

four pipelines described in Section 3 in the dataset

of CSEM. Table 4 shows that both CNN models out-

weigh the machine learning classiﬁer. Furthermore,

by comparing the average class recall of both CNN

models, we observe that the model of Zhang et al.

has obtained an average class recall of 0.86, while

the model of Kim et al. has scored 0.84. Also, by

comparing the performance on class level, we can see

that the model of Zhang et al. has the best recall rate

for the CSEM class of 0.78, while the model of n-

grams with SVM classiﬁer has the best recall rate for

the class Non-CSEM of 0.98. It is noteworthy that

the SVM classiﬁer obtained the lowest performance

in terms of the F1 score and the average class recall

metrics.

Table 4: A comparison between four algorithms to classify

ﬁle names. The values in bold refer to the best result ob-

tained.

Method Name Category Precision Recall F1

N-gram

Non-CSEM 0.95 0.97 0.96

CSEM 0.78 0.65 0.71

Average 0.86 0.81 0.84

N-gram

SVM

Non-CSEM 0.94 0.98 0.96

CSEM 0.81 0.61 0.70

Average 0.88 0.79 0.83

Zhang et al.

model

Non-CSEM 0.96 0.95 0.96

CSEM 0.71 0.78 0.74

Average 0.84 0.86 0.85

Kim et al.

model

Non-CSEM 0.95 0.97 0.96

CSEM 0.78 0.70 0.74

Average 0.86 0.84 0.85

In addition to the analyzed measures, the required

time to predict the legality of a ﬁle name plays a vi-

tal role when police forces investigate a hard disk of a

suspect. Therefore, we compare the aforementioned

classiﬁcation methods in terms of the time needed to

predict the category of one million ﬁle names. We

carried out this experiment on a consumer-level com-

puter with 16GB RAM, an Intel Core i7 processor and

a GPU Nvidia GeForce GTX 1060. Regarding the

CNN-based methods, we set the batch size to 10, 000

examples and we tried to predict the ﬁle names twice,

once using the GPU and another using the CPU.

While for the machine learning methods, we exam-

ined them on the CPU only. Table 5 shows that the

logistic regression classiﬁer along with char n-gram

for text representation achieves the highest prediction

speed using the CPU. Meanwhile, the CNN-based

models were enormously time-consuming. However,

when the GPU is used, the model of (Zhang et al.,

2015) surpasses the other proposals in terms of the

prediction time of 59 seconds.

Table 5: A comparison between four algorithms in terms

of the required time (in seconds) to predict one million ﬁle

names using a GPU or a CPU processor. The values in bold

refer to the best prediction time.

Method Name CPU GPU

N-gram LR 106 -

N-gram SVM 852 -

Zhang et al. model

1022 59

Kim et al. model

1172 62

6 CONCLUSIONS AND FUTURE

WORK

In this paper, we explored two machine learning clas-

siﬁers (Support Vector Machine and Logistic Re-

gression), versus two Convolutional Neural Network

(CNN) models to classify ﬁle names related to Child

Sexual Exploitation Material (CSEM). We compared

these models on a dataset of 65, 351 samples dis-

tributed over two classes (64, 857 Non-CSEM and

9, 330 CSEM ﬁle names) concerning the average class

recall metric.

Our results strengthen the superiority of CNN over

regular machine learning classiﬁers to categorize ﬁle

names. Incorporating the CNN model by Zhang et al.,

we were able to identify CSEM ﬁle names with a re-

call rate of 0.78, and an average class recall of 0.86.

Also, we demonstrated how the adapted CNN model

outperforms the model by Kim et al. (Kim et al.,

2016) and two supervised machine learning classiﬁers

(Support Vector Machine and Logistic Regression).

The presented results were conducted on a dataset

extracted from the ﬁle names only. However, there

is still further work to be investigated in the future

during our ongoing research. In particular, we are

planning to include orthographic features extracted

from ﬁle names, as it showed promising performance

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

232

on other NLP tasks, such as Named Entity Recogni-

tion (NER) (Aguilar et al., 2017). Furthermore, the

presented work did not investigate paths of the ﬁles,

which could be pivotal evidence when the ﬁle name

is meaningless, such as a ﬁle with a name made up of

numbers or random characters (e.g., kf3kfk3985.png).

Also, the metadata of the ﬁle, such as its header, size,

extension, could provide further clues to predict its

class correctly.

The assessment of transformer-based models,

such as BERT (Luo et al., 2018), RoBERTa (Liu et al.,

2019), and XLNet (Yang et al., 2019) for text classiﬁ-

cation is part of our immediate future research, as they

have shown promising results on various NLP tasks.

ACKNOWLEDGEMENTS

This research has been funded with support from the

European Commission under the 4NSEEK project

with Grant Agreement 821966. This publication re-

ﬂects the views only of the author, and the Euro-

pean Commission cannot be held responsible for any

use which may be made of the information contained

therein.

REFERENCES

Aguilar, G., Maharjan, S., Monroy, A. P. L., and Solorio, T.

(2017). A multi-task approach for named entity recog-

nition in social media data. In Proceedings of the 3rd

Workshop on Noisy User-generated Text, pages 148–

153.

Aizawa, A. (2003). An information-theoretic perspective of

tf–idf measures. Information Processing & Manage-

ment, 39(1):45–65.

Al-Nabki, M. W., Fidalgo, E., Alegre, E., and de Paz, I.

(2017). Classifying illegal activities on tor network

based on web textual contents. In Proceedings of

the 15th Conference of the European Chapter of the

Association for Computational Linguistics, volume 1,

pages 35–43.

Al-Nabki, M. W., Fidalgo, E., Alegre, E., and Fern

andez-

Robles, L. (2019). Torank: Identifying the most inﬂu-

ential suspicious domains in the tor network. Expert

Systems with Applications, 123:212–226.

Alsmadi, I. and Hoon, G. K. (2018). Term weighting

scheme for short-text classiﬁcation: Twitter corpuses.

Neural Computing and Applications, pages 1–13.

Beaufort, R., Roekhaut, S., Cougnon, L.-A., and Fairon, C.

(2010). A hybrid rule/model-based ﬁnite-state frame-

work for normalizing sms messages. In Proceedings

of the 48th Annual Meeting of the Association for

Computational Linguistics, pages 770–779. Associa-

tion for Computational Linguistics.

Breiman, L. (2001). Random forests. Machine Learning,

45(1):5–32.

Chaves, D., Fidalgo, E., Alegre, E., and Blanco, P. (2019).

Improving speed-accuracy trade-off in face detectors

for forensic tools by image resizing. In V Jor-

nadas Nacionales de Investigaci

on en Ciberseguridad

(JNIC), pages 1–2.

Chen, H., Mckeever, S., and Delany, S. J. (2017). Harness-

ing the power of text mining for the detection of abu-

sive content in social media. In Advances in Computa-

tional Intelligence Systems, pages 187–205. Springer.

Chen, J., Yan, S., and Wong, K.-C. (2018). Verbal ag-

gression detection on twitter comments: convolutional

neural network for short-text sentiment analysis. Neu-

ral Computing and Applications.

Chouchoulas, A. and Shen, Q. (1999). A rough set-based

approach to text classiﬁcation. In International Work-

shop on Rough Sets, Fuzzy Sets, Data Mining, and

Granular-Soft Computing, pages 118–127. Springer.

Europol (2019a). Child sexual exploitation.

https://www.europol.europa.eu/crime-areas-and-

trends/crime-areas/child-sexual-exploitation. Ac-

cessed: 2019-11-08.

Europol (2019b). Eu policy cycle - empact.

https://www.europol.europa.eu/crime-areas-and-

trends/eu-policy-cycle-empact. Accessed: 2019-11-

08.

Fidalgo, E., Alegre, E., Gonz

alez-Castro, V., and

Fern

andez-Robles, L. (2018). Boosting image classi-

ﬁcation through semantic attention ﬁltering strategies.

Pattern Recognition Letters, 112:176–183.

Fidalgo Fern

andez, E., Alegre Guti

errez, E., Fern

andez

Robles, L., and Gonz

alez Castro, V. (2019). Fusi

temprana de descriptores extra

ıdos de mapas de

prominencia multi-nivel para clasiﬁcar im

agenes. Re-

vista Iberoamericana de Autom

atica e Inform

atica in-

dustrial, 0(0).

Gangwar, A., Fidalgo, E., Alegre, E., and Gonz

alez-Castro,

V. (2017). Pornography and child sexual abuse detec-

tion in image and video: A comparative evaluation. In

8th International Conference on Imaging for Crime

Detection and Prevention (ICDP), pages 37–42.

Garc

ıa-Olalla, O., Alegre, E., Fern

andez-Robles, L., Fi-

dalgo, E., and Saikia, S. (2018). Textile retrieval based

on image content from CDC and webcam cameras in

indoor environments. Sensors (Switzerland), 18(5).

Genkin, A., Lewis, D. D., and Madigan, D. (2007). Large-

scale bayesian logistic regression for text categoriza-

tion. Technometrics, 49(3):291–304.

Hotho, A., N

urnberger, A., and Paaß, G. (2005). A brief

survey of text mining. In Ldv Forum, pages 19–62.

Citeseer.

Imran, M., Mitra, P., and Srivastava, J. (2016). Cross-

language domain adaptation for classifying crisis-

related short messages. In ISCRAM 2016 Conference

Proceedings - 13th International Conference on Infor-

mation Systems for Crisis Response and Management.

Information Systems for Crisis Response and Man-

agement, ISCRAM.

Joachims, T. (1998). Text categorization with support vec-

tor machines: Learning with many relevant features.

In N

edellec, C. and Rouveirol, C., editors, Machine

Learning: ECML-98, pages 137–142, Berlin, Heidel-

berg. Springer Berlin Heidelberg.

File Name Classiﬁcation Approach to Identify Child Sexual Abuse

233

Kim, S.-B., Han, K.-S., Rim, H.-C., and Myaeng, S. H.

(2006). Some effective techniques for naive bayes text

classiﬁcation. IEEE transactions on knowledge and

data engineering, 18(11):1457–1466.

Kim, Y., Jernite, Y., Sontag, D., and Rush, A. M. (2016).

Character-aware neural language models. In Thirtieth

AAAI Conference on Artiﬁcial Intelligence.

Lee, J. Y. and Dernoncourt, F. (2016). Sequential short-text

classiﬁcation with recurrent and convolutional neural

networks. In Proceedings of the 2016 Conference of

the North American Chapter of the Association for

Computational Linguistics: Human Language Tech-

nologies, pages 515–520, San Diego, California. As-

sociation for Computational Linguistics.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,

Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov,

V. (2019). Roberta: A robustly optimized bert pre-

training approach. arXiv preprint arXiv:1907.11692.

Luo, J., Zhou, W., and Du, Y. (2018). An active learn-

ing based on uncertainty and density method for pos-

itive and unlabeled data. In Vaidya, J. and Li, J., ed-

itors, Algorithms and Architectures for Parallel Pro-

cessing, pages 229–241, Cham. Springer International

Publishing.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and

Dean, J. (2013). Distributed representations of words

and phrases and their compositionality. In Advances in

neural information processing systems, pages 3111–

3119.

Panchenko, A., Beaufort, R., and Fairon, C. (2012). De-

tection of child sexual abuse media on p2p networks:

Normalization and classiﬁcation of associated ﬁle-

names. In Proceedings of the LREC Workshop on

Language Resources for Public Security Applications,

pages 27–31.

Peersman, C., Schulze, C., Rashid, A., Brennan, M., and

Fischer, C. (2014). icop: Automatically identifying

new child abuse media in p2p networks. In 2014

IEEE Security and Privacy Workshops, pages 124–

131. IEEE.

Peersman, C., Schulze, C., Rashid, A., Brennan, M., and

Fischer, C. (2016). icop: Live forensics to reveal

previously unknown criminal media on p2p networks.

Digital Investigation, 18:50–64.

Pennington, J., Socher, R., and Manning, C. (2014). Glove:

Global vectors for word representation. In Proceed-

ings of the 2014 conference on empirical methods in

natural language processing (EMNLP), pages 1532–

1543.

Perez, S. (2019). Twitter’s doubling of character count from

140 to 280 had little impact on length of tweets –

techcrunch.

Rana, M. I., Khalid, S., and Akbar, M. U. (2014). News

classiﬁcation based on their headlines: A review.

In 17th IEEE International Multi Topic Conference

2014, pages 211–216. IEEE.

Safavian, S. R. and Landgrebe, D. (1991). A survey of de-

cision tree classiﬁer methodology. IEEE transactions

on systems, man, and cybernetics, 21(3):660–674.

Schohn, G. and Cohn, D. (2000). Less is more: Active

learning with support vector machines. In ICML,

page 6. Citeseer.

Shang, C., Li, M., Feng, S., Jiang, Q., and Fan, J. (2013).

Feature selection via maximizing global information

gain for text classiﬁcation. Knowledge-Based Sys-

tems, 54:298 – 309.

Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., and

Demirbas, M. (2010). Short text classiﬁcation in twit-

ter to improve information ﬁltering. In Proceedings

of the 33rd international ACM SIGIR conference on

Research and development in information retrieval,

pages 841–842. ACM.

Sun, A. (2012). Short text classiﬁcation using very few

words. In Proceedings of the 35th international ACM

SIGIR conference on Research and development in in-

formation retrieval, pages 1145–1146. ACM.

Yang, Z., Dai, Z., Yang, Y., Carbonell, J. G., Salakhutdi-

nov, R., and Le, Q. V. (2019). Xlnet: Generalized au-

toregressive pretraining for language understanding.

CoRR, abs/1906.08237.

Zhang, X., Zhao, J., and LeCun, Y. (2015). Character-

level convolutional networks for text classiﬁcation. In

Advances in neural information processing systems,

pages 649–657.

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

234