Towards Automated Comprehensive Feature Engineering for Spam

Detection

Fred N. Kiwanuka

, Ja’far Alqatawna

1,2

, Anang Hudaya Muhamad Amin

, Sujni Paul

and Hossam Faris

Computer Information Science, Higher Colleges of Technology, Dubai, U.A.E.

King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan

Keywords:

Spam Detection, Dataset Processing, Automated Feature Engineering, Classiﬁcation, Spam Features, Data

Mining, Machine Learning, Python E-mail Feature Extraction and Classiﬁcation Tool (CPyEFECT).

Abstract:

Everyday billions of emails are passed or processed through online servers of which about 59% is spam ac-

cording to a recent research. Spam emails have increasingly contained viruses or other harmful malware and

are a security risk to computer systems. The importance of spam ﬁltering and the security of computer systems

has become more essential than ever. The rate of evolution of spam nowadays is so high and hence previously

successful spam detection methods are failing to cope. In this paper, we propose a comprehensive and au-

tomated feature engineering framework for spam classiﬁcation. The proposed framework enables ﬁrst, the

development of a large number of features from any email corpus, and second extracting automated features

using feature transformation and aggregation primitives. We show that the performance of classiﬁcation of

spam improves between 2% to 28% for almost all conventional machine learning classiﬁers when using auto-

mated feature engineering. As a by product of our comprehensive automated feature engineering, we develop

a Python-based open source tool, which incorporates the proposed framework.

1 INTRODUCTION

In the context of email systems, spam can be deﬁned

as unsolicited messages carrying commercial ads, of-

fers and/or malicious content (Alqatawna et al., 2015;

Guzella and Caminhas, 2009; Zaid et al., 2016). An-

nual statistics and research studies show that spam-

mers keep updating their techniques and tricks to cir-

cumvent current anti-spam solutions resulting in a

growing volume of spam messages (Kaspersky, 2017;

Ruano-Ords et al., 2018). This ongoing challenge

keeps the door open for more studies to better under-

stand the nature of these unwanted messages and im-

prove detection methods. Data mining techniques can

be used to study and analyse the spam features, which

would help in building spam classiﬁcation models and

signiﬁcantly improve their detection rates. The detec-

tion is based on the assumption that spam’s content

differs from that of a legitimate email in ways that can

be quantiﬁed(Caruana and Li, 2008). This usually re-

quires Features Engineering; a process of identifying

representative characteristics or attributes that can be

used to build detection models(Koitka and Friedrich,

2016).

The feature engineering process is considered the

most expensive and time consuming part of the data

mining process as it is usually performed manually

and requires domain knowledge(Domingos, 2012).

The traditional approaches to feature engineering is

to build features one at a time using domain knowl-

edge, a tedious, time-consuming, and error-prone pro-

cess known as manual feature engineering. Over the

last few years there has been an emerging trend to im-

prove and automate this process which would reduce

exploration time and give researchers the chance to

try more ideas(Kanter and Veeramachaneni, 2015a;

Lam et al., 2017b). Automated feature engineering

improves upon the manual feature engineering work-

ﬂow by automatically extracting useful and meaning-

ful features from datasets. Automatic feature engi-

neering cuts down on the time spent on feature en-

gineering,as well as reduces data leakage by ﬁltering

time-dependent data.

This paper contributes to this direction by propos-

ing an automated feature engineering approach in the

context of spam detection. In our previous work we

extensively studied the characteristics of Spam email

and conducted a review of the existing literature to

get insight into the various aspects of spam and spam-

Kiwanuka, F., Alqatawna, J., Amin, A., Paul, S. and Faris, H.

Towards Automated Comprehensive Feature Engineering for Spam Detection.

DOI: 10.5220/0007393004290437

In Proceedings of the 5th International Conference on Information Systems Security and Privacy (ICISSP 2019), pages 429-437

ISBN: 978-989-758-359-9

429

Figure 1: Comprehensive Feature Engineering Process.

mer techniques (Alqatawna et al., 2015; Alqatawna

et al., 2016; Hijawi et al., 2017a; Faris et al., 2015;

Halaseh and Alqatawna, 2016). We developed a com-

prehensive list of spam features that we used for spam

detection. As part of that work, we developed E-mail

Features Extraction Tool (EMFET) based on the .NET

platform that only extracts features (Hijawi et al.,

2017b; Hijawi et al., 2017a). In this paper, we extend

this work by automating and improve the feature engi-

neering process. Additionally, we integrate that with

the remaining data mining steps of classiﬁcation and

feature selection. The proposed approach is imple-

mented as an open source Python E-mail Feature Ex-

traction and Classiﬁcation Tool (CPyEFECT). CPyE-

FECT tool provides a ﬂexible and automated way to

perform spam feature engineering by extract a large

number of features from any email corpus to produce

cleansed dataset, run classiﬁcation, and to select the

most important features.

The rest of the report is organized as follow. A

background literature is presented in section 2. The

proposed approach presented in section 3 followed by

implementation in section 4. Testing and results are

shown in section 5. Finally, section 6, presents the

conclusions and future works.

2 LITERATURE REVIEW

Spam detection techniques have evolved as quickly

as the ever-increasing tide of spam, with different ob-

jectives, evaluation methods, and measures of suc-

cess. A large number of spam ﬁltering aimed at help-

ing eliminate spam problem exists. Existing spam

ﬁlters use different techniques for analyzing email

and determining if it is indeed spam. The accuracy

of spam ﬁlters are evaluated based on two dimen-

sions: How much spam you successfully ﬁlter out,

and how little legitimate messages you accidentally

delete (Blanzieri and Bryl, 2008). Maintaining ac-

curacy can be difﬁcult because spam is constantly

changing. Spam detection has focused mainly on im-

proved machine learning techniques to provide effec-

tive spam ﬁlters (Tretyakov, 2004). Feature engineer-

ing for spam has received less attention. The impor-

tance of feature engineering in predictions can not be

understated. Recent research has focused on feature

engineering. Unfortunately, there is not a lot in terms

of feature engineering exists for spam.

The work in (Alqatawna et al., 2015) attempts

to focus on feature engineering to handle the prob-

lem of spam detection. The develop a substantial

set (140) of features that can be used to detect spam

with considerably good performance. Feature engi-

neering for spam on twitter is studied in (Herzal-

lah et al., 2018). Similar interesting work in (Scott

and Matwin, 1999) proposes feature engineering for

text based classiﬁcation. Spam feature engineering

is purely text based. (Scott and Matwin, 1999) pro-

poses, feature engineering techniques for text based

data, and describe 8 automatically-generated repre-

sentations based on words, phrases, synonyms and

hypernyms. Unfortunately the majority of features in

this work are not feasible for spam emails.

In recent years there has been progress in automat-

ing many machine learning aspects including model

selection and hyperparameter tuning, however, fea-

ture engineering, which is considered the most impor-

tant aspect of the machine learning pipeline, has re-

ceived less attention. That is why recent work on au-

tomatic feature engineering (Kanter and Veeramacha-

neni, 2015b), (Khurana et al., 2016), (Katz et al.,

2016), (Lam et al., 2017a) has received quite some

attention in literature. Automated feature engineer-

ing is a relatively new technique. The traditional ap-

proaches to feature engineering is to build features

one at a time using domain knowledge, a tedious,

time-consuming, and error-prone process known as

manual feature engineering. Automated feature engi-

neering improves upon the manual feature engineer-

ICISSP 2019 - 5th International Conference on Information Systems Security and Privacy

430

Table 1: The Header Features.

ID Feature Details Type Ref.

1-2 Year, Month Metadata (Alqatawna et al., 2015)

3-6 Day, Hour, Minute, Second Metadata (Tran et al., 2013; Alqatawna et al., 2015)

7-22 From [Google?, AOL?, Metadata (Alqatawna et al., 2015)

Gov?, HTML?, MIL?,

Yahoo?, Example?]

To [Hotmail?, Yahoo?,

Example?, MSN?,

Localhost?, Google?,

AOL?, Gov?, MIL?]

23 Count of “To” Email Metadata (Tran et al., 2013)

24-33 Replay to [Google?, Hotmail? Metadata (Alqatawna et al., 2015)

MIL?, Yahoo?, AOL?,

Gov?], X-Mailman-Version,

Exist [Text/Plain?,

Multipart/Mixed?,

Multipart/Alternative?]

34-49 No. of [characters, capitalised Subject (Tran et al., 2013)

words], No. of words [in

all uppercase, that are digits,

containing only letters,

containing letters and numbers,

that are single letters,

that are single digits,

that are single characters],

Max ratio of [uppercase to

lowercase letters of each word,

uppercase letters to all

characters of each word,

digit characters to all

characters of each word,

non-alphanumerics to all

characters of each word],

Min character diversity

of each word, Max of the [

longest repeated character,

character lengths of words]

ing workﬂow by automatically extracting useful and

meaningful features from datasets. Automatic feature

engineering cuts down on the time spent on feature

engineering,as well as reduces data leakage by ﬁlter-

ing time-dependent data. Automatic feature engineer-

ing has seen tremendous success in image and video

processing through deep learning techniques. Deep

learning through its explicit feature construction has

also been used in many other ﬁelds like speech pro-

cessing and recently text classiﬁcation with consider-

able success. The main problem with deep learning

as highlighted by many authors relate to, higher com-

putational burden, requires large datasets in order to

prevent overﬁtting and unintrepretable features.

In this paper (Hanif Bhuiyan, 2018) have pre-

sented the classiﬁcation, evaluation and comparison

of different email spam ﬁltering system and sum-

marize the overall scenario regarding accuracy rate

of different existing approaches.A Naives Bayesian

Classiﬁer is used in this paper (G.Vijayasekaran,

2018) with three layer framework that includes clas-

siﬁer and anomaly detector for spam classiﬁcation for

bulk emails. The idea of automatic feature engineer-

ing for transaction and relational data was ﬁrst fronted

in (Kanter and Veeramachaneni, 2015b) with the aim

of deriving predictive modelling from raw data auto-

matically. The authors propose and develop the Deep

Feature Synthesis algorithm for automatically gener-

Towards Automated Comprehensive Feature Engineering for Spam Detection

431

ating features for relational datasets. The algorithm

captures features that are usually supported by human

intuition. The beauty about the deep feature synthesis

is that one can combine raw data with what you know

about your data to build meaningful features for ma-

chine learning and predictive modelling. Using ag-

gregation and transformation primitives, a large set of

features can be developed. In this paper, we adopt

some of the components of this work to spam de-

tection. While (Kanter and Veeramachaneni, 2015b)

is designed for relational and transaction data, spam

emails do not have the notion of relationships.

3 THE PROPOSED SPAM

FEATURE ENGINEERING

APPROACH

In this section, we present our proposed comprehen-

sive feature engineering approach which can be de-

scribed as a process with two phases; (1) a con-

ventional or manual features engineering phase to

produce the initial set of spam features from raw

emails(text-based features), where we develop 148

features, and (2) an automated feature engineering

through transformation primitives of the manual fea-

tures whose number depends on the number of prim-

itives used. These two phases are shown in ﬁgure 1

and discussed in the following subsections.

3.1 Conventional Manual Feature

Engineering

This conventional feature engineering phase has

evolved out of several years of research in the con-

text of spam detection to investigate malicious spam

content and type of attacks carried over email mes-

sages(Alqatawna et al., 2015; Alqatawna et al., 2016;

Hijawi et al., 2017a; Faris et al., 2015; Halaseh and

Alqatawna, 2016). In this phase we identify 148 fea-

tures which we extract from SpamAssassin corpus

In particular, we extract: header features, payload

(body) and attachment. As shown in ﬁgure 1, the

features comprise of 50 header features, 4 attachment

features, and 94 Payload features.

Header features are shown in table 1, these fea-

ture are extracted from the contents of header of the

email message (sender, recipient, date and subject).

Other features are extracted from the email payload

which is the part of transmitted data that is the ac-

SpamAssassin corpus. http://spamassassin.apache.org/

publiccorpus/

tual intended message. Features in this part are cat-

egorized into three sub categories: Email body fea-

tures, Readability features, and Lexical diversity fea-

tures. Email body features are extracted from the

body of which contains unstructured data such as im-

ages, HTML tags, text, and other objects. This fea-

tures set contains 64 features, which are described in

table 2. Four attachment related features are also ex-

tracted; one feature related to the number of all attach-

ment ﬁles in an email and other three features related

to the number of unique content types of attachment

ﬁles in an email(text/plain, multipart/mixed and mul-

tipart/alternative).

Readability features represent the difﬁculty prop-

erties of reading a word, a sentence or a paragraph

in the given email’s body (Shams and Mercer, 2016;

Al-Shboul et al., 2016). Readability features are ex-

tracted based on the syllables — sequence of speech

sounds —, which are used to distinguish between the

simple and complex words. This set of features con-

tains 26 features that measure the difﬁculty of reading

the email’s body.

Lexical Diversity Features are extracted based on

the vocabulary size in the text, where the word occur-

rences are counted using different constraints (Tran

et al., 2013; Tweedie and Baayen, 1998; Choi, 2012).

This set of features contains seven features as follows:

Vocabulary Richness,Hapax legomena, Hapax disle-

gomena, Entropy measure, YuleK, YuleI, SichelS,

Honore, and Unusual words

3.2 Automatic Feature Engineering

Manual feature engineering process depends on do-

main knowledge, intuition, and data manipulation.

This makes it extremely tedious and the ﬁnal fea-

tures are limited both by human subjectivity and time.

Automated feature engineering aims at automatically

creating many candidate features out of already avail-

able features from which the best can be selected and

used for training.

Once we have built the manual features like in sec-

tion 3.1, we employ, Deep Feature Synthesis (DFS)

as in (Kanter and Veeramachaneni, 2015b), to per-

form automated feature engineering. Deep feature

synthesis stacks multiple transformation and aggre-

gation operations to create features from data spread

across many tables. Whereas DFS is designed for

transaction and relation data with multi-tables, in this

research, we only have one table, we customise DFS,

to support one table. DFS is also not designed for text

data like in this research and requires to ﬁrst build an

initial set of features upon which to build the auto-

matic features.

ICISSP 2019 - 5th International Conference on Information Systems Security and Privacy

432

Table 2: Email Body Features.

ID Feature Details Ref. ID Feature Details Studies

1 Count of Spam Words (Shams and Mercer, 2016; Alqatawna et al., 2015; Al-Shboul et al., 2016) 31 Number of question marks (Alqatawna et al., 2015; Al-Shboul et al., 2016)

2 Count of Function Words (Shams and Mercer, 2016; Al-Shboul et al., 2016) 32 No. of multiple question marks (Alqatawna et al., 2015)

3 Count of HTML Anchor (Tran et al., 2013; Shams and Mercer, 2016; Alqatawna et al., 2015; Al-Shboul et al., 2016) 33 No. of exclamation marks (Alqatawna et al., 2015; Al-Shboul et al., 2016)

4 Count of Unique HTML Anchor (Tran et al., 2013; Alqatawna et al., 2015) 34 No. of multiple exclamation marks (Alqatawna et al., 2015)

5 Count of HTML Not Anchor (Shams and Mercer, 2016) 35 No. of colons (Alqatawna et al., 2015; Al-Shboul et al., 2016)

6 Count of HTML Image (Al-Shboul et al., 2016) 36 No. of ellipsis (Alqatawna et al., 2015; Al-Shboul et al., 2016)

7 Count of HTML All Tags (Shams and Mercer, 2016; Al-Shboul et al., 2016) 37 Total No. of sentences (Shams and Mercer, 2016; Alqatawna et al., 2015; Al-Shboul et al., 2016)

8 Count of Alpha-numeric Words (Tran et al., 2013; Shams and Mercer, 2016; Al-Shboul et al., 2016) 38 Total No. of paragraphs (Alqatawna et al., 2015)

9 TF-ISF (Shams and Mercer, 2016) 39 Average No. of sentences per paragraph (Alqatawna et al., 2015)

10 TF•ISF without stopwords (Shams and Mercer, 2016) 40 Average number of words pre paragraph (Alqatawna et al., 2015)

11 Count of duplicate words. (Al-Shboul et al., 2016) 41 Average No. of character per paragraph (Alqatawna et al., 2015)

12 Minimum word length (Al-Shboul et al., 2016) 42 Average No. of word per sentences (Alqatawna et al., 2015)

13 Count of lowercase letters (Al-Shboul et al., 2016) 43 No. of sentence begin with upper case (Alqatawna et al., 2015)

14 Longest sequence of adjacent capital letters (Alqatawna et al., 2015; Al-Shboul et al., 2016) 44 No. of sentence begin with lower case (Alqatawna et al., 2015)

15 Count of lines (Alqatawna et al., 2015; Al-Shboul et al., 2016) 45 Character frequency “$” (Alqatawna et al., 2015; Al-Shboul et al., 2016)

16 Total No. of digit character (Alqatawna et al., 2015) 46 No. of capitalized words. (Tran et al., 2013)

17 Total No. of white space (Alqatawna et al., 2015) 47 No. of words in all uppercase. (Tran et al., 2013)

18 Total No. of upper case character (Alqatawna et al., 2015; Al-Shboul et al., 2016) 48 Number of words that are digits. (Tran et al., 2013)

19 Total No. of characters (Tran et al., 2013; Alqatawna et al., 2015) 49 No. of words containing only letters. (Tran et al., 2013)

20 Total No. of tabs (Alqatawna et al., 2015) 50 No. of words that are single letters. (Tran et al., 2013)

21 Total No. of special characters (Alqatawna et al., 2015) 51 No. of words that are single digits. (Tran et al., 2013)

22 Total number of alpha characters (Alqatawna et al., 2015) 52 Number of words that are single characters. (Tran et al., 2013)

23 Total No. of words (Alqatawna et al., 2015; Al-Shboul et al., 2016) 53 Max ratio of uppercase letters to lowercase letters of each word. (Tran et al., 2013)

24 Average word length (Alqatawna et al., 2015; Al-Shboul et al., 2016) 54 Min of character diversity of each word. (Tran et al., 2013)

25 Words longer than 6 characters (Alqatawna et al., 2015) 55 Max ratio of uppercase letters to all characters of each word. (Tran et al., 2013)

26 Total No. of words (1 - 3 Characters) (Alqatawna et al., 2015) 56 Max ratio of digit characters to all characters of each word. (Tran et al., 2013)

27 No. of single quotes (Alqatawna et al., 2015; Al-Shboul et al., 2016) 57 Max ratio of non-alphanumerics to all characters of each word. (Tran et al., 2013)

28 No. of commas (Alqatawna et al., 2015; Al-Shboul et al., 2016) 58 Max of the longest repeating character. (Tran et al., 2013)

29 No. of periods (Alqatawna et al., 2015; Al-Shboul et al., 2016) 59 Max of the character lengths of words. (Tran et al., 2013; Al-Shboul et al., 2016)

30 No. of semi-colons (Alqatawna et al., 2015; Al-Shboul et al., 2016) - - -

Figure 2: The implementation structure of CPyEFECT.

We recursively apply a set of predeﬁned mathe-

matical transformations on the manual features from

section 3.1 to obtain new features. The number of

features computed depends on the number of man-

ual features computed, the number of transformation

primitives selected and the depth.

Let β be the total number of features computed, N

be the number of manual features computed and m be

the number of primitives selected and the depth be d.

The number of features computed is given as:

β =

(

Nm(N + 1)

)

(1)

Comprehensive Features Engineering.This is a two

fold phase. First, manual features were developed as

in section 3.1, then the automated feature engineering

process is done as described in section 3.2. As

highlighted in (Kanter and Veeramachaneni, 2015b),

the feature space grows very quickly with increase in

the number of primitives used as well as depth.

The number of features exponentially grows with

an increase in the depth. We utilize transformation

primitives rather than aggregation. The number of

potential transformations is indeﬁnite. The larger the

number of transformations, the larger the number of

features generated. In principle, we can generate

new features using different combinations of the man-

ual features. The only draw back is the exponential

growth of feature computation. The simplest way to

reduce the overall classiﬁcation complexity given the

huge feature space is feature selection which we uti-

lize in this research. There are other ways of reduc-

ing this complexity like those employed in (Khurana

et al., 2016), and (Katz et al., 2016), where the au-

thors propose mechanisms of ﬁnding the best trans-

formation and aggregation combinations that give the

best classiﬁcation performance as well as speeding up

the feature computation.

For our case, we limit the depth at two and gen-

erated 6121 features. The computational burden of

computing the features increases substantial with a

higher depth. For example, at a depth of 2, it took

6.4 minutes to compute the features while depth of 3,

it took at least 9 hours. However, an increase in depth

to 3 did not improve the classiﬁcation results substan-

tially. In this research, through try and error as well as

domain knowledge of classiﬁers, we ﬁnd transforma-

tion primitives that give the best classiﬁcation results.

However, this can be computed automatically like in

(Khurana et al., 2016), and (Katz et al., 2016) but at a

substantial cost.

4 IMPLEMENTATION

In this section, we describe the implementation of

the proposed comprehensive feature engineering

framework and provide a performance evaluation

for classiﬁcation of spam emails. We developed

Towards Automated Comprehensive Feature Engineering for Spam Detection

433

Figure 3: Most informative 50 features for Naive Bayes.

Figure 4: A Comparison of ROC curves for Naive Bayes

Classiﬁer.

a tool that incorporates the comprehensive feature

engineering framework. There are four functions to

cover the various processes required to study email

spam detection.

Data Preparation. The tool uses the python email

package to decompose MIME structured ﬁles. Email

message is read as a text from the individual ﬁles

in the corpus to produce the object structure of

the email message. This decomposition facilitates

extraction of the required features and preparation

of dataset for subsequent phases. Each email is split

into their main parts which are:

From

Date

Subject

ContentType

CCC

and

Body

. The

manual feature engineering as described in section

3.1 depends on these parts to extract the features.

Classiﬁcation. We implemented nine conventional

machine learning algorithms. These included: Extra

Tree Classiﬁer, Gradient Boosting Classiﬁer, Support

vector machine(SVM), Random Forest Classiﬁer,

K Nearest Neighbor Classiﬁer, Logistic Regression,

Naive Bayes, AdaBoost Classiﬁer, and Decision

trees). Additionally, various validation measures are

available to test the produced models.

Feature Selection. Thousands of features were com-

puted at a less than proportional increase in computa-

tional cost. We computed the most important features

for the classiﬁer with the best performance. Fig. 2

describes the implementation structure of the frame-

work.

ICISSP 2019 - 5th International Conference on Information Systems Security and Privacy

434

Figure 5: A comparison of classiﬁer precision and recall for various algorithms.

Figure 6: A comparison of classiﬁer accuracy and f-score for various algorithms.

5 RESULTS

In this section, using the SpamAssasin dataset, we

apply different classical classiﬁcation methods, such

as decision tree, neural network, support vector ma-

chines, random forests, gradient boosting, logistic re-

gression and k-nearest neighbors, to identify whether

an email is spam or ham. We do this ﬁrst for the

manually-engineered features, consequently we ex-

periment with the automatically generated features.

In the manual classiﬁcation, each email is represented

by a vector of 148 features, while the automatic fea-

ture classiﬁcation, email is represented by 6121 fea-

tures. We experimented on 1800 emails with 950 ham

and 850 spam.

For each set of features we computed Recall, Ac-

curacy, Precision, F-Score, ROC curves, feature

selection, confusion matrix. We would be happy to

provide all the results but for space limitation.

From Figure 6, Figure 5 and Figure 4, its clear

that automatic feature engineering improves the per-

formance of classiﬁers. This is especially so in the

case of weak classiﬁers for the manual features like

Naive Bayes. Using the manual features the accuracy

was 82% while the automatically generated features

there was 8% increase in the accuracy to 90%. The

f-score improved by 24%, precision improved by 5%,

while recall improved by 28%. This improvement

was also reﬂected by other classiﬁers in the magni-

tude of between 1.5% to 12%.

Towards Automated Comprehensive Feature Engineering for Spam Detection

435

6 CONCLUSIONS AND FUTURE

WORK

In this paper, we present an automated comprehen-

sive email feature engineering framework that has

been developed for the purpose of spam detection and

classiﬁcation. The framework incorporates a scal-

able mechanism for automated feature engineering

and classiﬁcation algorithms for spam classiﬁcation.

Currently, the proposed framework is capable of pro-

ducing high accuracy for spam classiﬁcation with 148

manual email features used. The automated feature

engineering scheme further improves the classiﬁca-

tion accuracy of some of the classiﬁers up to 12%,

with more features being added into the analysis.

For future work, we propose to look into the pro-

cess optimization of the developed framework, to en-

able more efﬁcient feature engineering and classiﬁca-

tion processes.

REFERENCES

Al-Shboul, B. A., Hakh, H., Faris, H., Aljarah, I., and Al-

sawalqah, H. (2016). Voting-based classiﬁcation for

e-mail spam detection. Journal of ICT Research and

Applications, 10(1):29–42.

Alqatawna, J., Faris, H., Jaradat, K., Al-Zewairi, M., and

Omar, A. (2015). Improving knowledge based spam

detection methods: The effect of malicious related

features in imbalance data distribution. International

Journal of Communications, Network and System Sci-

ences, 8(5):118–129.

Alqatawna, J., Hadi, A., Al-Zwairi, M., and Khader, M.

(2016). A preliminary analysis of drive-by email

attacks in educational institutes. In Cybersecurity

and Cyberforensics Conference (CCC), pages 65–69.

IEEE.

Blanzieri, E. and Bryl, A. (2008). A survey of learning-

based techniques of email spam ﬁltering. Artif. Intell.

Rev., 29(1):63–92.

Caruana, G. and Li, M. (2008). A survey of emerging

approaches to spam ﬁltering. ACM Comput. Surv.,

44(2):9:1–9:27.

Choi, W. H. (2012). Finding appropriate lexical diver-

sity measurements for small-size corpus. In Applied

Mechanics and Materials, volume 121, pages 1244–

1248. Trans Tech Publ.

Domingos, P. (2012). A few useful things to know about

machine learning. Communications of the ACM,

55(10):78–87.

Faris, H., Aljarah, I., and Alqatawna, J. (2015). Optimiz-

ing feedforward neural networks using krill herd algo-

rithm for e-mail spam detection. In Applied Electrical

Engineering and Computing Technologies (AEECT),

2015 IEEE Jordan Conference on , vol., no., pp.1-5,

pages 1–5. IEEE.

Guzella, T. S. and Caminhas, W. M. (2009). A review of

machine learning approaches to spam ﬁltering. Expert

Systems with Applications, 36(7):10206 – 10222.

G.Vijayasekaran, S. (2018). Spam and email detection in

big data plaftorm using naive bayesian classiﬁer. In-

ternational Journal of Computer Science and Mobile

Computing, Vol. 7, Issue. 4.

Halaseh, R. A. and Alqatawna, J. (2016). Analyzing cy-

bercrimes strategies: The case of phishing attack. In

Cybersecurity and Cyberforensics Conference (CCC),

pages 82–88. IEEE.

Hanif Bhuiyan, Akm Ashiquzzaman, T. I. J. S. B. . J. A.

(2018). A survey of existing e-mail spam ﬁlter-

ing methods considering machine learning techniques.

Global journal of computer science and technology:

Software and Data Engineering, 18 issue 2 Version

1.0.

Herzallah, W., Faris, H., and Adwan, O. (2018). Feature

engineering for detecting spammers on twitter: Mod-

elling and analysis. Journal of Information Science,

44(2):230–247.

Hijawi, W., Faris, H., Alqatawna, J., Ala’M, A. Z., and Al-

jarah, I. (2017a). Improving email spam detection us-

ing content based feature engineering approach. In

Applied Electrical Engineering and Computing Tech-

nologies (AEECT). IEEE.

Hijawi, W., Faris, H., Alqatawna, J., Aljarah, I., Al-Zoubi,

A., and Habib, M. (2017b). Emfet: E-mail features

extraction tool. arXiv preprint arXiv:1711.08521.

Kanter, J. M. and Veeramachaneni, K. (2015a). Deep fea-

ture synthesis: Towards automating data science en-

deavors. In Data Science and Advanced Analytics

(DSAA), 2015. 36678 2015. IEEE International Con-

ference on, pages 1–10. IEEE.

Kanter, J. M. and Veeramachaneni, K. (2015b). Deep fea-

ture synthesis: Towards automating data science en-

deavors. In 2015 IEEE International Conference on

Data Science and Advanced Analytics (DSAA), pages

1–10.

Kaspersky (2016 (accessed May 20, 2017)). Spam and

phishing in Q3 2016.

Katz, G., Shin, E. C. R., and Song, D. (2016). Explorekit:

Automatic feature generation and selection. In 2016

IEEE 16th International Conference on Data Mining

(ICDM), pages 979–984.

Khurana, U., Turaga, D., Samulowitz, H., and Parthas-

rathy, S. (2016). Cognito: Automated feature engi-

neering for supervised learning. In 2016 IEEE 16th

International Conference on Data Mining Workshops

(ICDMW), pages 1304–1307.

Koitka, S. and Friedrich, C. M. (2016). Traditional feature

engineering and deep learning approaches at medical

classiﬁcation task of imageclef 2016. In CLEF (Work-

ing Notes), pages 304–317.

Lam, H. T., Thiebaut, J., Sinn, M., Chen, B., Mai, T., and

Alkan, O. (2017a). One button machine for automat-

ing feature engineering in relational databases. CoRR,

abs/1706.00327.

Lam, H. T., Thiebaut, J.-M., Sinn, M., Chen, B., Mai, T.,

and Alkan, O. (2017b). One button machine for au-

ICISSP 2019 - 5th International Conference on Information Systems Security and Privacy

436

tomating feature engineering in relational databases.

arXiv preprint arXiv:1706.00327.

Ruano-Ords, D., Fdez-Riverola, F., and Mendez, J. R.

(2018). Concept drift in e-mail datasets: An empiri-

cal study with practical implications. Information Sci-

ences, 428:120 – 135.

Scott, S. and Matwin, S. (1999). Feature engineering for

text classiﬁcation. In Proceedings of the Sixteenth In-

ternational Conference on Machine Learning, ICML

’99, pages 379–388, San Francisco, CA, USA. Mor-

gan Kaufmann Publishers Inc.

Shams, R. and Mercer, R. E. (2016). Supervised classiﬁ-

cation of spam emails with natural language stylome-

try. Neural Computing and Applications, 27(8):2315–

2331.

Tran, K.-N., Alazab, M., and Broadhurst, R. (2013).

Towards a feature rich model for predicting spam

emails containing malicious attachments and urls. In

Eleventh Australasian Data Mining Conference Can-

berra, ACT, volume 146.

Tretyakov, K. (2004). Machine learning techniques in spam

ﬁltering. Technical report, Institute of Computer Sci-

ence, University of Tartu.

Tweedie, F. J. and Baayen, R. H. (1998). How variable may

a constant be? measures of lexical richness in perspec-

tive. Computers and the Humanities, 32(5):323–352.

Zaid, A., Alqatawna, J., and Huneiti, A. (2016). A proposed

model for malicious spam detection in email systems

of educational institutes. In Cybersecurity and Cyber-

forensics Conference (CCC), pages 60–64. IEEE.

Towards Automated Comprehensive Feature Engineering for Spam Detection

437