Phishing Email Detection based on Named Entity Recognition

ıt List

ık

Simon Let

, Jan

Sediv

and V

aclav Hlav

Faculty of Electrical Engineering, Department of Cybernetics, Czech Technical University in Prague, Technick

a 2,

Prague 6, Czech Republic

Faculty of Information Technology, Department of Theoretical Computer Science, Czech Technical University in Prague,

akurova 9, Prague 6, Czech Republic

Czech Institute of Informatics, Robotics and Cybernetics, Jugosl

avsk

ych Partyz

u 1580/3, Prague 6, Czech Republic

Keywords:

Named Entity Recognition (NER), Phishing, Email.

Abstract:

This work evaluates two phishing detection algorithms, which are both based on named entity recognition

(NER), on live trafﬁc of Email.cz. The ﬁrst algorithm was proposed in (Ramanathan and Wechsler, 2013).

It is using NER and latent Dirichlet allocation (LDA) as feature extractors for random forest classiﬁer. This

algorithm achieved 100% F-measure on the publicly available testing dataset. We are using this algorithm

as the baseline for our newly proposed solution. The newly proposed solution is using companies detected

by the NER and it is comparing URLs present in the email content to the company URL proﬁle (based on

history). The company URL proﬁle contains domains which are frequently mentioned in legitimate trafﬁc

from that domain. The advantage of the proposed solution is that it does not need phishing dataset, which is

hard to get, especially for languages other than English. Our solution outperforms the baseline solution. Both

solutions are able to detect previously undetected phishing attacks. Combination of the solutions achieves 100

% F-measure on the portion of live trafﬁc.

1 INTRODUCTION

Phishing is a fraudulent attempt to steal personal in-

formation. It is commonly used in email. This method

is often using impersonation techniques. The attacker

wants the victim to click on the malicious URL and

ﬁll in personal information. The attacker achieves that

because the victim thinks that the email was sent by

the legitimate and known organization. Phishing is

causing huge ﬁnancial and reputation losses globally.

1.1 Email.cz Trafﬁc Statistics

Email.cz is the biggest freemail provider in the Czech

Republic. The in-house anti-spam solution is analyz-

ing approximately 50 million messages a day. The

current anti-spam system it not focused on phishing

detection. But the negative effects of those attacks

affect not only the users but also the company. It be-

came a bigger problem when Email.cz started offer-

ing email on custom domain because it is a service

commonly used by companies.

Based on the language detection

75% of incom-

ing email messages at Email.cz are in Czech language

https://github.com/CLD2Owners/cld2

and 15% in the English language. The rest is mostly

undetected.

1.2 Task

Our task is to create a model for detecting phishing

emails in Email.cz trafﬁc. As shown in Sec. 1.1 huge

portion of the analyzed messages are in the Czech lan-

guage. Therefore the proposed solution should be ex-

tensible for other languages than English and should

not use phishing examples because they are almost

impossible to obtain (described in Sec. 4.1).

2 STATE OF THE ART

Phishing detection was approached many times. The

task solution may be based on metadata or content.

We are focusing on the content based methods. Those

methods may use email text, images contained in the

email or the attachments. We are using natural lan-

guage processing methods (NLP) for phishing de-

tection. More speciﬁcally named entity recognition

(NER) which was proven efﬁcient for this task (Ra-

manathan and Wechsler, 2013).

252

Listík, V., Let, Š., Šedivý, J. and Hlavá

c, V.

Phishing Email Detection based on Named Entity Recognition.

DOI: 10.5220/0007314202520256

In Proceedings of the 5th International Conference on Information Systems Security and Privacy (ICISSP 2019), pages 252-256

ISBN: 978-989-758-359-9

Other possible approaches are based on metadata,

the best example is information from SMTP trafﬁc

(i.e. sender reputation). Those approaches are suc-

cessful for tasks like spam detection because the mes-

sages are sent in bulks big enough to build the rep-

utation. Which does not apply to phishing which is

sent in smaller batches and the SMTP servers are of-

ten stolen. That makes reputation approaches ineffec-

tive for phishing detection. Next argument for not ex-

ploring this path is that those mechanisms are already

used in most anti-spam solutions, therefore, those at-

tacks would be not delivered anyway.

Most commonly used approach for phishing de-

tection is using blacklists, which is also true for spam

detection. This method is simple to implement. It

is using URLs present in the email body which are

searched in the blacklist. The beneﬁt of this method is

very high precision because the lists are hand-curated,

which is also its biggest downside, because the addi-

tion of new attacks is slow, causing a low recall.

2.1 Named Entity Recognition

Named entity recognition (NER) is a natural language

processing technique which is mapping sequence of

words to sequence of tags. Tags are called named en-

tities i.e. personal names, places, companies, dates,

and numbers.

We are using NER implementation called

Nametag (Strakov

a et al., 2014) because it achieves

the best results for the Czech language. It also has

good performance for English. We need to target

Czech because most of Email.cz trafﬁc is Czech (Sec.

1.1) and the Czech language is more complicated for

machine processing then English.

Nametag is a two-pass algorithm based on maxi-

mum entropy Markov model. It is using morpholog-

ical analysis and other features like word clustering,

gazetteers and orthographic features (capitalization,

punctuation) (Strakov

a et al., 2014).

3 METHODS

We propose two methods for phishing detection. First

one is described in the (Ramanathan and Wechsler,

2013). Authors achieved a great result of 100 % F-

measure. This method is a random forest classiﬁer

based on features extracted by NER and latent Dirich-

let allocation (LDA). We are re-implementing this

method and validating its performance in live trafﬁc.

This method is dependent on a representative dataset

which is hard to get for this task.

Our second proposed method is overcoming this

precondition. We are detecting entity impersonation

instead. This method is based on NER which is de-

tecting companies and then comparing the detected

company link proﬁle to the email link proﬁle. The

email link proﬁle is extracted from the email body.

The company link proﬁle is built from historical data

and consists of domains which are referred in the le-

gitimate emails sent by the company.

3.1 Random Forest Classiﬁer

We trained model based on (Ramanathan and Wech-

sler, 2013). This model consists of three parts. The

ﬁrst part is named entity recognition. The second part

is latent Dirichlet allocation (LDA). Third and the last

part is random forest classiﬁer.

First part is responsible for detecting names of

people, places, and organizations. LDA is responsible

for estimating topic probabilities. Those two models

work well together because NER is considering the

input word position and LDA is based only on word

frequencies (position independent) (Ramanathan and

Wechsler, 2013).

Stanford NER was used in the original paper

(Jenny Rose Finkel and Manning, 2005). We are us-

ing Nametag because it has more granular tags and

it was tested on the email language (described in 5.1)

(Strakov

a et al., 2014). We are using tags starting with

”g” (geographical names), ”i” (companies and institu-

tions) and ”p” (personal names).

LDA model is using the term-document matrix as

an input. This matrix is created with the Scikit count

vectorizer (Pedregosa et al., 2011). We are using 1000

features after stop words removal. The vectorizer is

trained on 90% of the dataset. We are using 90% of

the corpus because the corpus is small and we do want

most of the words to be present in the vectorizer. We

are using 200 categories as supposed in the original

paper. Achieved perplexity is 421 which is compara-

ble to the original paper.

First two models are used to create an input feature

vector for the ﬁnal classiﬁer. The feature vector con-

sists of 200 topic probabilities predicted by the LDA

model, and 40 NER based features consisting of the

tags and the entities.

The ﬁnal model consists of 200 weak learners of

max depth 5. We trained it using the dataset described

in Sec. 4.1 the same way as the original paper did.

This model architecture was able to achieve 100% F-

measure in the original paper. We achieved 94% F-

measure (more details in 5.2), which is a sufﬁcient

result for the production test (described in Sec. 5.3).

Our implementation in Python is using scikit-learn

Phishing Email Detection based on Named Entity Recognition

253

(Pedregosa et al., 2011) and is available on Github

3.2 Impersonation Detection

Phishing is based on entity impersonation. The ap-

proach proposed in Sec. 3.1 may learn to detect the

entity impersonation, because it is using NER features

but it needs a lot of phishing samples to do that. As

stated in Sec. 4.1 there is no dataset for phishing in the

Czech language (and other languages) and the avail-

able datasets in English are old and not very represen-

tative. We still want to detect those attacks. Therefore

we propose an impersonation detection method. This

method is using the knowledge that phishing is imper-

sonating some trusted entity to make a user click on

the malicious link.

Our suggested method consists of three steps.

1. Detect company entities from the email body with

use of NER (Sec. 3.3).

2. Map detected company entities (names) to the do-

mains (Sec. 3.4).

3. Compare domains extracted from links to de-

tected domain link proﬁles and report unexpected

domains present in the email (Sec. 3.5).

Those steps are described in greater detail in the

following sections.

3.3 Entity Detector

The ﬁrst step of this system is to detect company en-

tities from the email body. The email headers may be

edited by the attacker, but the email body almost al-

ways contain the company entity to persuade the user

that the entity sent the email. We are using NER for

this task, speciﬁcally Nametag with the custom model

described in Sec. 2.1. The output consisting of com-

pany entities (tag IF) detected by the Nametag model

is passed to the Target Detector module (Sec. 3.4)

3.4 Target Detector

Second step of the system is called target detector. We

are mapping company names detected with the en-

tity detector (Sec. 3.3) to the corresponding domains

(entity URL). Companies may be referred by many

names, our goal is to normalize all company names to

their canonical URL which may be used as an input

to the next part of the algorithm (Sec. 3.5).

We built the mapping from the national registry of

companies and added 50 selected international com-

panies by hand. We are using entity name expansion

https://github.com/tivvit/phishing-ner-lda

(adding sufﬁxes like ”co.”) because the ofﬁcial name

of the company contains them, but it may be omitted

in the email communication.

3.5 Domain Link Proﬁle

Domain (company) link proﬁle creation is an essential

part of the phishing detection. We did choose 20 can-

didate domains which are commonly attacked. This

list was created based on Phishtank reports and inter-

nal database of historic reports (OpenDNS, ). The ad-

vantage of this approach is that new domains may be

added (by hand or automatically) and it signiﬁcantly

lowers false positive detections.

For each of those domains, following steps are ex-

ecuted to build the link proﬁle.

1. Take emails signed with DKIM (Kucherawy et al.,

2011) matching to the domain of the analyzed

company from the historic trafﬁc (1 week in our

case). This ensures that only emails legitimately

sent from the company are used for the proﬁle cre-

ation.

2. Extract all domains linked from those emails.

Linked means present in the href attribute in the

HTML body of the email.

3. Filter out the long-tail.

The detection part of this system is using the built

domain link proﬁle. It gets a list of links (URLs) from

the email body and the detected domain for that email.

If the DKIM signature matches the detected domain,

the email is considered as valid (Kucherawy et al.,

2011). If there is no signature or it does not match

the module continues with the analysis. It extracts

domains from the email body links and checks if they

are present in the detected company proﬁle, if not the

email is considered as phishing.

4 DATA

This section describes datasets used for training and

evaluation of the presented models.

4.1 Public Phishing Datasets

Most of the email phishing detectors are tested with

the publicly available datasets.

The standard dataset of positive phishing samples

is (Nazario, ). This dataset consists of 4450 emails

sent from 2004 to 2007. This dataset is no longer

ICISSP 2019 - 5th International Conference on Information Systems Security and Privacy

254

available online, we published it again at Academic-

torrents

platform.

It is complicated to publish negative phishing

samples because of the email private nature. Com-

monly used is the (spa, ) dataset. It is a collection of

ham and spam messages collected from 2002 to 2005.

It consists of 6047 messages.

Most of the messages in those datasets are in the

English language. To our knowledge, there are no

publicly available phishing datasets for the Czech lan-

guage.

4.2 Email.cz Named Entity Recognition

Dataset

We have created a dataset of annotated entities from

email conversations. This dataset was created for the

internal use of Email.cz. It is based on 2 million email

texts dumped from live trafﬁc. Those emails were pre-

selected and no personal messages were used in this

dataset. Those messages were annotated by 4 people.

We used entity tags based on Cnec 2.0 (

Sev

ıkov

et al., 2007), but used only 39 of the 45 original tags.

We created this dataset because of poor results (de-

scribed in Sec. 5.1) of the original model on email

data. This dataset (annotated part) consists of 54724

sentences and 125711 tags.

5 EXPERIMENTAL RESULTS

In this section, we present results of NER on Email.cz

dataset and also phishing detection models on pub-

licly available datasets and Email.cz live trafﬁc.

5.1 NER Models

We tested Nametag model (Strakov

a et al., 2014)

(trained on Cnec 2.0 (

Sev

ıkov

a et al., 2007)) and our

Nametag based model (trained on internal dataset (de-

scribed in Sec. 4.2)) on testing data separated from

the Email NER 2018-04 dataset (described in Sec.

4.2). Results for those models are shown in Tab. 1.

Reported Nametag results for Cnec 2.0 is 77.35% F-

score (for 45 tags) (Strakov

a et al., 2014).

Table 1: Results for NER models on Email.cz dataset.

Model Precision Recall F-score

Cnec 2.0 8.81 14.82 11.05

Email.cz 77.31 80.58 78.91

http://academictorrents.com/details/

\a77cda9a9d89a60dbdfbe581adf6e2df9197995a

Based on the results shown in Tab. 1 we used

Email.cz NER model as the base for the other exper-

iments. We assume that the low performance of the

Cnec 2.0 based model is caused by the different lan-

guage nature used in an email (informal speech) and

Cnec 2.0 (based on news articles).

Because the domain link proﬁle model (Sec. 3.2)

relies on NER correctly detecting companies (IF tag)

we have also done evaluation only for that tag in Tab.

Table 2: Results for NER models on Email.cz dataset (only

companies - IF tag).

Model Precision Recall F-score

Cnec 2.0 26.57 21.11 23.53

Email.cz 75.16 75.67 75.41

5.2 Evaluation for Publicly Available

Phishing Datasets

Results for our implementation of the phishing ran-

dom forests classiﬁer may be seen in the Tab. 3.

The test was performed on randomly chosen 20%

emails (testing sample not used for the training) from

(Nazario, ) dataset described in Sec 4.1.

Table 3: Results for phishing random forest classiﬁer on

publicly available dataset.

Metric Ham Phishing avg / total

Precision 0.93 0.99 0.94

Recall 1.00 0.75 0.94

F-score 0.96 0.85 0.94

Support 1400 447 1847

Domain link proﬁle based detection is not evalu-

ated on publicly available datasets because it needs

to build domain link proﬁle, which would mean over-

ﬁtting the dataset. Present domain link proﬁle (built

form Email.cz production data) also cannot be used

because many of the targets in the dataset does not ex-

ist anymore or their proﬁle changed signiﬁcantly over

the years.

5.3 Phishing Detection Results for

Email.cz Trafﬁc

Both solutions were used for analysis of the small por-

tion of live trafﬁc at Email.cz. Systems were evalu-

ated for two days. 132000 messages were analyzed

during that period.

Those emails were annotated by hand. Some

of the malicious URLs were detected by Phishtank

(OpenDNS, ). Most of the analyzed emails are not

Phishing Email Detection based on Named Entity Recognition

255

Table 4: Phishing detection results for Email.cz trafﬁc.

Model Precision Recall F-score Support

Random forest 0.0002 1.00 0.0004 132000

Random forest ﬁltered 0.33 1.00 0.50 77

Link proﬁle 0.88 1.00 0.93 77

Combined 1.00 1.00 1.00 77

unique, therefore the annotation was simpler. 7 phish-

ing attacks were found in this corpus although only 4

of the attacks were unique. We can see that the por-

tion of the attacks is 0.0053%.

Results may be seen in 4. We can see that pro-

posed solutions are able to detect all attacks. Random

forest classiﬁer (Sec. 3.1) itself achieves very low pre-

cision. When we ﬁlter only domains which are com-

monly attacked by phishing attacks (described in Sec.

3.5) the model achieves much better results.

In comparison model based on domain link proﬁle

(described in Sec. 3.2) is even more precise. Best re-

sult was achieved with the combination of both mod-

els.

6 CONCLUSIONS AND FUTURE

WORK

We evaluated two phishing detection systems. First

one is using random forest classiﬁer with features ex-

tracted by NER and LDA and was proposed in (Ra-

manathan and Wechsler, 2013). The second solution

is new in this work. This solution is based on NER de-

tecting organizations. Then it is comparing company

URL proﬁle to the URL proﬁle present in the email.

Both solutions were evaluated in live trafﬁc of

a freemail provider Email.cz. The ﬁrst solution

achieved 50% F-measure with 100% recall. Our pro-

posed solution achieved 100% recall with 93% F-

measure. Combination of the methods achieved 100%

F-measure. None of the detected phishing attacks was

detected by system currently used at Email.cz.

We optimized the solution to work also for the

Czech language but over the testing period, there were

only attacks in English. We suggest running this test

for a longer period, which will generate more signif-

icant result. The detected attacks should be reported

to Phishtank in the future (OpenDNS, ). This sys-

tem should label and store phishing attacks and cre-

ate a multi-language public dataset which is currently

missing.

REFERENCES

Spamassassin corpus. http://spamassassin.apache.org/

publiccorpus/. [Online; accessed 2015-02-01].

Jenny Rose Finkel, T. G. and Manning, C. (2005). Stan-

ford ner - incorporating non-local information into in-

formation extraction systems by gibbs sampling. url-

http://nlp.stanford.edu/ner. [Online; accessed 2018-

09-30].

Kucherawy, M., Crocker, D., and Hansen, T. (2011). Do-

mainKeys Identiﬁed Mail (DKIM) Signatures. RFC

6376.

Nazario, J. Phishing corpus. http://monkey.org/

∼

jose/

wiki/doku.php?id=phishingcorpus. [Online; accessed

2015-02-01].

OpenDNS. PhishTank. urlhttp://www.phishtank.com. [On-

line; accessed 2018-09-30].

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,

Thirion, B., Grisel, O., Blondel, M., Prettenhofer,

P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,

A., Cournapeau, D., Brucher, M., Perrot, M., and

Duchesnay, E. (2011). Scikit-learn: Machine learning

in Python. Journal of Machine Learning Research,

12:2825–2830.

Ramanathan, V. and Wechsler, H. (2013). Phishing de-

tection and impersonated entity discovery using con-

ditional random ﬁeld and latent dirichlet allocation.

Computers & Security, 34:123–139.

Sev

ıkov

a, M.,

Zabokrtsk

y, Z., and Kr

uza, O. (2007).

Named entities in czech: annotating data and devel-

oping ne tagger. In International Conference on Text,

Speech and Dialogue, pages 188–195. Springer.

Strakov

a, J., Straka, M., and Haji

c, J. (2014). Open-Source

Tools for Morphology, Lemmatization, POS Tagging

and Named Entity Recognition. In Proceedings of

52nd Annual Meeting of the Association for Computa-

tional Linguistics: System Demonstrations, pages 13–

18, Baltimore, Maryland. Association for Computa-

tional Linguistics.

ICISSP 2019 - 5th International Conference on Information Systems Security and Privacy

256