Other possible approaches are based on metadata,
the best example is information from SMTP traffic
(i.e. sender reputation). Those approaches are suc-
cessful for tasks like spam detection because the mes-
sages are sent in bulks big enough to build the rep-
utation. Which does not apply to phishing which is
sent in smaller batches and the SMTP servers are of-
ten stolen. That makes reputation approaches ineffec-
tive for phishing detection. Next argument for not ex-
ploring this path is that those mechanisms are already
used in most anti-spam solutions, therefore, those at-
tacks would be not delivered anyway.
Most commonly used approach for phishing de-
tection is using blacklists, which is also true for spam
detection. This method is simple to implement. It
is using URLs present in the email body which are
searched in the blacklist. The benefit of this method is
very high precision because the lists are hand-curated,
which is also its biggest downside, because the addi-
tion of new attacks is slow, causing a low recall.
2.1 Named Entity Recognition
Named entity recognition (NER) is a natural language
processing technique which is mapping sequence of
words to sequence of tags. Tags are called named en-
tities i.e. personal names, places, companies, dates,
and numbers.
We are using NER implementation called
Nametag (Strakov
´
a et al., 2014) because it achieves
the best results for the Czech language. It also has
good performance for English. We need to target
Czech because most of Email.cz traffic is Czech (Sec.
1.1) and the Czech language is more complicated for
machine processing then English.
Nametag is a two-pass algorithm based on maxi-
mum entropy Markov model. It is using morpholog-
ical analysis and other features like word clustering,
gazetteers and orthographic features (capitalization,
punctuation) (Strakov
´
a et al., 2014).
3 METHODS
We propose two methods for phishing detection. First
one is described in the (Ramanathan and Wechsler,
2013). Authors achieved a great result of 100 % F-
measure. This method is a random forest classifier
based on features extracted by NER and latent Dirich-
let allocation (LDA). We are re-implementing this
method and validating its performance in live traffic.
This method is dependent on a representative dataset
which is hard to get for this task.
Our second proposed method is overcoming this
precondition. We are detecting entity impersonation
instead. This method is based on NER which is de-
tecting companies and then comparing the detected
company link profile to the email link profile. The
email link profile is extracted from the email body.
The company link profile is built from historical data
and consists of domains which are referred in the le-
gitimate emails sent by the company.
3.1 Random Forest Classifier
We trained model based on (Ramanathan and Wech-
sler, 2013). This model consists of three parts. The
first part is named entity recognition. The second part
is latent Dirichlet allocation (LDA). Third and the last
part is random forest classifier.
First part is responsible for detecting names of
people, places, and organizations. LDA is responsible
for estimating topic probabilities. Those two models
work well together because NER is considering the
input word position and LDA is based only on word
frequencies (position independent) (Ramanathan and
Wechsler, 2013).
Stanford NER was used in the original paper
(Jenny Rose Finkel and Manning, 2005). We are us-
ing Nametag because it has more granular tags and
it was tested on the email language (described in 5.1)
(Strakov
´
a et al., 2014). We are using tags starting with
”g” (geographical names), ”i” (companies and institu-
tions) and ”p” (personal names).
LDA model is using the term-document matrix as
an input. This matrix is created with the Scikit count
vectorizer (Pedregosa et al., 2011). We are using 1000
features after stop words removal. The vectorizer is
trained on 90% of the dataset. We are using 90% of
the corpus because the corpus is small and we do want
most of the words to be present in the vectorizer. We
are using 200 categories as supposed in the original
paper. Achieved perplexity is 421 which is compara-
ble to the original paper.
First two models are used to create an input feature
vector for the final classifier. The feature vector con-
sists of 200 topic probabilities predicted by the LDA
model, and 40 NER based features consisting of the
tags and the entities.
The final model consists of 200 weak learners of
max depth 5. We trained it using the dataset described
in Sec. 4.1 the same way as the original paper did.
This model architecture was able to achieve 100% F-
measure in the original paper. We achieved 94% F-
measure (more details in 5.2), which is a sufficient
result for the production test (described in Sec. 5.3).
Our implementation in Python is using scikit-learn
Phishing Email Detection based on Named Entity Recognition
253