2 BACKGROUND AND
LITERATURE REVIEW
Human is the weakest link in cybersecurity. There-
fore, cybercriminals target humans to bypass com-
puter security systems and applications. In social en-
gineering, attackers aim to steal personal information
and credentials by using specific methods to manip-
ulate users. Most popular tactics include Pretexting
(imitating acquaintance and pretending to be some-
one), Baiting (promising to give some gift to the vic-
tim), Quid pro Quo (pretending to help to the victim
in order to obtain confidential information)(Jain and
Gupta, 2016). More generally, phishing attacks are
social engineering attacks in which attackers use a
baiting tactic and aim to steal victim’s personal in-
formation by utilizing the victim’s curiosity and fear
emotions(Goodrich and Tamassia, 2018). The at-
tacker usually sends a link with a text which triggers
the victim’s emotions, such as curiosity and fear. If
the victim clicks the link in the text, a connection is
made to a mock web server or a fake site that looks
like the actual website while being under the control
of the attacker. By filling the HTML forms on the
fake webpage, the victim may send their credentials
unwittingly not to the actual website but actually to
the attacker.
A phishing attack is usually executed in three
phases (Aleroud and Zhou, 2017). It starts with a
preparation phase. In that phase, the attacker pre-
pares the attack vector. For this, the attacker may use
penetration test tools or write scripts for preparing the
attack vector. The attacker then clones the original
webpage before starting a phishing campaign. After
that, the attacker picks a URL that looks compatible
with the text in the attack vector. In addition, the fake
website is served on a web server under the control
of the attacker. The second phase of the attack is the
execution phase. In this phase, the attacker chooses a
media type. The media may be an email, an SMS, or
a social media platform. The text content is adapted
according to the type of social media. The attack-
ers mostly use the news media content, email con-
tent, a connection to a cloud system for changing a
password, or a webpage that the victim submits credit
card information for an urgent online payment. The
goal of the critical content is to trigger victims’ fear
and curiosity emotions. Later, the attacker sends the
attack vector to the bulk of users, also on social media
platforms or through advertisements for a more effec-
tive attack vector. The final phase is the exploitation
phase. In this phase, the attackers use credentials,
credit card numbers, personal information of the vic-
tims obtained during the campaign.
In the present study, we used Bag of Words (Term
Frequency-Inverse Document Frequency), and Bag of
Words (Frequency). Term Frequency-Inverse Docu-
ment Frequency (TF-IDF) is a method for extracting
the weight of a word in a text. In a text, a word TF-
IDF is a product of f
w,t
, where word frequency in a
text, and log
|
T
|
f
w,T
where
|
T
|
is a count of text in a
text corpus, f
w,T
is the frequency of the word in text
corpus.
(w
d
= f
w,t
∗ log
|T |
f
w,T
) (1)
To give an example, assume that an email corpus
consists of two emails: Mail-1 = {This is a phishing
attack} and Mail-2 = {This is not a phishing attack}.
When we calculate the TF-IDF values of the corpus,
all words are presented with 0 but only {not} will be
presented with 1. Accordingly, this example shows
the importance {not} as a single instance that iden-
tifies the difference between the two sentences. Bag
of Words (Frequency) method, which has been used
in text classification for transforming sentences into
vectors. This method is simpler than the TF-IDF. All
of the text content in our corpus was transformed into
sentences with frequency values. All the words that
we used have their specific locations in the Bag of
Words (Frequency) representations.
A review of the previous work reveals numer-
ous studies for mitigating phishing attacks, including
technical solutions, such as white and black listings of
phishing URLs (Jain and Gupta, 2016; Prakash et al.,
2010; Azeez et al., 2021), sending fake entries (Li
and Schmitz, 2009), and employing a one-time pass-
word (Khan, 2013). Machine Learning have been
used widely for detecting phishing attacks. For in-
stance, Garera et al. used logistic regression to detect
malicious phishing URLs. Their dataset consisted of
1,245 malicious and 1,263 non-malicious URLs (Gar-
era et al., 2007). Zareapoor et al. classified phishing
attacks into two categories, namely deceptive phish-
ing and malware phishing. They proposed a method-
ology with multiple steps detect deceptive phishing
attacks (Zareapoor and K.R., 2015). Fette et al. used
not only features, such as whether emails contained
javascript or not, the number of dots, and the number
of reference links besides the text content (Fette et al.,
2007) . Abu-Nimeh et al. used 71 features, 60 of them
being BoW (TF-IDF) values. They used 5,152 raw
English emails, 1,409 of them being phishing emails
(Abu-Nimeh et al., 2009). Some of those datasets
did not include social media posts. Miyamoto et al.
employed nine machine learning techniques on 1,500
phishing URLs and 1,500 legitimate URLs, by us-
ing heuristic models employed in CANTINA, also
including BoW (TF-IDF) (Miyamoto et al., 2009).
DMMLACS 2021 - 2nd International Special Session on Data Mining and Machine Learning Applications for Cyber Security
578