information retrieval and text mining. The TF-IDF
value increases proportionally to the number of times
a feature appears in the document, but is offset by the
frequency of the feature in the corpus.
It is interesting to note that the base of the log
function does not matter and constitutes a constant
multiplicative factor toward the overall result. Thus,
a term t has a high TF-IDF weight by having a high
term frequency in a given document D (i.e. a feature
is common in a document) and a low document
frequency in the whole collection of documents (i.e.
is relatively uncommon in other documents).
The Keyword extraction heuristics used in our
proposed system are discussed as follows:
6. Domain Name Credibility: The domain name
credibility feature determines the genuineness of
the target organization by phishing kit creators
using Google’s PageRank system. If kit contains
file for one or more target organization, then the
rank of the hosting domain is compared with a
threshold value (usually 5 on 0 to 10 scale)
indicating the legitimacy of the site.
7. Domain Name Identity: Most of the website
domain names have relationship to their contents.
The keywords in this domain name are usually
part of the base domain URL. If the keyword
identity set of a page is not related to its contents,
then it is phishing. Otherwise, it is legitimate and
non-kit related.
8. Out-of-Position Brand Name: Legitimate sites
often put their brand name into their domain
name. On the other hand, phishing sites are always
hosted on compromised or newly registered
domains. If the domain keywords are not related
to its brand, then the page is suspicious.
9. Age of Domain: This feature checks the age of
the domain name. Many phishing pages claim the
identity of known brand which has relatively long
history. If the age of domain does not correspond
to the WHOIS lookups, then it is likely to be
deceptive.
The URL identity of a webpage is determined by
analyzing the patterns from its hyperlinks structure.
In a legitimate website most of the links points to its
own domain or associated domain, but in phishing
sites (including the kit-based phish sites) most of the
links point to foreign domain to imitate the behavior
of a legitimate page. For URL identity extraction, the
SDM consider the “href” and “src’ attributes of the
anchor links, particularly <a>, <area>, <link>,
<img>, and <script> tags from the DOM tree of a
webpage. For each anchor, the SDM extracts the base
domain portion from the URL, and then calculate the
number of times each base domain appears. The base
domain that has the highest frequency will be the
URL identity of the webpage. This step is necessary
in determining the behavior of the URL embedded in
a suspicious webpage. In the end, the following
features are considered from URL identity to
generalize the detection accuracy of the proposed
system:
$hostname = gethostbyaddr($ip);
$message = "Chase Bank Spam ReZulT\n"; ...
$message .= "User ID : $user\n";
$messege .= "hostip" $message .= "Full Name :
$fullname\n"; ...
$message .= "City : $city\n";
$messege .= "port";
$message .= "State : $state\n"; ...
$message .= "Mother Maiden Name : $mmn\n";
$messege .= "@"; ...
mail($to,$subject,$message,$headers);
mail($messege,$subject,$message,$headers);
Figure 3: Sample of Drop Email Code (Cova et al., 2008).
10. URL of Original Site: Most phish sites usually
put the URL of the original site faked as comment
at the top of the html page. This is to show where
the website was copied from. If such feature exists
on a page, then it is phishing and possibly kit-
based.
11. Presence of User-info in the Domain Name: In
this feature, the presence of @ or dash (-) is
checked for within the URL. If such feature is
found, then the page is a phish site.
12. IP Address Behavior (Either Irreversible or
Reversible): In this feature, the system checks
whether the URL address of a website is a
permanent IP address which does not have DNS
entries. In most phishing site, the practice is
usually an IP address-based URL because of its
low cost. Therefore, if such feature exists, then the
page is a phish site.
13. Number of Dots in the URL: This feature counts
the number of dots in the URL as most phishing
pages tend to use more dots in their URLs. If this
feature exist on a page, then it is a phish site.
14. Domain Name in the Path of the URL: This
feature checks for the presence of dot separated
domain or host name in the path part of the URL.
If this feature exists in a page, then it is a phish
site.
15. Presence of Foreign Anchors: This feature
examines of foreign anchors in a webpage. If a