method was able to correctly classify 98% of phish-
ing emails and 99.3% of 1000 legitimate emails from
authors’ inboxes.
9 CONCLUSION
We have presented a novel approach that is simple yet
effective in detection and classification of phishing
emails. We have shown how the unique character-
istics of Message-IDs can be exploited with n-gram
analysis to produce features that can distinguish be-
tween phishing and legitimate emails. Our approach
studies the performance of different classifiers on dif-
ferent order of n-gram features and several datasets.
It is also the first method that has applied confidence
weighted learning algorithm on an email header field
instead of the body. The results we obtained are
very promising considering the minimal information
required by our technique. If combined with exist-
ing methods of phishing detection based on header or
body analysis, it might even reach higher detection
rates with low false positives, and such combinations
would be even more robust and harder to attack than
individual methods.
ACKNOWLEDGEMENTS
This research is supported in part by NSF grants CNS
1319212 and DUE 1241772.
REFERENCES
Basnet, R. B. and Sung, A. H. (2010). Classifying phishing
emails using confidence-weighted linear classifiers. In
International Conference on Information Security and
Artificial Intelligence (ISAI), pages 108–112.
Breiman, L. (1996). Bagging predictors. Machine learning,
24(2):123–140.
Breiman, L. (2001). Random forests. Machine learning,
45(1):5–32.
Chen, T.-C., Stepan, T., Dick, S., and Miller, J. (2014). An
anti-phishing system employing diffused information.
ACM Transactions on Information and System Secu-
rity (TISSEC), 16(4):16.
Costales, B., Janse, G., Abmann, C., and Shapiro, G. N.
(2007). Sendmail (4th ed.). In Sendmail (4th ed.).
O’Reilly.
Crammer, K. (2009). Confidence weighted learning li-
brary. http://webee.technion.ac.il/people/koby/code-
index.html.
Fette, I., Sadeh, N., and Tomasic, A. (2007). Learning to
detect phishing emails. In Proceedings of the 16th
international conference on World Wide Web, pages
649–656. ACM.
Freund, Y. and Schapire, R. E. (1996). Experiments with a
new boosting algorithm. In Thirteenth International
Conference on Machine Learning, pages 148–156,
San Francisco. Morgan Kaufmann.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann,
P., and Witten, I. H. (2009). The weka data min-
ing software: an update. ACM SIGKDD explorations
newsletter, 11(1):10–18.
Hamid, I. R. A. and Abawajy, J. (2011). Hybrid feature
selection for phishing email detection. In Algorithms
and Architectures for Parallel Processing, pages 266–
275. Springer.
Irani, D., Webb, S., Giffin, J., and Pu, C. (2008). Evolution-
ary study of phishing. In eCrime Researchers Summit,
2008, pages 1–10. IEEE.
John, G. H. and Langley, P. (1995). Estimating continuous
distributions in bayesian classifiers. In Eleventh Con-
ference on Uncertainty in Artificial Intelligence, pages
338–345, San Mateo. Morgan Kaufmann.
Mejer, A. and Crammer, K. (2010). Confidence
in structured-prediction using confidence-weighted
models. In Proceedings of the 2010 conference on em-
pirical methods in natural language processing, pages
971–981. Association for Computational Linguistics.
Nazario, J. (2004). The online phishing corpus.
http://monkey.org/ jose/wiki/doku.php.
Pasupatheeswaran, S. (2008). Email ‘message-ids’ helpful
for forensic analysis?
Platt, J. et al. (1999). Fast training of support vector ma-
chines using sequential minimal optimization. Ad-
vances in kernel methods-support vector learning, 3.
Quinlan, J. R. (2014). C4. 5: programs for machine learn-
ing. Elsevier.
Resnick, P. (2001). Internet message format.
http://www.ietf.org/rfc/rfc2822.txt.
SpamAssassin, A. (2006). Spamassassin public mail cor-
pus. https://spamassassin.apache.org/publiccorpus/.
Toolan, F. and Carthy, J. (2010). Feature selection for spam
and phishing detection. In eCrime Researchers Sum-
mit (eCrime), 2010, pages 1–12. IEEE.
Verma, R. and Hossain, N. (2014). Semantic feature selec-
tion for text with application to phishing email detec-
tion. In Information Security and Cryptology–ICISC
2013, pages 455–468. Springer.
Verma, R., Shashidhar, N., and Hossain, N. (2012). De-
tecting phishing emails the natural language way. In
ESORICS, pages 824–841.
SECRYPT2015-InternationalConferenceonSecurityandCryptography
434