Clustering Spam Emails into Campaigns
Mina Sheikh Alishahi, Mohamed Mejri, Nadia Tawbi
2015
Abstract
Spam emails constitute a fast growing and costly problems associated with the Internet today. To fight effectively against spammers, it is not enough to block spam messages. Instead, it is necessary to analyze the behavior of spammer. This analysis is extremely difficult if the huge amount of spam messages is considered as a whole. Clustering spam emails into smaller groups according to their inherent similarity, facilitates discovering spam campaigns sent by a spammer, in order to analyze the spammer behavior. This paper proposes a methodology to group large sets of spam emails into spam campaigns, on the base of categorical attributes of spam messages. A new informative clustering algorithm, named Categorical Clustering Tree (CCTree), is introduced to cluster and characterize spam campaigns. The complexity of the algorithm is also analyzed and its efficiency has been proven.
References
- Anderson, D., Fleizach, C., Savage, S., and Voelker, G. (2007). Spamscatter: Characterizing internet scam hosting infrastructure. In Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium.
- Bergholz, A., PaaB, G., Reichartz, F., Strobel, S., and Birlinghoven, S. (2008). Improved phishing detection using model-based features. In In Fifth Conference on Email and Anti-Spam, CEAS.
- Blanzieri, E. and Bryl, A. (2008). A survey of learningbased techniques of email spam filtering. Artif. Intell. Rev., 29(1):63-92.
- Calais, P., Pires, D., Guedes, D., Meira, W., Hoepers, C., and Steding-Jessen, K. (2008). A campaign-based characterization of spamming strategies. In CEAS.
- Calais Guerra, P., Pires, D., C. Ribeiro, M., Guedes, D., Meira, W., Hoepers, C., H.P.C Chaves, M., and Steding-Jessen, K. (2009). Spam miner: A platform for detecting and characterizing spam campaigns. Information Systems Applications.
- Carreras, X., Marquez, L., and Salgado, J. (2001). Boosting trees for anti-spam email filtering. In Proceedings of RANLP-01, 4th International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, BG, pages 58-64.
- Cimiano, P., Hotho, A., and Staab, S. (2004). Comparing conceptual, divisive and agglomerative clustering for learning taxonomies from text.
- Drucker, H., Wu, D., and Vapnik, V. (1999). Support vector machines for spam categorization. Neural Networks, IEEE Transactions on, 10(5):1048 -1054.
- Fette, I., Sadeh, N., and Tomasic, A. (2007). Learning to detect phishing emails. In Proceedings of the 16th International Conference on World Wide Web, pages 649-656. ACM.
- Gao, H., Hu, J., Wilson, C., Li, Z., Chen, Y., and Zhao, B. (2010). Detecting and characterizing social spam campaigns. In Proceedings of the 10th annual conference on Internet measurement, pages 35-47. ACM.
- Ghahramani, Z. (2004). Unsupervised learning. In Advanced Lectures on Machine Learning, volume 3176 of Lecture Notes in Computer Science, pages 72-112. Springer Berlin Heidelberg.
- Han, J., Kamber, M., and Pei, J. (2011). Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 3rd edition.
- Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). Data clustering: A review. ACM Comput. Surv., 31(3):264- 323.
- John, J., Moshchuk, A., Gribble, S., and Krishnamurthy, A. (2009). Studying spamming using botlab. In Proceedings of the 6th USENIX symposium on Networked systems design and implementation, NSDI09, pages 291-306, Berkeley, CA, USA. USENIX Association.
- Kreibich, C., Kanich, C., Levchenko, K., Enright, B., Voelker, G., Paxson, V., and Savage, S. (2009). Spamcraft: an inside look at spam campaign orchestration. In Proceedings of the 2nd USENIX conference on Large-scale exploits and emergent threats: botnets, spyware, worms, and more, LEET09, pages 4-4, Berkeley, CA, USA. USENIX Association.
- Labs, M. A. (2009). Mcafee threats report: Second quarter 2009.
- Leontiadis, N. (2011). Measuring and analyzing searchredirection attacks in the illicit online prescription drug trade. In Proceedings of USENIX Security 2011.
- Li, F. and Hsieh, M. (2006). An empirical study of clustering behavior of spammers and groupbased anti-spam strategies. In CEAS 2006 Third Conference on Email and AntiSpam, pages 27-28.
- Quinlan, J. R. (1986). Induction of decision trees. Mach. Learn, pages 81-106.
- Quinlan, J. R. (1993). C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
- Ramachandran, A. and Feamster, N. (2006). Understanding the network-level behavior of spammers. ACM SIGCOMM Computer Communication Review, 36(4):291-302.
- Rao, J. and Reiley, D. (2012). On the spam campaign trail. In The Economics of Spam, pages 87-110. Journal of Economic Perspectives, Volume 26, Number 3.
- Report, T. (April, 2012). http://www.commtouch. com/threat-report-april-2012/.
- Seewald, A. (2007). An evaluation of naive bayes variants in content-based learning for spam filtering. Intell. Data Anal., 11(5):497-524.
- Shannon, C. E. (2001). A mathematical theory of communication. SIGMOBILE Mob. Comput. Commun. Rev., 5(1):3-55.
- Song, J., Inque, D., Eto, M., Kim, H., and Nakao, K. (2010a). An empirical study of spam: Analyzing spam sending systems and malicious web servers. In Proceedings of the 2010 10th IEEE/IPSJ International Symposium on Applications and the Internet, SAINT 7810, pages 257-260, Washington, DC, USA. IEEE Computer Society.
- Song, J., Inque, D., Eto, M., Kim, H., and Nakao, K. (2010b). A heuristic-based feature selection method for clustering spam emails. In Proceedings of the 17th international conference on Neural information processing: theory and algorithms - Volume Part I, ICONIP'10, pages 290-297, Berlin, Heidelberg. Springer-Verlag.
- Song, J., Inque, D., Eto, M., Kim, H., and Nakao, K. (2011). O-means: An optimized clustering method for analyzing spam based attacks. In IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences, volume 94, pages 245-254.
- Tretyakov, K. (2004a). Machine learning techniques in spam filtering. Technical report, Institute of Computer Science, University of Tartu.
- Tretyakov, K. (2004b). Machine learning techniques in spam filtering. In Data Mining Problem-oriented Seminar, MTAT, volume 3, pages 60-79. Citeseer.
- Wei, C., Sprague, A., Warner, G., and Skjellum, A. (2008). Mining spam email to identify common origins for forensic application. In Proceedings of the 2008 ACM symposium on Applied computing, SAC 7808, pages 1433-1437, New York, NY, USA. ACM.
- Xie, Y., Yu, F., Achan, K., Panigrahy, R., Hulten, G., and Osipkov, I. (2008). Spamming botnets: Signatures and characteristics. SIGCOMM Comput. Commun. Rev., 38(4):171-182.
- Zhang, C., Chen, W., Chen, X., and Warner, G. (2009). Revealing common sources of image spam by unsupervised clustering with visual features. In Proceedings of the 2009 ACM symposium on Applied Computing, SAC 7809, pages 891-892, New York, NY, USA. ACM.
- Zhao, Y., Xie, Y., Yu, F., Ke, Q., Yu, Y., Chen, Y., and Gillum, E. (2009). Botgraph: Large scale spamming botnet detection. Proc. of 6th NSDI.
Paper Citation
in Harvard Style
Sheikh Alishahi M., Mejri M. and Tawbi N. (2015). Clustering Spam Emails into Campaigns . In Proceedings of the 1st International Conference on Information Systems Security and Privacy - Volume 1: ICISSP, ISBN 978-989-758-081-9, pages 90-97. DOI: 10.5220/0005244500900097
in Bibtex Style
@conference{icissp15,
author={Mina Sheikh Alishahi and Mohamed Mejri and Nadia Tawbi},
title={Clustering Spam Emails into Campaigns},
booktitle={Proceedings of the 1st International Conference on Information Systems Security and Privacy - Volume 1: ICISSP,},
year={2015},
pages={90-97},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005244500900097},
isbn={978-989-758-081-9},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 1st International Conference on Information Systems Security and Privacy - Volume 1: ICISSP,
TI - Clustering Spam Emails into Campaigns
SN - 978-989-758-081-9
AU - Sheikh Alishahi M.
AU - Mejri M.
AU - Tawbi N.
PY - 2015
SP - 90
EP - 97
DO - 10.5220/0005244500900097