ANOMALY-BASED SPAM FILTERING

Igor Santos, Carlos Laorden, Xabier Ugarte-Pedrero, Borja Sanz, Pablo G. Bringas

Abstract

Spam has become an important problem for computer security because it is a channel for the spreading of threats such as computer viruses, worms and phishing. Currently, more than 85% of received e-mails are spam. Historical approaches to combat these messages, including simple techniques such as sender blacklisting or the use of e-mail signatures, are no longer completely reliable. Many solutions utilise machine-learning approaches trained using statistical representations of the terms that usually appear in the e-mails. However, these methods require a time-consuming training step with labelled data. Dealing with the situation where the availability of labelled training instances is limited slows down the progress of filtering systems and offers advantages to spammers. In this paper, we present the first spam filtering method based on anomaly detection that reduces the necessity of labelling spam messages and only employs the representation of legitimate emails. This approach represents legitimate e-mails as word frequency vectors. Thereby, an email is classified as spam or legitimate by measuring its deviation to the representation of the legitimate e-mails. We show that this method achieves high accuracy rates detecting spam while maintaining a low false positive rate and reducing the effort produced by labelling spam.

References

  1. Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., and Spyropoulos, C. (2000a). An evaluation of naive bayesian anti-spam filtering. In Proceedings of the workshop on Machine Learning in the New Information Age, pages 9-17.
  2. Androutsopoulos, I., Koutsias, J., Chandrinos, K., and Spyropoulos, C. (2000b). An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 160-167.
  3. Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C., and Stamatopoulos, P. (2000c). Learning to filter spam e-mail: A comparison of a naive bayesian and a memory-based approach. In Proceedings of the Machine Learning and Textual Information Access Workshop of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases.
  4. Awad, A., Polyvyanyy, A., and Weske, M. (2008). Semantic querying of business process models. In IEEE International Conference on Enterprise Distributed Object Computing Conference (EDOC 2008), pages 85-94.
  5. Baeza-Yates, R. A. and Ribeiro-Neto, B. (1999). Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.
  6. Bates, M. and Weischedel, R. (1993). Challenges in natural language processing. Cambridge Univ Pr.
  7. Becker, J. and Kuropka, D. (2003). Topic-based vector space model. In Proceedings of the 6th International Conference on Business Information Systems, pages 7-12.
  8. Blanzieri, E. and Bryl, A. (2007). Instance-based spam filtering using SVM nearest neighbor classifier. Proceedings of FLAIRS-20, pages 441-442.
  9. Bratko, A., Filipic?, B., Cormack, G., Lynam, T., and Zupan, B. (2006). Spam filtering using statistical data compression models. The Journal of Machine Learning Research, 7:2673-2698.
  10. Cano, J., Herrera, F., and Lozano, M. (2006). On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining. Applied Soft Computing Journal, 6(3):323-332.
  11. Carnap, R. (1955). Meaning and synonymy in natural languages. Philosophical Studies, 6(3):33-47.
  12. Carpinter, J. and Hunt, R. (2006). Tightening the net: A review of current and next generation spam filtering tools. Computers & security, 25(8):566-578.
  13. Carreras, X. and Márquez, L. (2001). Boosting trees for anti-spam email filtering. In Proceedings of RANLP01, 4th international conference on recent advances in natural language processing, pages 58-64. Citeseer.
  14. Cohen, D. (1974). Explaining linguistic phenomena. Halsted Press.
  15. Cranor, L. and LaMacchia, B. (1998). Spam! Communications of the ACM, 41(8):74-83.
  16. Cruse, D. (1975). Hyponymy and lexical hierarchies. Archivum Linguisticum, 6:26-31.
  17. Czarnowski, I. and Jedrzejowicz, P. (2006). Instance reduction approach to machine learning and multi-database mining. In Proceedings of the Scientific Session organized during XXI Fall Meeting of the Polish Information Processing Society, Informatica, ANNALES Universitatis Mariae Curie-Sklodowska, Lublin, pages 60-71.
  18. Dash, M. and Liu, H. (2003). Consistency-based search in feature selection. Artificial Intelligence, 151(1- 2):155-176.
  19. Dietterich, T., Lathrop, R., and Lozano-Pérez, T. (1997). Solving the multiple instance problem with axisparallel rectangles. Artificial Intelligence, 89(1-2):31- 71.
  20. Drucker, H., Wu, D., and Vapnik, V. (1999). Support vector machines for spam categorization. IEEE Transactions on Neural networks, 10(5):1048-1054.
  21. Elkan, C. (2001). The foundations of cost-sensitive learning. In Proceedings of the 2001 International Joint Conference on Artificial Intelligence, pages 973-978.
  22. Heron, S. (2009). Technologies for spam detection. Network Security, 2009(1):11-15.
  23. Ide, N. and Véronis, J. (1998). Introduction to the special issue on word sense disambiguation: the state of the art. Computational linguistics, 24(1):2-40.
  24. Jagatic, T., Johnson, N., Jakobsson, M., and Menczer, F. (2007). Social phishing. Communications of the ACM, 50(10):94-100.
  25. Jung, J. and Sit, E. (2004). An empirical study of spam traffic and the use of DNS black lists. In Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, pages 370-375. ACM New York, NY, USA.
  26. Karlberger, C., Bayler, G., Kruegel, C., and Kirda, E. (2007). Exploiting redundancy in natural language to penetrate bayesian spam filters. In Proceedings of the 1st USENIX workshop on Offensive Technologies (WOOT), pages 1-7. USENIX Association.
  27. Kent, J. (1983). Information gain and a general measure of correlation. Biometrika, 70(1):163-173.
  28. Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In International Joint Conference on Artificial Intelligence, volume 14, pages 1137-1145.
  29. Kolcz, A., Chowdhury, A., and Alspector, J. (2004). The impact of feature selection on signature-driven spam detection. In Proceedings of the 1st Conference on Email and Anti-Spam (CEAS-2004).
  30. Kuropka, D. (2004). Modelle zur Repräsentation natürlichsprachlicher Dokumente-InformationFiltering und-Retrieval mit relationalen Datenbanken. Advances in Information Systems and Management Science, 10.
  31. Lewis, D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. Lecture Notes in Computer Science, 1398:4-18.
  32. Liu, H. and Motoda, H. (2001). Instance selection and construction for data mining. Kluwer Academic Pub.
  33. Liu, H. and Motoda, H. (2008). Computational methods of feature selection. Chapman & Hall/CRC.
  34. Lovins, J. (1968). Development of a Stemming Algorithm. Mechanical Translation and Computational Linguistics, 11(1):22-31.
  35. Maron, O. and Lozano-Pérez, T. (1998). A framework for multiple-instance learning. Advances in neural information processing systems, pages 570-576.
  36. McGill, M. and Salton, G. (1983). Introduction to modern information retrieval. McGraw-Hill.
  37. Ming-Tzu, K. and Nation, P. (2004). Word meaning in academic English: Homography in the academic word list. Applied linguistics, 25(3):291-314.
  38. Mishne, G., Carmel, D., and Lempel, R. (2005). Blocking blog spam with language model disagreement. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pages 1-6.
  39. Navigli, R. (2009). Word sense disambiguation: a survey. ACM Computing Surveys (CSUR), 41(2):10.
  40. Polyvyanyy, A. (2007). Evaluation of a novel information retrieval model: eTVSM. MSc Dissertation.
  41. Pyle, D. (1999). Data preparation for data mining. Morgan Kaufmann.
  42. Quinlan, J. (1986). Induction of decision trees. Machine learning, 1(1):81-106.
  43. Radden, G. and Kövecses, Z. (1999). Towards a theory of metonymy. Metonymy in language and thought, pages 17-59.
  44. Ramachandran, A., Dagon, D., and Feamster, N. (2006). Can DNS-based blacklists keep up with bots. In Conference on Email and Anti-Spam. Citeseer.
  45. Sahami, M., Dumais, S., Heckerman, D., and Horvitz, E. (1998). A Bayesian approach to filtering junk e-mail. In Learning for Text Categorization: Papers from the 1998 workshop, volume 62, pages 98-05. Madison, Wisconsin: AAAI Technical Report WS-98-05.
  46. Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C., and Stamatopoulos, P. (2003). A memory-based approach to anti-spam filtering for mailing lists. Information Retrieval, 6(1):49-73.
  47. Salton, G. and McGill, M. (1983). Introduction to modern information retrieval. McGraw-Hill New York.
  48. Salton, G., Wong, A., and Yang, C. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11):613-620.
  49. Schneider, K. (2003). A comparison of event models for Naive Bayes anti-spam e-mail filtering. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, pages 307-314.
  50. Sculley, D. and Wachman, G. (2007). Relaxed online SVMs for spam filtering. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 415-422.
  51. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1):1-47.
  52. Seewald, A. (2007). An evaluation of naive Bayes variants in content-based learning for spam filtering. Intelligent Data Analysis, 11(5):497-524.
  53. Torkkola, K. (2003). Feature extraction by non parametric mutual information maximization. The Journal of Machine Learning Research, 3:1415-1438.
  54. Tsang, E., Yeung, D., and Wang, X. (2003). OFFSS: optimal fuzzy-valued feature subset selection. IEEE transactions on fuzzy systems, 11(2):202-213.
  55. Vapnik, V. (2000). The Nature of Statistical Learning Theory. Springer.
  56. Wilbur, W. and Sirotkin, K. (1992). The automatic identification of stop words. Journal of information science, 18(1):45-55.
  57. Wittel, G. and Wu, S. (2004). On attacking statistical spam filters. In Proceedings of the 1st Conference on Email and Anti-Spam (CEAS).
  58. Zhang, L., Zhu, J., and Yao, T. (2004). An evaluation of statistical spam filtering techniques. ACM Transactions on Asian Language Information Processing (TALIP), 3(4):243-269.
  59. Zhou, Y., Jorgensen, Z., and Inge, M. (2007). Combating Good Word Attacks on Statistical Spam Filters with Multiple Instance Learning. In Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence-Volume 02, pages 298-305. IEEE Computer Society.
Download


Paper Citation


in Harvard Style

Santos I., Laorden C., Ugarte-Pedrero X., Sanz B. and G. Bringas P. (2011). ANOMALY-BASED SPAM FILTERING . In Proceedings of the International Conference on Security and Cryptography - Volume 1: SECRYPT, (ICETE 2011) ISBN 978-989-8425-71-3, pages 5-14. DOI: 10.5220/0003444700050014


in Bibtex Style

@conference{secrypt11,
author={Igor Santos and Carlos Laorden and Xabier Ugarte-Pedrero and Borja Sanz and Pablo G. Bringas},
title={ANOMALY-BASED SPAM FILTERING},
booktitle={Proceedings of the International Conference on Security and Cryptography - Volume 1: SECRYPT, (ICETE 2011)},
year={2011},
pages={5-14},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003444700050014},
isbn={978-989-8425-71-3},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Security and Cryptography - Volume 1: SECRYPT, (ICETE 2011)
TI - ANOMALY-BASED SPAM FILTERING
SN - 978-989-8425-71-3
AU - Santos I.
AU - Laorden C.
AU - Ugarte-Pedrero X.
AU - Sanz B.
AU - G. Bringas P.
PY - 2011
SP - 5
EP - 14
DO - 10.5220/0003444700050014