EMBEDDED FEATURE SELECTION FOR SPAM AND PHISHING FILTERING USING SUPPORT VECTOR MACHINES

Sebastián Maldonado, Gastón L'Huillier

Abstract

Today, the Internet is full of harmful and wasteful elements, such as phishing and spam messages, which must be properly classified before reaching end-users. This issue has attracted the pattern recognition community’s attention and motivated to determine which strategies achieve best classification results. Several methods use as many features as content-based properties the data set have, which leads to a high dimensional classification problem. In this context, this paper presents a feature selection approach that simultaneously determines a nonlinear classification function with minimal error and minimizes the number of features by penalizing their use in the dual formulation of binary Support Vector Machines (SVM). The method optimizes the width of an anisotropic RBF Kernel via successive gradient descent steps, eliminating features that have low relevance for the model. Experiments with two real-world Spam and Phishing data sets demonstrate that our approach accomplishes the best performance compared to well-known feature selection methods using consistently a small number of features.

References

  1. Asuncion, A. and Newman, D. (2007). UCI machine learning repository.
  2. Bergholz, A., Beer, J. D., Glahn, S., Moens, M.-F., Paass, G., and Strobel, S. (2010). New filtering approaches for phishing email. Journal of Computer Security, 18(1):7-35.
  3. Bradley, P. and Mangasarian, O. (1998). Feature selection vía concave minimization and support vector machines. In Int. Conference on Machine Learning, pages 82-90.
  4. Canu, S. and Grandvalet, Y. (2002). Adaptive scaling for feature selection in SVMs. Advances in Neural Information Processing Systems, 15:553-560.
  5. Chapelle, O., Vapnik, V., Bousquet, O., and Mukherjee, S. (2002). Choosing multiple parameters for support vector machines. Machine Learning, 46:131-159.
  6. Goodman, J., Cormack, G. V., and Heckerman, D. (2007). Spam and the ongoing battle for the inbox. Commun. ACM, 50(2):24-33.
  7. Guyon, I., Gunn, S., Nikravesh, M., and Zadeh, L. A. (2006). Feature extraction, foundations and applications. Springer, Berlin.
  8. Guyon, I., Saffari, A., Dror, G., and Cawley, G. (2009). Model selection: Beyond the bayesian frequentist divide. Journal of Machine Learning research, 11:61- 87.
  9. L'Huillier, G., Hevia, A., Weber, R., and Rios, S. (2010). Latent semantic analysis and keyword extraction for phishing classification. In ISI'10: Proceedings of the IEEE International Conference on Intelligence and Security Informatics, pages 129-131, Vancouver, BC, Canada. IEEE.
  10. Maldonado, S. and Weber, R. (2009). A wrapper method for feature selection using support vector machines. Information Sciences, 179:2208-2217.
  11. Maldonado, S., Weber, R., and Basak, J. (2011). Kernelpenalized SVM for feature selection. Information Sciences, 181(1):115-128.
  12. Neumann, J., Schnörr, C., and Steidl, G. (2005). Combined svm-based feature selection and classification. Machine Learning, 61:129-150.
  13. Rakotomamonjy, A. (2003). Variable selection using SVMbased criteria. Journal of Machine Learning research, 3:1357-1370.
  14. Tang, Y., Krasser, S., Alperovitch, D., and Judge, P. (2008). Spam sender detection with classification modeling on highly imbalanced mail server behavior data. In Proceedings of the International Conference on Artificial Intelligence and Pattern Recognition, AIPR'08, pages 174-180. ISRST.
  15. Taylor, B., Fingal, D., and Aberdeen, D. (2007). The war against spam: A report from the front line. In In NIPS 2007 Workshop on Machine Learning in Adversarial Environments for Computer Security.
  16. Vapnik, V. (1998). Statistical Learning Theory. John Wiley and Sons.
  17. Weston, J., Elisseeff, A., Schölkopf, B., and Tipping, M. (2003). The use of zero-norm with linear models and kernel methods. Journal of Machine Learning research, 3:1439-1461.
  18. Weston, J., Mukherjee, S., Chapelle, O., Ponntil, M., Poggio, T., and Vapnik, V. (2001). Feature selection for SVMs. In Advances in Neural Information Processing Systems 13, volume 13.
Download


Paper Citation


in Harvard Style

Maldonado S. and L'Huillier G. (2012). EMBEDDED FEATURE SELECTION FOR SPAM AND PHISHING FILTERING USING SUPPORT VECTOR MACHINES . In Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 2: ICPRAM, ISBN 978-989-8425-99-7, pages 445-450. DOI: 10.5220/0003782004450450


in Bibtex Style

@conference{icpram12,
author={Sebastián Maldonado and Gastón L'Huillier},
title={EMBEDDED FEATURE SELECTION FOR SPAM AND PHISHING FILTERING USING SUPPORT VECTOR MACHINES},
booktitle={Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 2: ICPRAM,},
year={2012},
pages={445-450},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003782004450450},
isbn={978-989-8425-99-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 2: ICPRAM,
TI - EMBEDDED FEATURE SELECTION FOR SPAM AND PHISHING FILTERING USING SUPPORT VECTOR MACHINES
SN - 978-989-8425-99-7
AU - Maldonado S.
AU - L'Huillier G.
PY - 2012
SP - 445
EP - 450
DO - 10.5220/0003782004450450