Igor Santos, Carlos Laorden, Pablo G. Bringas


Malware is any type of computer software harmful to computers and networks. The amount of malware is increasing every year and poses as a serious global security threat. Signature-based detection is the most broadly used commercial antivirus method, however, it fails to detect new and previously unseen malware. Supervised machine-learning models have been proposed in order to solve this issue, but the usefulness of supervised learning is far to be perfect because it requires a significant amount of malicious code and benign software to be identified and labelled in beforehand. In this paper, we propose a new method that adopts a collective learning approach to detect unknown malware. Collective classification is a type of semi-supervised learning that presents an interesting method for optimising the classification of partially-labelled data. In this way, we propose here, for the first time, collective classification algorithms to build different machine-learning classifiers using a set of labelled (as malware and legitimate software) and unlabelled instances. We perform an empirical validation demonstrating that the labelling efforts are lower than when supervised learning is used, while maintaining high accuracy rates.


  1. Cano, J., Herrera, F., and Lozano, M. (2006). On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining. Applied Soft Computing Journal, 6(3):323-332.
  2. Devesa, J., Santos, I., Cantero, X., Penya, Y. K., and Bringas, P. G. (2010). Automatic Behaviour-based Analysis and Classification System for Malware Detection. In Proceedings of the 12th International Conference on Enterprise Information Systems (ICEIS), pages 395-399.
  3. Garner, S. (1995). Weka: The Waikato environment for knowledge analysis. In Proceedings of the New Zealand Computer Science Research Students Conference, pages 57-64.
  4. Holmes, G., Donkin, A., and Witten, I. H. (1994). Weka: a machine learning workbench. pages 357-361.
  5. Kang, M., Poosankam, P., and Yin, H. (2007). Renovo: A hidden code extractor for packed executables. In Proceedings of the 2007 ACM workshop on Recurring malcode, pages 46-53.
  6. Kolter, J. and Maloof, M. (2004). Learning to detect malicious executables in the wild. In Proceedings of the 10th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 470- 478. ACM New York, NY, USA.
  7. Liu, H. and Motoda, H. (2001). Instance selection and construction for data mining. Kluwer Academic Pub.
  8. Liu, H. and Motoda, H. (2008). Computational methods of feature selection. Chapman & Hall/CRC.
  9. Martignoni, L., Christodorescu, M., and Jha, S. (2007). Omniunpack: Fast, generic, and safe unpacking of malware. In Proceedings of the 23rd Annual Computer Security Applications Conference (ACSAC), pages 431-441.
  10. McGill, M. and Salton, G. (1983). Introduction to modern information retrieval. McGraw-Hill.
  11. Morley, P. (2001). Processing virus collections. In Proceedings of the 2001 Virus Bulletin Conference (VB2001), pages 129-134. Virus Bulletin.
  12. Moskovitch, R., Stopel, D., Feher, C., Nissim, N., and Elovici, Y. (2008). Unknown malcode detection via text categorization and the imbalance problem. In Proceedings of the 6th IEEE International Conference on Intelligence and Security Informatics (ISI), pages 156-161.
  13. Namata, G., Sen, P., Bilgic, M., and Getoor, L. (2009). Collective classification for text classification. Text Mining, pages 51-69.
  14. Neville, J. and Jensen, D. (2003). Collective classification with relational dependency networks. In Proceedings of the Workshop on Multi-Relational Data Mining (MRDM).
  15. Pyle, D. (1999). Data preparation for data mining. Morgan Kaufmann.
  16. Royal, P., Halpin, M., Dagon, D., Edmonds, R., and Lee, W. (2006). Polyunpack: Automating the hidden-code extraction of unpack-executing malware. In Proceedings of the 22nd Annual Computer Security Applications Conference (ACSAC), pages 289-300.
  17. Santos, I., Nieves, J., and Bringas, P. (2011). Semisupervised learning for unknown malware detection. In Abraham, A., Corchado, J., Gonzlez, S., and De Paz Santana, J., editors, International Symposium on Distributed Computing and Artificial Intelligence, volume 91 of Advances in Intelligent and Soft Computing, pages 415-422. Springer Berlin / Heidelberg.
  18. Santos, I., Penya, Y., Devesa, J., and Bringas, P. (2009). N-Grams-based file signatures for malware detection. In Proceedings of the 11th International Conference on Enterprise Information Systems (ICEIS), Volume AIDSS, pages 317-320.
  19. Schapire, R. (2003). The boosting approach to machine learning: An overview. Lecture Notes in Statistics, pages 149-172.
  20. Schultz, M., Eskin, E., Zadok, F., and Stolfo, S. (2001). Data mining methods for detection of new malicious executables. In Proceedings of the 22nd IEEE Symposium on Security and Privacy., pages 38-49.
  21. Shafiq, M., Khayam, S., and Farooq, M. (2008). Embedded Malware Detection Using Markov n-Grams. Lecture Notes in Computer Science, 5137:88-107.
  22. Sharif, M., Yegneswaran, V., Saidi, H., Porras, P., and Lee, W. (2008). Eureka: A Framework for Enabling Static Malware Analysis. In Proceedings of the European Symposium on Research in Computer Security (ESORICS), pages 481-500.
  23. Singh, Y., Kaur, A., and Malhotra, R. (2009). Comparative analysis of regression and machine learning methods for predicting fault proneness models. International Journal of Computer Applications in Technology, 35(2):183-193.
  24. Tsang, E., Yeung, D., and Wang, X. (2003). OFFSS: optimal fuzzy-valued feature subset selection. IEEE transactions on fuzzy systems, 11(2):202-213.
  25. Zhou, Y. and Inge, W. (2008). Malware detection using adaptive data compression. In Proceedings of the 1st ACM workshop on Workshop on AISec, pages 53-60. ACM New York, NY, USA.
  26. Zhou, Y., Jorgensen, Z., and Inge, M. (2007). Combating Good Word Attacks on Statistical Spam Filters with Multiple Instance Learning. In Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence-Volume 02, pages 298-305.

Paper Citation

in Harvard Style

Santos I., Laorden C. and G. Bringas P. (2011). COLLECTIVE CLASSIFICATION FOR UNKNOWN MALWARE DETECTION . In Proceedings of the International Conference on Security and Cryptography - Volume 1: SECRYPT, (ICETE 2011) ISBN 978-989-8425-71-3, pages 251-256. DOI: 10.5220/0003452802510256

in Bibtex Style

author={Igor Santos and Carlos Laorden and Pablo G. Bringas},
booktitle={Proceedings of the International Conference on Security and Cryptography - Volume 1: SECRYPT, (ICETE 2011)},

in EndNote Style

JO - Proceedings of the International Conference on Security and Cryptography - Volume 1: SECRYPT, (ICETE 2011)
SN - 978-989-8425-71-3
AU - Santos I.
AU - Laorden C.
AU - G. Bringas P.
PY - 2011
SP - 251
EP - 256
DO - 10.5220/0003452802510256