A Structural and Content-based Approach for a Precise and Robust Detection of Malicious PDF Files

Davide Maiorca, Davide Ariu, Igino Corona, Giorgio Giacinto

Abstract

During the past years, malicious PDF files have become a serious threat for the security of modern computer systems. They are characterized by a complex structure and their variety is considerably high. Several solutions have been academically developed to mitigate such attacks. However, they leveraged on information that were extracted from either only the structure or the content of the PDF file. This creates problems when trying to detect non-Javascript or targeted attacks. In this paper, we present a novel machine learning system for the automatic detection of malicious PDF documents. It extracts information from both the structure and the content of the PDF file, and it features an advanced parsing mechanism. In this way, it is possible to detect a wide variety of attacks, including non-Javascript and parsing-based ones. Moreover, with a careful choice of the learning algorithm, our approach provides a significantly higher accuracy compared to other static analysis techniques, especially in the presence of adversarial malware manipulation.

References

  1. Adobe (2006). PDF Reference. Adobe Portable Document Format Version 1.7. Adobe.
  2. Adobe (2008). Adobe Supplement to ISO 32000. Adobe.
  3. Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A. F., and Nielsen, H. (2000). Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics, 16(5):412-424.
  4. Bania, P. (2010). Jit spraying and mitigations. CoRR, abs/1009.1038.
  5. Biggio, B., Corona, I., Maiorca, D., Nelson, B., Srndic, N., Laskov, P., Giacinto, G., and Roli, F. (2013a). Evasion attacks against machine learning at test time. In M. Learning and Know. Discovery in Databases - Europ. Conf., ECML PKDD, pages 387-402.
  6. Biggio, B., Corona, I., Nelson, B., Rubinstein, B., Maiorca, D., Fumera, G., Giacinto, G., and Roli, F. (2014a). Security evaluation of support vector machines in adversarial environments. In Ma, Y. and Guo, G., editors, Support Vector Machines Applications, pages 105-153. Springer International Publishing.
  7. Biggio, B., Fumera, G., and Roli, F. (2010). Multiple classifier systems for robust classifier design in adversarial environments. Int'l J. Mach. Learn. and Cybernetics, 1(1):27-41.
  8. Biggio, B., Fumera, G., and Roli, F. (2014b). Security evaluation of pattern classifiers under attack. IEEE Transactions on Knowledge and Data Engineering, 26(4):984-996.
  9. Biggio, B., Nelson, B., and Laskov, P. (2012). Poisoning attacks against support vector machines. In Langford, J. and Pineau, J., editors, 29th Int'l Conf. on M. Learning (ICML). Omnipress, Omnipress.
  10. Biggio, B., Pillai, I., Bulò , S. R., Ariu, D., Pelillo, M., and Roli, F. (2013b). Is data clustering in adversarial settings secure? In Proceedings of the 2013 ACM Workshop on Artificial Intelligence and Security, AISec 7813, pages 87-98, New York, NY, USA. ACM.
  11. Biggio, B., Rieck, K., Ariu, D., Wressnegger, C., Corona, I., Giacinto, G., and Roli, F. (2014c). Poisoning behavioral malware clustering. In Proc. 2014 Workshop on Artificial Intelligent and Security Workshop, AISec 7814, pages 27-36, New York, NY, USA. ACM.
  12. Buchanan, E., Roemer, R., Sevage, S., and Shacham, H. (2008). Return-oriented programming: Exploitation without code injection. In Black Hat 7808.
  13. Canali, D., Cova, M., Vigna, G., and Kruegel, C. (2011). Prophiler: a fast filter for the large-scale detection of malicious web pages. In Proc. of the 20th Int. Conf. on World Wide Web.
  14. Corona, I., Maiorca, D., Ariu, D., and Giacinto, G. (2014). Lux0r: Detection of malicious pdf-embedded javascript code through discriminant analysis of api references. In To appear in the Proc. of the 7th ACM Workshop on Art. Intelligence and Security.
  15. Cova, M., Kruegel, C., and Vigna, G. (2010). Detection and analysis of drive-by-download attacks and malicious javascript code. In Proc. of the 19th Int. Conf. on World Wide Web.
  16. Curtsinger, C., Livshits, B., Zorn, B., and Seifert, C. (2011). Zozzle: fast and precise in-browser javascript malware detection. In Proc. of the 20th USENIX Conf. on Security.
  17. Engleberth, M., Willems, C., and Holz, T. (2009). Detecting malicious documents with combined static and dynamic analysis. In Virus Bulletin.
  18. Esparza, J. M. (2011). Obfuscation and (non-)detection of malicious pdf files. In S21Sec e-crime.
  19. Freund, Y. and Schapire, R. E. (1995). A decision-theoretic generalization of on-line learning and an application to boosting.
  20. Laskov, P. and S? rndic, N. (2011). Static detection of malicious javascript-bearing pdf documents. In Proc. of the 27th Annual Computer Security Applications Conf.
  21. Li, W.-J., Stolfo, S., Stavrou, A., Androulaki, E., and Keromytis, A. D. (2007). A study of malcode-bearing documents. In Proc. of the 4th Int. Conf. on Detect. of Intrus. and Malware, and Vulnerability Assessment.
  22. Liu, D., Wang, H., and Stavrou, A. (2014). Detecting malicious javascript in pdf through document instrumentation. In Proc. of the 44th Annual Int. Conf. on Dependable Systems and Networks.
  23. Maass, M., Scherlis, W. L., and Aldrich, J. (2014). Innimbo sandboxing. In Proc. of the 2014 Symp. and Bootcamp on the Science of Security, HotSoS 7814, pages 1:1-1:12, New York, NY, USA. ACM.
  24. MacQueen, J. B. (1967). Some methods for classification and analysis of multivariate observations. In Cam, L. M. L. and Neyman, J., editors, Proc. of the fifth Berkeley Symp. on Mathematical Statistics and Probability, volume 1, pages 281-297. University of California Press.
  25. Maiorca, D., Corona, I., and Giacinto, G. (2013). Looking at the bag is not enough to find the bomb: An evasion of structural methods for malicious pdf files detection. In Proc. of the 8th ACM SIGSAC Symp. on Information, Computer and Communications Security.
  26. Maiorca, D., Giacinto, G., and Corona, I. (2012). A pattern recognition system for malicious pdf files detection. In Proc. of the 8th Int. Conf. on M. Learning and Data Mining in Pattern Recognition.
  27. Quinlan, J. R. (1996). Learning decision tree classifiers. ACM Comput. Surv., 28(1):71-72.
  28. Ratanaworabhan, P., Livshits, B., and Zorn, B. (2009). Nozzle: a defense against heap-spraying code injection attacks. In Proc. of the 18th conf. on USENIX security symp.
  29. Rieck, K., Holz, T., Willems, C., D üssel, P., and Laskov, P. (2008). Learning and classification of malware behavior. In Proc. of the 5th Int. Conf. on Detect. of Intrus. and Malware, and Vulnerability Assessment.
  30. Rieck, K., Krueger, T., and Dewald, A. (2010). Cujo: efficient detection and prevention of drive-by-download attacks. In Proc. of the 26th Annual Computer Security Appl. Conference.
  31. Shafiq, M. Z., Khayam, S. A., and Farooq, M. (2008). Embedded malware detection using markov n-grams. In Proc. of the 5th Int. Conf. on Detect. of Intrus. and Malware, and Vulnerability Assessment.
  32. Smutz, C. and Stavrou, A. (2012). Malicious pdf detection using metadata and structural features. In Proc. of the 28th Annual Computer Security Appl. Conference.
  33. Snow, K. Z., Krishnan, S., Monrose, F., and Provos, N. (2011). Shellos: enabling fast detection and forensic analysis of code injection attacks. In Proc. of the 20th USENIX conf. on Security.
  34. Symantec (2014). Internet Security Threat Reports. 2013 Trends. Symantec.
  35. Tabish, S. M., Shafiq, M. Z., and Farooq, M. (2009). Malware detection using statistical analysis of byte-level file content. In Proc. of the ACM SIGKDD Work. on CyberSecurity and Intelligence Informatics.
  36. Tzermias, Z., Sykiotakis, G., Polychronakis, M., and Markatos, E. P. (2011). Combining static and dynamic analysis for the detection of malicious documents. In Proc. of the 4th Europ. Work. on System Security.
  37. S?rndic, N. and Laskov, P. (2013). Detection of malicious pdf files based on hierarchical document structure. In Proc. of the 20th Annual Network & Distributed System Security Symp.
  38. S?rndic, N. and Laskov, P. (2014). Practical evasion of a learning-based classifier: A case study. In Proc. of the 2014 IEEE Symp. on Security and Privacy, SP 7814, pages 197-211, Washington, DC, USA. IEEE Computer Society.
  39. Willems, C., Holz, T., and Freiling, F. (2007). Toward automated dynamic malware analysis using cwsandbox. IEEE Security and Privacy, 5(2).
Download


Paper Citation


in Harvard Style

Maiorca D., Ariu D., Corona I. and Giacinto G. (2015). A Structural and Content-based Approach for a Precise and Robust Detection of Malicious PDF Files . In Proceedings of the 1st International Conference on Information Systems Security and Privacy - Volume 1: ICISSP, ISBN 978-989-758-081-9, pages 27-36. DOI: 10.5220/0005264400270036


in Bibtex Style

@conference{icissp15,
author={Davide Maiorca and Davide Ariu and Igino Corona and Giorgio Giacinto},
title={A Structural and Content-based Approach for a Precise and Robust Detection of Malicious PDF Files},
booktitle={Proceedings of the 1st International Conference on Information Systems Security and Privacy - Volume 1: ICISSP,},
year={2015},
pages={27-36},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005264400270036},
isbn={978-989-758-081-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 1st International Conference on Information Systems Security and Privacy - Volume 1: ICISSP,
TI - A Structural and Content-based Approach for a Precise and Robust Detection of Malicious PDF Files
SN - 978-989-758-081-9
AU - Maiorca D.
AU - Ariu D.
AU - Corona I.
AU - Giacinto G.
PY - 2015
SP - 27
EP - 36
DO - 10.5220/0005264400270036