Authors:
Emil Wåreus
1
;
2
;
Anton Duppils
1
;
Magnus Tullberg
1
and
Martin Hell
2
Affiliations:
1
Debricked AB, Malmö, Sweden
;
2
Dept. of Electrical and Information Technology, Lund University, Lund, Sweden
Keyword(s):
Machine Learning, Open-Source Software, Vulnerabilities, Semi-supervised Learning, Classification.
Abstract:
Open-Source Software (OSS) is increasingly common in industry software and enables developers to build better applications, at a higher pace, and with better security. These advantages also come with the cost of including vulnerabilities through these third-party libraries. The largest publicly available database of easily machine-readable vulnerabilities is the National Vulnerability Database (NVD). However, reporting to this database is a human-dependent process, and it fails to provide an acceptable coverage of all open source vulnerabilities. We propose the use of semi-supervised machine learning to classify issues as security-related to provide additional vulnerabilities in an automated pipeline. Our models, based on a Hierarchical Attention Network (HAN), outperform previously proposed models on our manually labelled test dataset, with an F1 score of 71%. Based on the results and the vast number of GitHub issues, our model potentially identifies about 191 036 security-related i
ssues with prediction power over 80%.
(More)