Authors:
Sabrina Kall
1
and
Slim Trabelsi
2
Affiliations:
1
EPFL, Lausanne, Switzerland
;
2
SAP Labs France, Mougins, France
Keyword(s):
Federated Learning, Machine Learning, Cyber Threat Intelligence, Password Detection, Privacy, Security, Threat Awareness, Personalization.
Abstract:
Hard-coded tokens and secrets leaked through source code published on open-source platforms such as Github are a pervasive security threat and a time-consuming problem to mitigate. Prevention and damage control can be sped up with the aid of scanners to identify leaks, however such tools tend to have low precision, and attempts to improve them through the use of machine learning have been hampered by the lack of training data, as the information the models need to learn from is by nature meant to be kept secret by its owners. This problem can be addressed with federated learning, a machine learning paradigm allowing models to be trained on local data without the need for its owners to share it. After local training, the personal models can be merged into a combined model which has learned from all available data for use by the scanner. In order to optimize local machine learning models to better identify leaks in code, we propose an asynchronous federated learning system combining pe
rsonalization techniques for local models with merging and benchmarking algorithms for the global model. We propose to test this new approach on leaks collected from the code-sharing platform Github. This use case demonstrates the impact on the accuracy of the local models employed by the code scanners when we apply our new proposed approach, balancing federation and personalization to handle often highly diverse and unique datasets.
(More)