tational power. However, most of the heuristics are
not adapted to all use cases, so the developer has to
manually configure the tool without efficiency guar-
antees. In our approach, we choose to adapt the scan-
ning process to each developer, thus the fine-tuning
is performed by the Leak Generator rather than the
user herself. The continuous training parameter is
the ability for the tool to re-train the machine learn-
ing models when the user flags a discovery, so to im-
prove future classifications. Open-source solutions
are more focused on single use cases, offering lim-
ited interactions with the developers. Our approach,
similar to the GitGuardian platform, is to improve the
accuracy while reviewing, decreasing the monitoring
time. The user experience is also a key point in order
to be used efficiently. The price could represent an
important barrier for small companies willing to pro-
tect themselves, encouraging bad development habits.
Commercial products provide a user interface, mak-
ing the tool more accessible to developers, and even
to non-technical people. Since the origin of a leak
does not depend on the level of expertise of the devel-
opers (Meli et al., 2019), tools with a user interface
could be easily used also by beginners to protect their
code.
8 PRIVACY CONCERNS
DISCLOSURE
In this paper, we deal with critical data, which could
harm users’ privacy in case they were used for mali-
cious purposes. Thus, we need to discuss privacy is-
sues in the scope of our research. First, with regard to
the experiment shown in Section 6.1, public reposito-
ries represent open-source data found in public web-
sites (in particular, github.com), while the access to
the proprietary platform has been granted by the com-
pany that owns all the rights on it. In both cases, no
intrusion or hacking techniques were used to obtain
data. We ensure that data collected are only accessi-
ble to our working team, for analysis purposes only,
and that sensitive information have not been used to
train predictive models. The training of the models,
together with the evaluation of our approach shown in
Section 6.3, has been achieved using sanitized data.
Furthermore, we did not attempt to use any actual
leaks we discovered to verify their authenticity, and
we tried, when possible, to notify the developer re-
sponsible for publishing credentials. Finally, all the
real data we collected have been deleted after the ex-
perimental evaluation of our approach.
9 CONCLUSION
We proposed an approach to detect data leaks in open-
source projects with a low false positive rate. Our
solution improves classic regular expression scanning
methods by leveraging machine models, filtering an
important number of false positives. Through our se-
ries of experiments, we show that our approach out-
performs classic scanning methods, produces a negli-
gible amount of undetected leaks and results in a false
positive rate of at most 6% of the output data.
ACKNOWLEDGMENTS
We would like to thank Sabrina Kall for her help dur-
ing the writing of this paper. We also would like to
thank the Institue for artificial intelligence 3IA and
the Councel of Industrial Resarch for Artificial Intel-
ligence ICAIR for their support.
REFERENCES
Alon, U., Zilberstein, M., Levy, O., and Yahav, E. (2018).
code2vec: Learning distributed representations of
code.
Bellman, R. (1957). Dynamic Programming.
Bronshtein, A. (2017). Train/test split and cross validation
in python. Understanding Machine Learning.
Bursztein, E. The bleak picture of two-factor authentication
adoption in the wild. https://tinyurl.com/yctk4aja.
Cambronero, J., Li, H., Kim, S., Sen, K., and Chandra, S.
(2019). When deep learning met code search.
Center, P. R. (2019). Americans and digital knowledge.
https://tinyurl.com/y8ftudoh.
Dahl, G. E., Stokes, J. W., Deng, L., and Yu, D. (2013).
Large-scale malware classification using random pro-
jections and neural networks. In ICASSP.
Gelman, B., Hoyle, B., Moore, J., Saxe, J., and Slater,
D. (2018). A language-agnostic model for semantic
source code labeling. In MASES.
Gousios, G., Vasilescu, B., Serebrenik, A., and Zaidman, A.
(2014). Lean ghtorrent: Github data on demand. In
MSR, pages 384–387.
Guzman, E., Azócar, D., and Li, Y. (2014). Sentiment
analysis of commit comments in github: an empirical
study. In MSR, pages 352–355.
Husain, H., Wu, H.-H., Gazit, T., Allamanis, M., and
Brockschmidt, M. (2019). Codesearchnet challenge:
Evaluating the state of semantic code search.
Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T.
(2016). Bag of tricks for efficient text classification.
Kalliamvakou, E., Gousios, G., Blincoe, K., Singer, L., Ger-
man, D. M., and Damian, D. (2014). The promises and
perils of mining github. In MSR.
ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy
156