We believe that a promising direction for future
research is the use of pretrained code models. Appar-
ently, the application of such models to build seman-
tic embeddings using the context of the secret (previ-
ous and subsequent tokens) will give better quality,
although it will require more computing resources.
The pretrained code models contain the programming
language model (MLM, masked-language modeling
task). This makes it possible to understand, based on
the context, what is happening in a given fragment
of code. Whether it’s sensitive information or not. A
similar situation occurs in the code completion task or
in the variable name prediction task, where pretrained
code models are great (Guo et al., 2022).
REFERENCES
Abnar, S., Dehghani, M., Neyshabur, B., and Sedghi, H.
(2021). Exploring the Limits of Large Scale Pre-
training. ArXiv, abs/2110.02095.
Collins, K. (2016). Developers keep leaving secret keys
to corporate data out in the open for anyone to take.
https://qz.com/674520.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2019). BERT: Pre-training of deep bidirectional
transformers for language understanding. In Proceed-
ings of the 2019 Conference of the North American
Chapter of the Association for Computational Lin-
guistics, pages 4171–4186, Minneapolis, Minnesota.
Ding, Z. Y., Khakshoor, B., Paglierani, J., and Rajpal,
M. (2020). Sniffing for Codebase Secret Leaks
with Known Production Secrets in Industry. CoRR,
abs/2008.05997.
Dorogush, A. V., Ershov, V., and Gulin, A. (2018). Cat-
boost: gradient boosting with categorical features sup-
port. ArXiv, abs/1810.11363.
Guo, D., Ren, S., Lu, S., Feng, Z., Tang, D., Liu, S.,
Zhou, L., Duan, N., Svyatkovskiy, A., Fu, S., Tu-
fano, M., Deng, S. K., Clement, C. B., Drain, D., Sun-
daresan, N., Yin, J., Jiang, D., and Zhou, M. (2021).
GraphCodeBERT: Pre-training code representations
with data flow. In ICLR 2021.
Guo, D., Svyatkovskiy, A., Yin, J., Duan, N.,
Brockschmidt, M., and Allamanis, M. (2022).
Learning to Complete Code with Sketches. ArXiv,
abs/2106.10158.
Kall, S. and Trabelsi, S. (2021). An Asynchronous Feder-
ated Learning Approach for a Security Source Code
Scanner. In ICISSP, pages 572–579.
Klieber, W., Flynn, L., Bhosale, A., Jia, L., and Bauer, L.
(2014). Android taint flow analysis for app sets. In
SOAP.
Knight, S. (2016). 10,000 AWS secret access keys care-
lessly left in code uploaded to GitHub. https://www.
techspot.com/news.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen,
D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoy-
anov, V. (2019). RoBERTa: A robustly optimized
bert pretraining approach. arXiv e-prints, page
arXiv:1907.11692.
Lounici, S., Rosa, M., Negri, C. M., Trabelsi, S., and
¨
Onen,
M. (2021). Optimizing Leak Detection in Open-
source Platforms with Machine Learning Techniques.
In ICISSP, pages 145–159.
Manning, C. D., Raghavan, P., and Sch
¨
utze, H. (2008). In-
troduction to Information Retrieval. Cambridge Uni-
versity Press, USA.
Marlow, P. (2019). Finding Secrets in Source Code the De-
vOps Way. In SANS Institute.
McCullagh, P. and Nelder, J. A. (1989). Generalized linear
model. In CRC press, volume 37.
Meli, M., McNiece, M. R., and Reaves, B. (2019). How Bad
Can It Git? Characterizing Secret Leakage in Public
GitHub Repositories. Proceedings 2019 Network and
Distributed System Security Symposium.
Miessler, D. (2021a). 000webhost. https://github.com/
danielmiessler/SecLists/blob/master/Passwords/
Leaked-Databases/000webhost.txt.
Miessler, D. (2021b). 10 million password list top
1000000. https://github.com/danielmiessler/SecLists/
blob/master/Passwords/Common-Credentials/
10-million-password-list-top-1000000.txt.
Rahman, M. R., Imtiaz, N., Storey, M.-A., and Williams,
L. (2022). Why secret detection tools are not enough:
It’s not just about false positives — An industrial case
study. Empirical Software Engineering, 27(3):1–29.
Ray, B., Hellendoorn, V., Godhane, S., Tu, Z., Bacchelli,
A., and Devanbu, P. (2016). On the ”naturalness” of
buggy code. ICSE ’16, pages 428–439.
Roesch, M. (1999). Snort — Lightweight Intrusion Detec-
tion for Networks. In LISA.
Saha, A., Denning, T., Srikumar, V., and Kasera, S. K.
(2020). Secrets in Source Code: Reducing False Pos-
itives using Machine Learning. In 2020 International
Conference on COMmunication Systems NETworkS
(COMSNETS), pages 168–175.
Sinha, V. S., Saha, D., Dhoolia, P., Padhye, R., and Mani, S.
(2015). Detecting and Mitigating Secret-Key Leaks
in Source Code Repositories. In Proceedings of the
12th Working Conference on Mining Software Repos-
itories, MSR ’15, pages 396–400. IEEE Press.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L., and Polosukhin, I.
(2017). Attention is all you need. NIPS’17, pages
6000–6010.
Viennot, N., Garcia, E., and Nieh, J. (2014). A measure-
ment study of Google Play. In The 2014 ACM Inter-
national Conference on Measurement and Modeling
of Computer Systems, ser. SIGMETRICS, pages 221–
233.
Zhou, Y., Zhang, X., Jiang, X., and Freeh, V. W. (2011).
Taming informationstealing smartphone applications
(on android). In TRUST.
ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering
596