unidirectional RNNs. The GRU and Bi-GRU models
can detect more effectively than the LSTM and Bi-
LSTM models in case of training on the Nine-projects
dataset. This is due to the structure of the GRU
networks being more compatible with the small size
dataset. Since the Nine-projects dataset has smaller
size than the SARD dataset, GRU models had better
advantages for less memory consumption.
6 CONCLUSION AND FUTURE
WORK
Automated detection of software vulnerability is an
important direction in cybersecurity research. How-
ever, conventional techniques such as dynamic anal-
ysis or symbolic-execution showed inefficiency when
dealing with an immense amount of source code (Lin
et al., 2019b). To enhance the vulnerability discov-
ery capability, applying deep learning techniques was
necessary to speed up the code analysis process. Our
work presented an approach to examine the effective-
ness of word embeddings combined with four deep
learning models for the vulnerability detection task.
The system trained the models and tested them on two
genres of the datasets. With the synthetic dataset, all
models could present sufficient but identical vulnera-
bility retrieval results. In contrast, the models showed
differences clearly with the real-world implemented
dataset. This is worth noticing since the real vulnera-
ble dataset in the released software code can be lim-
ited to size and numbers in varied scenarios. Thus, it
is vital to select the right combination of embedding
methods and neural network structures to build an ef-
fective detection system that can accommodate well
to the dataset.
Our approach investigated the use of embedding
algorithms on the supervised learning methods, and
the system can generate the vulnerability detectors
at the function level. It can be used as an assisting
tool for selecting the good combinations of embed-
ding methods and deep learning models for building
effective vulnerability detection systems. There are
several research directions for extending our work and
improving system performance. First, we can collect
and build up the volume of the real vulnerable dataset
to resolve the imbalance issue in the open-source
dataset. Second, we can work on implementing other
embedding solutions such as adapting an ASTs ex-
tractor (Kovalenko et al., 2019). This could extract
different patterns of information in source code for the
center machine learning models to learn in the later
stage. Finally, building better neural network mod-
els should be investigated to reduce the gap between
natural language text and programming. This would
allow the vulnerability detection system to learn bet-
ter and adapt to other programming languages.
REFERENCES
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A.,
Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard,
M., et al. (2016). Tensorflow: A system for large-
scale machine learning. In 12th {USENIX} Sympo-
sium on Operating Systems Design and Implementa-
tion ({OSDI} 16), pages 265–283.
Allamanis, M., Barr, E. T., Devanbu, P., and Sutton, C.
(2018). A survey of machine learning for big code
and naturalness. ACM Computing Surveys (CSUR),
51(4):1–37.
Black, P. E. (2018). Juliet 1.3 Test Suite: Changes From
1.2. US Department of Commerce, National Institute
of Standards and Technology.
Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T.
(2017). Enriching word vectors with subword infor-
mation. Transactions of the Association for Computa-
tional Linguistics, 5:135–146.
Chollet, F. et al. (2015). https://github.com/fchollet/keras.
CVE (2019). Common vulnerabilities and exposures web-
site. https://cve.mitre.org/.
Fang, Y., Liu, Y., Huang, C., and Liu, L. (2020). Fastembed:
Predicting vulnerability exploitation possibility based
on ensemble machine learning algorithm. Plos one,
15(2):e0228439.
Harer, J. A., Kim, L. Y., Russell, R. L., Ozdemir, O., Kosta,
L. R., Rangamani, A., Hamilton, L. H., Centeno, G. I.,
Key, J. R., Ellingwood, P. M., et al. (2018). Auto-
mated software vulnerability detection with machine
learning. arXiv preprint arXiv:1803.04497.
Henkel, J., Lahiri, S. K., Liblit, B., and Reps, T. (2018).
Code vectors: understanding programs through em-
bedded abstracted symbolic traces. In Proceedings of
the 2018 26th ACM Joint Meeting on European Soft-
ware Engineering Conference and Symposium on the
Foundations of Software Engineering, pages 163–174.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term
memory. Neural computation, 9(8):1735–1780.
Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T.
(2016). Bag of tricks for efficient text classification.
arXiv preprint arXiv:1607.01759.
Kim, Y. (2014). Convolutional neural networks for sentence
classification. arXiv preprint arXiv:1408.5882.
Kostadinov, S. (2017). Understanding gru networks. https://
www.towardsdatascience.com. Accessed 25 Jan 2020.
Kovalenko, V., Bogomolov, E., Bryksin, T., and Bacchelli,
A. (2019). Pathminer: a library for mining of path-
based representations of code. In Proceedings of the
16th International Conference on Mining Software
Repositories, pages 13–17. IEEE Press.
Kula, M. (2019). A python implementation of glove: glove-
python. https://github.com/maciejkula/glove-python.
The Comparison of Word Embedding Techniques in RNNs for Vulnerability Detection
119