vast difference between the two cases was primarily
due to the need to train multiple HMMs (i.e., mul-
tiple random restarts) in cases where the amount of
training data is relatively small. Word2Vec can be
trained on short opcode sequences, since a larger win-
dow size W effectively inflates the number of training
samples that are available.
As future extension of this research, similar exper-
iments could be performed on a larger and more di-
verse set of malware families. Also, here we only con-
sidered opcode sequences—analogous experiments
on other features, such as byte n-grams or dynamic
features such as API calls would be interesting. In
addition, other word embedding techniques could be
considered, such as those based on principal com-
ponent analysis (PCA), as considered, for example,
in (Chandak et al., 2021).
Further experiments involving the many parame-
ters found in the various machine learning techniques
considered here would be worthwhile. To mention
just one of many such examples, additional combina-
tions of window sizes and feature vector lengths could
be considered in Word2Vec. Finally, other machine
learning paradigms would be worth considering in the
context of malware detection based on vector embed-
ding features. Examples of other machine learning
approaches that could be advantageous for this prob-
lem include adversarial networks and reinforcement
learning.
REFERENCES
Aycock, J. (2006). Computer Viruses and Malware.
Springer, New York.
Chandak, A., Lee, W., and Stamp, M. (2021). A comparison
of Word2Vec, HMM2Vec, and PCA2Vec for malware
classification. In Malware Analysis using Artificial In-
telligence and Deep Learning. Springer.
Chollet, F. (2015). Keras. https://github.com/fchollet/keras.
Cortes, C. and Vapnik, V. (1995). Support-vector networks.
Machine Learning, 20(3):273–297.
Dhanasekar, D., Di Troia, F., Potika, K., and Stamp, M.
(2018). Detecting encrypted and polymorphic mal-
ware using hidden Markov models. In Guide to Vul-
nerability Analysis for Computer Networks and Sys-
tems: An Artificial Intelligence Approach, pages 281–
299. Springer.
Gael, V. (2014). hmmlearn. https://github.com/hmmlearn/
hmmlearn.
Kim, S. (2018). PE header analysis for malware detection.
https://scholarworks.sjsu.edu/etd
projects/624/.
Kolter, J. Z. and Maloof, M. A. (2006). Learning to detect
and classify malicious executables in the wild. Jour-
nal of Machine Learning Research, 7:2721–2744.
Krogh, A., Brown, M., Mian, I., Sj
¨
olander, K., and Haus-
sler, D. (1994). Hidden Markov models in compu-
tational biology: Applications to protein modeling.
Journal of Molecular Biology, 235(5):1501–1531.
Lo, W. W., Yang, X., and Wang, Y. (2019). An xception
convolutional neural network for malware classifica-
tion with transfer learning. In 2019 10th IFIP Inter-
national Conference on New Technologies, Mobility
and Security (NTMS), pages 1–5.
Microsoft Security Intelligence (2020a).
Rogue:Win32/FakeRean.
Microsoft Security Intelligence (2020b). Tro-
jan:Win32/BHO.BO.
Microsoft Security Intelligence (2020c). Tro-
jan:Win32/OnLineGames.A.
Microsoft Security Intelligence (2020d). Vir-
Tool:Win32/CeeInject.
Microsoft Security Intelligence (2020e). Win32/Renos
threat description.
Microsoft Security Intelligence (2020f). Win32/Vobfus.
Microsoft Security Intelligence (2020g).
Win32/Winwebsec threat description.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a).
Efficient estimation of word representations in vector
space. https://arxiv.org/abs/1301.3781.
Mikolov, T., Chen, K., Corrado, G. S., and Dean, J. (2013b).
Efficient estimation of word representations in vector
space. https://arxiv.org/abs/1301.3781.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
Thirion, B., Grisel, O., Blondel, M., Prettenhofer,
P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,
A., Cournapeau, D., Brucher, M., Perrot, M., and
Duchesnay, E. (2011). Scikit-learn: Machine learning
in Python. Journal of Machine Learning Research,
12:2825–2830.
Popov, I. (2017). Malware detection using machine learn-
ing based on word2vec embeddings of machine code
instructions. In 2017 Siberian Symposium on Data
Science and Engineering (SSDSE), pages 1–4.
Rabiner, L. (1989). A tutorial on hidden markov models
and selected applications in speech recognition. Pro-
ceedings of the IEEE, 77(2):257–286.
Sethi, A. (2019). Classification of malware models.
Shaily, S. and Mangat, V. (2015). The hidden Markov
model and its application to human activity recog-
nition. In 2015 2nd International Conference on
Recent Advances in Engineering Computational Sci-
ences (RAECS), pages 1–4.
Stamp, M. (2017). Introduction to Machine Learning with
Applications in Information Security. Chapman and
Hall CRC, 1st edition.
Vemparala, S., Troia, F. D., Visaggio, C. A., Austin, T. H.,
and Stamp, M. (2016). Malware detection using dy-
namic birthmarks. In Verma, R. M. and Rusinowitch,
M., editors, Proceedings of the 2016 ACM on Interna-
tional Workshop on Security And Privacy Analytics,
pages 41–46. ACM.
Zhang, Z. (2018). Improved Adam optimizer for deep neu-
ral networks. In 2018 IEEE/ACM 26th International
Symposium on Quality of Service (IWQoS), pages 1–2.
ForSE 2021 - 5th International Workshop on FORmal methods for Security Engineering
742