code sequences and API sequences to improve the
classification accuracy. Work on this problem done
earlier, experimented with single dataset which is a
direct combination of the feature sets. We have intro-
duced derived datasets where multi level classification
is performed, which greatly improves the classifica-
tion accuracy. With these improvements our model
classification accuracy is more than 99.90% which is
better than existing approaches where P. V. Shijo and
A. Salim(Shijo and Salim, 2015) achieved 98.7% ac-
curacy in their work titled Integrated static and dy-
namic analysis for malware detection and Igor San-
tos et al.(Santos et al., 2013) achieved 96.60% accu-
racy in their work titled OPEM: A Static-Dynamic Ap-
proach for Machine-learning-based Malware Detec-
tion.
5 CONCLUSION
In this paper we have presented detection of ma-
licious windows binaries with behavioural analysis
along with machine intelligence approaches. Fea-
ture sets like API sequences, opcode sequences, file
meta information, custom attributes and import func-
tions are extracted from the binaries and the dataset
is created by taking the union of these feature sets.
All feature sets except file meta information are bi-
nary. File meta information is a real valued feature
set. The derived datasets are created by different
mechanisms like direct concatenation of all individual
feature sets except real valued file meta information
where machine learning models are used to make it a
single binary feature, concatenation of individual fea-
ture set predictions by machine learning model pre-
viously trained on it and, concatenation of individual
feature set predictions across all five classifiers pre-
viously trained on it. The results show that derived
datasets give better performance as compared to us-
ing individual feature sets. In the derived datasets,
TYPE3 dataset outperforms the other datasets. In this
work we have used four additional feature sets besides
opcode sequences and API sequences along with clas-
sifier ensemble methods to improve the classifier per-
formance which further leads to improvement in the
malware detection rate by reducing the false positives.
We have achieved more than 99.90% classification ac-
curacy with our work. In future, we are planning to
extend this mechanism to other file formats like MS
office Suite and PDF etc. We are also planing to use
unsupervised and soft computing mechanisms for au-
tomatic feature extraction and selection.
6 FUTURE WORK
In this paper we have discussed detection of mali-
cious windows binaries only. This mechanism will be
extended to other file formats like Microsoft Office
documents(Word, Power Point, Excel), Portable Doc-
ument Format(PDF) and Web Application(HTML,
HTA, JS). We are also planning to extend this mecha-
nism to other operating systems(Linux, MacOS) by
developing appropriate automatic malware analysis
engines. We are also planning to create and experi-
ment with another type of dataset known as Malware
Instruction Set(MIST Dataset ) discussed by Konrad
Reick et al.(Rieck et al., 2011) with modifications to
number of levels and argument blocks based on out-
put format of automated malware analysis framework.
We are also planning to use unsupervised models like
Latent Dirichlet allocation(LDA)(Blei et al., 2003) to
automatically extract features from analysis logs. Soft
computing techniques will be used for feature selec-
tion and weighting.
REFERENCES
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). La-
tent dirichlet allocation. J. Mach. Learn. Res.,
3(null):993–1022.
(BusterBSA), P. L. (2020). Buster Sandbox Analyser. http:
//bsa.isoftware.nl/.
Carrera, E. (2020). pefile - Multi-platform Python module
to parse and work with Portable Executable (PE) files.
https://github.com/erocarrera/pefile.
Chollet, F. (2020). Keras: The Python Deep Learn-
ing library - The Sequential model. https://keras.io/
getting-started/sequential-model-guide/.
Christopher D. Manning, P. R. and Sch
¨
utze, H. (2008).
Introduction to Information Retrieval, chapter 13.5.
Cambridge University Press.
Community, C. S. (2017). Malwr (Free malware analysis
service). https://malwr.com/.
Dabah, G. (2020). Powerful Disassembler Library For
x86/AMD64. https://github.com/gdabah/distorm.
Goppit (2006). Portable executable file format – a re-
verse engineer view. CodeBreakers Magazine (Se-
curity & Anti-Security- Attack & Defense), 1 issue
2. http://index-of.es/Windows/pe/CBM 1 2 2006
Goppit PE Format Reverse Engineer View.pdf.
Hunt, G. and Brubacher, D. (1999). Detours: Binary inter-
ception of win32 functions. In Third USENIX Win-
dows NT Symposium, page 8. USENIX.
Mellissa (2020). VirusShare (Repository of malware sam-
ples). https://virusshare.com/.
Michael Ligh, Steven Adair, B. H. and Richard, M. (2010).
Malware Analyst’s Cookbook and DVD: Tools and
Techniques for Fighting Malicious Code. Wiley Pub-
lishing.
IoTBDS 2021 - 6th International Conference on Internet of Things, Big Data and Security
36