The best-performing pipelines can be placed into pro-
duction for processing new data and generating pre-
dictions based on model training (Rauf and Alanazi,
2014). Automated artificial intelligence can also be
implemented to ensure that the model has no inher-
ent bias and automates the tasks for continuous model
development. However, even with the advanced tech-
nology of AutoAI, machine learning experts can, at
times, obtain better results.
We propose a model that achieves an increase in
accuracy over AutoAI on the Microsoft Kaggle’s Mal-
ware Prediction dataset, i.e., the probability of a Win-
dows machine being infected by different malware
families, based on different properties of that ma-
chine. The telemetry data containing these proper-
ties and the system infections were created by com-
bining heartbeat and threat reports collected by Win-
dows Defender, Microsoft’s endpoint protection solu-
tion (Microsoft, 2018). The architecture that we pro-
pose is designed to be able to detect malware without
a lot of computational power, yet with increased ac-
curacy.
2 RELATED WORK
In (LIN, 2019), the authors described a method that
used Na
¨
ıve Transfer Learning approach on Kaggle’s
Microsoft Malware Prediction dataset. They trained
a Gradient Boosting Machine (GBM) to get a simple
prediction model based on the training data, and then
fine-tuned it to suit the test dataset. The authors tried
to minimize the marginal distribution gap between the
source and target domains, figured out the key fea-
tures for domain adaptation and changed the results
of predictions according to the general statistical reg-
ularities extracted from the training set. They ran a
GBM to collect each feature’s importance ratings, and
then picked 20 of the most important category fea-
tures for further study. This was done to simplify the
problem and reduce the costs of the computation. The
first model used 20 of the most important features,
and achieved an accuracy of 63.7%. A second model,
in which columns with maximum mean discrepancy
were removed achieved an accuracy of 64.3%.
In (Ren et al., 2018), the authors presented a
lightweight malware detection and mobile categorisa-
tion security framework. They evaluated the method
with malware on Android devices. Because of the
success and openness of the Android platform, it is
constantly under attack. They performed the analysis
on a very large dataset consisting of 184,486 benign
applications and 21,306 malware samples. They ran-
domly divided the dataset into two subsets for training
(80%) and testing (20%), and evaluated five classi-
fiers: k-nearest neighbor (KNN), Ada, random forest
(RF), support vector machine (SVM), and GBM. The
GBM classifier achieved the best accuracy, of 96.8%.
Since the Gradient Booster algorithm outperformed
all other well known algorithms in predicting mal-
ware, we decided to use it on another malware prob-
lem.
In (Rai and Mandoria, 2019), the authors used
classifiers such as XG-Boost and LGBM to detect net-
work intrusion, and evaluated them using the NSL
KDD dataset (Choudhary and Kesswani, 2020). The
dataset is built on 41 features including basic features,
traffic features and content features, and 21 classes
of attack. The authors’ experimental results showed
that Gradient Boosting Decision Tree ensembles like
LGBM, XG-Boost, and the stacked ensemble, outper-
formed linear models and deep neural networks. Sim-
ilar with the previous related work, since ensemble
methods outperformed linear models and a deep neu-
ral network, we would like to evaluate such methods
on a more recent malware problem.
In (Stephan Michaels, 2019), the author proposed
a method for malware prediction. Two models were
trained and evaluated using LGBM. With one method,
the dataset was cleaned and string values encoded.
Afterwards a LightGBM was trained. With the other
method, the preprocessed data from the first model
was extended with new features. Then, important fea-
tures were selected and a LightGBM was trained. Fi-
nally, an average of the predictions of both models
was calculated. We replicated the experiment because
the author did not present results on Microsoft Mal-
ware Prediction data (Microsoft, 2018). We got the
accuracy score of 66.18% which is below our score
but higher than the AutoAI.
In (Onodera, 2019), the author engineered five
features, which were discovered by trying hundreds
of engineered variables to increase Time Split Valida-
tion. Each variable was added to the model one at a
time and validation score was recorded. After every
variable was changed to dtype integer, each variable
was tested one by one to see if making it categorical
increases LGBM validation score. We replicated the
experiment and we got the accuracy score of 64.91%
which is below our score and very similar to the Au-
toAI score.
3 EXPERIMENTAL DESIGN
The goal of our experimental design is to test our
framework on Microsoft Malware Prediction dataset
and compare the results with AutoAI as well as
ICPRAM 2021 - 10th International Conference on Pattern Recognition Applications and Methods
296