The packets of each traffic were sorted by the time
they were received or sent. Then, since some study
showed that only inspecting the first few packets of
each traffic flow is sufficient for the purpose of
malware detection, and to keep the input size of each
traffic sample for machine learning consistent, only
the first 10 packets of each traffic was kept (Hwang
et al 2019). Next, for each traffic, its 10 packets were
placed side by side so that each row of 9600 sub-
features represents one sample. This is the standard
representation for each sample in this research. In the
experiments, the samples will be changed by
including and excluding groups of sub-features based
on the set of features chosen to be tested on.
The 170, 000 samples of processed labeled traffic
were split into 120, 000 training samples and 50000
testing samples. There are 19 categories for the 170,
000 sample. However, some categories didn’t have
many samples in them. Therefore, only the 10
categories that contained the most samples was
considered. The 10 categories are benign,
magichound, htbot, trickster, ursnif, artemis, trickbot,
dridex, emotet, minertorjan. After only considering
these 10 categories, there were 117842 samples for
training and 49124 samples for testing.
Before using the samples for training and testing,
the importance of each of the 36 features in malware
classification was determined through random forest
using sklearn (Pedregosa et al 2018). The set of
samples used to determine feature importance was all
of the samples in the training set. The 117842 samples
were adjusted to be suited for random forest. The
9600 sub-features were grouped to form 360 features,
which are the 36 header fields for each of the 10
packets in each sample. Then, the binary values for
features ipv4_hl, ipv4_tl, ipv4_foff, ipv4_ttl,
ipv4_cksum, tcp_seq, tcp_ackn, tcp_doff, tcp_wsize,
tcp_cksum, tcp_urp, udp_len, udp_cksum were
turned into base 10 integers, since these features
represent certain kinds of magnitude. The rest of the
features had their binary values turn into strings.
Then, since sklearn random forest only takes in
numerical inputs, the features with string values were
expanded to multiple features using one-hot
encoding. After the expansion, the samples had 1, 933
features.
2.2 Random Forest
2.2.1 Decision Tree
Decision tree is a binary tree for classification tasks
that contains two types of nodes: decision node and
leaf node. The tree starts from a decision node, which
represents a feature that contains a split of its
variables that gives the most information (Computed
using the training samples) gained compared to other
features. Then, the decision node is connected to two
child nodes by two edges each representing one set of
the variables formed by the split. The two-child node
can either be a leaf node, which represents one of the
categories in the classification, if all samples
satisfying all the conditions set by the ancestors of
this node are in this category, or another decision
node chosen by the same way its ancestors were
decided given the conditions set by its ancestors. The
tree is expanded with the above process until it cannot
be expanded.
This research used gini index to measure the
information gain caused by a split of variables for any
particular feature. Gini index of a node is calculated
by the function:
𝐺𝑖𝑛𝑖𝐼𝑛𝑑𝑒𝑥 =1−∑𝑝
(1)
In which
𝑝
𝑖
represents the probability for the
samples to be in category
𝑖 given that the samples
satisfy all the conditions set by the ancestors of the
node.
Information Gain is calculated by the function:
𝐼𝐺 = 𝑤
𝐺(𝑝𝑎𝑟𝑒𝑛𝑡) −∑𝑤
𝐺(𝑐ℎ𝑖𝑙𝑑
) (2)
In which
𝐺 is the gini index, 𝑤
𝑝
is the proportion
of samples satisfying all the conditions set by the
ancestors of the parent, and
𝑤
𝑖
is the proportion of
samples satisfying all the conditions set by the
ancestors of child
𝑖.
2.2.2 Random Forest
Random forest is a collection of n trees each trained
using a set of samples that had the same size as the
entire training set and had its samples chosen
randomly from the training set with replacement. The
features considered by each tree were selected
without replacement from all the features. The
number of features considered by each tree was the
square root of the total number of features. After
training the n trees, classification task was done by
giving the input to all n trees and then chose the
category that most trees gave as the result. In this
research, 100 trees were trained in the random forest.
2.2.3 Feature Importance
The feature importance of each feature in each tree
was calculated by taking the sum of the information
gain caused by all instances of decision nodes that