written in the documentation section.
Efforts to detect mobile malware have been car-
ried out with various approaches. A behaviorbased
method that uses permissions and system calls as fea-
tures produced accuracy that is still relatively low,
with an average of 60% (Kaushik and Jain, 2015).
The result was 65.29% using Simple Logistic Regres-
sion, 65.29% using Naive Bayes, 70.31% using SMO,
and 54.79% using Random Tree (Kaushik and Jain,
2015). Other research using the Neural Network (NN)
method with network traffic features to detect mal-
ware on smartphones had successfully identified mal-
ware botnets with a precision level of around 88.3%
(Stevanovic and Pedersen, 2015). This result is much
higher compared to the naive Bayes and logistic re-
gression methods, each of which has a value of 7%
and 32% (Stevanovic and Pedersen, 2015). Besides,
the neural network method successfully outperformed
the Support Vector Machine (SVM) method in clas-
sifying network traffic (Zhang et al., 2012). Detect-
ing malware through network traffic analysis which
is time-series data is suitable for the neural network
method.
The weakness of that research was the NN is per-
formed on all network traffic features. Some network
features have significantly more roles than other net-
work traffic features. For example, the network des-
tination port is more important than the length of the
header contents. Second, using all network traffic fea-
tures means increasing the internal errors carried in
the data. Third, features with large values will auto-
matically weigh higher; for example, the port num-
ber commonly used will be much smaller when com-
pared to the amount of data flow across the network
(Lashkari et al., 2017). On the other hand, there is
a Principal Component Analysis (PCA) method for
feature extraction. Research showed that in network
traffic classification, PCA had faster speed, higher ac-
curacy, and more stable than the Naive Bayes estima-
tion method (Yan and Liu, 2014).
The differences between this study and previous
research are the network traffic dataset, the combina-
tion of features, and the NN configuration iteration
used. The dataset was from the Canadian Institute
for Cybersecurity, University of New Brunswick (for
Cybersecurity, 2017) combined with sample data col-
lected in Harapan Bangsa computer laboratory. The
set of features will be carried out based on PCA com-
pared with features obtained from literature studies
and features chosen by researchers. The iteration of
the NN configuration is done by programming that
pays attention to learning rate, epoch, and parameter
evaluation. The purpose of this study is to investigate
the combination of network traffic features that can
produce high precision, recall, and F-measure.
2 METHOD
2.1 Research Framework
This part outlines the framework of thought, namely
indicators, proposed methods, objectives, and mea-
surements. The indicator explains the factors that af-
fect the results of the objective. The number of pack-
ets of datasets analyzed is the first factor. Secondly,
the number of features and features names will be
used in the training and testing process. Third, neural
network hyperparameters include the number of hid-
den neurons, the epochs, and learning rate. Then in
the proposed method, there is a dataset source. Then
enter the feature selection stage. The feature will then
be selected by analyzing it first. Then, after obtain-
ing features from the results of the previous analysis,
training will be conducted using the neural network
method and continued with the testing phase. The test
results processed to produce the objectives in the form
of precision, recall, and Fmeasure.
2.2 Flowchart
The steps of this research were arranged in the form
of a flowchart that begins with preprocessing. Prepro-
cessing is the normalization of the features by divid-
ing it by the maximum value of each feature, minimiz-
ing features so as not to dominate other features. Then
the learning stage used the neural network method
with backpropagation algorithm, and the testing stage
used feed-forward. In the initial phase, the weight
will be random according to the previous provisions
and stored in a weight file. Learning outcomes would
give a new weight value used in the test phase. The
test output was divided into three, namely benign type
network traffic, adware, or general malware. The
flowchart can be seen at Figure 1.
Neural Network with Principal Component Analysis for Malware Detection using Network Traffic Features
267