The Investigation of Packet Header Field Importance on Malware Classification Following Nprint Processing

Fangzhou Xing



In 2021, a research endeavor aimed to standardize and automate the utilization of machine learning in network traffic analysis through the introduction of Nprint. Nprint converts complete packets into binary representation (1s, 0s, and -1s), subsequently feeding the processed data into an autoML system. This study demonstrated remarkable performance across various network traffic analysis tasks, including malware classification. However, it did not investigate the impact of excluding certain packet header fields on the results. Consequently, this research seeks to explore how the utilization of Nprint for data processing, while selectively considering specific packet header fields, influences the outcome of the malware classification task. This research used random forest on Nprint processed network traffics to determine the importances of each header field on the task of malware classification, and then tried using only the information of top n most important header fields as the data to be fed into AutoGluon to determine how the classification accuracy and the training time would be changed. The research had found that using only 3 of the packet header fields could still achieve an accuracy that was 99.9% of the accuracy achieved by using all the header fields, and at the same time shortened the training time required for the best performing modal on this task given by an AutoGluon by more than half.


