Malware Analysis Using Transformer Based Models: An Empirical

Study

Abhishek Joshi

1,∗

, Divyateja Pasupuleti

1,∗

, Nischith P.

1,∗

, Sarvesh Sutaone

1,∗

, Soumil Ray

1,∗

Soumyadeep Dey

and Barsha Mitra

1,∗∗

Department of CSIS, BITS Pilani, Hyderabad Campus, Hyderabad, India

Microsoft, India

Keywords:

Malware, Android Malware, IoT Malware, Transformer Models, Malware Classiﬁcation.

Abstract:

The massive demand for connected and smart applications and the growth of high-speed communication tech-

nologies like 5G have led to a surge in the use of Android and Internet-of-Things (IoT) devices. The popularity

of such devices has resulted in a huge number of malware attacks and infections being inﬂicted upon these de-

vices. Cyber criminals relentlessly target the Android and IoT devices by developing new strains of malware.

To defend against these malware attacks, researchers have developed different types of malware detection

and categorization techniques. In this paper, we investigate the applicability and effectiveness of different

transformer-based models, which use self-attention to learn global dependencies and contextual information,

for malware classiﬁcation on two platforms: Android and IoT. We consider two types of inputs for malware

analysis - images and sequences. For image-based analysis, we convert Android APKs and IoT trafﬁc into

images that reﬂect their structural and behavioral features. We compare various convolutional neural network

(CNN) based models with and without transformer layers, and a pure transformer model that directly processes

the images. For sequence-based analysis, we extract the API call sequences from Android APKs, and apply

a transformer model to encode and classify them. We also explore the effect of pretraining and embedding

initialization on the transformer models. Our experiments demonstrate the advantages and limitations of using

transformer-based models for malware classiﬁcation, and provide insights into the training strategies and chal-

lenges of these models. To the best of our knowledge, this is the ﬁrst work that systematically explores and

compares different transformer-based models for malware classiﬁcation on both image and sequence inputs.

1 INTRODUCTION

Malicious applications, or malware, pose a serious

threat to the security and privacy of various comput-

ing systems, as they can damage, steal, or spy on the

data and resources of the host devices. The Internet

plays a key role in the dissemination and infection

of malware across different types of device, such as

desktops, laptops, servers, mobile phones, tablets, and

Internet-of-Things (IoT) devices. These devices have

diverse resource constraints, such as memory, bat-

tery, and processing power. Mobile phones, tablets,

and IoT devices have become essential for our daily

activities and communications. However, they also

face a surge of malware attacks, with more than 5.04

∗

Abhishek Joshi, Divyateja Pasupuleti, Nischith P.,

Sarvesh Sutaone and Soumil Ray have equal contribution

∗∗

Corresponding Author

billion incidents reported in 2022 as per 2022 Son-

icWall Cyber Threat Report

. 2023 experienced a

worldwide increase in malware volume by 2% with

a massive surge of 87% in IoT malware

. Moreover,

in 2023, 6.06 billion malware attacks were recorded

alone by researchers associated with SonicWall Cap-

ture Labs

. Android, being one of the most popular

mobile operating systems, attracts a large number of

malware developers and distributors. To combat this

challenge, various malware detection and classiﬁca-

tion techniques have been proposed and implemented

https://www.sonicwall.com/resources/white-papers/

2022-sonicwall-cyber-threat-report/

https://www.sonicwall.com/news/2023-sonicwall-cyb

er-threat-report-casts-new-light-on-shifting-front-lines-t

hreat-actor-behavior/

https://www.sonicwall.com/medialibrary/en/white-p

aper/2024-cyber-threat-report.pdf

858

Joshi, A., Pasupuleti, D., Nischith, P., Sutaone, S., Ray, S., Dey, S. and Mitra, B.

Malware Analysis Using Transformer Based Models: An Empirical Study.

DOI: 10.5220/0012855100003767

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 21st International Conference on Security and Cryptography (SECRYPT 2024), pages 858-865

ISBN: 978-989-758-709-2; ISSN: 2184-7711

in the literature.

Attention mechanism is a powerful technique that

allows a model to dynamically weigh the relevance

of different parts of the input and output sequences,

and to process them in parallel without relying on re-

current or convolutional operations. This enables the

model to capture long-range dependencies and con-

textual information in data, which are essential for

many natural language processing and computer vi-

sion tasks. Transformer architecture, which is based

on attention mechanism, has achieved remarkable re-

sults and efﬁciency in these domains, surpassing pre-

vious state-of-the-art models (Vaswani et al., 2023).

However, the applications of attention and trans-

former are not limited to natural language and com-

puter vision. Recently, they have also been explored

in various malware analysis and classiﬁcation tasks,

where they have shown promising performance and

robustness against malware obfuscation and evasion

techniques. Malware analysis and classiﬁcation is a

challenging problem that requires the model to un-

derstand the semantic and syntactic features of mali-

cious code as well as the behavioral and contextual

aspects of its execution. Attention and transformer

can potentially address these challenges by learning

high-level representations of malware from different

sources of data, such as binary, assembly, API calls,

or network trafﬁc (Seneviratne et al., 2022), (Jo et al.,

2023), (Ravi et al., 2023).

Malware classiﬁcation is a complex task that re-

quires the ability to capture the complex and diverse

behaviors of ever-evolving malicious programs. In re-

cent years, transformer-based models have been in-

creasingly used for malware categorization. In this

paper, we investigate the applicability and effective-

ness of different transformer-based models for mal-

ware classiﬁcation on two platforms: Android and

IoT. Our inputs are primarily of two types: images

and sequences. Image-based malware analysis in-

volves conversion of the Android APKs and IoT traf-

ﬁc traces into images. Such images capture the

morphological and behavioral features of the corre-

sponding data. We then compare the performance of

various convolutional neural network (CNN) based

models, such as ResNet-50 (He et al., 2015), Mo-

bilenetV2 (Sandler et al., 2018), and LCNN (Yuan

et al., 2022), with and without transformer layers on

top of them. We also evaluate the effect of a pure

transformer model, namely the Visual Transformer

(ViT), which directly processes the malware images

without any CNN layers. For sequence-based mal-

ware analysis, the API call sequences are extracted

from Android APKs. These sequences represent the

interactions between the apps and the system. We

then apply a transformer model, namely BERT (De-

vlin et al., 2018), to encode and classify the API call

sequences. Our experiments present a detailed perfor-

mance comparison of malware classiﬁcation strate-

gies using transformer-based models. Moreover, var-

ious aspects of the training strategies and the chal-

lenges of these models are also highlighted. To the

best of our knowledge, this is the ﬁrst work that ex-

plores the capabilities of different transformer-based

models for malware classiﬁcation on both image and

sequence inputs.

The rest of the paper is organized as follows. Sec-

tion 2 provides details of how Android applications

can be converted into images and sequences and the

manner in which images can be obtained from IoT

trafﬁc traces. Section 3 outlines the models consid-

ered and the experimental setup. In Section 4, we

present the details of the datasets that we have used

for our experimental analysis. Section 5 presents the

detailed experimental results and the corresponding

analysis providing insights into the relative perfor-

mances of the different models. Finally, Section 6

concludes the paper.

2 PRELIMINARIES

This section describes the approaches for generat-

ing images and sequences from Android applications

(APKs), which are the main inputs for our malware

detection and analysis models. Additionally, we illus-

trate how we transform IoT trafﬁc data into images,

which can capture the patterns and anomalies of the

network activities of the different connected devices.

Android Application to Image: One way to visu-

alize the internal structure of a binary ﬁle is to con-

vert it into an array of 8-bit unsigned integers, where

each array element represents a byte value from 0 to

255. This array can then be displayed as a grayscale

image, where darker pixels correspond to lower byte

values and lighter pixels represent higher byte values.

This image representation of a binary ﬁle is called a

byteplot and it reveals the patterns and features that

reﬂect the ﬁle’s content and format. Byteplot was ﬁrst

proposed by Conti et al. (Conti et al., 2008) as a tool

to analyze and compare binary ﬁles. Later, Nataraj

et al. (Nataraj et al., 2011) used byteplot to transform

malware samples into images and classify them using

image processing techniques. In this work, we use

byteplot to convert APK ﬁles into images and extract

features from them for malware classiﬁcation.

Android Application to Sequences: To obtain se-

quences of API invocations from an APK, we ﬁrst

need to extract the function call graph (FCG) of the

Malware Analysis Using Transformer Based Models: An Empirical Study

859

APK. FCG represents the possible paths of execution

among the methods in the APK, and the API invoca-

tions are the calls to the methods provided by the An-

droid framework or other libraries. We then apply dif-

ferent graph traversal algorithms on the FCG to gen-

erate sequences that capture the relative order of API

invocations as proposed in (Cannarile et al., 2022) in

different scenarios. These sequences are called API

sequences and they can be used to characterize the

behavior and functionality of the APK.

IoT Trafﬁc Data to Image: One of the most com-

mon sources of IoT malware analysis is pcap ﬁles,

which are a standard format for storing network pack-

ets captured by packet snifﬁng tools. These tools can

monitor network trafﬁc on a speciﬁc interface, either

in real-time or based on predeﬁned ﬁlters and set-

tings. Each packet contains information like the IP

addresses for source and destination, protocols, port

numbers, and payload data. The packets are the ba-

sic units of data transmission over a network. By

analysing pcap ﬁles, we can examine the communi-

cation patterns, anomalies, and potentially malicious

activities within IoT networks, as malware often ex-

hibits distinctive network behaviour, like using un-

usual protocols, generating high trafﬁc volume, or

connecting to suspicious IP addresses. We can also

use pcap ﬁles for behavioural analysis of IoT devices

based on how they interact with the network and other

devices using ﬂow or session data. A ﬂow is a se-

quence of packets that have common attributes, such

as the source and destination IP or port pairs and the

protocol type (e.g., TCP, UDP). A session is a bi-

directional ﬂow, i.e., the source and destination IP or

port pairs are interchangeable. To generate images

from pcap ﬁles, we use SplitCap, an open-source tool

that can split a pcap ﬁle into multiple pcap ﬁles based

on a given criterion, such as splitting per ﬂow or per

session. We also use SplitCap to select whether to ex-

tract only the application layer data (the 7th layer in

the OSI model) or the whole packet. After splitting,

we either trim the ﬁles or pad the ﬁles with 0s until

they reach a uniform size that can be reshaped into an

NxN image, for example, 784 bytes for a 28x28 im-

age. We then read the ﬁles byte-wise and append the

values to an array, where each value represents the

pixel intensity. We reshape the array to the desired

size, resulting in an NxN image.

3 METHODOLOGIES

To explore the potential and performance of different

transformer-based models for malware classiﬁcation,

we conduct experiments using Android APKs and IoT

trafﬁc traces. We choose data corresponding to the

Android and IoT platforms because they are widely

used and are vulnerable to various types of malware

attacks. We design two types of inputs for our mod-

els: images and sequences, which capture different

aspects of malware features.

For image-based malware analysis, we transform

the raw bytes of Android APKs and IoT trafﬁc

into grayscale images (as described in Section 2).

We adopt several CNN based models, which are

commonly used for image recognition tasks, and

enhance them with transformer layers, which can

learn global dependencies and attention among im-

age patches. Speciﬁcally, we compare the follow-

ing models: ResNet-50 (He et al., 2015), which uses

residual connections and deep layers to achieve high

accuracy; MobilenetV2 (Sandler et al., 2018), which

uses depth wise separable convolutions and inverted

residuals to reduce the computational cost and model

size; and LCNN (Yuan et al., 2022), which uses lo-

cal convolutional layers to capture local features and

reduce the number of parameters. We also append a

transformer encoder and a classiﬁcation head on top

of these CNN models, and examine how the trans-

former layers affect the classiﬁcation performance

and the model complexity. Furthermore, we evaluate

a pure transformer model, namely Vision Transformer

(ViT) (Dosovitskiy et al., 2020). ViT does not em-

ploy any CNN layers and directly processes the image

patches. This model relies on the transformer encoder

and the classiﬁcation head to encode and classify the

malware images.

API call sequences imbibe the high-level interac-

tions between Android apps and the system. Such

sequences can be inspected to classify malware. To

analyze malware based on their API call sequences,

we employ a transformer model architecture, namely

BERT (Devlin et al., 2018), that can encode and

comprehend natural language sequences. We train

BERT from scratch on our malware dataset, and use

its output embeddings and a classiﬁcation head to

distinguish the API call sequences. We experiment

with two methods for initializing the embeddings of

the APIs. In one method, we assign a unique 128-

dimensional embedding to each API in the dataset,

and learn them using a linear projection layer. In

another method, we use a Skip-gram model similar

to Word2Vec (Mikolov et al., 2013) to generate vec-

tor representations for the API calls. The Skip-gram

model is adapted from natural language processing

to learn embeddings from large corpora of Android

APKs. It processes pairs of API calls that occur

within the same method, and assigns a unique index

to each API call in the vocabulary. The model iterates

SECRYPT 2024 - 21st International Conference on Security and Cryptography

860

through batches of API-API pairs from both benign

and malicious APKs during training.

4 DATASET

In this section, we describe the datasets that we have

considered for our experiments.

AndroZoo: The AndroZoo dataset was ﬁrst pub-

lished in (Allix et al., 2016). Since then, the dataset

has been continuously augmented. In its current

form, AndroZoo consists of 24,470,628 APKs

collected from various sources, including the Google

Play app market and is possibly the largest malware

dataset. These APKs are analyzed by numerous

anti-virus tools to identify malicious and benign

contents. Androzoo uses a vt detection (Allix et al.,

2016) mechanism which is the number of virus total

engines that detected an APK as malicious. For the

set of APKs that we have considered, we have used

a criterion that an APK is considered as malicious

if it is identiﬁed as malicious by a minimum of 5

engines, otherwise the APK is considered as benign.

We randomly chose APKs such that the total size

of the APKs is not more than 150 GB. Thus, we

considered 1,23,880 malicious APKs and 3,23,095

benign APKs.

MALNET-IMAGE and MALNET-IMAGE TINY

Datasets: The MALNET-IMAGE dataset (Freitas

et al., 2022) consists of 12,62,024 images from APKs

in the Androzoo dataset, which spans 47 types and

696 families of malware. The MALNET-IMAGE

Tiny dataset is a reduced version of MALNET-

IMAGE that removes the four most common

malware types in the original dataset. In the rest of

the paper, we will refer to MALNET-IMAGE-Tiny

dataset as MALNET-Tiny. Freitas et al. generated

images from APKs by extracting the DEX ﬁle from

each APK and converting it into a byte sequence,

which correspond to a 1D array of pixel values

between 0 and 255. They then converted the 1D array

into a 2D array using linear plotting with the size

parameters recommended in (Nataraj et al., 2011),

thereby creating a grayscale image. They resized the

image to 256x256 and assigned a colour to each byte

based on its position in the DEX ﬁle by adopting the

approach in (Gennissen and Blasco, 2017).

Drebin: Drebin (Arp et al., 2014) is a public dataset

for Android malware family classiﬁcation, consisting

of 1,29,013 applications collected from various

sources over two years (August 2010 - October

2012). Among them, 5,560 are malicious, belonging

to 179 families, some of which are still active in the

market. The malicious applications have been identi-

ﬁed by at least two of the ten anti-virus scanners used.

The dataset is highly imbalanced in terms of malware

family distribution. Therefore, we focus on the top

20 malware families for our experiments. Drebin

is one of the largest and most widely used datasets

for Android malware classiﬁcation and provides a

valuable resource for testing and developing malware

identiﬁcation algorithms.

CICAndMal2017: CICAndMal2017 (Lashkari

et al., 2018) is a collection of Android applications

with different security labels, obtained from various

sources. The dataset consists of 5,491 Android appli-

cations, out of which 426 applications are malware

and 5,065 applications are benign. The malware

samples belong to four categories: Ransomware,

Adware, SMS Malware and Scareware.

IoT dataset (USTC-TFC2016): The USTC-

TFC2016 dataset (Wang et al., 2017) is a collection

of 20 raw network trafﬁc capture ﬁles (pcap) that

represent 20 different classes of trafﬁc. Out of

the 20 classes, 10 are malicious and the remaining

10 are benign. The malicious trafﬁc classes were

obtained from public sources by Stratosphere Re-

search Laboratory, and these classes include various

types of malware and botnet activities. Due to the

large sizes of some of the malicious trafﬁc ﬁles,

only segments of them were used for the analysis,

while smaller ﬁles that were generated by the same

applications were merged together. The benign trafﬁc

classes were generated by IXIA BPS, a professional

network trafﬁc simulation tool that can emulate

realistic network scenarios. The benign trafﬁc classes

cover eight common application categories, such as

MySQL, Facetime, and others and reﬂect typical

network usage patterns.

5 RESULTS AND DISCUSSION

Table 1 shows the performance of different image-

based models for Android malware classiﬁcation on

four datasets: MALNET-Tiny (Freitas et al., 2022),

Drebin Top 10, Drebin Top 20 (Arp et al., 2014), and

CICAndMal2017 (Lashkari et al., 2018). The met-

rics used to evaluate the models are Accuracy (Acc)

and F1-score (F1-s), which is the harmonic mean of

precision and recall. The models include ResNet-

50 (He et al., 2015), MobileNetV2 (Sandler et al.,

2018), and LCNN (0.25) (Yuan et al., 2022). These

are all convolutional neural networks (CNNs) with

different architectures and complexities. The table

also shows the variants of these models with one or

two transformer layers (1T or 2T) attached after the

CNN feature extraction. The rationale behind us-

Malware Analysis Using Transformer Based Models: An Empirical Study

861

Table 1: Results for Android Malware Classiﬁcation using CNN and Transformer based Models on 4 datasets: MALNET-Tiny

(43 malware classes), Drebin Top 10 (top 10 malware classes), Drebin Top 20 (top 20 malware classes), and CICAndMal2017

(1 benign, and 4 malware classes). Acc represents Accuracy and F1-s denotes F1-score. NA values indicate that pretaining

has been done on the respective dataset. For a dataset, values in boldface correspond to the best performing model. The gray

colored cells represent the best performance across a model and its two variations for a speciﬁc dataset.

MALNET-Tiny Drebin Top 10 Drebin Top 20 CICAndMal2017

Model

Acc F1-s Acc F1-s Acc F1-s Acc F1-s

ResNet-50 70.05 70.95 90.56 90.60 86.88 86.99 75.00 75.05

ResNet-50-1T 74.23 75.48 87.95 88.64 84.84 86.10 64.21 63.94

ResNet-50-2T 72.01 72.66 89.56 89.70 80.86 82.50 60.78 61.21

MobileNetV2 74.54 75.11 87.45 87.65 84.62 85.14 69.61 69.64

MobileNetV2-1T 72.29 73.19 81.12 81.77 75.91 77.42 66.18 67.01

MobileNetV2-2T 71.30 72.72 75.78 77.00 74.73 77.03 66.18 61.80

LCNN (0.25) 70.68 71.52 90.56 90.63 84.30 86.59 77.94 78.22

LCNN (0.25)-1T 71.27 72.20 88.57 89.38 82.69 84.01 74.51 70.81

LCNN (0.25)-2T 68.91 69.76 85.22 85.51 77.63 79.93 67.64 67.85

ViT 70.77 70.26 66.83 68.06 64.19 63.66 58.33 58.30

ViT-PreMALNET-Tiny NA NA 70.65 71.23 61.74 56.99 68.89 66.27

ViT-PreMALNET NA NA 60.41 59.52 49.42 41.55 61.24 62.71

ing transformer layers is that they are expected to

capture the global dependencies and long-range in-

teractions among the image pixels. Table 1 also in-

cludes results for the visual image transformer model,

ViT (Dosovitskiy et al., 2020). ViT is the model that

uses only transformer layers for feature extraction and

classiﬁcation, without any convolutional layers. Ad-

ditionally, the table shows the results for ViT mod-

els pretrained on MALNET-Tiny (model name ViT-

PreMALNET-Tiny) and MALNET (model name ViT-

PreMALNET) datasets, which are larger and more di-

verse than the target datasets. From the table, we can

observe the following:

• ResNet-50 is the best performing CNN model

on Drebin Top 10 and CICAndMal2017 datasets,

achieving the highest Accuracies and F1-scores

among all the models. However, it is outper-

formed by MobileNetV2 and LCNN (0.25) on

MALNET-Tiny dataset, and by LCNN (0.25) on

Drebin Top 20 dataset. This suggests that ResNet-

50 is more effective on datasets with fewer classes

and more balanced samples, but may suffer from

overﬁtting or complexity issues on datasets with

more classes and imbalanced samples.

• Adding one or two transformer layers after CNN

feature extraction generally reduces the perfor-

mance of the models on all the datasets, except for

LCNN (0.25)-1T on MALNET-Tiny dataset. This

indicates that the transformer layers may not be

beneﬁcial for the Android malware image classi-

ﬁcation task, as they may introduce more parame-

ters, noise, or redundancy into the models. Alter-

natively, the transformer layers may require more

data or ﬁne-tuning to adapt to the task.

• MobileNetV2 and LCNN (0.25) are the best per-

forming lightweight CNN models on the datasets.

These two models achieve comparable or bet-

ter results than ResNet-50 on MALNET-Tiny and

Drebin Top 20 datasets, and exhibit slightly lower

performance on Drebin Top 10 and CICAnd-

Mal2017 datasets. These models have fewer

parameters and lower computational cost than

ResNet-50, and may be more suitable for mobile

or resource-constrained devices.

• ViT is the worst performing model on all the

datasets, achieving the lowest accuracy and F1-

score values among all the models. This suggests

that ViT is not well suited for image-based An-

droid malware classiﬁcation. The reason for this

can be because ViT may not be able to capture the

semantic or structural features of the images, or

may require more data or ﬁne-tuning to adapt to

the task.

• Pretraining ViT on MALNET-Tiny or MALNET-

IMAGE datasets slightly improves the perfor-

mance of ViT on some of the datasets, but not

on others. For example, ViT-PreMALNET-Tiny

achieves higher accuracies and F1-scores than

ViT on Drebin Top 10 and CICAndMal2017

datasets, but lower metric values on Drebin Top

20 dataset. ViT-PreMALNET achieves higher

accuracy and F1-score than ViT on CICAnd-

Mal2017 dataset, but lower scores on Drebin Top

10 and Drebin Top 20 datasets. This indicates that

the pretraining data may not be representative or

sufﬁcient for the target datasets, or that the trans-

fer learning may not be effective for the task.

Table 2 shows the results of using BERT to clas-

SECRYPT 2024 - 21st International Conference on Security and Cryptography

862

Table 2: Results for Android Malware Classiﬁcation

through API Sequence Classiﬁcation using BERT on three

datasets: Androzoo (benign and malware classes), Drebin

Top 20 (top 20 malware classes) and CICAndMal2017 (4

malware classes). The upper half of the table presents re-

sults where APIs are initialized with linear projection and

the lower half shows results where APIs are initialized with

Skip-gram. Acc represents Accuracy and F1-s denotes F1-

score. For a dataset, values in boldface correspond to the

best performing model.

API initialized with Linear Projection

Dataset Acc Precision Recall F1-s

Androzoo 70.75 70.23 74.01 72.07

Drebin Top 20 69.27 02.99 03.70 03.19

CICAndMal2017 63.53 16.05 15.68 15.80

API initialized with Skip-gram

Dataset Acc Precision Recall F1-s

Androzoo 84.00 85.10 82.56 83.81

Drebin Top 20 79.42 05.56 05.65 05.52

CICAndMal2017 63.53 16.75 16.12 16.18

sify Android malware based on API sequences, with

two different ways of initializing the API embed-

dings: Linear Projection and Skip-gram. Skip-gram is

a method of learning API embeddings from the con-

text of the API corpus, which can capture the seman-

tic and syntactic similarities of the APIs. The table

reports the accuracy, precision, recall and F1-score of

the models on each dataset. The results are reported

on three datasets: Androzoo, Drebin Top 20 and CI-

CAndMal2017, which vary in size, diversity and im-

balance of malware classes. Some of the key ﬁndings

from the analysis are:

• BERT with Skip-gram consistently outperforms

BERT with linear projection on all three datasets,

achieving higher accuracy, precision, recall and

F1-score. This suggests that Skip-gram is a better

way of initializing the APIs for BERT, as it pre-

serves the semantic and syntactic information of

the APIs, which may be useful for malware clas-

siﬁcation.

• BERT with Skip-gram achieves the highest per-

formance on Androzoo, a large and balanced

dataset of benign and malicious Android applica-

tions. The model achieves an accuracy of 84.00%,

a precision of 85.10%, a recall of 82.56% and an

F1-score of 83.81%, indicating that it can effec-

tively distinguish between benign and malicious

applications based on their API sequences. This

also shows that BERT can handle long and com-

plex API sequences, as Androzoo contains appli-

cations with varying lengths and complexities of

API sequences.

• BERT with Skip-gram performs poorly on Drebin

Top 20 and CICAndMal2017, two smaller and

imbalanced datasets of Android malware sam-

ples across different malware classes. The model

achieves low accuracy, precision, recall and F1-

score on both datasets, indicating that it cannot

capture the subtle differences between different

malware classes based on their API sequences.

This may be due to the lack of sufﬁcient train-

ing data, high class imbalance, or the high simi-

larity of API sequences among different malware

classes.

Table 3: IoT Malware classiﬁcation Results on USTC-

TFC2016 dataset. Acc represents Accuracy and F1-s de-

notes F1-score. For a dataset, values in boldface correspond

to the best performing model. The gray colored cells repre-

sent the best performance across a model and its two varia-

tions for a speciﬁc dataset.

Model Acc Precision Recall F1-s

ResNet-50 99.81 99.83 99.76 99.79

ResNet-50-1T 99.67 99.24 99.71 99.45

MobileNetV2 99.79 99.69 99.70 99.69

MobileNetV2-1T 99.58 99.55 99.47 99.51

LCNN (0.25) 99.20 99.38 99.17 99.27

LCNN (0.25)-1T 98.34 97.77 97.75 97.74

ViT 82.67 76.41 72.83 72.45

Table 3 shows the performance of different im-

age classiﬁcation models on the IoT dataset, USTC-

TFC2016 (Wang et al., 2017). This dataset consists of

20 trafﬁc classes that are commonly found in the IoT

domain. The models include ResNet-50 (He et al.,

2015) MobileNetV2 (Sandler et al., 2018), LCNN

(0.25) (Yuan et al., 2022), and ViT (Dosovitskiy et al.,

2020), as well as their variants with one transformer

layer (1T) added after the convolutional feature ex-

traction. The metrics used to evaluate the models are

Accuracy, Precision, Recall, and F1-score. From Ta-

ble 3, we observe the following:

• ResNet-50 achieves the highest performance

among all the models, with an accuracy of

99.81%, a precision of 99.83%, a recall of

99.76%, and an F1-score of 99.79%. This indi-

cates that ResNet-50 is very effective at recogniz-

ing the classes in the IoT dataset, with high accu-

racy and balanced precision and recall. ResNet-

50 is a deep residual network that uses skip con-

nections to overcome the degradation problem of

deep networks. ResNet-50 has 50 layers and is

widely used for various computer vision tasks.

• Adding one transformer layer to ResNet-50

slightly degrades its performance, as shown by

ResNet-50-1T, which has an accuracy of 99.67%,

a precision of 99.24%, a recall of 99.71%, and an

F1-score of 99.45%. The transformer layer is a

Malware Analysis Using Transformer Based Models: An Empirical Study

863

self-attention mechanism that can capture long-

range dependencies and semantic relations among

the features. However, in this case, the trans-

former layer may introduce some noise or redun-

dancy which in turn reduces the precision of the

model, while maintaining a high recall.

• MobileNetV2 is another high-performing model,

with an accuracy of 99.79%, a precision of

99.69%, a recall of 99.70%, and an F1-score of

99.69%. MobileNetV2 is a lightweight and ef-

ﬁcient network that uses inverted residual blocks

and depth wise separable convolutions to reduce

the computational cost and parameter size. Mo-

bileNetV2 is suitable for mobile and embedded

devices as well as IoT applications.

• Similar to ResNet-50, adding one transformer

layer to MobileNetV2 also lowers its perfor-

mance, as shown by MobileNetV2-1T, which has

an accuracy of 99.58%, a precision of 99.55%, a

recall of 99.47%, and an F1-score of 99.51%. The

transformer layer may not bring much beneﬁt to

the already compact and powerful MobileNetV2,

and may instead introduce some overhead or com-

plexity that affects the precision and recall of the

model.

• LCNN (0.25) is a low-complexity network that

uses a fraction of the parameters and operations

of standard CNNs by applying a low-rank decom-

position to the convolutional ﬁlters. LCNN (0.25)

has an accuracy of 99.20%, a precision of 99.38%,

a recall of 99.17%, and an F1-score of 99.27%.

This shows that LCNN (0.25) can achieve com-

parable performance to the other models, while

reducing the computational and memory require-

ments. LCNN (0.25) is also suitable for resource-

constrained IoT devices.

• However, adding one transformer layer to LCNN

(0.25) signiﬁcantly lowers its performance, as

shown by LCNN (0.25)-1T, which has an accu-

racy of 98.34%, a precision of 97.77%, a recall of

97.75%, and an F1-score of 97.74%. The trans-

former layer may not be compatible with the low-

rank structure of LCNN (0.25), and may instead

degrade the quality and efﬁciency of the model.

The transformer layer may also increase the pa-

rameter size and computational cost of the model,

which defeats the purpose of using LCNN (0.25).

• ViT is a visual image transformer model that

uses only transformer layers to process the images

without any convolutional layers. ViT has an ac-

curacy of 82.67%, a precision of 76.41%, a recall

of 72.83%, and an F1-score of 72.45%. This indi-

cates that ViT performs poorly on the IoT dataset,

compared to the other models. ViT is not able to

capture the ﬁne-grained and local features of the

IoT data, and may suffer from the low resolution

and diversity of the images. ViT may also require

more data and training to achieve optimal perfor-

mance, as it is a data-hungry and complex model.

Thus, ViT may not be suitable for IoT applications

as it has a large parameter size and computational

cost.

6 CONCLUSIONS

In this paper, we have investigated the applicability

and effectiveness of different transformer-based mod-

els for classifying malware targeting Android applica-

tions and IoT devices. We have performed two types

of malware analysis - image-based and sequence-

based. For image-based malware analysis, we have

converted Android APKs and IoT trafﬁc into ﬁxed

size images, and have compared various CNN archi-

tectures with and without transformer layers, and a

pure transformer model to classify the converted im-

ages. For sequence-based malware analysis, we ex-

tracted the API call sequences from Android APKs,

and applied a transformer model to encode and clas-

sify them. We have also explored the effect of pre-

training and embedding initialization on the trans-

former models.

Our experiments demonstrate the advantages and

limitations of using transformer-based models for

malware categorization, and provide insights into the

training strategies of these models. The main ﬁndings

of our study can be summarized as follows:

• For image-based malware analysis, CNN based

models in general, have outperformed the

transformer-based models on all the datasets.

ResNet-50 is the best performing CNN model on

datasets with fewer classes and more balanced

samples, while MobileNetV2 and LCNN (0.25)

are the best performing lightweight CNN mod-

els on datasets with more classes and imbalanced

samples.

• Adding transformer layers after the CNN feature

extraction generally reduces the performance of

the lightweight CNN models.

• ViT is the worst performing model on all the

datasets, and pretraining ViT on larger and more

diverse datasets slightly improves its performance

on some of the datasets, but not on others. These

results suggest that the transformer-based models

may not be beneﬁcial or suitable for the image-

based malware analysis task, as they may intro-

SECRYPT 2024 - 21st International Conference on Security and Cryptography

864

duce more parameters, noise, or redundancy to the

models, or may require more data or ﬁne-tuning to

adapt to the task.

• For sequence-based malware analysis, BERT is an

effective model for encoding and classifying the

API call sequences extracted from Android APKs.

Initializing the API embeddings with Skip-gram,

an API2vec technique, signiﬁcantly improves the

performance of BERT. These results suggest that

the Skip-gram embeddings capture more seman-

tic and syntactic information about the API se-

quences than the linear projection, and that they

help BERT to learn better representations and

classiﬁcations of the malware families. However,

the Skip-gram embeddings may fail to capture the

diversities of the API sequences across different

malware and benign families. Moreover, BERT

may need higher volume of data or more sophis-

ticated methods to achieve higher performance

on dataset containing a large number of malware

families.

Our study provides a comprehensive and system-

atic comparison of different transformer-based mod-

els for malware analysis on image and sequence in-

puts, and has revealed the strengths and weaknesses

of these models. Our study has also highlighted the

challenges and opportunities for future research on

applying transformer-based models for accurate mal-

ware analysis and related tasks.

ACKNOWLEDGEMENT

The authors gratefully acknowledge the computing

time provided on the high performance computing fa-

cility, Sharanga, at the Birla Institute of Technology

and Science - Pilani, Hyderabad Campus.

REFERENCES

Allix, K., Bissyand

e, T. F., Klein, J., and Le Traon, Y.

(2016). Androzoo: Collecting millions of android

apps for the research community. In 13th Int. Conf.

on Mining Software Repositories, pages 468–471.

Arp, D., Spreitzenbarth, M., Hubner, M., Gascon, H., and

Rieck, K. (2014). Drebin: Effective and explain-

able detection of android malware in your pocket. In

NDSS. The Internet Society.

Cannarile, A., Carrera, F., alantucci1, S., Iannacone, A.,

and Pirlo, G. (2022). A study on malware detection

and classiﬁcation using the analysis of api calls se-

quences through shallow learning and recurrent neural

networks. Italian Conference on Cybersecurity, 3260.

Conti, G., Dean, E., Sinda, M., and Sangster, B. (2008).

Visual reverse engineering of binary and data ﬁles. In

Int. Workshop on Visualization for Computer Security,

page 1 – 17.

Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2018).

BERT: pre-training of deep bidirectional transformers

for language understanding. CoRR, abs/1810.04805.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,

D., Zhai, X., Unterthiner, T., Dehghani, M., Min-

derer, M., Heigold, G., Gelly, S., Uszkoreit, J., and

Houlsby, N. (2020). An image is worth 16x16 words:

Transformers for image recognition at scale. ArXiv,

abs/2010.11929.

Freitas, S., Duggal, R., and Chau, D. H. (2022). Malnet:

A large-scale image database of malicious software.

In 31st ACM Int. Conf. on Information & Knowledge

Management, page 3948–3952.

Gennissen, J. and Blasco, J. (2017). Gamut : Sifting

through images to detect android malware.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep resid-

ual learning for image recognition.

Jo, J., Cho, J., and Moon, J. (2023). A malware detection

and extraction method for the related information us-

ing the vit attention mechanism on android operating

system. Applied Sciences, 13(11).

Lashkari, A. H., Kadir, A. F. A., Taheri, L., and Ghor-

bani, A. A. (2018). Toward developing a system-

atic approach to generate benchmark android malware

datasets and classiﬁcation. In 2018 Int. Carnahan

Conf. on Security Technology, pages 1–7.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).

Efﬁcient estimation of word representations in vector

space.

Nataraj, L., Karthikeyan, S., Jacob, G., and Manjunath, B.

(2011). Malware images: Visualization and automatic

classiﬁcation. In 8th Int. Symposium on Visualization

for Cyber Security.

Ravi, A., Chaturvedi, V., and Shaﬁque, M. (2023). Vit4mal:

Lightweight vision transformer for malware detection

on edge devices. ACM Transactions on Embedded

Computing Systems, 22(117).

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and

Chen, L.-C. (2018). Mobilenetv2: Inverted residuals

and linear bottlenecks. In 2018 IEEE/CVF Conference

on Computer Vision and Pattern Recognition (CVPR),

pages 4510–4520.

Seneviratne, S., Shariffdeen, R., Rasnayaka, S., and

Kasthuriarachchi, N. (2022). Self-supervised vision

transformers for malware detection. IEEE Access,

10:103121–103135.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, L., and Polosukhin, I.

(2023). Attention is all you need. In 31st Conf. on

Neural Information Processing Systems.

Wang, W., Zhu, M., Zeng, X., Ye, X., and Sheng, Y. (2017).

Malware trafﬁc classiﬁcation using convolutional neu-

ral network for representation learning. In 2017 Int.

Conf. on Information Networking, pages 712–717.

Yuan, B., Wang, J., Wu, P., and Qing, X. (2022). Iot mal-

ware classiﬁcation based on lightweight convolutional

neural networks. IEEE Internet of Things Journal,

9(5):3770–3783.

Malware Analysis Using Transformer Based Models: An Empirical Study

865