Labelled Vulnerability Dataset on Android Source Code (LVDAndro) to
Develop AI-Based Code Vulnerability Detection Models
Janaka Senanayake
1,2 a
, Harsha Kalutarage
1 b
, Mhd Omar Al-Kadri
3 c
, Luca Piras
4 d
and Andrei Petrovski
1 e
1
School of Computing, Robert Gordon University, Aberdeen, U.K.
2
Faculty of Science, University of Kelaniya, Kelaniya, Sri Lanka
3
School of Computing and Digital Technology, Birmingham City University, Birmingham, U.K.
4
Department of Computer Science, Middlesex University, London, U.K.
Keywords:
Android Application Security, Code Vulnerability, Labelled Dataset, Artificial Intelligence, Auto Machine
Learning.
Abstract:
Ensuring the security of Android applications is a vital and intricate aspect requiring careful consideration
during development. Unfortunately, many apps are published without sufficient security measures, possibly
due to a lack of early vulnerability identification. One possible solution is to employ machine learning models
trained on a labelled dataset, but currently, available datasets are suboptimal. This study creates a sequence
of datasets of Android source code vulnerabilities, named LVDAndro, labelled based on Common Weakness
Enumeration (CWE). Three datasets were generated through app scanning by altering the number of apps
and their sources. The LVDAndro, includes over 2,000,000 unique code samples, obtained by scanning over
15,000 apps. The AutoML technique was then applied to each dataset, as a proof of concept to evaluate
the applicability of LVDAndro, in detecting vulnerable source code using machine learning. The AutoML
model, trained on the dataset, achieved accuracy of 94% and F1-Score of 0.94 in binary classification, and
accuracy of 94% and F1-Score of 0.93 in CWE-based multi-class classification. The LVDAndro dataset is
publicly available, and continues to expand as more apps are scanned and added to the dataset regularly. The
LVDAndro GitHub Repository also includes the source code for dataset generation, and model training.
1 INTRODUCTION
Approximately 90,000 Android mobile apps are re-
leased through the Google Play Store monthly. In
January 2023, Android holds a 71.74% market share
(Statista, 2023; Statcounter, 2023). However, many
of these apps are developed without adhering to se-
cure coding best practices and standards, resulting in
source code vulnerabilities, which appeal to attack-
ers. In contrast to iOS, Android applications are not
thoroughly checked for security aspects (Senanayake
et al., 2021), and therefore the security of these apps
is not guaranteed, and they may fail to comply with
rigorous security protocols.
a
https://orcid.org/0000-0003-2278-8671
b
https://orcid.org/0000-0001-6430-9558
c
https://orcid.org/0000-0002-1146-1860
d
https://orcid.org/0000-0002-7530-4119
e
https://orcid.org/0000-0002-0987-2791
To ensure the security of apps, it is recommended
to implement secure coding practices while writing
the code, as many vulnerabilities stem from flaws in
the source code. The Security Development Life-
cycle (SDL) recommends following secure develop-
ment practices in real-time, rather than waiting until
the application is developed (Souppaya et al., 2021).
To help enforce these practices, researchers have de-
veloped automated tools for identifying Android app
vulnerabilities using various scanning methods, in-
cluding conventional, Machine Learning (ML), and
Deep Learning (DL) methods (Shezan et al., 2017;
Senanayake et al., 2023). These methods utilise three
analysis approaches: static, dynamic, and hybrid.
However, many existing vulnerability detection meth-
ods require Android Application Package (APK) files
that are ready to be installed, limiting their usefulness
during development. To overcome that, it is possible
to use well-trained ML/DL models, which can detect
vulnerabilities simultaneously when the code is writ-
Senanayake, J., Kalutarage, H., Al-Kadri, M., Piras, L. and Petrovski, A.
Labelled Vulnerability Dataset on Android Source Code (LVDAndro) to Develop AI-Based Code Vulnerability Detection Models.
DOI: 10.5220/0012060400003555
In Proceedings of the 20th International Conference on Security and Cryptography (SECRYPT 2023), pages 659-666
ISBN: 978-989-758-666-8; ISSN: 2184-7711
Copyright
c
2023 by SCITEPRESS Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)
659
ten. A properly labelled dataset on Android source
code vulnerability is required to train such models.
Hence, this paper makes the following contributions.
1. Producing a properly labelled novel dataset of An-
droid source code vulnerabilities named LVDAn-
dro, which offers the following characteristics:
A. LVDAndro contains more than fifteen million
distinct code samples scanned from over fifteen
thousand Android Apps;
B. within LVDAndro, vulnerable code examples
were labelled with Common Weakness Enu-
meration (CWE)
1
identifications and contain
additional attributes such as vulnerability cate-
gory, severity, and description. As a result, this
dataset is unique compared to existing ones;
C. within LVDAndro, the labelling process was
done by combining multiple vulnerability scan-
ners including Mobile Security Framework
(MobSF)
2
and Quick Android Review Kit
(Qark)
3
. Hence ML models trained with LV-
DAndro learn the capabilities of all scanners.
2. Performing binary and multi-class classification-
related AutoML experiments as a Proof-of-
Concept (PoC) to determine the applicability of
the LVDAndro dataset for Android source code
vulnerability detection with ML. The classifiers
achieved:
A. accuracy of 94% and F1-Score of 0.94 in binary
classification, which predicts vulnerable codes;
B. accuracy of 94% and F1-Score of 0.93 in multi-
class classification, which predicts the CWE ID
of a vulnerable code.
3. Making the dataset available for public access as
a GitHub repository
4
, along with the dataset gen-
eration scripts and the instructions to enhance the
dataset by adding more data as needed.
The remaining sections of the paper are structured
as follows: in Section 2, prior research on the sub-
ject is reviewed, and in Section 3, the dataset gener-
ation is discussed. Section 4 outlines the attributes
and the statistics of the LVDAndro dataset, and Sec-
tion 5 examines how the LVDAndro can be used to
train ML models to identify vulnerabilities in Android
code. Section 6 provides the conclusion by discussing
the findings and future plans.
1
https://cwe.mitre.org/
2
https://github.com/MobSF/
Mobile-Security-Framework-MobSF
3
https://github.com/linkedin/qark/
4
https://github.com/softwaresec-labs/LVDAndro
2 RELATED WORK
This section discusses the related studies on code vul-
nerabilities and datasets, which can be used to detect
software vulnerabilities using ML-based methods.
Organisations and communities have identified a
number of vulnerabilities. CWE and Common Vul-
nerabilities and Exposures (CVE)
5
are generally used
as references for identifying weaknesses and vulner-
abilities across many programming languages. As a
result, mobile app developers can also refer to these
references to identify vulnerabilities and address se-
curity issues in their source code.
Previous research has proposed a number of
datasets and repositories focused on vulnerabilities.
For instance, AndroVul (Namrud et al., 2019) is a
repository that deals with security issues related to
Android, such as security code smells, dangerous per-
missions, and dangerous shell commands. It was cre-
ated by analysing APKs downloaded from Andro-
Zoo (Allix et al., 2016), and serves as a benchmark
for detecting Android malware. It can also be used
for ML experiments to detect malware with static
analysis. Another dataset, introduced in (Challande
et al., 2022), is a commit-level dataset for real-world
vulnerabilities, which has analysed more than 1,800
projects and over 1,900 vulnerabilities based on CVE
from the Android Open Source Project. Ghera (Mi-
tra and Ranganath, 2017), an open-source repository
of benchmarks, has captured 25 known vulnerabili-
ties in Android apps and also presented some com-
mon characteristics of vulnerability benchmarks and
repositories. Additionally, the National Vulnerabil-
ity Database (NVD)
6
is another dataset that can be
utilised as a reference for vulnerabilities. However,
it lacks the ability to support the development of AI-
based models because vulnerable categories for code
lines are not correctly labeled.
It is possible to create new datasets by examin-
ing Android apps for vulnerabilities. There are two
methods for analysing Android applications. The first
approach involves reverse-engineering the developed
APKs and analysing the code. However, this method
requires a pre-built application and is not applicable
in the early stages of the SDLC (Senanayake et al.,
2022). The second approach involves analysing the
source code while it is being written. Both meth-
ods utilise static, dynamic, and hybrid analysis tech-
niques as the initial step of application scanning.
Static analysis techniques identify code issues with-
out executing the application or source code and can
be categorised into two types: manifest analysis and
5
https://cve.mitre.org/
6
https://nvd.nist.gov/vuln
SECRYPT 2023 - 20th International Conference on Security and Cryptography
660
code analysis. Manifest analysis can identify package
names, permissions, activities, services, intents, and
providers. On the other hand, code analysis provides
deeper insights into the source code by analysing
features such as API calls, information flow, native
code, taint tracking, clear-text analysis, and opcodes
(Senanayake et al., 2021). Dynamic analysis, in con-
trast to static analysis, requires a runtime environment
to execute the application for scanning. This approach
is commonly used for malware detection and identi-
fying vulnerabilities in pre-built applications. Hybrid
analysis combines both static and dynamic analysis
techniques, where static analysis is used to analyse
the manifest file and source code files, and dynamic
analysis is used to analyse the application’s charac-
teristics at runtime.
Various tools are available to conduct such anal-
ysis. For instance, APKTool
7
can extract code-
level information by decompiling the APK using a
static analyser. This tool is widely used as the foun-
dation for vulnerability detection methods based on
static analysis that involve reverse-engineering APKs
(Senanayake et al., 2021). Qark is another tool that
can identify vulnerabilities in Android apps by exam-
ining pre-built APKs or source code files. MobSF,
on the other hand, uses a hybrid analysis model to de-
tect vulnerabilities, malware, and perform penetration
testing. It is a security framework designed for An-
droid and iOS, which offers REST API for develop-
ment integration. HornDroid (Calzavara et al., 2016)
is a tool that can analyse information flow in Android
apps by abstracting their semantics to construct secu-
rity properties, while COVERT (Bagheri et al., 2015)
can perform compositional analysis of inter-app vul-
nerabilities in Android. Another tool that can be use-
ful for identifying vulnerabilities in Android source
code through static analysis is Android Lint (Goa
¨
er,
2020), which uses Abstract Syntax Trees (AST) or
Universal AST generated from source code.
Previous research has noted that both ML-based
and non-ML-based methods can be employed to de-
tect vulnerabilities. However, in recent years, there
has been a greater tendency to use ML-based ap-
proaches over non-ML-based methods (Ghaffarian
and Shahriari, 2017). Additionally, model accu-
racy and performance can be improved by enhancing
datasets and tuning parameters through various ML
experiments. While Alloy (Bagheri et al., 2018) and
VulArcher (Qin et al., 2020) have presented non-ML-
based techniques such as formal and heuristic-based
methods, ML-based vulnerability detection methods
have been proposed in studies such as (Senanayake
et al., 2022; Gajrani et al., 2020). These studies have
7
https://ibotpeaches.github.io/Apktool/
employed various classifiers, including Decision Tree
(DT), Naive Bayes (NB), AdaBoost (AB), Random
Forest (RF), Gradient Boosting (GB), Extreme Gradi-
ent Boosting (XGB), Logistic Regression (LR), Sup-
port Vector Classifier (SVC), and Multi-Layer Per-
ception (MLP) trained on labelled datasets.
The detection of source code vulnerabilities in
Android has been a challenge due to the absence of
an accurate method during code writing, as well as
a lack of appropriately labelled datasets to train ma-
chine learning (ML) models for vulnerability predic-
tion (Senanayake et al., 2023). To address this gap,
the LVDAndro dataset is introduced in this study, and
a proof of concept (PoC) is presented that uses ML
techniques to detect Android code vulnerabilities.
3 DATASET GENERATION
PROCESS
The LVDAndro dataset is a comprehensive and di-
verse collection of labelled data that is specifically de-
signed to address the challenges of detecting Android
source code vulnerabilities using ML techniques. The
dataset contains a wide range of source code samples
with varying degrees of complexity and security vul-
nerabilities. The overall process of LVDAndro dataset
generation is illustrated in Figure 1, and the genera-
tion process consists of three main stages, as follows.
1. Scrapping of APKs and corresponding source
files (Data collection).
2. Scanning APKs for vulnerabilities using exist-
ing tools to label the source code with CWE-IDs
(Data labelling).
3. Generating processed dataset (Preprocessing).
Figure 1: LVDAndro dataset generation process.
3.1 Scrapping APKs and Source Files
(Data Collection)
To generate the LVDAndro dataset, the first step is to
scrape APKs, and their source code, from application
repositories. This includes Google Play, Fossdroid
Labelled Vulnerability Dataset on Android Source Code (LVDAndro) to Develop AI-Based Code Vulnerability Detection Models
661
(Simonin, 2023), AndroZoo (Allix et al., 2016), and
some well-known malware repositories (Senanayake
et al., 2021). Python scripts were used to download
APKs and their source code from GitHub reposito-
ries. An experiment was also carried out to investigate
whether source code from reverse-engineered APKs
could be used to generate the dataset instead of rely-
ing on the original source code. This was due to the
lack of open-source APKs and the high availability of
closed-source APKs. Figure 2 illustrates the sources
of the downloaded apps in the current version of the
dataset, which will be increased in future versions.
Figure 2: Source of downloaded apps.
3.2 Scanning APKs for Vulnerabilities
(Data Labelling)
LVDAndro was developed to leverage ML to iden-
tify source code vulnerabilities in real-time. To create
robust and effective machine learning models, a di-
verse dataset of APKs and source files was necessary.
To achieve this, various scanning methods were em-
ployed as the second step in dataset generation, which
involved scanning both the APKs and source files for
vulnerabilities. This enabled the ML models to be
trained on a comprehensive range of vulnerabilities.
The LVDAndro dataset was developed using the
code analysis approach in the static analysis method
by scanning APKs and Android project files (which
include source code and file structure). Vulnerability
scanning tools such as MobSF and Qark were used
for this purpose. During the scanning process, these
tools could identify the vulnerable lines of code and
the corresponding CWE-IDs. The idea was that by
using the resulting dataset to train machine learning
models, the models would be able to learn from the
capabilities of both scanners and perform better than
either tool alone in terms of detection. A Python script
was developed to automate the scanning process, and
all applications were scanned using this script.
To analyse an APK or Android project using
MobSF, it needs to be set up as a server, and several
API requests can be made, including upload, scan,
and download. When an APK or project is uploaded,
MobSF decompiles it using tools such as JADX,
dex2-jar, JD-GUI. The decompiled source code or
project files are then scanned for vulnerabilities. Af-
ter the scan is complete, the results are stored in a
local database, which is mapped to a generated hash
value. The results are retrieved as a JSON object and
passed to the automation Python script using the hash
value. The JSON object contains details of the upload
files, including vulnerability status, manifest analysis
details, code analysis details, and associated files. A
separate Python script is then used to extract the nec-
essary details and the source code lines of both vul-
nerable and non-vulnerable codes labelled by MobSF.
To perform analysis with Qark, it is needed to run
Qark as a shell script, because it does not offer APIs
like MobSF does. APK of the project source file di-
rectory should be passed as parameters when running
the Qark. When an APK is passed, Qark decompiles
it using tools such as Fernflower, Procyon, and CFR,
and then scans it to identify vulnerable lines of code.
If a file is submitted, it is directly scanned. After iden-
tifying the vulnerable code lines, a Python script la-
bels and stores them, along with a description from
the scanner, vulnerability type, and severity level.
Python scripts were created to scan APKs and
source files, utilising a unified approach that inte-
grates the functionalities of MobSF and Qark. These
scripts employ techniques for scanning and identify-
ing any potential vulnerabilities in the application or
source code, and the results are tagged with CWE IDs
to provide relevant information.
3.3 Generating Processed Dataset
(Preprocessing)
During this stage, various preprocessing steps were
carried out. First, user-defined string values were re-
placed with user str, since typical user-defined string
values do not make a significant impact for vulnera-
bilities (Hanif and Maffeis, 2022). However, string
values that included IP addresses and encryption al-
gorithms like AES, SHA-1, and MD5 were not re-
placed since they may cause for vulnerabilities such
as CWE-200 and CWE-327, which involve exposing
sensitive information to unauthorised parties and us-
ing insecure cryptographic algorithms. Next, all com-
ments were replaced with //user comment since the
language compilers ignore comments. Finally, dupli-
cates were removed based on the processed code and
the vulnerability status.
SECRYPT 2023 - 20th International Conference on Security and Cryptography
662
4 RESULTING DATASET
A sequence of datasets was created when generating
the LVDAndro dataset. This section discusses the
characteristics of these resulting datasets.
4.1 Different Datasets
The LVDAndro datasets were produced by scan-
ning real-world Android applications. Dataset 01
was compiled by including all the popular open-
source APKs and their related Android projects in
FossDroid, leading to a total of 511 Apps. An-
other dataset, named Dataset 02, was composed
of 5,503 open-source APKs and their associated
projects, scanned from all the listed apps in FossDroid
across 17 different categories such as Internet, Sys-
tems, Games, and Multimedia. Furthermore, Dataset
03 was formed by scanning 15,021 APKs from Foss-
Droid, AndroVul, and Android malware repositories.
This dataset includes scanned source code from both
open-source and closed-source applications, consist-
ing of 23 different CWE ID labels. If AI models need
to be developed based on the types of apps, the three
variations of datasets can be used. However, if there
is no such requirement, Dataset 3 can be used to per-
form an extensive analysis and build more accurate
models since it contains a large number of labelled
source code examples. A summary of the LVDAndro
datasets is provided in Table 1.
After processing Dataset 01 and Dataset 02, nine
sub-datasets were created. These sub-datasets were
generated using three different scanning approaches:
MobSF scanner, Qark scanner, and the proposed com-
bined scanner, to compare their effectiveness. Within
each method, three sub-datasets were generated using
only APKs, only Android projects, and both APKs
and Android projects. Dataset 03, which was gen-
erated using only APKs and the combined approach,
produced one dataset.
Figure 3 displays the distribution of vulnerable
and non-vulnerable code samples across the datasets
in LVDAndro. Observing the data, it is evident that
the count of non-vulnerable source code samples is
generally greater than the vulnerable source code
sample count. Since the datasets were created using
actual applications, it is possible that they contain a
significant proportion of non-vulnerable code.
4.2 Statistics of Datasets
Table 2 presents the fields included in the LVDAn-
dro dataset. While the processed code, vulnerability
status, and CWE-IDs are necessary for detecting vul-
Figure 3: Vulnerable and non-vulnerable code samples dis-
tribution in each dataset (in APK Combined Approach).
nerabilities, other fields can also provide additional
information for prediction.
Table 3 classifies the CWE-IDs based on their
likelihood of exploitation, and Figure 4 shows the
distribution of CWE-IDs in LVDAndro Dataset 03.
CWE-532 has a large number of code examples, as
it is common to write information to log files for de-
bugging purposes. However, these logs may also con-
tain sensitive information, which can be accidentally
written by the developer. CWE-312 also has a signif-
icant number of code examples, as many developers
tend to write sensitive information in cleartext. Most
of the other CWE categories have an even distribution
of code examples, whereas categories like CWE-299,
CWE-502, and CWE-599 have fewer examples due
to their complexity and difficulty in finding relevant
instances in the context of Android source code.
Figure 4: CWE-ID distribution in dataset 03.
Figure 5 depicts the distribution of CWE-IDs in
Dataset 03 based on their CWE likelihood of exploita-
tion values. As the dataset consists of 95% vulnerable
code examples for both high and medium exploitable
CWE-IDs, it is expected to be highly effective in de-
tecting vulnerabilities.
5 DATASET USAGE
This section outlines the proof-of-concept concerning
utilising LVDAndro to train machine learning models,
Labelled Vulnerability Dataset on Android Source Code (LVDAndro) to Develop AI-Based Code Vulnerability Detection Models
663
Table 1: Summary of the LVDAndro datasets.
Dataset Created
Date
No. of Code
Samples
No. of
Vulnerable
Codes
No. of Non
Vulnerable
Codes
Vul : Non-
vul Ratio
No. of
CWE-IDs
Description
Created using 511 open-source apps.
Dataset
01
Mar-2022 1,020,134 765,101 255,034 1:3 22 9 sub-datasets - scanned with MobSF, Qark
and Combined scanners (3 by scanning only
APKs, 3 by scanning only source files, 3 by
scanning both APKs and source files).
Created using 5,503 open-source apps.
Dataset
02
Jun-2022 14,228,925 10,529,405 3,699,521 7:9 23 9 sub-datasets - scanned with MobSF, Qark
and Combined scanners (3 by scanning only
APKs, 3 by scanning only source files, 3 by
scanning both APKs and source files).
Created using 15,021 apps.
Dataset
03
Dec-2022 21,289,029 14,689,432 6,599,597 9:11 23 1 dataset - scanned with combined scanner us-
ing both open-source and closed-source apps
from various sources
Table 2: Fields in LVDAndro.
Field Name Description
Index Auto-generated identifier
Code Original source code line
Pprocessed code Source code line after preprocessing
Vulnerability status Vulnerable(1) or Non-vulnerable(0)
Category Category of the vulnerability
Severity Severity of the vulnerability
Type Type of the vulnerability
Pattern Pattern of the vulnerable code
Description Description of the vulnerability
CWE ID CWE-ID of the vulnerability
CWE Desc Description of the vulnerable class
CVSS Common vulnerability scoring system
OWSAP Mobile Open web application security project for
mobile apps details
OWSAP MASVS OWASP Mobile application security veri-
fication standard
Reference CWE reference URL for the vulnerability
Figure 5: CWE distribution based on the likelihood of ex-
ploit.
for detecting vulnerabilities in Android source code.
It shows that by training AutoML model on the LV-
DAndro dataset, it is possible to accurately detect and
classify different types of vulnerabilities in Android
source code.
Table 3: Available CWE-IDs in LVDAndro.
CWE ID Likelihood
of Exploit
CWE Description
CWE-79 High Improper Neutralisation of Input During
Web Page Generation (’Cross-site Script-
ing’)
CWE-89 High Improper Neutralisation of Special Ele-
ments used in an SQL Command (’SQL In-
jection’)
CWE-200 High Exposure of Sensitive Information to an
Unauthorised Actor
CWE-250 Medium Execution with Unnecessary Privileges
CWE-276 Medium Incorrect Default Permissions
CWE-295 High Improper Certificate Validation
CWE-297 High Improper Validation of Certificate with
Host Mismatch
CWE-299 Medium Improper Check for Certificate Revocation
CWE-312 Medium Cleartext Storage of Sensitive Information
CWE-327 High Use of a Broken or Risky Cryptographic
Algorithm
CWE-330 High Use of Insufficiently Random Values
CWE-502 Medium Deserialisation of Untrusted Data
CWE-532 Medium Insertion of Sensitive Information into Log
File
CWE-599 High Missing Validation of OpenSSL Certificate
CWE-649 High Reliance on Obfuscation or Encryption of
Security-Relevant Inputs without Integrity
Checking
CWE-676 High Use of Potentially Dangerous Function
CWE-749 Low Exposed Dangerous Method or Function
CWE-919 Medium Weaknesses in Mobile Applications
CWE-921 Medium Storage of Sensitive Data in a Mechanism
without Access Control
CWE-925 Medium Improper Verification of Intent by Broad-
cast Receiver
CWE-926 High Improper Export of Android Application
Components
CWE-927 High Use of Implicit Intent for Sensitive Com-
munication
CWE-939 High Improper Authorisation in Handler for
Custom URL Scheme
SECRYPT 2023 - 20th International Conference on Security and Cryptography
664
5.1 Training AutoML Models
In this section, the performance of AutoML models
trained on LVDAndro datasets for detecting vulnera-
ble code lines (binary classification), and for detecting
CWE-IDs (multi-class classification), are compared.
To handle the data imbalance issue, the data were re-
sampled, and the dataset was split into an 80:20 ratio
for training and testing. The resulting performance
metrics are presented in Table 4 and Table 5 for bi-
nary and multi-class classification, respectively, and
are categorised by dataset.
Table 4: Performance comparison of AutoML models in
binary classification.
Sub dataset Name Binary Classification
Accuracy F1-Score Top Classifier
Dataset 01
APKs Qark 91% 0.90 RF
Source Qark 91% 0.90 RF
All Qark 91% 0.90 MLP
APKs MobSF 91% 0.90 RF
Source MobSF 91% 0.90 SVC
All MobSF 91% 0.90 MLP
APKs Combined 92% 0.91 MLP
Source Combined 92% 0.90 MLP
All Combined 92% 0.90 MLP
Dataset 02
APKs Combined 93% 0.92 RF
Source Combined 93% 0.91 RF
All Combined 93% 0.91 RF
Dataset 03
APKs Combined 94% 0.94 RF
According to Table 4 and Table 5, it can be ob-
served that the combined approach yielded better re-
sults when using APKs, source files, and both for
models in Dataset 01. As a result, only the com-
bined approach was used to train AutoML models in
Dataset 02. When training with Dataset 02, it was
discovered that the APKs combined approach outper-
formed the other source combined and all combined
Table 5: Performance comparison of AutoML models in
multi-class classification.
Sub dataset Name Multi-class Classification
Accuracy F1-Score Top Classifier
Dataset 01
APKs Qark 91% 0.82 RF
Source Qark 91% 0.81 RF
All Qark 91% 0.81 RF
APKs MobSF 91% 0.84 RF
Source MobSF 91% 0.83 RF
All MobSF 91% 0.83 RF
APKs Combined 92% 0.88 RF
Source Combined 92% 0.84 RF
All Combined 92% 0.86 RF
Dataset 02
APKs Combined 93% 0.91 RF
Source Combined 93% 0.85 RF
All Combined 93% 0.87 RF
Dataset 03
APKs Combined 94% 0.93 MLP
approaches. Therefore, only APKs were used for
scanning in Dataset 03. Furthermore, using multiple
sources to download APKs could potentially reduce
bias and impact the overall performance. Increasing
the dataset size resulted in a continuous improvement
in F1-Scores for both binary and multi-class classi-
fications. Minimising false positives and false nega-
tives is crucial to enhance the efficiency of any ML-
based solution, with minimising false negatives being
more critical in this problem. To accomplish this, sev-
eral measures, such as improving data quality during
preprocessing and training, were taken to reduce both
types of false alarms.
5.2 AutoML Model Comparison
An API was developed to detect vulnerable code
lines, and the associated CWE-ID using an AutoML
model trained on LVDAndro dataset. The API re-
quires source code lines as input, and in this experi-
ment, it was tested on a set of 3,312 source code lines
(unseen data) comprising both vulnerable examples
from the CWE repository and non-vulnerable exam-
ples from real applications. Subsequently, an APK
was created by incorporating the same set of 3,312
source code lines, and the APK was scanned using
MobSF and Qark Scanners. The accuracies of the
three approaches are reported in Table 6.
Table 6: Accuracy comparison of proposed ML model with
MobSF and Qark.
Approach Accuracy
MobSF 91%
Qark 89%
Proposed Approach 94%
The detection techniques of MobSF and Qark rely
on signatures, which are known for producing a high
number of false negatives, while maintaining accu-
racy in terms of true positives. To overcome this
limitation, the proposed ML-based technique trained
on LVDAndro can be applied, as it has learned from
the strengths of both scanners to make it more ro-
bust. With an accuracy of 94%, the proposed ML
model could accurately predict the vulnerability as-
sociated with tested code. This test was performed
using unseen source code, demonstrating that the
proposed method can detect vulnerabilities in new
APKs with high accuracy, confirming the hypothe-
sis (PoC). By incorporating additional scanners into
the pipeline and expanding the dataset with regular
updates that include data related to novel vulnerabil-
ities, the proposed method’s accuracy can be further
improved. Additionally, the size and quality of the
labelled dataset can also be increased.
Labelled Vulnerability Dataset on Android Source Code (LVDAndro) to Develop AI-Based Code Vulnerability Detection Models
665
6 CONCLUSION AND FUTURE
WORKS
When developing Android mobile applications, it is
essential to adopt security-focused practices, from
early stages, during the overall development cycle,
and it is important to receive valuable automated tool
support. One way to support app developers, in
identifying source code vulnerabilities, is by apply-
ing AI methods. This study presents a dataset called
LVDAndro, which contains over 20 million distinct
source code samples, labelled based on CWE-IDs, for
identifying Android source code vulnerabilities. The
dataset can be used to train machine learning mod-
els to predict vulnerabilities, achieving 94% accuracy
in binary and multi-class classification, with 0.94 and
0.93 F1-Scores, respectively. The dataset is available
on GitHub and ongoing efforts are underway to ex-
pand it and increase sample sizes for deeper learning
models. The addition of more scanners can further
increase the model’s accuracy. Adopting security-
focused practices and receiving automated tool sup-
port is important for developing secure Android apps.
REFERENCES
Allix, K., Bissyand
´
e, T. F., Klein, J., and Le Traon, Y.
(2016). Androzoo: Collecting millions of android
apps for the research community. In Proceedings of
the 13th International Conference on Mining Software
Repositories, MSR ’16, page 468–471, New York,
NY, USA. ACM.
Bagheri, H., Kang, E., Malek, S., and Jackson, D. (2018). A
formal approach for detection of security flaws in the
android permission system. Formal Aspects of Com-
puting, 30(5):525–544.
Bagheri, H., Sadeghi, A., Garcia, J., and Malek, S. (2015).
Covert: Compositional analysis of android inter-app
permission leakage. IEEE transactions on Software
Engineering, 41(9):866–886.
Calzavara, S., Grishchenko, I., and Maffei, M. (2016).
Horndroid: Practical and sound static analysis of an-
droid applications by smt solving. In 2016 IEEE Euro-
pean Symposium on Security and Privacy (EuroS&P),
pages 47–62, Saarbruecken, Germany. IEEE.
Challande, A., David, R., and Renault, G. (2022). Build-
ing a commit-level dataset of real-world vulnerabili-
ties. In Proceedings of the Twelfth ACM Conference
on Data and Application Security and Privacy, CO-
DASPY ’22, page 101–106, New York, USA. ACM.
Gajrani, J., Tripathi, M., Laxmi, V., Somani, G., Zemmari,
A., and Gaur, M. S. (2020). Vulvet: Vetting of vulner-
abilities in android apps to thwart exploitation. Digital
Threats: Research and Practice, 1(2):1–25.
Ghaffarian, S. M. and Shahriari, H. R. (2017). Software
vulnerability analysis and discovery using machine-
learning and data-mining techniques: A survey. ACM
Comput. Surv., 50(4).
Goa
¨
er, O. L. (2020). Enforcing green code with android
lint. In Proceedings of the 35th IEEE/ACM Interna-
tional Conference on Automated Software Engineer-
ing Workshops, ASE ’20, page 85–90, New York, NY,
USA. ACM.
Hanif, H. and Maffeis, S. (2022). Vulberta: Simplified
source code pre-training for vulnerability detection. In
2022 International Joint Conference on Neural Net-
works (IJCNN), pages 1–8.
Mitra, J. and Ranganath, V.-P. (2017). Ghera: A repository
of android app vulnerability benchmarks. In Proceed-
ings of the 13th International Conference on Predic-
tive Models and Data Analytics in Software Engineer-
ing, PROMISE, page 43–52, New York, NY, USA.
ACM.
Namrud, Z., Kpodjedo, S., and Talhi, C. (2019). Androvul:
A repository for android security vulnerabilities. In
Proceedings of the 29th Annual International Confer-
ence on Computer Science and Software Engineering,
CASCON ’19, page 64–71, USA. IBM Corp.
Qin, J., Zhang, H., Guo, J., Wang, S., Wen, Q., and Shi,
Y. (2020). Vulnerability detection on android apps–
inspired by case study on vulnerability related with
web functions. IEEE Access, 8:106437–106451.
Senanayake, J., Kalutarage, H., and Al-Kadri, M. O.
(2021). Android mobile malware detection using ma-
chine learning: A systematic review. Electronics,
10(13):1606.
Senanayake, J., Kalutarage, H., Al-Kadri, M. O., Petrovski,
A., and Piras, L. (2022). Developing secured android
applications by mitigating code vulnerabilities with
machine learning. In Proceedings of the 2022 ACM
on Asia Conference on Computer and Communica-
tions Security, ASIA CCS ’22, page 1255–1257, New
York, NY, USA. ACM.
Senanayake, J., Kalutarage, H., Al-Kadri, M. O., Petrovski,
A., and Piras, L. (2023). Android source code vulner-
ability detection: A systematic literature review. ACM
Comput. Surv., 55(9).
Shezan, F. H., Afroze, S. F., and Iqbal, A. (2017). Vulner-
ability detection in recent android apps: An empiri-
cal study. In 2017 International Conference on Net-
working, Systems and Security (NSysS), pages 55–63,
Dhaka, Bangladesh. IEEE.
Simonin, D. (2023). Fossdroid. https://fossdroid.com/. Ac-
cessed: 2023-01-02.
Souppaya, M., Scarfone, K., and Dodson, D. (2021). Secure
software development framework: Mitigating the risk
of software vulnerabilities. Technical report, NIST.
Statcounter (2023). Mobile operating system mar-
ket share worldwide. https://gs.statcounter.com/
os-market-share/mobile/worldwide/. Accessed:
2023-01-02.
Statista (2023). Average number of new an-
droid app releases via google play per month.
https://www.statista.com/statistics/1020956/
android-app-releases-worldwide/. Accessed:
2023-02-02.
SECRYPT 2023 - 20th International Conference on Security and Cryptography
666