cesses, we store the applications inside a Cassandra
NoSQL database (Apache (2008)) and before down-
loading a new one, we check if the application is ab-
sent from the database.
The crawler has been used for two purposes. First,
in October 2016, it was configured not to download
applications, but just to find them. It has generated a
list of more than 2 millions applications. In January
2017, we used it to download and extract the features
(see section 3) of 3,650 applications, which constitute
our sample for this study. For this second usage, we
used a laptop with a 2 core 4 thread intel CPU and the
process took 2 weeks (depending on the application
extracting the features can take up to 15 min while
downloading takes less than one minute).
3 FEATURE ANALYSIS
3.1 Extraction
We perform static analysis, and therefore, we extract
all the features that can be interesting for classifica-
tion directly from the APK file. To reverse applica-
tions, we use the set of tools Androguard (Bachmann
(2015)) written in Python. We provide the type of fea-
tures we extract and why. Our static analysis follows a
similar pattern to what was presented by P. Irolla and
E. Filiol at Black Hat Asia Conference (2015) (Irolla
and Filiol (2015)).
• General Information:
This information is not directly used to perform
the analysis, but more as a way to identify and
contextualize it. We extract these information
mainly from the application manifest (Android
(2017a)).
First, we retrieve the application common name
(ex: Facebook Messenger) the package name
(com.orca.facebook), and the version number.
Both of these are supposed to identify the appli-
cation, but considering it is easy to repackage an
Android application, we use the SHA-256 hash of
the APK instead.
We also extract the certificate used to sign the ap-
plication (to be published on the Playstore, an ap-
plication must be signed). From this certificate,
we can extract interesting information on the de-
veloper.
The other things we extract are the information
relative to the SDK (Android (2012)) (minimal/-
maximal/target version), the intent for the stati-
cally defined receivers (communication between
applications), intents for the activities (starting a
foreground process inside an application) and in-
tent for the services (background processes). The
last thing retrieved is all the URLs statically defi-
ned in the application.
• Classification Information:
These are the features on which the detection is
based. They are retrieved by decompiling the .dex
files in the application. These files store the java
bytecode that will be executed by dalvik (or An-
droid RunTime), the android Java virtual machine
(Android (2017b)). We chose these features be-
cause, based on other researcher’s work, they see-
med promising for static analysis (see: G. Canfora
and Visaggio (2015b); G. Canfora and Visaggio
(2015a); D. Arp and Rieck (2014)).
Most of the information we extract is based
on the opcodes (the instructions executed by
dalvik) themselves, without the operands.
The first thing we use is opcodes frequencies
(G. Canfora and Visaggio (2015b)). As a more
representative value, we store trigrams (An
N-gram is a sequence of N adjacent opcodes)
frequencies (G. Canfora and Visaggio (2015a)).
In Figure 3, the opcodes are (invoke-static,
move-result-object, if-eqz, invoke-direct, return-
object, const/4, goto), and the corresponding
trigrams would be ([invoke-static:move-result-
object:if-eqz], [move-result-object:if-eqz:invoke-
direct], [if-eqz:invoke-direct:return-object],
[invoke-direct:return-object:const/4], [return-
object:const/4:goto])
To use several Android API methods, Android
applications have to ask for certain permissions
(use network, bluetooth, access user information,
...). Malware also have to ask for these permis-
sions, and some permissions give access to more
potentially malicious behaviour, and are therefore
a good way to analyze applications (D. Arp and
Rieck (2014)).
Finally, we extract the Android API call sequence.
The API is a software interface (a set of functi-
ons) used by developers to perform device re-
lated tasks (sending SMS, manipulating the UI,
etc...). Some of these tasks can be maliciously
employed. These calls were found to also be
pertinent to detect android malware (D. Arp and
Rieck (2014)). For Figure 3, API call would be
api function call1, api function call2, as well as
the constructor of api type1
From a collection of applications, we extracted
these features. There is a total of 228 dalvik opco-
des. The applications were constituted on average of
80,000 to 100,000 opcodes, but a few reached more