Table 5: Classification results.
Correct classification
leave-one-out malware clusters 59 / 65 (90.77%)
malware sequences 654 / 654 (100%)
benign sequences 499 / 521 (95.77%)
overall 1212 / 1240 (97.74%)
deviations. We tested all benign sequences and all
malware sequences — that were not already belong-
ing to a cluster. The results are presented in Table 5.
A sequence is considered to be correctly classified
if a sequence is predicted to belong in the right clus-
ter for leave-one-out malware sequences or if the se-
quence is predicted to not belong to any of the clusters
for the rest of the sequences. The overall results of
97.74% accuracy is close to the state-of-the-art (98%
(Deshotels et al., 2014), 99% (Aafer et al., 2013),
97.3-99% (Yerima et al., 2015)). The limitation is that
we could not make consistent Γ-embedding DAG for
the whole dataset. The process of DAG creation or
clustering needs improvement to enable a larger us-
age of the Γ-embedding DAG. However it shows that
this mathematical object (Γ-embedding DAG) is use-
ful for production applications.
This article contributes to the state-of-the-art by defin-
ing, formalizing and constructing a new representa-
tion for the common subsequences of a sequence set.
It is called A
DAG. We showed that the A
DAG con-
tains all information about the common subsequences
and is expressed in a very compact form. We succeed
to design an algorithm that is able to build this con-
struction for sequence without intra-repetitions. For
other sequences, we have designed an algorithm that
is able to construct a structure close to solution.
We assessed its utility for classification heuristics.
With simple metrics we came to 97.74% accuracy for
singling out clustered malware from other applica-
tions with the sequence of their Android API calls ex-
ecuted during dynamic analysis. While it does com-
pete with state-of-the-art malware detection with ma-
chine learning, it shows that Γ-embedding DAG con-
veys enough information for classification. The ex-
ploitation of this representation for data mining needs
further researches.
