Table 5: Classification results.
Correct classification
leave-one-out malware clusters 59 / 65 (90.77%)
malware sequences 654 / 654 (100%)
benign sequences 499 / 521 (95.77%)
overall 1212 / 1240 (97.74%)
deviations. We tested all benign sequences and all
malware sequences — that were not already belong-
ing to a cluster. The results are presented in Table 5.
A sequence is considered to be correctly classified
if a sequence is predicted to belong in the right clus-
ter for leave-one-out malware sequences or if the se-
quence is predicted to not belong to any of the clusters
for the rest of the sequences. The overall results of
97.74% accuracy is close to the state-of-the-art (98%
(Deshotels et al., 2014), 99% (Aafer et al., 2013),
97.3-99% (Yerima et al., 2015)). The limitation is that
we could not make consistent Γ-embedding DAG for
the whole dataset. The process of DAG creation or
clustering needs improvement to enable a larger us-
age of the Γ-embedding DAG. However it shows that
this mathematical object (Γ-embedding DAG) is use-
ful for production applications.
9 CONCLUSION
This article contributes to the state-of-the-art by defin-
ing, formalizing and constructing a new representa-
tion for the common subsequences of a sequence set.
It is called A
Γ
DAG. We showed that the A
Γ
DAG con-
tains all information about the common subsequences
and is expressed in a very compact form. We succeed
to design an algorithm that is able to build this con-
struction for sequence without intra-repetitions. For
other sequences, we have designed an algorithm that
is able to construct a structure close to solution.
We assessed its utility for classification heuristics.
With simple metrics we came to 97.74% accuracy for
singling out clustered malware from other applica-
tions with the sequence of their Android API calls ex-
ecuted during dynamic analysis. While it does com-
pete with state-of-the-art malware detection with ma-
chine learning, it shows that Γ-embedding DAG con-
veys enough information for classification. The ex-
ploitation of this representation for data mining needs
further researches.
REFERENCES
Aafer, Y., Du, W., and Yin, H. (2013). Droidapiminer:
Mining api-level features for robust malware detection
in android. pages 86–103. DOI 10.1007/978-3-319-
04283-1 6.
Aha, D. W., Kibler, D., and Albert, M. K. (1991).
Instance-based learning algorithms. Machine learn-
ing, 6(1):37–66.
Aho, A. V., Garey, M. R., and Ullman, J. D.
(1972). The transitive reduction of a directed graph.
SIAM Journal on Computing, 1(2):131–137. DOI
10.7146/math.scand.a-10849.
Arp, D., Spreitzenbarth, M., Hubner, M., Gascon, H.,
Rieck, K., and Siemens, C. (2014). Drebin: Effective
and explainable detection of android malware in your
pocket. 14:23–26. DOI 10.14722/ndss.2014.23247.
Cheatham, M. and Hitzler, P. (2013). String similarity met-
rics for ontology alignment. pages 294–309.
Daciuk, J., Mihov, S., Watson, B. W., and Watson,
R. E. (2000). Incremental construction of minimal
acyclic finite-state automata. Computational linguis-
tics, 26(1):3–16. DOI 10.3115/1611533.1611538.
D’Angelo, J. P. and West, D. B. (2000). Mathemati-
cal Thinking: Problem-Solving and Proofs, 2nd ed.
Prentice-Hall. DOI 10.4324/9781315044613.
Deshotels, L., Notani, V., and Lakhotia, A. (2014). Droi-
dlegacy: Automated familial classification of android
malware. page 3. DOI 10.1145/2556464.2556467.
Hirschberg, D. S. (1975). A linear space algorithm for
computing maximal common subsequences. Com-
munications of the ACM, 18(6):341–343. DOI
10.1145/360825.360861.
Irolla, P. and Dey, A. (2018). The duplication issue within
the drebin dataset. Journal of Computer Virology and
Hacking Techniques, pages 1–5.
Irolla, P. and Filiol, E. (2017). Glassbox: Dynamic analy-
sis platform for malware android applications on real
devices. ForSE. DOI 10.5220/0006094006100621.
Lin, Y.-D., Lai, Y.-C., Lu, C.-N., Hsu, P.-K., and Lee, C.-
Y. (2015). Three-phase behavior-based detection and
classification of known and unknown malware. Secu-
rity and Communication Networks, 8(11):2004–2015.
DOI 10.1002/sec.1148.
Mount, D. W. Bioinformatics: sequence and genome anal-
ysis. 2004. Bioinformatics: Sequence and Genome
Analysis.
Shen, F., Del Vecchio, J., Mohaisen, A., Ko, S., and Ziarek,
L. (2018). Android malware detection using complex-
flows. IEEE Transactions on Mobile Computing.
Yerima, S. Y., Sezer, S., and Muttik, I. (2015). High accu-
racy android malware detection using ensemble learn-
ing. IET Information Security, 9(6):313–320. DOI
10.1049/iet-ifs.2014.0099.
ForSE 2019 - 3rd International Workshop on FORmal methods for Security Engineering
656