Figure 17: Comparison of robust hashing approaches.
mewhat better, but our best robust hashing results are
comparable. With respect to robust hashing, we find
that these techniques perform better than the SVM on
most families, but a few malware families are extre-
mely difficult to classify. In contrast, the SVM yields
more consistent results over the 25 families.
In spite of the slightly lower accuracy, there are
some potential advantages to robust hashing. Per-
haps the biggest of these advantages is that in ro-
bust hashing, it is easy to add new families as they
appear—we simple generate a syndrome (which, in
effect, defines a cluster) for the new family and the re-
mainder of the robust hashing classification model is
unchanged. In contrast, for an SVM, we would need
to retrain the model each time a new family is added.
For large problems, the SVM training cost—in terms
of both time and computational resources—would be
substantial.
It appears that malware analysis is a novel appli-
cation of robust hashing. And it is worth noting that
robust hashing is a fairly general and somewhat amor-
phous concept. Hence, a multitude of variations on
robust hashing can be considered. In addition to tes-
ting some of the many different possible forms or ro-
bust hashing, future work could include an analysis of
additional features. We could also obtain a more fine-
grained view by analyzing each section of an executa-
ble (i.e., .text, .rdata, and so on) separately. In ad-
dition, hybrid classification techniques involving ro-
bust hashing and any of a variety of machine learning
techniques would be an interesting topic for further
research. For example, we could apply various ro-
bust hashing techniques to a variety of image features,
then apply a machine learning classifier to the results
of these robust hashing algorithms.
Within the robust hashing paradigm, it would be
worthwhile to experiment with more sophisticated co-
ding techniques, such as Reed-Muller codes or trellis-
coded modulation (TCM). Another type of hybrid
model would consist of using non-coding based com-
pression strategies within a robust hashing scheme—
techniques such as K-means, EM clustering, k-nearest
neighbor, and Gaussian mixture models, for exam-
ple, would fit naturally within a robust hashing fra-
mework.
REFERENCES
Ahonen, T., Hadid, A., and Pietikainen, M. (2006). Face
description with local binary patterns: Application to
face recognition. IEEE Transactions on Pattern Ana-
lysis and Machine Intelligence, 28(12):2037–2041.
Alazab, M., Venkataraman, S., and Watters, P. (2010). To-
wards understanding malware behaviour by the ex-
traction of api calls. In Proceedings of the 2010 Se-
cond Cybercrime and Trustworthy Computing Works-
hop, CTC ’10, pages 52–59. IEEE Computer Society.
Aycock, J. (2006). Computer Viruses and Malware. Sprin-
ger.
Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction
to Support Vector Machines and Other Kernel-Based
Learning Methods. Cambridge University Press.
Dalal, N. and Triggs, B. (2005). Histograms of oriented gra-
dients for human detection. In IEEE Computer Society
Conference on Computer Vision and Pattern Recogni-
tion, CVPR 2005, pages 886–893. IEEE.
Daubechies, I. (1990). The wavelet transform, time-
frequency localization and signal analysis. IEEE
Transactions on Information Theory, 36(5):961–1005.
Hamming, R. W. (1950). Error detecting and error
correcting codes. Bell Labs Technical Journal,
29(2):147–160.
Johnson, M. and Ramchandran, K. (2003). Dither-based se-
cure image hashing using distributed coding. In Pro-
ceedings of 2003 International Conference on Image
Processing, ICIP 2003, pages 751–754. IEEE.
Lin, C.-Y. and Chang, S.-F. (1998). Generating robust digi-
tal signature for image/video authentication. In Mul-
timedia and Security Workshop at ACM Multimedia,
pages 49–54.
Monga, V., Banerjee, A., and Evans, B. L. (2006). A clus-
tering based approach to perceptual image hashing.
IEEE Transactions on Information Forensics and Se-
curity, 1(1):68–79.
Nakamoto, S. (2009). Bitcoin: A peer-to-peer electronic
cash system. https://bitcoin.org/bitcoin.pdf.
Nataraj, L., Karthikeyan, S., Jacob, G., and Manjunath, B.
(2011). Malware images: Visualization and automatic
classification. In Proceedings of the 8th International
Symposium on Visualization for Cyber Security, Viz-
Sec’11, pages 4:1–4:7. ACM.
Pradhan, S. S. and Ramchandran, K. (2003). Distributed
source coding using syndromes (DISCUS): Design
and construction. IEEE Transactions on Information
Theory, 49(3):626–643.
BASS 2018 - International Workshop on Behavioral Analysis for System Security
458