used throughout in this paper) produces a digest 70
characters in length. The TLSH hash digest has the
property that two similar inputs would produce a
similar hash digest, based on statistical features of the
input bytes (Oliver & Hagen, 2021). The hash digest
is a concatenation of the digest header and digest
body. The following steps are involved in
computation of the standard TLSH hash:
• All 3-grams from a sliding window of 5 bytes
are used to compute an array of bucket counts,
which are used to form the digest body.
• Based on the calculation of bucket counts (as
calculated above) the three quartiles are
calculated (referred to as q1, q2, and q3
respectively).
• The digest body is constructed based on the
values of the quartiles in the array of bucket
counts, using two bits per 128 buckets to
construct a 32-byte digest.
• The digest header is composed of a checksum,
the logarithm of the byte string length and a
compact representation of the histogram of
bucket counts using the ratios between the
quartile points for q1:q3 and q2:q3
Two different TLSH hash digests are compared using
the TLSH distance. The TLSH distance of zero
represents that the files are likely identical, and scores
greater than that indicate the greater degrees of
dissimilarity (please see the original paper for more
details on the computation of the distance).
2.2 LZJD
Lempel-Ziv Jaccard Distance or LZJD algorithm
designed by Edward Raff and Charles Nicholas
(Raff
& Nicholas, 2017)
. The inspiration of LZJD come
from the Normalized Compression Distance, which
measures the ability to compress two inputs into a
similar output using the same compression technique.
This has a long history of use in data mining, and
LZJD applies this approach to determine malware
similarity. First, the Lempel-Ziv Set algorithm
coverts a byte sequence into a set of byte sub-
sequences which is previously seen sequences in the
set. Initially, the set is empty, and then the following
process is repeated from the beginning of the byte
stream until the end of the stream is reached:
beginning with a sub-sequence of length 1, if this sub-
sequence has not been seen, then add it to the set and
the pointer move to the end of current sub-sequence
and next desired sub-sequence length reset to one. If
the sub-sequence has been seen before, it increases
next desired sub-sequence length by one to
incorporate the next byte.
LZJD only compares a small portion of the set to
speed up the process even more. To approximate the
distance by using min-hashing to create a compact
representation of the input string. Use this
approximation to reduce time and memory
requirements for computing LZJD. Moreover, there
is around 3% approximation error by selecting
minimum k = 1024 hashes from the set. The steps of
procedure to compare byte sequences are following:
• Covert byte sequence 𝐵
into many unique sub-
sequences using Lempel-Ziv Set Algorithm
• Hash these unique sub-sequences into a set 𝐶
of
integers via hash functions.
• Sort integers set and keep k=1024 smallest
values
• Calculate the Jaccard distance between two set
of smallest values.
𝐴𝑝𝑝𝑟𝑜𝑥𝑖𝑚𝑎𝑡𝑒 𝐿𝑍𝐽𝐷𝐵
, 𝐵
≈1 − 𝐽(𝐶
, 𝐶
)
We can interpret the LZJD score as a rough measure
of byte similarity. For example, consider two inputs
A and B. A score of 0.75 means that, for all sub-
strings shared between the LZSet(A) and LZSet(B),
75% of them could be found in both files. This can be
loosely interpreted as saying that A and B share 75%
of their byte strings (Raff & Nicholas, 2017).
3 EXPERIMENTS SETUPS
Dataset. Our evaluation uses real malware files
obtained from the MalwareBazaar website
(MalwareBazaar, n.d.). The 5138 Windows-based
malware samples, belonging to 20 families, come
from this source. The training set uses April 2022
collected files, while the testing set consists of
samples from May 1st, 2022.
Classifier. We employ the Nearest Neighbors
classifier with KD Tree indexing. The principle is to
identify the closest predefined number of training
samples to the new point and predict its label. This
paper uses radius-based neighbor learning instead of
k-nearest neighbor learning. The number of training
samples is based on the local density of points within
a set radius. Any metric measure can be used for
distance, but standard Euclidean distance is the most
common. In this case, we use TLSH and LZJD
distance scores as the distance metric.