tive is to make an exact match which is the case for
this paper, but there also exists hash variations that
do fuzzy matching such as ssdeep (Kornblum, 2006),
with variations (Baier and Breitinger, 2011).
When performing positive matching the pre-
computed hash set contains hashes of files that are
of particular interest to the examiner. Locating any
occurrence of these files on the media under exami-
nation is thus the purpose of these kinds of examina-
tions. One example where this methodology is com-
monly used are investigations related to Child Sexual
Abuse (CSA) material. By performing locating hash-
ing against a hash set of known illegal material such
material can easily be found on the storage media.
Negative matching is instead used to remove files
from further examination. In this case the hash set
contains hash values from files known to be benign
in relation to the examination conducted, such as un-
modified application and operating system files. One
example where this approach can be used is in the
examination of infections of new malware. In these
examinations it is necessary to work broadly as the
malware could potentially have modified a number of
files on the storage media in various ways to perform
the various spreading, hiding, and anti-forensics func-
tionalities that have been designed into it. By using
excluding hashing files which are known to be un-
modified can be safely excluded from further inves-
tigation, thus considerably decreasing the effort re-
quired.
File size information is a potentially useful piece
of information to have in a hash-set in addition to the
hash value. It is easily concluded that it is only neces-
sary to compute hashes for files on the storage media
which has a file size identical to a file size which exist
in the hash set. Depending on the size distribution of
the files in the hash set and of the files on the stor-
age media, a smaller or larger fraction of the files on
the storage media can be skipped without having to
compute a hash value for them. This contributes to a
corresponding decrease in the amount of time needed
to process all the information on the storage media.
Time can be saved both from the perspective of not
having to read in all the file contents and compute the
hash as well as avoiding a seek operation to the loca-
tion of the storage media where the file is located.
The purpose of the work reported here is to pro-
vide some empirically based intuition on the order of
magnitude of the improvements that can be obtained
by using side file size side information when perform-
ing hash matching. To make these examinations a
number of evaluation data sets were used which are
described in the next section.
3 EVALUATION DATA SETS
To perform the evaluation five different file size data
sets from different sources were used. These data sets
are of two different categories, the first category being
hash data sets. These data sets provide file size infor-
mation for files that are used to create hash sets. The
second category are scan data sets that reflect actual
contents of a number of storage devices. In an inves-
tigation hash sets are used when examining files on
storage devices. The file sizes on a device follows a
particular distribution, such as the distributions exem-
plified by the scan data sets. Some general statistics
on the data sets are provided in Table 1.
Table 1: Data set characteristics.
Data set Number Unique Total file
of files sizes size (GB)
Hash sets
CSA 180057 88498 255
NSRL 22502929 896382 5382
Scan sets
PC 9984693 188545 1140
GOVDOCS 986278 340955 466
RDC 2689123 149012 4930
3.1 Hash Data Sets
CSA Data Set
This file size data set was obtained from a law-
enforcement organization in a European country. The
organization keeps their own collection of files that
are known to be illicit Child Sexual Abuse (CSA) ma-
terial according to their national legal rules. This col-
lection is used to create hash sets employed to per-
form positive, or locating, hashing when new incom-
ing material is examined. At the date of data collec-
tion, the collection consisted of 180057 files. Since a
number of those files have exactly the same size, the
number of unique file sizes is lower (88498). These
unique file sizes make up the CSA data set, for which
the unique file size distribution is shown in Figure 1.
The left histogram shows the file size distribution
in the range 0 - 100 000 bytes, with each bar repre-
senting the number of unique file sizes in a particular
interval. For example, the interval 0 to 1000 bytes has
quite a small number of unique file sizes, only around
100. Furthermore, we can see that in the range from
approximately 10,000 to 40,000 the number of unique
file sizes is very high. Followingthe downwards slope
we can see that around a file size of 70,000 the num-
ber of occurrences are around 500.
SECRYPT2012-InternationalConferenceonSecurityandCryptography
334