In summary, the implementation of Fingerprint
Generator must have three properties, ability of
identification, smaller size and fast process speed.
The solution which we choose is xxHash (Collet,
2016). xxHash is an extremely fast non-cryptographic
hash algorithm, working at speeds close to RAM
limits. It is widely used by many software like
ArangoDB, LZ4, TeamViewer, etc. Moreover, it
successfully completes the SMHasher (Appleby,
2012) test suite which evaluates collision, dispersion
and randomness qualities of hash functions.
Although xxHash is powerful and successfully
completes the SMHasher test suite, its 32-bit version
still has collision. Here we provide a simple test to
verify 32-bit xxHash collision rate with a real-world
data. The data is from a GPS trajectory dataset (Yuan,
2011) that contains one-week trajectories of 10357
taxis. The sum of points in this dataset is about 15
million and the total distance of the trajectories
reaches 9 million kilometers. We use some data
reprocessing to filter the raw data and gather them
into a handled dataset. The file size of the handled
dataset is about 410 MB. Figure 3 illustrates the
repetition rates of the handled dataset with two hash
functions SHA-1 and xxHash.
We can see from Figure 3 that the repetition rate
of the first row is undoubted by using 160-bit SHA-1
function. We find that the repetition rate of xxHash32
is higher than SHA-1 about 0.1% in field Hash Map
which does not have any restriction. This 0.1%
difference means that the 32-bit xxHash occurs
collision in this simple test. In contrast, xxHash64 has
the same repetition rate with SHA-1. The collision
rate of xxHash64 is lower than xxHash32, but
xxHash64 also has higher cost because its longer hash
value size for our scheme. Even the xxHash32 has the
risk of collision, we still prone for it. There are two
reasons that mitigate the influence of collision. The
first one is about its probability; hence, we consider
that 0.1% deviation could not affect the result a lot.
On the other hand, this error can be handled in
computing phase by some operations. Another one is
the implementation of hash map is LRU hash map, so
the limitation not only prevents to occupy excessive
memory but also reduces the occurrence of collision
with an extra cost of having the repetition rate a little
lower. Because after discarding the least recently
used data blocks, the occurrences of collision have
high possibility to eliminate. In summary, we said the
defect of xxHash32 used in this scheme is ignorable.
The memory size of LRU Cache Map is based on
two factors, one is the size of hash value, and another
one is its parameter. In Table 1, it shows that the
standard hash map can store all fingerprints and data
block, but it leads to out of memory. That is why we
pick LRU hash map. The average size of records in
the dataset is about 25 bytes. It shows xxHash32 has
the smallest memory size for the LRU hash map.
Figure 3: Repetition rate and LRU Cache Map analysis.
Table 1: Memory size of each data structure.
Hash
Map
LRU-
10^3
LRU-
10^4
LRU-
10^5
LRU-
10^6
SHA-1 OOM 50KB 500KB 5MB 50MB
xxHash64 OOM 35KB 350KB 3.5MB 35MB
xxHash32 OOM 30KB 300KB 3MB 30MB
2.4 Data Chunk Preprocess
In file synchronization systems, most of the time, the
content difference between local node and remote
node is slightly small. So, the methods of file
synchronization are focus on how to find out the
different parts between two files. Note that the data
generated by sensors in a time interval comes in
record by record. For instance, consider the GPS
dataset. The average size of the record in the GPS
dataset is about 25 bytes. On the contrary, the
parameter s in Rsync is at least 300 bytes, let alone
the average block size in LBFS is 8KB. Therefore, a
fine-grained chunking method is essential for our
work.
The data block in our scheme is like a record that
sensor generates in a time interval. Spatial
dependence leads to a neighbour cluster of sensors to
detect similar values; time dependence leads to each
record from the same sensor to measure smooth data.
Therefore, we split raw data and obtain duplicated
records as possible as it can be.
In sensors network, a cluster head collects the real-
time data from many sensors. There is so much noise
that causes low probability to distinguish the
duplicated part. To identify the difference, we require
LRU-
10^3
LRU-
10^4
LRU-
10^5
LRU-
10^6
Hash
Map
SHA-1
28,12% 28,57% 28,88% 28,88% 29,10%
xxHash64
28,32% 28,77% 28,88% 28,88% 29,10%
xxHash32
28,32% 28,78% 28,89% 28,89% 29,20%
27,40%
27,60%
27,80%
28,00%
28,20%
28,40%
28,60%
28,80%
29,00%
29,20%
29,40%
Repetition Rate
SHA-1 xxHash64 xxHash32
Fast Deduplication Data Transmission Scheme on a Big Data Real-Time Platform
159