for fast searching inside this data. Thus We need to
shrink the data in the chunk table even more.
4.2 Chunk as its Hash
We propose that the chunk identification should be
stored not as the exact set of word identifications, but
as some kind of the hash value of the words them-
selves. This gives us a lower number of bits needed
for expressing the chunk identification. Moreover, by
using different hash functions we can even choose the
number of bits used for expressing the chunk ID. In
other words, we can set the various levels of tradeoff
between the data size and the accuracy of the data (the
probability of hash collisions).
The hash function does not matter, we can for ex-
ample take the highest n bits of MD5(chunk). As for
the value of n, we have tried values of 24 and 28 bits.
Note that the total number of different chunks in our
data set is between 2
28
and 2
29
. The results were in-
teresting: with 24 bits of hash value size, the absolute
difference between the computed and exact similari-
ties were up to 5 %, but only for documents which had
their similarity already at most 5 %. So we have got
only few false positives for the document pairs which
have already been different enough. For n = 28, the
absolute difference was at most 1 %.
Should the exact results be needed, we can use
this approach as an upper estimate of the similarity,
and compute the exact similarities only for document
pairs which are preselected by this algorithm, and
only after the user looks at these documents (so not
precompute the exact values).
Also note that using hash from the words them-
selves relaxes the need of unique word ID numbers.
The dictionary table then can be transformed into a
set (i.e. we will not have to look up the word ID, but
instead only ask whether the word is present in the
dictionary or not). This can lower the resource re-
quirements for the dictionary table, altough this re-
duction is not significant in the whole picture.
4.3 Data Structure
The hash function we use has the range of values
from 0 to 2
n
− 1 for some n. Unlike the database
approach, we actually do not need the whole chunk
table, searchable both by chunk ID and the document
ID. In fact, we only need one of these two directions:
for discovering similar documents to a given one, we
need to split the new document into the chunks, and
then look up in which other documents those chunks
are. So in the database speech, we only need the in-
dex mapping the chunk ID to the list of document
0
1
2
2^n−1
chunk
2^n
5013
14
9123
5431
8550
649
2108
1043
1041
2
2
0
offset array document ID array
Figure 1: Data structure mapping chunk ID to the document
IDs.
IDs. The proposed data structure for this task is
described in the Figure 1. The data structure contains
two arrays:
• The array of document IDs (in the Figure 1 the
rightmost one). This is an array of values of the
“document ID” data type. It contains the list of
documents in which the ID 0 appears, then the list
of documents in which the chunk ID 1 appears,
and so on. The size of this array is approximately
equal to sizeof(document id) multiplied by the
total number of all chunks in all documents. For
600,000,000 chunks and three bytes for the docu-
ment ID, it is about 1.7 GB. There is nothing sim-
ple we can do to reduce the size of this array.
• The array of offsets (in the Figure 1 the leftmost
one). This array describes where in the array of
document IDs we should look, when we want to
find all documents, in which a given chunk occurs.
The entry i of this array gives the offset of the first
document ID for the chunk with the hash value
i, and the entry i + 1 gives the offset of the first
document ID, in which this chunk does not occur.
It is an array of the integer data type, indexed by
all possible values of the chunk ID. So for 24-bit
hash function value space and 4-byte integer, this
array has 2
24
· 4 bytes, i.e. 64 MB, and for 28-bit
hash function value space it has 1 GB. So the size
of this array is proportional to the number of bits
of the hash value.
For example, in the Figure 1, the chunk with hash
value of 0 occurs in documents 5431 and 9123, the
chunk with hash value 1 is not anywhere in the whole
data set, the chunk with hash value 2 is in documents
14, 5013, 8550 and maybe others. The 2
n
-th entry is
used to terminate the array of document IDs.
DISTRIBUTED SYSTEM FOR DISCOVERING SIMILAR DOCUMENTS - From a Relational Database to the
Custom-Developed Parallel Solution
439