DISTRIBUTED SYSTEM FOR DISCOVERING SIMILAR

DOCUMENTS

From a Relational Database to the Custom-Developed Parallel Solution

Jan Kasprzak, Michal Brandejs, Miroslav K

ripa

c and Pavel

Smerk

Faculty of Informatics, Masaryk University, Botanick

a 68a, Brno, Czech Republic

Keywords:

University, Plagiarism, Similar Documents, Cluster, Information System, Theses.

Abstract:

One of the drawbacks of e-learning methods such as Web-based submission and evaluation of students’ papers

and essays is that it has become easier for students to plagiarize the work of other people. In this paper

we present a computer-based system for discovering similar documents, which has been in use at Masaryk

University in Brno since August 2006, and which will also be used in the forthcoming Czech national archive

of graduate theses. We also focus on practical aspects of this system: achieving near real-time response to

newly imported documents, and computational feasibility of handling large sets of documents on commodity

hardware. We also show the possibilities and problems with parallelization of this system for running on a

distributed cluster of computers.

1 INTRODUCTION

1.1 About IS MU

At Masaryk University, the study administration is

being supported by a web-based Information System

(http://is.muni.cz/, IS MU), which has been in

development since 1999. See (Pazdziora and Bran-

dejs, 2000) for detils about IS MU. Since then, IS

MU has become the central part of the study admin-

istration and communication at Masaryk University.

Among others, it handles e-learning tasks like submit-

ting essays, and it stores various documents such as

students’ theses. The in-house developed distributed

storage subsystem is used for these tasks.

1.2 Handling Plagiarism

One of the problems of storing (and making available)

documents in an electronic form is that documents

can be easily plagiarized. This is by no means a prob-

lem speciﬁc to IS MU: students often have their own

WWW sites for exchanging documents like essays or

written exams, so disallowing document sharing in-

side IS MU would not help to mitigate the problem.

Instead, we actively encourage document sharing, and

using old essays as basis for new ones, provided that

the source is correctly cited. However, we must

provide tools to detect similar documents, so that

the teacher (or a thesis reviewer) can easily discover

copied sections in students’ essays. The actual deci-

sion whether the document is plagiarized or not relies

on human work. The machine can only serve as a

tool.

In this paper, we will discuss the inner workings

of our system for discovering similar documnets in its

original prototype SQL database-backed form (which

has been in use inside IS MU since August 2006), and

in its new implementation, which will be more than an

order of magnitude faster, while allowing it to be dis-

tributed to a set of commodity computers in a Linux

cluster. This system will be used also in the forthcom-

ing Czech national archive of graduate theses.

2 SIMILAR DOCUMENTS

Firstly let us describe which documents we consider

similar and how to calculate similarity in documents.

There are various approaches in discovering similar

documents (Monostori et al., 2002). We use a chunk-

437

Kasprzak J., Brandejs M., Kripa

c M. and Šmerk P. (2008).

DISTRIBUTED SYSTEM FOR DISCOVERING SIMILAR DOCUMENTS.

In Proceedings of the Tenth International Conference on Enterprise Information Systems - DISI, pages 437-440

DOI: 10.5220/0001687604370440

 SciTePress

based approach: the document in its plain text form is

split into chunks of text, and the system then tries to

ﬁnd these chunks inside other documents.

2.1 Document Similarity

For two documents A and B we deﬁne the similarity

of the document A to the document B as a percentage

of chunks of the document A which can also be found

inside the document B. Using this deﬁnition, the sim-

ilarity is a real number between 0 and 100 inclusively.

Note that similarity is not symmetric (for example,

when the document A as a whole is contained inside

a bigger document B). The actual similarity of the

document B to the document A can be computed as

number of common chunks in A and B

total number of chunks in B

· 100%

2.2 Current Data Set

We have approximately 250,000 documents in IS

MU which enter the similarity search process. This

set of documents can be transformed to about

600,000,000 (chunk, document-ID) pairs. There are

circa 445,000,000 unique chunks in the data set. For

the Czech national archive of graduate theses, it is ex-

pected that the total volume will be at least two times

bigger.

3 PROTOTYPE SYSTEM

In the ﬁrst implementation, we have used Oracle as

the database back-end. We have not stored the chunks

themselves, but a concatenation of word ID numbers

from the dictionary instead, which has saved us some

space.

3.1 Resource Requirements

The data needed for calculating the similar documents

has been stored in the following database tables:

dictionary—for converting betwen the word and its

ID. The table had 900,000 rows in 19 MB of disk

space, and both indexes in another 36 MB.

chunk table—mapping the document ID to the

chunk. The table data has about 30 GB, the in-

dex which maps the chunk to the document ID

has about 46 GB, and the reverse index 17 GB.

The system runs on a SGI Altix 350 with 14 Itanium2

CPUs and 28 GB of RAM. The big amount of RAM

is signiﬁcant, but the data set is still bigger than the

available memory.

Generating chunks from all the documents takes

about two hours, inserting them to the chunk table

takes another two hours (both using all 14 CPUs).

Computing similarities from the chunk table takes

about 50 hours also using all 14 CPUs.

3.2 Pros and Cons

Interestingly enough, the Oracle representation of the

chunk table was not much bigger than expected—

their metadata size did not add any substantial over-

head. To obtain a signiﬁcant speed improvement, the

different data structures will have to be used. Also

the solution with SQL database and Perl/DBI can be

easily prototyped, so we could put the system into the

production use relatively fast.

On the other hand, the ACID properties of SQL

database have been a bottleneck of the system. For

most of the tasks we did not need a strictly isolated

transactions.

4 DISTRIBUTED APPROACH

In the next step, we have decided to reimplement this

system outside the database, in the tightly packed and

customized data structures. The requirements to the

new system were the following:

• Usability on a commodity hardware with much

less resources than our Altix system.

• Scalability by by adding computing nodes, not ex-

panding the single server.

• Speed. Users are not willing to tolerate several

hours or even a day of delay for newly added doc-

uments.

4.1 Chunk Table

The biggest barrier which prevents the system from

being used on commodity hardware is the size of the

chunk table. Even our mid-range server cannot ﬁt the

data set into its memory. The estimated size limits of

this approach are:

We have about 1 milion words in the dictio-

nary. So we need about 20 bits to encode a word.

With average ﬁve-word chunks, we need 100 bits,

i.e. 12.5 bytes, to encode the chunk itself. With

450,000,000 different chunks, we would need about

5.2 GB to store just the chunk IDs in this extremely

tightly packed encoding, not counting the documents

in which those chunks appear, and the index needed

ICEIS 2008 - International Conference on Enterprise Information Systems

438

for fast searching inside this data. Thus We need to

shrink the data in the chunk table even more.

4.2 Chunk as its Hash

We propose that the chunk identiﬁcation should be

stored not as the exact set of word identiﬁcations, but

as some kind of the hash value of the words them-

selves. This gives us a lower number of bits needed

for expressing the chunk identiﬁcation. Moreover, by

using different hash functions we can even choose the

number of bits used for expressing the chunk ID. In

other words, we can set the various levels of tradeoff

between the data size and the accuracy of the data (the

probability of hash collisions).

The hash function does not matter, we can for ex-

ample take the highest n bits of MD5(chunk). As for

the value of n, we have tried values of 24 and 28 bits.

Note that the total number of different chunks in our

data set is between 2

and 2

. The results were in-

teresting: with 24 bits of hash value size, the absolute

difference between the computed and exact similari-

ties were up to 5 %, but only for documents which had

their similarity already at most 5 %. So we have got

only few false positives for the document pairs which

have already been different enough. For n = 28, the

absolute difference was at most 1 %.

Should the exact results be needed, we can use

this approach as an upper estimate of the similarity,

and compute the exact similarities only for document

pairs which are preselected by this algorithm, and

only after the user looks at these documents (so not

precompute the exact values).

Also note that using hash from the words them-

selves relaxes the need of unique word ID numbers.

The dictionary table then can be transformed into a

set (i.e. we will not have to look up the word ID, but

instead only ask whether the word is present in the

dictionary or not). This can lower the resource re-

quirements for the dictionary table, altough this re-

duction is not signiﬁcant in the whole picture.

4.3 Data Structure

The hash function we use has the range of values

from 0 to 2

− 1 for some n. Unlike the database

approach, we actually do not need the whole chunk

table, searchable both by chunk ID and the document

ID. In fact, we only need one of these two directions:

for discovering similar documents to a given one, we

need to split the new document into the chunks, and

then look up in which other documents those chunks

are. So in the database speech, we only need the in-

dex mapping the chunk ID to the list of document

2^n−1

chunk

2^n

5013

9123

5431

8550

649

2108

1043

1041

offset array document ID array

Figure 1: Data structure mapping chunk ID to the document

IDs.

IDs. The proposed data structure for this task is

described in the Figure 1. The data structure contains

two arrays:

• The array of document IDs (in the Figure 1 the

rightmost one). This is an array of values of the

“document ID” data type. It contains the list of

documents in which the ID 0 appears, then the list

of documents in which the chunk ID 1 appears,

and so on. The size of this array is approximately

equal to sizeof(document id) multiplied by the

total number of all chunks in all documents. For

600,000,000 chunks and three bytes for the docu-

ment ID, it is about 1.7 GB. There is nothing sim-

ple we can do to reduce the size of this array.

• The array of offsets (in the Figure 1 the leftmost

one). This array describes where in the array of

document IDs we should look, when we want to

ﬁnd all documents, in which a given chunk occurs.

The entry i of this array gives the offset of the ﬁrst

document ID for the chunk with the hash value

i, and the entry i + 1 gives the offset of the ﬁrst

document ID, in which this chunk does not occur.

It is an array of the integer data type, indexed by

all possible values of the chunk ID. So for 24-bit

hash function value space and 4-byte integer, this

array has 2

· 4 bytes, i.e. 64 MB, and for 28-bit

hash function value space it has 1 GB. So the size

of this array is proportional to the number of bits

of the hash value.

For example, in the Figure 1, the chunk with hash

value of 0 occurs in documents 5431 and 9123, the

chunk with hash value 1 is not anywhere in the whole

data set, the chunk with hash value 2 is in documents

14, 5013, 8550 and maybe others. The 2

-th entry is

used to terminate the array of document IDs.

DISTRIBUTED SYSTEM FOR DISCOVERING SIMILAR DOCUMENTS - From a Relational Database to the

Custom-Developed Parallel Solution

439

4.4 Algorithm

1. Construct a set of hash-based chunk IDs of all not

yet added (i.e. new) documents.

2. Construct the array of document IDs and the array

of offsets as described in Section 4.3.

3. Merge the data structure from the previous step

with the same data structure describing the previ-

ously added documents, possibly removing data

about documents, which has been deleted from

the system.

4. Using the merged data structure, for each newly

added document ﬁnd all documents similar to it.

If similarities are found in the documents which

already had been in the database from previous

runs of this algorithm, also compute the inverse

similarity (as described in Section 2.1).

4.5 Properties of the Algorithm

• As for transforming the plain text form of the doc-

ument to the set of chunks, there is not much to

be improved speed-wise. This is an easily par-

allelized task, and id does not need any network

communication (other than retrieving the docu-

ment itself and storing the computed results).

• In the step 2 we want to compute an “inverted in-

dex”. I.e. from the document ID to list of chunk

IDs mapping, we need to compute the opposite

direction. We can use a bucket sort especially as

the data can be pre-sorted into a given number of

buckets in the step 1.

• Merging the two data structures from Section 4.3

can be done sequentially, in a linear time. This

step cannot be parallelized. However, we can split

the whole data structure to the cluster nodes giv-

ing each node its own range of the chunk IDs.

Then each node can merge only its own part of

the data structure.

• Finding similar documents: the complexity of this

step is proportional to the number of chunks in the

newly added documents. We can distribute this

task so that each cluster node handles only part of

the document ID range. So by adding more nodes,

we lower the memory requirements on each node.

• Incremental runs: the incremental runs are fast,

we expect them to be run in a one- to ﬁve-minute

period on a production system.

4.6 Practical Results

We have implemented this algorithm, and we are able

to present some practical results:

• The step 1 took about 2 hours, including pre-

sorting different chunk ranges to separate ﬁles, in

order to do a radix sort in the next step. The time

taken is about the same as in the prototype solu-

tion.

• The step 2, i.e. merging the pre-sorted ranges,

took about three hours on a single CPU. Further

speed improvements by using multiple nodes or

multiple CPUs are possible by, for example, giv-

ing each node its own range of chunk IDs to sort.

• The resulting data structure takes less than 2 GB

of memory for 24-bit hash value, and less than

3 GB of memory for 28-bit hash value.

• Finding similar documents using this data struc-

ture can be done in slightly over two hours on 14

CPUs.

Thus the total run time of this new system for the ini-

tal recomputing all similarities in the given data set is

about 7 hours. Preliminary results with the fully dis-

tributed implementation on a cluster of 22 dual-core

nodes shows that the total run time should ﬁt into one

hour.

5 CONCLUSIONS

We have described two generations of a system for

ﬁnding similar documents in the real-world informa-

tion system.

The new implementation runs much faster than the

prototype one (7 hours versus 54 hours for an initial

step), with more speedup possible. No part of the new

system require more than 4 GB of RAM, and it can be

distributed on a cluster of commodity computers.

So far we are not aware of any other system

for ﬁnding similarities in documents, which uses the

hash-based approach for approximating the actual

chunk identiﬁcation. This approach can provide sig-

niﬁcant savings in the total memory needed.

REFERENCES

Monostori, K., Finkel, R. A., Zaslavsky, A. B., Hod

asz, G.,

and Pataki, M. (2002). Comparison of overlap de-

tection techniques. In ICCS ’02: Proceedings of the

International Conference on Computational Science-

Part I, pages 51–60, London, UK. Springer-Verlag.

Pazdziora, J. and Brandejs, M. (2000). University infor-

mation system fully based on www. In ICEIS 2000

Proceedings, pages 467–471. Escola Superior de Tec-

nologia do Instituto Politcnico de Setbal.

ICEIS 2008 - International Conference on Enterprise Information Systems

440