macros and variables ; L3: change the position of var-
iables; L4: replace the function call with the function
body, representing a decrease of function; L5: modify
the program statements, such as i++ becomes i + = 1;
L6: modify the program control logic.
2.2 Code Similarity Definition
Obviously, 100% means completely copying. The
plagiarism relationship with two programs is meas-
ured by code similarity. The higher the similarity is,
the greater the possibility of plagiarism. T. Yama-
moto, M. Matsushita, T. Kamiya and K. Inoue (J. A.
W. Faidhi and S. K. Robinson, 1987) give the defini-
tion of similarity of the two software systems. For two
software systems A and B, A consists of the elements
a1, a2, a3, ..., am, represented by the set: {a1, a2, a3,
..., am}.Similarly, B elements b1, b2, b3 , ... bn, rep-
resented by the collection {b1, b2, b3, ..., bn}. Here,
a1, a2, a3, ..., am and b1, b2, b3, ..., bn can be the file
or the line of a program in software systems A and B.
Suppose we are able to calculate the matching be-
tween ai and bj (1 <= i <= m, 1 <= j <= n). The col-
lection of all match (ai, bi) is is represented by Rs, the
similarity S is defined as follows:
||||
|}),(|{||}),(|{|
),(
BA
RbabRbaa
BAS
siiisiii
+
∈+∈
=
(1)
As shown in Eq.(1), this definition indicates that
the similarity between A and B is a ratio, which is ob-
tained by the sum of A’ size and B’ size divided by
the size of Rs. if Rs is small, then S will be smaller ,if
the RS is the empty set, then S = 0. When A and B are
the same, S = 1.
3 CODE SIMILARITY
DETECTION METHOD
Code similarity detection methods are divided into
two categories: attribute-counting technique and
structure-metric technique. Attribute-counting tech-
nique is proposed and used in code detection firstly.
Program code has its features, such as: the number of
lines of code, the number of variables, operators, the
number of control conditions and the number of cy-
cles. Attribute-counting technique should figure out
the number of the unique attributes of a program. Ob-
viously different programs have different result of at-
tributes statistic, and the result of the attributes statis-
tics of the same or similar program code should be
similar. Verco and Wise have proved that a anti-pla-
giarism detection system based on attribute-counting
technique just be well work in the situation that the
two programs are same or very similar, it does not
work for the programmers with a little programming
experience who could make several modifications to
the source code. The structure-metric method is used
to determine whether the two procedures are similar
by comparing their structural information. It is well
known that, for programmers, it is easy to change the
attributes of a program, but the structure of the pro-
gram is very difficult to change, otherwise it can’t be
called as plagiarism.
At present, the structure-metric method based on
string matching algorithm is widely used in most anti-
plagiarism systems. This method has two key points.
The first point is how to analyze the structure of the
program code and converse the code into a string. The
second point is how to choose a string matching algo-
rithm to compute the similarity.
3.1 Code Plagiarism Description
The string matching algorithm in the Anti-plagiarism
detection system is used to calculate the similarity of
the program code. The plagiarism refers to the situa-
tions that the students simply make some modifica-
tions to some of the variables, change the position of
some functions; and therefore the string matching al-
gorithm must be able to detect these cases. String
matching must be possible to find the longest match,
due to the fact that some short match exists even if
there is no plagiarism. String matching algorithm for
plagiarism detection is not simply find the position of
the mode string, but to find the set of all exact matches
in the two strings; the proportion of the size of exact
matched strings to the size of total string can be used
to determine the level of similarity. String matching
algorithm should mark the location of the longest
matches to facilitate the detection. There are many ef-
fective string matching algorithm such as: LCS (long-
est common substring), Levenshtein distance (Mi-
chael Gilleland, 2007), Heckel algorithm (Michael J.
Wise, 1992), dynamic programming and GST. Given
the features of anti-plagiarism system, in this paper,
we use the GST algorithm which has a better accu-
racy. The processes of GST are as follows:
Step 1: defining MIN_MATCH_LEN, which
should be equal to or greater than 1; the TILES is in-
itialized as empty; S and T are not marked by default,
which means the matching has not start yet.
Step 2: Find one or more maximum-matching
string; initializing the maximum matching length
max_match as MIN_MATCH_LEN; setting matches
ASimilarityDetectionPlatformforProgrammingLearning
481