A Similarity Detection Platform for Programming Learning

Yuanyuan Li, Yu Sheng, Lei Xiao and Fu Wang

School of Information Science and Engineering, Central South University, Lushan South Road, Changsha, China

Keywords: Similarity Detection, Structure-Metric, GST Algorithm, Sub-Graph Isomorphism.

Abstract: Code similarity detection has been studied for several decades, which are prevailing categorized into attribute-

counting and structure-metric. Due to the one fold validity of attribute-counting for full replication, mature

systems usually use the GST string matching algorithm to detect code structure. However, the accuracy of

GST is vulnerable to interference in code similarity detection. This paper presents a code similarity detection

method combining string matching and sub-graph isomorphism. The similarity is calculated with the GST

algorithm. Then according to the similarity, the system determines whether further processing with the sub-

graph iIsomorphism algorithm is required. Extensive experimental results illustrate that our method signifi-

cantly enhances the efficiency of string matching as well as the accuracy of code similarity detecting.

1 INTRODUCTION

The combination of information technology and edu-

cation been increasingly applied to modern teaching.

For the programming course in computer Science and

Technology, we developed a learning platform to

help students improve the ability of programming

skills, and also help teachers to improve their teaching

efficiency. With the help of our platform, teachers can

assign and check homework, issue course news or or-

ganize examinations online, and the students can also

complete their task online. And therefore plagiarism

becomes a big headache of teachers. If teachers check

each program manually, it will cost much time and

effort, and if students modify the program slightly,

the task of program check becomes more difficult.

To address this issue, code plagiarism checking

has been studied widely, mainly focus on how to

compute the similarity of two program code and de-

termine whether plagiarism exisits. The attribute

counting method in checking code plagiarism is

firstly put forward by Halstead (M. H. Halstead,

1977), and using structure measurement techniques to

calculate the code similarity was presented by Verco

and Wise (K. L. Verco and M. J. Wise, 1996).

Through investigation we find out that most mature

anti-plagiarism system adopt the string matching

method to compare the code structure (Donaldson et

al., 1981; G. Whale, 1990; D. Gitchell and N. Tran,

1999; Michael J. Wise, 2003). The systems based on

such method can run efficiently, and can be imple-

mented easily; however the disadvantage is such sys-

tems can’t make accurate detection in complex copy-

ing method. This paper studies the related algorithms

and techniques, and designs a similarity detection

method, which combines string matching algorithm

and subgraph isomorphism algorithm.

2 CODE SIMILARITY

DETECTION OVERVIEW

Code similarity means that the degree of similarity

between one program and another program.

2.1 Code Plagiarism Description

Programming language course is a very practical

course, and extensive programming exercises are nec-

essary to improve students' programming ability.

However some students copy other students' source

code or just simply change the name of variables or

functions, which lead to plagiarism. Plagiarism waste

teachers’ effort, and can not lend any help to improve

the students' programming ability. Faidhi and Robin-

son (J. A. W. Faidhi and S. K. Robinson, 1987) di-

vided code plagiarism into seven levels. L0: not make

any modifications to the source code; L1: only mod-

ify the source code comments; L2: modify identifiers

of the source code, such as the name of the functions,

480

Li Y., Sheng Y., Xiao L. and Wang F..

A Similarity Detection Platform for Programming Learning.

DOI: 10.5220/0005490304800485

In Proceedings of the 7th International Conference on Computer Supported Education (CSEDU-2015), pages 480-485

ISBN: 978-989-758-107-6

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

macros and variables ; L3: change the position of var-

iables; L4: replace the function call with the function

body, representing a decrease of function; L5: modify

the program statements, such as i++ becomes i + = 1;

L6: modify the program control logic.

2.2 Code Similarity Definition

Obviously, 100% means completely copying. The

plagiarism relationship with two programs is meas-

ured by code similarity. The higher the similarity is,

the greater the possibility of plagiarism. T. Yama-

moto, M. Matsushita, T. Kamiya and K. Inoue (J. A.

W. Faidhi and S. K. Robinson, 1987) give the defini-

tion of similarity of the two software systems. For two

software systems A and B, A consists of the elements

a1, a2, a3, ..., am, represented by the set: {a1, a2, a3,

..., am}.Similarly, B elements b1, b2, b3 , ... bn, rep-

resented by the collection {b1, b2, b3, ..., bn}. Here,

a1, a2, a3, ..., am and b1, b2, b3, ..., bn can be the file

or the line of a program in software systems A and B.

Suppose we are able to calculate the matching be-

tween ai and bj (1 <= i <= m, 1 <= j <= n). The col-

lection of all match (ai, bi) is is represented by Rs, the

similarity S is defined as follows:

||||

|}),(|{||}),(|{|

),(

RbabRbaa

BAS

siiisiii

∈+∈

(1)

As shown in Eq.(1), this definition indicates that

the similarity between A and B is a ratio, which is ob-

tained by the sum of A’ size and B’ size divided by

the size of Rs. if Rs is small, then S will be smaller ,if

the RS is the empty set, then S = 0. When A and B are

the same, S = 1.

3 CODE SIMILARITY

DETECTION METHOD

Code similarity detection methods are divided into

two categories: attribute-counting technique and

structure-metric technique. Attribute-counting tech-

nique is proposed and used in code detection firstly.

Program code has its features, such as: the number of

lines of code, the number of variables, operators, the

number of control conditions and the number of cy-

cles. Attribute-counting technique should figure out

the number of the unique attributes of a program. Ob-

viously different programs have different result of at-

tributes statistic, and the result of the attributes statis-

tics of the same or similar program code should be

similar. Verco and Wise have proved that a anti-pla-

giarism detection system based on attribute-counting

technique just be well work in the situation that the

two programs are same or very similar, it does not

work for the programmers with a little programming

experience who could make several modifications to

the source code. The structure-metric method is used

to determine whether the two procedures are similar

by comparing their structural information. It is well

known that, for programmers, it is easy to change the

attributes of a program, but the structure of the pro-

gram is very difficult to change, otherwise it can’t be

called as plagiarism.

At present, the structure-metric method based on

string matching algorithm is widely used in most anti-

plagiarism systems. This method has two key points.

The first point is how to analyze the structure of the

program code and converse the code into a string. The

second point is how to choose a string matching algo-

rithm to compute the similarity.

3.1 Code Plagiarism Description

The string matching algorithm in the Anti-plagiarism

detection system is used to calculate the similarity of

the program code. The plagiarism refers to the situa-

tions that the students simply make some modifica-

tions to some of the variables, change the position of

some functions; and therefore the string matching al-

gorithm must be able to detect these cases. String

matching must be possible to find the longest match,

due to the fact that some short match exists even if

there is no plagiarism. String matching algorithm for

plagiarism detection is not simply find the position of

the mode string, but to find the set of all exact matches

in the two strings; the proportion of the size of exact

matched strings to the size of total string can be used

to determine the level of similarity. String matching

algorithm should mark the location of the longest

matches to facilitate the detection. There are many ef-

fective string matching algorithm such as: LCS (long-

est common substring), Levenshtein distance (Mi-

chael Gilleland, 2007), Heckel algorithm (Michael J.

Wise, 1992), dynamic programming and GST. Given

the features of anti-plagiarism system, in this paper,

we use the GST algorithm which has a better accu-

racy. The processes of GST are as follows:

Step 1: defining MIN_MATCH_LEN, which

should be equal to or greater than 1; the TILES is in-

itialized as empty; S and T are not marked by default,

which means the matching has not start yet.

Step 2: Find one or more maximum-matching

string; initializing the maximum matching length

max_match as MIN_MATCH_LEN; setting matches

ASimilarityDetectionPlatformforProgrammingLearning

481

to empty, and then repeatedly scanning and compar-

ing the unmarked character in S and T. If the two

characters are equal, then increase the matching

length len. Repeating the process until the characters

are not equal or the characters have been marked. If

len equals to max, match,which means that we find a

new match with the max_match size, and therefore

we add the new match into matches. if len is greater

than max_match which means the collections in the

matches we found before are not the longest common

strings, therefore we should reset the matches to

empty, add the current maximum matching item into

matches, and reset max_match to len. Repeating the

process above until there are no unmarked character

in S and T.

Step 3: If the matches generated in the second step

is not empty, then add it to the set tiles and mark the

characters in matches.

Step 4: Repeating the second and the third step.

And the algorithm is ended.

3.2 Subgraph Isomorphism Algorithm

Subgraph isomorphism problem is an NP-hard prob-

lem (M. Garey and D. Johnson, 1979), however serv-

eral decades of studies have shown that some optimi-

zation algorithm is relatively fast,such as the back-

tracking search algorithm (Evgeny B. Krissinel and

Kim Henrick, 2004) proposed by E.B.Krissinel in the

University of Cambridge, and only in rare cases such

algorithm is slow. The process of the backtracking

search algorithm is as follows:

function BackTrace() {

if !Extendable(queue)

return

end if

node vi = PickVertex()

X = GetMatchedNodes(v);

for all ui in X

map.put(vi,ui);

If Validate() then

Marked(vi);

Marked(ui);

n=n>map.size()?n:map.size();

UpdateQueue(vi);

Backtrace();

else

map.remove(vi,ui);

end if

end for

}

4 THE REALIZATION OF THE

SIMILARITY DETECTION

The code similarity detection designed in this article

works in the process shown in Fig.1. Preprocess the

program code A and B and calculate the similarity

with the GST algorithm. Then according to the simi-

larity, the system determines whether further pro-

cessing with the subgraph iIsomorphism algorithm is

required. The algorithm finally return the similarity.

Figure 1: Similarity detection process.

4.1 Preprocessing

Due to the readability of the program code, there are

always several comments or text prompts in the code.

The programmer can just modify these comments or

text prompts. And preprocessing module is used to

filter out such useless information and avoid plagia-

rism level L1. Preprocessing module will first scan

the source code and delete the comments and empty

lines, as well as text prompts. Besides such useless

information, preprocessing module also should delete

the header files, because copycat add a lot of irrele-

vant header files will confuse similarity detection;

mostly the same but a lot of program header file, for

example, in C language, programmer will use these

standard header files such as stdio.h, stdlib.h, file.h,

math.h. For that even if there is no plagiarism the

CSEDU2015-7thInternationalConferenceonComputerSupportedEducation

482

header files could be the same, so the header file in-

formation is not necessary in similarity detection.

4.2 GST Algorithm Implementation

1. Tokenization

Tokenization means converse the source code to a tag

string by lexical analysis, which can facilitate the

string matching. The concrete realization of tokeniza-

tion is changing the program code into an intermedi-

ate object called LangGrammerElem. LangGram-

merElem is divided into four types: SINGLE,

METHOD, LOOP and CONTROL. SINGLE ele-

ments include single-line elements, such as variable

declarations, assignments and function calls.

METHOD elements include functions, including the

main function. Loop elements include for, while, do-

while and other loop statement. Control elements in-

clude if, else, switch and other branch control state-

ments. LangGrammerElem is a recursive structure

because METHOD,LOOP and CONTROL elements

may contain more than one SINGLE elements. Table

1 shows an example of tokenization of a C program.

Table 1: Example of Tokenization.

C Source Code Tokenization

1 void main() MAIN{

2 {

3 int number[20],n,m,i; DECLAREDECLARED

ECLARE DECLARE

4 scanf("%d",&n); METHOD_CALL

5 scanf("%d",&m); METHOD_CALL

6 for(i=0;i<n;i++) FOR{

7 scanf("%d,",&number[i]); METHOD_CALL }

8 move(number,n,m); METHOD_CALL

9 for(i=0;i<n-1;i++) FOR{

10 printf("%d ",number[i]); METHOD_CALL }

11 printf("%d",number[n-1]); METHOD_CALL

12 } }

2. Similarity Calculation

Similarity is using the result of string matching algo-

rithm-GST. GST is using the BR brute force algo-

rithm to compare two strings. However, we can use

KMP algorithm to optimize it. Or we can replace GST

with the RKR-GST algorithm which is better and also

based on the the famous string matching algo-

rithm,Karp-Rabin (Michael J. Wise, 1993).

4.3 Decision-making Process

Using the GST algorithm mentioned above we can

get a similarity of two program codes. In this paper,

to get a more precise result, we design a decision-

making module to decide whether we should use the

subgraph isomorphism algorithm to detect the simi-

larity. In our system, users can configure the similar-

ity threshold max and min. For example, we set max

to 0.9 and min to 0.5. If the similarity got by GST

algorithm is greater than max(0.9), we can make a

conclusion that there is plagiarism; in contrast, if the

similarity is less than min(0.5), we can assume that

there is no plagiarism. Otherwise, if the similarity is

between max and min, this situation is suspected of

plagiarism and therefore further detecting via the sub-

graph isomorphism algorithm is necessary. The set-

ting of similarity threshold max and min are consid-

ered as follows, on one hand we should try to ensure

the accuracy of the detection of plagiarism, the other

is that we should assures the high efficiency of the

system. For example, if min is too high the accuracy

of the detection would be decreased. And if min is too

small the efficiency of the system would be reduced.

Generally, max=0.9 and min=0.5 is relatively modest

according to the accuracy and efficiency of the sys-

tem.

4.4 Subgraph Isomorphism Algorithm

Implementation

Subgraph isomorphism algorithm implementation in-

cludes two parts. The first part convert the structure

of the program into a dependency graph. The second

is the subgraph isomorphism calculation.

1. The Dependency Graph Generation

For the programmers, no matter how they modify a

program they will not change the output of the pro-

gram. If they change the results, such plagiarism does

not make any sense. And the output is determined by

the program's data and its structure. The program's

data, namely the variables in the program, can be di-

rectly assigned a value. Also it is indirectly assigned

by another variable, which is a dependency relation-

ship of the variables. The structure of a program in-

cludes sequential process, branching process and

loop. Program dependence graph (PDG) can fully

represent the data and the structure of a program, and

therefore we use PDG in our system. In PDG, the

node represents a programming statement, and an

edge represents a data dependency and control flow.

Table 2: Types of program dependence graph node.

Type Description

Declare Declaration of variables

ASimilarityDetectionPlatformforProgrammingLearning

483

Assign Assignment of variables,such as =,+=,++,--

Control if,else,while,for,do-while,switch

Jump goto,break

Call Function call

Return return

Case Case or default in switch

The nodes in PDG are divided into the types shown

in Table 2. The PDG edges are divided into two types:

control dependency edges and data dependency

edges. Control dependency edges represent control-

ling relationship, such as if, else or while control.

Data dependency edges represent that there are data

dependencies between nodes.

The following program is the source code for

summing.

int i;

int sum = 0;

for(i=0;i<=100;i++) {

sum+=I;

}

And Fig. 2 shows the PDG after conversion.

Figure 2: Program Dependence Graph.

2. Similarity Calculation

Using subgraph isomorphism algorithm on PDG and

we can get the maximum common subgraph, and then

we can calculate the ratio of nodes number in the

maximum common subgraph( represented with T ) to

that in the pattern graph( represented with P ). And

the similarity sim=|T|/|P|.

5 EXPERIMENTAL ANALYSIS

In this study, we tested our system by two groups of

program codes. The first group contains five ques-

tions fetched from the programming language plat-

form. And each question is finished independently by

nine students. During the similarity detection test, to

any question, there are 9 source codes, and we com-

pared these source codes in pairs. That is, to one ques-

tion, there are 9X(9-1) /2 = 36 compare and a total of

five questions generated 36X5 = 180 comparisons. In

our experiments we set plagiarism threshold value

max = 0.9, min = 0.5. The detection results are shown

in Fig. 3.

Figure 3: Experimental results.

The second The second group only contains one ques-

tion, namely "Find out all the intimacy numbers less

than 3000”, and then the source code was modified in

plagiarism way by 12 students respectively. Also we

provide ten modifying references to the students as

follows: (1) completely copy the original program;

(2) modify the annotation; (3) alter the program for-

mat and blank lines; (4) change the name of variable;

(5) adjust the location of the code statement; (6) ad-

just the variable declaration position; (7) change the

location of the operand or operator in the expression;

(8) change the data type; (9) add redundant code; (10)

replace the control structure with a equivalent way.

Finally only 2 of the 12 plagiarism samples had the

similarity less than 0.9, 4 codes went through GST

similarity calculation, and the rest 8 codes tested by

subgraph isomorphism algorithm. We found that the

subgraph isomorphism algorithm was more accuracy

in program structure modifications than GST.

Similarity calculation based on string matching is

widely used in the plagiarism detection system. How-

ever, it cannot effectively detect plagiarisms if adding

numerous useless code or changing the code posi-

tions, due to the characteristics of the string matching

algorithm. JPlag tried to achieve a high detection ac-

curacy for the structure modification with a string

matching algorithm and failed finally. From the

course of experiment above, we came to the conclu-

sion that the Similarity Calculation based on subgraph

isomorphism algorithm was more accurate than that

based on string matching algorithm. String matching

CSEDU2015-7thInternationalConferenceonComputerSupportedEducation

484

algorithm needs to find a matching set, which is

greater than the minimum matching length, and re-

ducing the minimum match length can increase the

similarity of the GST algorithm. But if a minimum

match length was too small, it would cause suspicion

of plagiarism for some code without coping. This sys-

tem used the GST and subgraph isomorphism algo-

rithm to calculate the similarity, achieving a better ac-

curacy compared to JPlag etc. for most copying

means, and the efficiency was also close to the string

matching algorithm.

6 CONCLUSIONS

In this paper, we use the well-known string matching

algorithm-GST and subgraph isomorphism algorithm

in the similarity detection system, and these classic

algorithms were applied to practical applications. The

detection processes were completed by four steps. We

tested our system by two experimental procedures, of

which the program source codes were submitted by

real students. The first result shows that the code sim-

ilarity detection system runs faster, with low accu-

racy. The second testing result demonstrates that the

system could detect the all the plagiarism level de-

fined by Faidhi and Robinson with high accuracy of

nearly 90%.

ACKNOWLEDGEMENTS

This work is supported by the National Natural Sci-

ence Foundation of China (61202494), The teaching

reform project of Hunan Province.

REFERENCES

Donaldson, L. John, Ann-Marie Laricaster and H. Paula

Sposato. A Plagiarism Detection System. Twelfth

SIGCSE Teachnical Symposium, St. Louis, Missouri,

1981:21-25.

G. Whale. Identification of Program Similarity in Large

Populations [J]. The Computer Journal, 1990,

33(2):140-146.

D. Gitchell and N. Tran. Sim: A Utility for Detecting Sim-

ilarity in Computer Programs [C]. In Proceedings of the

30th SIGCSE Technical Symposium, March 1999.

Michael J. Wise. YAP3: Improved Detection of Similarities

in Computer Program and other Texts [J]. Department

of Computer, University of Sydney, 2003.

M. H. Halstead. Elements of Software Science [J]. Elsevier

computer science library, New York, 1977 (17):5-7.

K. L. Verco, M. J. Wise. Software for Detecting Suspected

Plagiarism: Companng Structure and Attribute-Count-

ing Systems [J]. Computer Science, University of Syd-

ney, 1996:3-5.

J. A. W. Faidhi and S. K. Robinson. An Empmcal Approach

for Detecting Program Similarity within a University

Programming Environment [J]. Computers and Educa-

tion, 1987, 11(1):1-19.

Michael Gilleland. Levenshtein Distance, in Three Flavors

[J]. http://www.Merriampark.com/ld.htm.2007-4-18.

Michael J. Wise. Detection of Similarities in Student Pro-

grams: YAP’ing May Be Preferable to Plague’hag [J].

SIGSCI Technical Symposium, Kansas City, USA,

March 5-6, 1992:268-271．

Evgeny B. Krissinel and Kim Henrick. Common subgraph

isomorphism detection by backtracking search [J]. Soft-

ware-Practice and Experience 2004(34):591-607(DOI:

0.1002/spe.588).

Michael J. Wise. String Similarity Via Greedy String Tiling

and Running Karp Rabin Matching [J]. Department of

Computer Science, University of Sydney. December

1993.

M. Garey and D. Johnson. Computers and Intractability: A

Guide to the Theory of NP-Completeness [J]. Freeman,

1979.

ASimilarityDetectionPlatformforProgrammingLearning

485