JSIMIL

A Java Bytecode Clone Detector

Luis Quesada, Fernando Berzal and Juan Carlos Cubero

Department of Computer Science and Artiﬁcial Intelligence, CITIC, University of Granada, Granada 18071, Spain

Keywords:

Java bytecode, Clone detection, Metrics, Hierarchical matching.

Abstract:

We present JSimil, a code clone detector that uses a novel algorithm to detect similarities in sets of Java pro-

grams at the bytecode level. The proposed technique emphasizes scalability and efﬁciency. It also supports

customization through proﬁles that allow the user to specify matching rules, system behavior, pruning thresh-

olds, and output details. Experimental results reveal that JSimil outperforms existing systems. It is even able

to spot similarities when complex code obfuscation techniques have been applied.

1 INTRODUCTION

Code clone detection can expose many interesting

features that are deeply embedded in the program

source code: common clusters of sentences, perfect

or almost perfect code matches, limits of the program-

ming language and derived workarounds, coding style

or lack thereof, and programming antipatterns.

Furthermore, tracking down duplicated code

clusters has many commercial applications, as it

can be used to prove plagiarism in legal disputes

over intellectual property rights or software patents

(Belkhouche et al., 2004), help in automatic refactor-

ization or code maintenance (Tairas, 2008), and com-

pare student assignments (Cosma and Joy, 2006).

Most traditional techniques include string match-

ing, whose variants are implemented by YAP3 (Wise,

1996) and Baker and Mamber’s (Baker and Manber,

1998); string hash matching, as in Moss (Schleimer

et al., 2003); token-based matching, as implemented

by JPlag (Prechelt et al., 2000), CP-Miner (Li et al.,

2006), and CCFinder (Kamiya et al., 2002); or

control-ﬂow graph analysis and its variants, as the en-

hanced CFGs used by JDiff (Apiwattanapong et al.,

2007).

More recent techniques compare program depen-

dence graphs (Krinke, 2001), as in GPLAG (Liu et al.,

2006); analyze simpliﬁcations of program behavior,

either in the way of program slices (Weiser, 1981) or

by considering them as black boxes, as implemented

by Semantic Diff (Jackson and Ladd, 1994); or calcu-

late and compare metrics (Dunsmore, 1984).

These code clone detectors are hard-coded and

they are not customizable at all, hence the results they

produce can be inaccurate beyond the speciﬁc situa-

tions they are designed for. In contrast, JSimil is cus-

tomizable to ﬁt any need.

2 JSIMIL

JSimil is a code clone detector that performs a heuris-

tic matching of program hierarchies at the bytecode

level.

JSimil results, which are comprised of nested ele-

ments matches, are generated by matching the classes,

the methods, and the basic blocks within Java soft-

ware. Several metrics, such as the number of differ-

ent kinds of instruction, are calculated for every basic

block.

JSimil input consists of a conﬁguration ﬁle, which

determines the paths to both compiler and disassem-

bler and options about their usage; a proﬁle, which

determines the system behavior, thresholds, pruning

rules, and output detail; and data input, which con-

tains the source code (.java) ﬁles, bytecode (.jar or

.class) ﬁles, or a mix of them, of the programs to com-

pare. The output is a browsable hierarchy of matches

across programs and their elements.

2.1 Hierarchical Matching

Matching is done at the block level, the method

level, the class level, and the program level. Cus-

333

Quesada L., Berzal F. and Carlos Cubero J. (2010).

JSIMIL - A Java Bytecode Clone Detector.

In Proceedings of the 5th International Conference on Software and Data Technologies, pages 333-336

DOI: 10.5220/0003013403330336

 SciTePress

tomized proﬁles determine how the system performs

the matching.

2.1.1 Block Matching

The normalized similarity S

for any two blocks (b

and b

) is calculated as follows:

, b

) = 1−

∑

i=0

(i) − b

(i)| ∗ w(i)

∑

i=0

max{b

(i), b

(i)} ∗ w(i)

(1)

where M is the number of metrics computed for the

blocks, b

(i) is the value of the ith measure for the x

block, and w(i) is the weight for the ith measure as

deﬁned in the proﬁle.

It should be noted that S

, b

) = S

, b

2.1.2 Method Matching

The normalized similarity S

for any two methods is

calculated by choosing one of them (m

) according

to proﬁle parameters and method size, ordering its

blocks by decreasing size, and matching them with

the ones in the other method (m

) by applying the fol-

lowing expression:

, m

) =

∑

i=0

), bm

, m

))) ∗ s(m

))

(2)

where B

is the number of blocks contained in m

;

) is the ith block in the method m

, ordered

by decreasing size; bm

, m

)) is the block con-

tained in m

that best matches the block b

in m

; and

s(m

)) is the number of instructions of the block b

in m

The proﬁle parameters also determine which

method matchings are tried and which blocks are ef-

fectively matched, according to similarity thresholds,

elements sizes, and already matched elements, among

other factors.

In general, S

, m

) 6= S

, m

) because the

best match for m

) might be m

) but the best

match for m

) might not be m

2.1.3 Class Matching

The normalized similarity S

for any two classes is

calculated like the S

method similarity, just by re-

placing blocks (b) with methods (m) and methods (m)

with classes (c). Formally:

, c

) =

∑

i=0

), bm

, c

))) ∗ s(c

))

(3)

Two classes will be matched only if their similar-

ity is higher than an user-deﬁned proﬁle threshold.

When any class is not matched, its methods ascend

through the hierarchy and the program temporary be-

comes their parent, so they can be matched in the pro-

gram matching step.

It should be noted that S

, c

) 6= S

, c

), by

the same reason that S

, m

) 6= S

, m

2.1.4 Program Matching

The normalized similarity S

for any two programs is

calculated by choosing one of them (p

) according to

its size and the proﬁle parameters, ordering its classes

and still-unmatched methods by decreasing size, and

matching them with the ones in the other program

) by applying the following expression:

, p

) =

∑

i=0

), bm

, p

))) ∗ s(p

))+

∑

i=0

), bm

, p

))) ∗ s(p

))

(4)

where C

is the number of classes contained in p

;

) is the ith class in the program p

, ordered by

decreasing size; bm

, p

)) is the class contained

in p

that best matches the class c

of p

; s(p

)) is

the number of instructions in class c

in p

; U

is the

number of still unmatched methods contained in p

;

) is the ith unmatched method in the program

, ordered by decreasing size; bm

, p

)) is the

unmatched method contained in p

that best matches

the method u

in p

; and s(p

)) is the number of

instructions of the still-unmatched method u

in p

It should be noted that S

, p

) 6= S

, p

), by

the same reason that S

, m

) 6= S

, m

) and

, c

) 6= S

, c

2.2 JSimil Proﬁles

Proﬁles are the main novelty of the proposed clone

detector and they are, indeed, key to JSimil ﬂexibility.

Proﬁles contain parameters that allow the user

to ﬁne-tune the behavior of the system by adjusting

matching thresholds, and rules, output detail level,

pruning thresholds and metrics weights.

JSimil distribution package includes JSimil Pro-

ﬁle Manager, a tool for designing proﬁles that offers

contextual help and shows several proﬁle properties,

both as normalized numeric values and as a Kiviat di-

agram called proﬁle ﬁngerprint (Figure 1):

ICSOFT 2010 - 5th International Conference on Software and Data Technologies

334

Figure 1: Fingerprints for default, student assignment plagiarism detection, and exhaustive proﬁles.

JPlag Moss GPLAG CP-Miner CCFinder Semantic JDiff Baker and JSimil

Diff Manber’s

Applications

Plagiarism ✓ ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✓

Copy&Pasted code ✗ ✗ ✗ ✓ ✓ ✗ ✗ ✓ ✓

Diff ✗ ✗ ✗ ✗ ✗ ✓ ✓ ✗ ✓

Supported inputs

Source code (java) ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✗ ✓

Bytecode (jar/class) ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✓ ✓

Mixed ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✓

Supported outputs

Plain text ✗ ✓ ✗ ✓ ✓ ✓ ✓ ✓ ✓

HTML ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✓

XML ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✓

Diff alike ✗ ✗ ✗ ✗ ✗ ✓ ✓ ✓ ✓

Robustness

Textual changes ✓ ✓ ✓ ✗ ✓ ✓ ✓ ✓ ✓

Code reordering ✗ ✗ ✓ ✗ ✗ ✓ ✗ ✗ ✓

Code insertion ✗ ✗ ✓ ✗ ✗ ✓ ✗ ✗ ✓

Optimizations

Parallelized ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✓

Supports pruning ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✓

Hierarchical matching ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✓

Customization

System behavior ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✓

Matching rules ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✓

Output detail ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✓

Pruning thresholds ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✓

Figure 2: Qualitative comparison of existing clone detectors and JSimil.

• Speed (SP) measures the number of used options

that help reduce processing time.

• Detail (DE) measures the amount of data that will

be generated.

• Sensibility (SE) measures how much partial re-

sults may inﬂuence global results.

• Precision (PR) measures how small the margin of

error will be.

• Specialization (SC) measures the amount of op-

tions with non-standard values.

• Assimilation (AS) measures the amount of pre-

vious knowledge provided by the user to obtain

better results.

These properties correspond to the visible charac-

teristics of the system behavior the user will be most

concerned about.

3 EXPERIMENTAL

COMPARISON

A table summarizing the capabilities of similarity de-

tection systems is shown in Figure 2.

JSimil was able to obtain more accurate results

than other system in the comparison of 222 student

JSIMIL - A Java Bytecode Clone Detector

335

assignments that was performedin order to detect pla-

giarism. JSimil matched the 222 student assignments,

of 200KB source code each on average, among them-

selves (a total of 24753 program-program matches) in

30 seconds using JSimil plagiarism detection proﬁle,

which corresponds to a millisecond per match on an

mid-range Intel Quad Core personal computer.

4 CONCLUSIONS

We have described a Java code clone detector sys-

tem that uses a novel hierarchical matching technique

that solves issues that affect similar existing systems

and offers advantages over them, such as: support for

all Java constructs, the possibility of comparing pro-

grams when only bytecode is available, browsing re-

sults, a parallelized implementation, and pruning non-

signiﬁcant matches to reduce processing time.

The proﬁles allow users to adjust system behavior

so unwanted features may be removed and algorithm

adjustments can be made.

The proposed algorithm is not sensitive to com-

mon obfuscation and plagiarism concealing tech-

niques that other systems are sensitive to. In fact, the

experimental results revealed that JSimil outperforms

existing systems as it is able to detect similarities they

cannot.

Currently, JSimil matches hierarchies of Java

bytecode whose leaf nodes are sets of metrics com-

puted from basic blocks. JSimil can be extended by

giving the system a description of the hierarchy to

be used to match data from different sources using

the same proﬁles. This will give researchers the pos-

sibility of developing general proﬁles (for example,

for difference detection or similarity detection) that

would deﬁne how to match any kind of hierarchical

data.

REFERENCES

Apiwattanapong, T., Orso, A., and Harrold, M. J. (2007).

JDiff: a differencing technique and tool for object-

oriented programs. Automated Software Engineering,

14(1):3–36.

Baker, B. S. and Manber, U. (1998). Deducing similarities

in java sources from bytecodes. In Proc. of Usenix

Annual Technical Conference, pages 179–190.

Belkhouche, B., Nix, A., and Hassell, J. (2004). Plagiarism

detection in software designs. In Proc. of the 42nd An-

nual Southeast Regional Conference, pages 207–211.

Cosma, G. and Joy, M. (2006). Source-code plagiarism:

a UK academic perspective. Technical Report 422,

University of Warwick.

Dunsmore, H. E. (1984). Software metrics: an overview

of an evolving methodology. Information Processing

and Management, 20(1-2):183–192.

Jackson, D. and Ladd, D. A. (1994.). Semantic Diff: a tool

for summarizing the effects of modiﬁcations. In Proc.

of the International Conference on Software Mainte-

nance, pages 243–252.

Kamiya, T., Kusumoto, S., and Inoue, K. (2002). CCFinder:

a multilinguistic token-based code clone detection

system for large scale source code. IEEE Transactions

on Software Engineering, 28(7):654–670.

Krinke, J. (2001). Identifying similar code with program

dependence graphs. In Proc. of the 8th Working Con-

ference on Reverse Engineering, pages 301–309.

Li, Z., Lu, S., Myagmar, S., and Zhou, Y. (2006). CP-

Miner: Finding copy-paste and related bugs in large-

scale software code. IEEE Transactions on Software

Engineering, 32(2):176–192.

Liu, C., Chen, C., and Han, J. (2006). GPLAG: Detection

of software plagiarism by program dependence graph

analysis. In Proc. of the 12th ACM SIGKDD Interna-

tional Conference on Knowledge Discovery and Data

Mining, pages 872–881.

Prechelt, L., Malpohl, G., and Philippsen, M. (2000). JPlag:

Finding plagiarism among a set of programs. Techni-

cal Report 2000-1, University of Karlsruhe.

Schleimer, S., Wilkerson, D. S., and Aiken, A. (2003). Win-

nowing: Local algorithms for document ﬁngerprint-

ing. In Proc. of the 22nd ACM SIGMOD International

Conference on Management of Data, pages 76–85.

Tairas, R. (2008). Clone maintenance through analysis and

refactoring. In Proc. of the 2008 Foundations of Soft-

ware Engineering Doctoral Symposium, pages 29–32.

Weiser, M. (1981). Program slicing. In Proc. of the 5th

International Conference on Software Engineering,

pages 439–449.

Wise, M. J. (1996). YAP3: Improved detection of similari-

ties in computer program and other texts. In Proc. of

the 27th SIGCSE Technical Symposium on Computer

Science Education, pages 130–134.

ICSOFT 2010 - 5th International Conference on Software and Data Technologies

336