assignments that was performedin order to detect pla-
giarism. JSimil matched the 222 student assignments,
of 200KB source code each on average, among them-
selves (a total of 24753 program-program matches) in
30 seconds using JSimil plagiarism detection profile,
which corresponds to a millisecond per match on an
mid-range Intel Quad Core personal computer.
4 CONCLUSIONS
We have described a Java code clone detector sys-
tem that uses a novel hierarchical matching technique
that solves issues that affect similar existing systems
and offers advantages over them, such as: support for
all Java constructs, the possibility of comparing pro-
grams when only bytecode is available, browsing re-
sults, a parallelized implementation, and pruning non-
significant matches to reduce processing time.
The profiles allow users to adjust system behavior
so unwanted features may be removed and algorithm
adjustments can be made.
The proposed algorithm is not sensitive to com-
mon obfuscation and plagiarism concealing tech-
niques that other systems are sensitive to. In fact, the
experimental results revealed that JSimil outperforms
existing systems as it is able to detect similarities they
cannot.
Currently, JSimil matches hierarchies of Java
bytecode whose leaf nodes are sets of metrics com-
puted from basic blocks. JSimil can be extended by
giving the system a description of the hierarchy to
be used to match data from different sources using
the same profiles. This will give researchers the pos-
sibility of developing general profiles (for example,
for difference detection or similarity detection) that
would define how to match any kind of hierarchical
data.
REFERENCES
Apiwattanapong, T., Orso, A., and Harrold, M. J. (2007).
JDiff: a differencing technique and tool for object-
oriented programs. Automated Software Engineering,
14(1):3–36.
Baker, B. S. and Manber, U. (1998). Deducing similarities
in java sources from bytecodes. In Proc. of Usenix
Annual Technical Conference, pages 179–190.
Belkhouche, B., Nix, A., and Hassell, J. (2004). Plagiarism
detection in software designs. In Proc. of the 42nd An-
nual Southeast Regional Conference, pages 207–211.
Cosma, G. and Joy, M. (2006). Source-code plagiarism:
a UK academic perspective. Technical Report 422,
University of Warwick.
Dunsmore, H. E. (1984). Software metrics: an overview
of an evolving methodology. Information Processing
and Management, 20(1-2):183–192.
Jackson, D. and Ladd, D. A. (1994.). Semantic Diff: a tool
for summarizing the effects of modifications. In Proc.
of the International Conference on Software Mainte-
nance, pages 243–252.
Kamiya, T., Kusumoto, S., and Inoue, K. (2002). CCFinder:
a multilinguistic token-based code clone detection
system for large scale source code. IEEE Transactions
on Software Engineering, 28(7):654–670.
Krinke, J. (2001). Identifying similar code with program
dependence graphs. In Proc. of the 8th Working Con-
ference on Reverse Engineering, pages 301–309.
Li, Z., Lu, S., Myagmar, S., and Zhou, Y. (2006). CP-
Miner: Finding copy-paste and related bugs in large-
scale software code. IEEE Transactions on Software
Engineering, 32(2):176–192.
Liu, C., Chen, C., and Han, J. (2006). GPLAG: Detection
of software plagiarism by program dependence graph
analysis. In Proc. of the 12th ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data
Mining, pages 872–881.
Prechelt, L., Malpohl, G., and Philippsen, M. (2000). JPlag:
Finding plagiarism among a set of programs. Techni-
cal Report 2000-1, University of Karlsruhe.
Schleimer, S., Wilkerson, D. S., and Aiken, A. (2003). Win-
nowing: Local algorithms for document fingerprint-
ing. In Proc. of the 22nd ACM SIGMOD International
Conference on Management of Data, pages 76–85.
Tairas, R. (2008). Clone maintenance through analysis and
refactoring. In Proc. of the 2008 Foundations of Soft-
ware Engineering Doctoral Symposium, pages 29–32.
Weiser, M. (1981). Program slicing. In Proc. of the 5th
International Conference on Software Engineering,
pages 439–449.
Wise, M. J. (1996). YAP3: Improved detection of similari-
ties in computer program and other texts. In Proc. of
the 27th SIGCSE Technical Symposium on Computer
Science Education, pages 130–134.
ICSOFT 2010 - 5th International Conference on Software and Data Technologies
336