explained. Note that only the order of matching and
non-matching citations in the document that is
chunked determines which chunks are formed for
that document. The citation sequence of the other
document has no influence on the chunk formation.
For document A in Figure 5, the algorithm starts
forming the first chunk shown in red by including
the matching citations #1, #2 and #3, because no
non-matching citations separate those matching
citations. The first chunk contains three matching
citations at this point. Therefore, the algorithm will
include the next matching citation in the sequence if
three or less non-matching citations separate it from
the last matching citation in the chunk (#3). Citation
#4 fulfills this condition, thus is included in the first
chunk, as is citation #5, which directly succeeds #4.
At this point, the first chunk contains five matching
citations. The next matching citation in the sequence
(#6) is separated from the last matching citation in
the first chunk (#5) by six non-matching citations. In
other words, the number of non-matching citations
in between the matching citations #5 and #6 is larger
than the number of matching citations in the first
chunk. Therefore, the algorithm finalizes the first
chunk and includes citation #6 and #7 in a second
chunk shown in green thereby completing the
processing of document A. By processing the
citation sequence of document B in the same
manner, the algorithm forms two chunks for
document B, although the order of matching
citations in document A and B differs.
In the following comparison step, the Citation
Chunking algorithm compares all chunks formed for
both documents with each other. The chunks with
the highest overlap in matching citations are stored
as a citation pattern match. For the example shown
in Figure 5, the algorithm stores a pattern match of
length four between the red chunks and a pattern
match of length two for the green chunks.
Citation Chunking aims to uncover potential
cases in which text segments or logical structures
have been copied or were influenced by another text.
The chunking strategy implemented in the CitePlag
prototype allows for sporadic non-shared citations
that may have been inserted to make the resulting
text appear more “genuine”. By allowing an
increasing number of non-shared citations within a
chunk, given that a certain number of shared
citations have already been included, the Citation
Chunking algorithm can also detect potential
plagiarism cases where text segments and citations
from different sources were copied and interwoven
(shake&paste plagiarism).
The second image from the left in Figure 4
shows the citation patterns identified in the instance
of cross-language plagiarism from the Guttenberg
thesis using the Citation Chunking algorithm. The
substantial overlap in citations, which was already
apparent by visualizing Bibliographic Coupling
relations, is also reflected by the numerous and
densely linked citation chunks. Visualizing the
patterns returned by the Citation Chunking
algorithm reveals a number of clusters pointing to
similar text segments. Within individual clusters,
lines connecting matching citations are mostly
parallel with only few overlaps. The pattern suggests
that the selection and placement of citations in
numerous well defined segments of the Guttenberg
thesis is highly similar to the source document.
4.3 Greedy Citation Tiling
The Greedy Citation Tiling (GCT) algorithm
identifies all individually longest citation patterns
that consist entirely of matching citations in the
exact same order. Individually longest patterns refer
to sequences of matching citations in the same order
that cannot be extended to the left or right without
encountering a citation that is not shared by both
documents being compared. Such individual longest
matches are called citation tiles.
Figure 6 illustrates the formation of Greedy
Citation Tiles. Using the notation introduced in
Figure 5, Arabic numerals represent matching
citations, the letter x denotes non-matching citations.
Colored highlights with roman numerals represent
citation tiles. Citation tiles are stored as a numeric
triplet shown at the bottom of Figure 6. The first
element of the triplet indicates the starting position
of the citation tile in document A, the second
element denotes the starting position in document B
and the third element corresponds to the length of
the tile.
Figure 6: Greedy Citation Tiling schematic concept.
Finding many or long matching citation tiles is
rarely a coincidence, and can thus be a strong
indicator of plagiarism. In Figure 4, the third image
from the left shows the visualization of citation tiles
for the longest instance of cross language plagiarism
in the Guttenberg thesis. Numerous citation tiles up
to a length of five citations were identified in this
Web-basedDemonstrationofSemanticSimilarityDetectionUsingCitationPatternVisualizationforaCrossLanguage
PlagiarismCase
681