A Method for Plagiarism Detection over Academic Citation Networks

Sidik Soleman and Atsushi Fujii

Department of Computer Science, Tokyo Institute of Technology, Tokyo, Japan

Keywords:

Plagiarism Detection, Citation Behavior, Information Retrieval, Content Analysis.

Abstract:

Whereas in the academic publication, citation has been used for a long time to borrow ideas from another doc-

ument and show the credit to the authors of that document, plagiarism, which does not indicate the appropriate

credit for a borrowed idea, has of late become problematic. Because plagiarism detection has been formulated

as ﬁnding partial near-duplicate in response to a document for a suspected case of plagiarism, in this paper we

propose a method to improve the similarity computation between text fragments. Our contribution is to for-

mulate three document similarities based on citation and content analysis, and to combine them in our method.

We also show the effectiveness of our method experimentally and discuss its advantages and limitation.

1 INTRODUCTION

Reﬂecting the rapid progress of science, technology,

and culture, an increasing number of academic publi-

cations have recently been available by means of dig-

ital libraries or general-purpose search engines on the

Web. Whereas an academic publication should in-

clude the novel ideas proposed by the authors, most

of the residue include known facts or knowledge in a

large body of literature, for which citation provides a

practical solution to easily indicate the source of each

idea and also credit to the authors of each source.

This customary has resulted in a huge network in

which each academic publication (i.e., document) and

citation is represented as a node and a directed link

between two nodes, which we shall call “academic

citation network (ACN)”. In practice, the entire ACN

can be divided into more than one subnetwork, each

of which roughly corresponds to a different discipline.

However, we use only “ACN” to refer to both the en-

tire and a partial ACN, without loss of generality.

Whereas in principle the authors who wish to bor-

row speciﬁc content from other documents are re-

sponsible for citing appropriate documents, in prac-

tice misconduct associated with missing or deceptive

citations has of late become a crucial problem. Such

conducts are generally termed “plagiarism” and is de-

ﬁned, for example, in the Merriam-Webster dictio-

nary

as “the act of using another person’s words or

ideas without giving credit to that person”.

https://www.merriam-webster.com/dictionary/plagia

rism

Plagiarism has a signiﬁcantly negative impact

on our society in terms of the following perspec-

tives. First, it discourages the spirit of the inven-

tion and creativeness because the credit is not given

to the right people. Second, the evaluation of each

publication can purposefully be manipulated given

that the frequency of a document being cited has

been used to measure the achievement for a research

project and intellectual contribution of individual re-

searchers. Finally, plagiarism can decrease the trust of

the academia. The above background has motivated

us to explore plagiarism detection over the ACN.

A single case of plagiarism can generally be rep-

resented as “party X plagiarized document Q using

one or more documents S

,...,S

, where X, Q,

and S

are variables representing a plagiarist, plagia-

rized document, and source document, respectively”.

Plagiarized documents, which refer to a resultant doc-

ument, should not be confused with the source docu-

ments. In contrast, a task of plagiarism detection can

be different depending on the purpose of a user. In

the following, (a)-(c) are example scenarios for pla-

giarism detection associated with different resolution

of analysis.

(a) To determine if a document in question is a pla-

giarized one, in which the input can potentially be

non-plagiarized one.

(b) To ﬁnd one or more source documents for each of

the plagiarized ones as an evidence of plagiarism,

in addition to (a).

274

Soleman, S. and Fujii, A.

A Method for Plagiarism Detection over Academic Citation Networks.

DOI: 10.5220/0006938402740281

In Proceedings of the 10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2018) - Volume 1: KDIR, pages 274-281

ISBN: 978-989-758-330-8

ment has been modiﬁed in the plagiarized one, in

addition to (a) and (b).

Although it may also be important to determine

whether a plagiarism is due to a deliberate intention

or an innocent mistake, in this paper we focus only on

intentional cases.

Because as in the general representation for pla-

giarism above, document Q usually consists of frag-

ments of S

(1 ≤ i ≤ n) with optional modiﬁcation,

plagiarism detection has often been recast as detection

for partial near-duplicate text in a document collec-

tion. Thus, a system for plagiarism detection can be

realized with a straightforward application of infor-

mation retrieval (IR), and more precisely the purpose

is to search the document collection for one or more

fragments resembling those in a document in ques-

tion. Finally, the candidate documents whose simi-

larity score or whose ranking in descending order of

the similarity score is above a predetermined thresh-

old are presented to the user. Systems for plagiarism

detection that follows the IR approach generally rely

on the similarity between the plagiarized document

and a candidate for its source document.

In this paper, we propose a method for plagiarism

detection, focusing mainly on the computation for the

similarity score between two documents. Our contri-

bution is, unlike existing methods for plagiarism de-

tection relying only on a single type of document sim-

ilarity, to formulate three document similarities based

on citation and content analysis and to combine them

in our method.

Section 2 surveys past research on plagiarism de-

tection to clarify our focus and approach. Sections 3

and 4 elaborates our method for plagiarism detection

and our experiment to evaluate its effectiveness, re-

spectively. In Section 5, we conclude our work.

2 RELATED WORK

We can categorize existing work into three categories

according to type of information that is compared to

calculate document similarity, i.e. method based on

textual content, citation, and combination of both.

2.1 Method based on Textual Content

The methods in this type calculate the document sim-

ilarity by comparing the textual contents between two

documents. Several methods have been proposed

to compare the textual contents, e.g. bag-of-word

model, word n-gram model, and ﬁngerprinting.

In the bag-of-word model, a document is repre-

sented as a set of words or terms. Each term is usually

assigned with a certain weight. Since some fragments

of document are more important than the other parts,

e.g. the methodology section is more important than

the introduction one, (Alzahrani et al., 2012) consid-

ered word distribution in each document section to

calculate its weight.

(HaCohen-Kerner et al., 2010) found that com-

paring contents in abstract section of documents is

promising despite its short size. While (Soleman

and Fujii, 2017a) discovered that the distinction be-

tween citing and non-citing sentences in documents is

important when calculating the document similarity.

They categorize a sentence as citing one if it contains

at least one citation anchor. Citation anchor is a sym-

bol or characters in body text referring to a document

in reference list.

Unlike the bag-of-word model, the word n-gram

model preserves word order since it transforms a doc-

ument into a set of substrings consisting of a sequence

of n words. According to the experiment conducted

by (Barr

on-Cede

no and Rosso, 2009), they found that

using word 2 and 3-gram are effective.

The ﬁngerprinting methods transform a docu-

ment into a collection of substrings and/or often ap-

ply a mathematical function to transform the sub-

strings into unique ﬁxed-size strings. For example,

(Kasprzak and Brandejs, 2010) used word 5-gram to

generate substrings and used a hash function to trans-

form the substrings. While (Grozea et al., 2009) used

character 16-gram and (S

anchez-Vega et al., 2017)

proposed several types of character n-gram to gener-

ate the substrings.

As the content from source document may have

been modiﬁed by plagiarist in plagiarized one, some

methods proposed to utilize lexical dictionaries to

handle the word substitution by means of synonym,

such as the method proposed by (Chong and Specia,

2011), and (Chen et al., 2010). Besides using the con-

tents of candidate source documents to be compared

with the content of a document in question, (Soleman

and Fujii, 2017b) demonstrated that using sentences

citing the candidate source documents as additional

information for them is also useful.

2.2 Method based on Citation

In the ﬁeld of citation analysis, bibliographic cou-

pling (Kessler, 1963) is a well-known method to mea-

sure the similarity between documents with respect to

citation link. Two document are likely to be similar

or related if they cite the same documents.

Motivated by the above study, (Gipp and

Meuschke, 2011) compared pattern of citation an-

chors between two documents for calculating the doc-

A Method for Plagiarism Detection over Academic Citation Networks

275

ument similarity. While (HaCohen-Kerner et al.,

2010) compared the document title in reference list

to calculate the document similarity. However, they

found that several false-positives are produced by

their method. It means that innocent and non-source

documents are identiﬁed as plagiarized and source

ones, respectively.

Since the methods in this type only work when

document contains citation information or reference

list, they should be combined with the other types of

method.

2.3 Combination of Textual Content

and Citation

The methods in this type combined more than one

type of document similarity to improve their ef-

fectiveness, since the document similarities should

complement each other. For example, (HaCohen-

Kerner et al., 2010) found that the combination of

the content-based method, i.e. the content similarity

in the abstract section, and citation-based method is

promising. While (Pertile et al., 2016) successfully

improved the effectiveness of their method by com-

bining content-based method, i.e. the content similar-

ity in document level and several citation-based meth-

ods.

Although (HaCohen-Kerner et al., 2010) and (Per-

tile et al., 2016) combined the citation and content-

based methods, they considered the citation informa-

tion and the textual contents as two independent enti-

ties. However, some text fragments in documents and

the citation information are closely related.

To address this problem, (Soleman and Fujii,

2017a) proposed a citation-based method by calculat-

ing the content similarity of citing sentences, and they

linearly combined it with a content-based method, i.e.

the content similarity in non-citing sentences. They,

however, did not consider the cited documents or the

citation directions when calculating the content simi-

larity of citing sentences.

3 PROPOSED METHODS

As we have mentioned earlier, one of our main con-

tribution is to formulate three document similarities

based on citation and content analysis and to combine

them in our method. We use the information from

the ACN to extract citing sentences, to perform con-

tent analysis, and to formulate the document similar-

ity calculations.

Suppose we have ACN consisting three docu-

ments q, y, and z where q and y cite z. One or more

sentences s

q,1

, s

q,2

, s

q,3

, ... , s

q,n

in q and s

y,1

, s

y,2

y,3

, ... , s

y,n

in y contain citation anchor referring to

z. We describe our hypotheses to formulate the docu-

ment similarities as follows:

1. s

q,i

and s

y,i

are likely to contain novel contents

borrowed from z. Thus, they can be used for ad-

ditional contents for z to emphasis novel content

described in z. Any document having similar con-

tent signiﬁcantly to the additional content of z,

that document is likely to be a plagiarized one and

z is the source document.

2. Since citing sentences contain borrowed contents

from the cited document, the non-citing sentences

in q, y, and z should contain novel content de-

scribed in each document. Hence, when non-

citing sentences in a plagiarized and its source

document are compared, they should have a sig-

niﬁcant content similarity.

3. If s

q,i

and s

y,i

have a signiﬁcant content similarity,

q and y have the same citation behavior towards

z. Thus, q and y are similar. Typically, the more

documents are cited by two documents with the

same citation behavior, the more similar they are.

Now, let’s say we want to detect plagiarism for

document in question q given a document collection

D. We elaborate our document similarities between q

and d ∈ D by the following text. However, we ﬁrst de-

scribe the method that we use to calculate the content

similarity between two text fragments.

Given text fragment s

from q and s

from d, we

calculate the content similarity between the text frag-

ments by means of bag-of-word model and TF·IDF

term weighting. We also perform several preprocess-

ing

, i.e. text lowercasing, stopword removal, and

stemming. Thus, we calculate the weight of term t

in s

as follow:

T F(t,s

) = |{t

∈ terms(s

) : t

= t}| (1)

IDF(t) = log

|D|

{d ∈ D : t ∈ terms(d)}

(2)

w(t,s

) = T F(t,s

) · IDF(t) (3)

After calculating the weight for each term in s

and s

, we transform them into vector representations,

i.e.

and

, we calculate their content similarity by

means of cosine similarity:

sim

cont

) = cos(

) =

|| ||

(4)

Term frequency·inverted document frequency

https://www.nltk.org/

KDIR 2018 - 10th International Conference on Knowledge Discovery and Information Retrieval

276

Next, we describe our three proposed document

similarity calculations that are formulated based on

our hypotheses above.

According to our ﬁrst hypothesis, we use citing

sentences as additional contents for the cited docu-

ment. Thus, we use sentences citing d as its addi-

tional content. We consider a sentence as citing one if

it contains at least one citation anchor. Since the addi-

tional content contain the novel content described in

d, we should compare it with the part of q containing

the novel content, i.e. its non-citing sentences (see the

second hypothesis). Thus, our ﬁrst document similar-

ity sim

add

(q,d) is calculated as:

cited(d) = {d

∈ D : d

cites d} (5)

ncs(q) = concat({s ∈ sentence(q) :

s not citing})

(6)

cs(d

,d) = concat({s ∈ sentences(d

) :

s citing d})

(7)

sim

add

(q,d) = sim

cont

(ncs(q),

concat(

∑

∈cited(d)

cs(d

,d)))

(8)

Based on the second hypothesis, we compare the

novel contents described in q and d by comparing

their non-citing sentences. Therefore, our second

document similarity sim

(q,d) is calculated as:

sim

(q,d) = sim

cont

(ncs(q),ncs(d)) (9)

Based on the third hypothesis, we calculate the

similarity between q and d by comparing their cita-

tion behavior. Hence, our third document similarity

sim

(q,d) is calculated as follow:

citing(d) = {d

∈ D : d cites d

} (10)

cb(q,d) =

∑

∈{citing(q)∩citing(d)}

sim

cont

(cs(q,d

),cs(d,d

)) (11)

smth

(q,d) = sim

cont

(concat(

∑

∈{citing(q)−citing(d)}

cs(q,d

)),

concat(

∑

∈{citing(d)−citing(q)}

cs(d, d

)))

(12)

sim

(q,d) =

cb(q,d) + cb

smth

(q,d)

min(|citing(q)|,|citing(d)|)

(13)

We use cb

smth

(q,d) as smoothing score for the

similarity of citation behavior, since (Soleman and

Fujii, 2017a) found that comparing the content of cit-

ing sentences regardless of the cited document is also

useful. This may also alleviate the problem when pla-

giarist modiﬁes some citation anchors, i.e. replacing

them with the other ones. In the similarity of citation

behavior, we also use min(|citing(q)|, |citing(d)|) to

anticipate when plagiarist reduces or adds the number

of cited documents.

Finally, we combine our three document similari-

ties by a linear combination in our proposed method.

Hence, the ﬁnal document similarity score between q

and d is:

ds(q, d) = αsim

(q,d) + (1 − α)sim

(q,d)

+βsim

add

(q,d)

(14)

where α,β ∈ [0,1]. We use α to prioritize between

the similarity of citation behavior and non-citing sen-

tences. While β is used to determine how much the

similarity of additional content should be considered.

We can also use a machine learning algorithm to com-

bine our document similarities. Hence, α,β are auto-

matically determined by the algorithm.

4 EVALUATION

4.1 Evaluation Scenarios

In this paper, our task is to identify the candidate

source documents in a collection, given a document in

question. We can perform this task either by ranking

or classifying the documents in collection according

to their document similarity scores.

In the ranking task, the document in question is a

plagiarized document, while in the classiﬁcation task,

it is either a plagiarized or an innocent one, and the

objective is to classify whether a pair of document in

question and candidate source document is the pair of

plagiarized and source document. We use both rank-

ing and classiﬁcation scenario in the evaluation.

4.2 Dataset

We evaluate the proposed method by the dataset de-

veloped by (Pertile et al., 2016). It is constructed by

exhaustive investigation of two document collections,

i.e. ACL

and PubMed

. For this evaluation, how-

ever, we only used dataset from the ACL since these

documents have more consistent citation format.

(Pertile et al., 2016) created the dataset by per-

forming pairwise content comparison between docu-

ments in the collection by means of several document

http://aclanthology.info/

https://www.ncbi.nlm.nih.gov/pubmed/

A Method for Plagiarism Detection over Academic Citation Networks

277

similarities to select top-n pairs. They asked 10 an-

notators to judge whether a pair in the top-n pairs is

suspected as a plagiarism case. If the pair is suspected

to be a plagiarism case, they labeled it as positive pair,

otherwise it is labeled as negative one. However, the

positive pairs in this dataset might be better to be ad-

dressed as suspected self-plagiarism since the docu-

ments in a pair share one or more authors.

In the evaluation, we use the positive and nega-

tive pairs for the classiﬁcation scenario. While we use

the positive pairs and the document collection for the

ranking scenario. One of the documents in a positive

pair is used as the document in question and the other

one is used as the source document.

Since the documents in the dataset are in the PDF

format, we use Grobid (Lopez, 2009) to extract their

contents, to parse their reference lists, to extract and

link citing sentences to the cited documents. The

complete information about this dataset is shown in

Table 1.

Table 1: The detail information of the dataset.

Type Detail

Topic computational

linguistics

Positive pair 41

Negative pair 52

Target document 4685

Plagiarized document 40

Source/plagiarized document 1.025

Avg. word (target) 2557.7

Avg. word (plagiarized) 2797

Kappa 0.675 (substantial)

Agreement rate 84%

4.3 Evaluation Methods

We measured the performance of the methods by us-

ing precision (P), recall (R), and F1, which are calcu-

lated as:

P = T P/(T P + FP) (15)

R = T P/(T P + FN) (16)

F1 = 2 × P × R/(P + R) (17)

where T P (true positive) is the number of retrieved

source documents in the ranking scenario, or the num-

ber of correctly predicted positive pairs in the classi-

ﬁcation one. FP (false positive) is the number of re-

trieved non-source documents in the ranking scenario,

or the number of negative pairs predicted as positive

ones in the classiﬁcation scenario. FN (false nega-

tive) is the number of source documents that are not

retrieved in the ranking scenario, or the number of

positive pairs predicted as negative ones in the clas-

siﬁcation scenario.

We also calculate Mean Average Precision (or

MAP) to measure ranking quality in the ranking sce-

nario:

AP(q,n) =

∑

i=0

P(i) × sourceDoc(i) (18)

MAP(n) =

|Q|

∑

q∈Q

AP(q,n) (19)

where AP(q,n),S

,P(i),sourceDoc(i), and Q are

the average precision of input document q at the cut-

off n, the source documents of q, the precision at rank

i, a function that returns 1 if document in rank i is one

of the source documents of q, otherwise it returns 0,

and the set of documents in question, respectively.

In addition, we calculate the percentage of docu-

ment pairs predicted correctly or accuracy in the clas-

siﬁcation scenario:

A = (T P + T N)/(T P + FP + FN + T N) (20)

where T N (true negative) is the number of cor-

rectly predicted negative pairs in the classiﬁcation

scenario.

4.4 Baseline Methods

In the evaluation, ﬁrst, we compare the proposed

method with the method that compares the content

of two documents by means of bag-of-word model.

Suppose we have a document in question q and a col-

lection D. Thus, we calculate the document similarity

score between q and d ∈ D in this method as:

bowm(q,d) = sim

cont

(q,d) (21)

Second, we compare our method with the method

proposed by (Soleman and Fujii, 2017a). Their

method cnc(q,d) calculate document similarity by

comparing the content similarities between citing and

non-citing sentences.

sim

(q,d) = sim

cont

(concat(

∑

∈citing(q)

cs(q,d

)),

concat(

∑

∈citing(d)

cs(d, d

)))

(22)

cnc(q,d) = λsim

(q,d) + (1 − λ)sim

(q,d) (23)

where λ is a weighting variable between 0 and 1.

KDIR 2018 - 10th International Conference on Knowledge Discovery and Information Retrieval

278

Lastly, we compare our method with a citation-

based one calculated by counting the same cited doc-

uments:

cbm(q,d) = |citing(q) ∩ citing(d)| (24)

4.5 Evaluation Results

In this section, we discuss the results from the rank-

ing and the classiﬁcation scenario. We also discuss

the error and successful cases that happened in the

classiﬁcation scenario.

Table 2 shows the results from the ranking sce-

nario for R, P, and F1. According to the R scores, the

baseline method bowm, cbm, and cnc achieved a sig-

niﬁcant high score. These happened because during

the dataset creation, (Pertile et al., 2016) pooled the

document pairs that have a signiﬁcant content simi-

larity, i.e. the top-30 document pairs to be annotated.

Hence, improving their performances in this scenario

is quite difﬁcult.

Among the baselines, the method cnc achieved the

best performance in R for the cut-off=30. While our

method achieved the same performance as cnc in R,

P, and F for every cut-off. At the cut-off=30, our

method and cnc improved the R scores for one and

three plagiarized documents when we compared them

with bowm and cbm, respectively.

According to the MAP scores shown on Table 3 at

cut-off=30, cnc is the best method among the baseline

methods, while ours achieves the best performance

of all the methods. Our method improved the rank-

ing position of one source document compared with

the baseline method bowm and cnc. The rank of this

source document in bowm, cnc, and our method is 32,

30, 24, respectively.

In this evaluation, we found the best λ for cnc was

.2. We also found the optimal α and β for our method

were between .1 and .3, and between .2 and .4, re-

spectively. These results suggest that the content sim-

ilarity of non-citing sentences should be prioritized,

but the similarity of citation behavior should not be

ignored. Additionally, using citing sentences as addi-

tional content for the cited document is also useful.

Table 4 shows the MAP scores for each document

similarity method that we proposed. Among them,

sim

achieves the best MAP score, while sim

add

the lowest one. The reason for sim

add

to have the low-

est MAP scores is because some source documents

(20 of 40) do not have any sentences citing them, or

the citing sentences are not extracted or identiﬁed.

In the classiﬁcation scenario, we used SVM (Sup-

port Vector Machine) algorithm

to perform this task.

http://scikit-learn.org/

Thus, α, β, and λ are decided automatically by the

SVM. We did stratiﬁed 10-fold cross-validation and

also searched for the optimum parameters in the

SVM, i.e. type of kernel, C, and γ.

Since the method (Lopez, 2009) failed to extract

some citing sentences and/or identify the cited doc-

uments in the ranking scenario, we manually per-

formed these tasks on both positive and negative pairs

for the classiﬁcation scenario. Thus, we could give

the ideal situation for all the document similarity

methods except for sim

add

since it was not possible

to do these tasks manually on all documents in the

collection.

Table 5 shows the evaluation results in the clas-

siﬁcation scenario. Our similarity of citation behav-

ior (sim

) achieved .3466, .17, .4228 and .338 higher

than cbm for P, R, F1 and A, respectively. These re-

sults indicate that citing sentences should be consid-

ered when comparing citations or reference lists.

Since we could not give ideal situation for sim

add

i.e. only 31 of 93 document pairs that the candi-

date source documents have sentences citing them,

its performance is the lowest among our three doc-

ument similarities in P, R, F1, and A. Despite its per-

formance, sim

add

is still useful when we consider the

situation where the content of candidate source docu-

ments are not available in the document collection.

The combination of sim

and sim

also scored

.0433, .045, .0446, and .0424 higher than cnc in the

terms of P, R, F1, and A, respectively. These results

suggest that citation anchors should also be consid-

ered when comparing citing sentences. Additionally,

this combination performed .0655, .0501, and .0433

higher than bowm for P, F1, and A, respectively. It

indicates that citing and non-citing sentences should

be distinguished when comparing documents.

According to F1 and A scores, our method

(sim

,sim

, sim

add

) is the best one. It performed

.0501, .4228, and .0585 higher than the baseline

method bowm, cbm, and cnc for F1, respectively. This

indicates that our document similarity methods com-

plement each others. In addition, our method also

made the least FN and FP compare with the baseline

methods according to Table 6.

Our method produced three FN and six FP accord-

ing to the table 6. Three FN happened because the

similar text fragment between these pairs are short.

They share less than three citing sentences, and a few

non-citing ones. Thus, to detect plagiarism for small

textual overlap remains difﬁcult when documents are

long.

While the reason for the FP is because a few

shared citing sentences containing multiple citation

anchors. Typically, these citing sentences only list

A Method for Plagiarism Detection over Academic Citation Networks

279

Table 2: The recall (R), precision (P), and F1 scores of the baselines and proposed method in the ranking scenario.

Cut- bowm cbm cnc ours

off P R F1 P R F1 P R F1 P R F1

10 .1025 .975 .1855 .095 .9 .1719 .1025 .975 .1855 .1025 .975 .1855

30 .0342 .975 .0661 .0325 .925 .0628 .035 1 .0676 .035 1 .0676

100 .0105 1 .0208 .0098 .925 .0194 .0105 1 .0208 .0105 1 .0208

Table 3: The MAP scores of the baselines and proposed

method in the ranking scenario.

Cut-off bowm cbm cnc ours

10 .9625 .8467 .9625 .9625

30 .9625 .8478 .9633 .9635

100 .9633 .8478 .9633 .9635

Table 4: The MAP scores of each proposed document sim-

ilarity method in the ranking scenario.

Cut-off sim

sim

add

10 .7099 .95 .2622

30 .7109 .9508 .266

100 .7114 .9508 .266

several existing studies. Although their number is a

few, they contribute to a signiﬁcant scores when cal-

culating the similarity of citation behavior. There are

two FP associated with this error. Hence, in the future,

it is also important to consider the type or function of

citation when calculating the similarity of citation be-

havior. For the rest of the FP, we could not ﬁnd the

reason for them.

In the classiﬁcation, we also compare the predic-

tion results between our method (sim

,sim

, sim

add

)

and bowm. Our method and bowm make the same cor-

rect and incorrect prediction for 78 and 7 document

pairs, respectively. Our method incorrectly classiﬁes

one document pair where bowm correctly classiﬁes it.

Our method, however, correctly classiﬁes seven doc-

ument pairs (two are positive pairs and ﬁve are neg-

ative pairs) that are incorrectly classiﬁed by bowm.

This indicates that our method could predict correctly

topically similar document-pair, which could be con-

sidered as difﬁcult case according to bowm. Hence, it

could increase TP and decrease FP at the same time.

We also investigate the reason for these correctly

predicted document pairs by only our method. Two

positive pairs are correctly classiﬁed because they cite

many the same documents with similar citing sen-

tences. While ﬁve negative pairs are correctly pre-

dicted because they use different citing sentences for

the same cited documents. Therefore, sim

could

successfully predict these pairs. Additionally, four

of ﬁve negative pairs are also correctly classiﬁed be-

cause they have different content in the non-citing

sentences. Hence, sim

could correctly predict these

pairs.

5 CONCLUSIONS

In this paper, we address the problem of detecting

source document in plagiarism detection by means

of ranking or classifying the candidate source docu-

ments. A few existing methods combine more than

one document similarity score. In this paper, we for-

mulate three document similarities based on citation

and content analysis, i.e. the similarity of citation

behavior, non-citing sentences, and the similarity be-

tween non-citing sentence in a document in question

and sentences citing candidate source documents, and

combine them in our proposed method.

In the ranking scenario, our method is slightly bet-

ter than the baselines according to their MAP scores.

In the classiﬁcation scenario, our method achieves

the best performance in F1 and A. Our method per-

forms .0501, .4228, and .0585 higher than the base-

line method bowm, cbm, and cnc for F1, respectively.

Our method also produces the least FN and FP.

The evaluation results suggest that the document

similarity calculations combined in our method com-

plement each other. When comparing citation anchors

or reference lists, we should not ignore the citing sen-

tences, and we also should consider the citation an-

chors when comparing the citing sentences.

The evaluation results also imply that we should

distinguish between citing and non-citing sentences

when calculating document similarity. Also, using

citing sentences as the additional content for the cited

document document is useful in plagiarism detection.

In the future, since the evaluation was conducted

on suspected plagiarism case, it is also important to

evaluate our method in real case of plagiarism. Ad-

ditionally, the type or function of citation should also

be considered in the similarity of citation behavior.

KDIR 2018 - 10th International Conference on Knowledge Discovery and Information Retrieval

280

Table 5: The evaluation scores in the classiﬁcation scenario.

Features (kernel, C,γ) P R F1 A

All (rb f ,5,1) .865 .925 .8874 .8933

Ours (sim

,sim

add

) (rb f ,10

,10

−4

) .89 .925 .8981 .9044

sim

,sim

(rb f ,10

,10

−3

) .885 .9 .8842 .8933

sim

add

(rb f ,10

,1) .45 .25 .3091 .5823

sim

(rb f ,10

,10

−3

) .855 .855 .8485 .862

sim

(rb f ,10

,1) .8933 .66 .7213 .8063

cnc (linear,1,−) .8417 .855 .8396 .8509

cbm, bowm (rb f ,.1,10) .8195 .9 .848 .85

cbm (rb f ,10,10) .5467 .49 .4753 .5664

bowm (linear,5,−) .8195 .9 .848 .85

Table 6: The number of TP, FN, TN, and FP of our method

and the baselines.

Method TP FN TN FP

Ours 38 3 46 6

cnc 35 6 44 8

sbc 20 21 32 20

scont 37 4 42 10

REFERENCES

Alzahrani, S., Palade, V., Salim, N., and Abraham, A.

(2012). Using structural information and citation ev-

idence to detect signiﬁcant plagiarism cases in scien-

tiﬁc publications. J. Am. Soc. Inf. Sci., 63(2):286–312.

Barr

on-Cede

no, A. and Rosso, P. (2009). On automatic pla-

giarism detection based on n-grams comparison. In

Proceedings of the 31th European Conference on IR

Research on Advances in Information Retrieval, pages

696–700. Springer.

Chen, C. Y., Yeh, J. Y., and Ke, H. R. (2010). Plagiarism

detection using rouge and wordnet. Journal of Com-

puting, 2(3):34–44.

Chong, M. and Specia, L. (2011). Lexical generalisation

for word-level matching in plagiarism detection. In

Proceedings of International Conference Recent Ad-

vances in Natural Language Processing, pages 704–

709. Association for Computational Linguistics.

Gipp, B. and Meuschke, N. (2011). Citation pattern match-

ing algorithms for citation-based plagiarism detection:

greedy citation tiling, citation chunking and longest

common citation sequence. In Proceedings of the 11th

ACM Symposium on Document Engineering. ACM.

Grozea, C., Gehl, C., and Popescu, M. (2009). Encoplot:

Pairwise sequence matching in linear time applied to

plagiarism detection. In Proceedings of the SEPLN

2009 Workshop on Uncovering Plagiarism, Author-

ship, and Social Software Misuse, pages 10–18. ceur-

ws.org.

HaCohen-Kerner, Y., Tayeb, A., and Ben-Dror, N. (2010).

Detection of simple plagiarism in computer science

papers. In Proceedings of the 23rd International

Conference on Computational Linguistics, pages 421–

429. Association for Computational Linguistics.

Kasprzak, J. and Brandejs, M. (2010). Improving the reli-

ability of the plagiarism detection system: lab report

for pan at clef 2010. In Notebook Papers of CLEF

2010 Labs and Workshops. ceur-ws.org.

Kessler, M. M. (1963). Bibliographic coupling between sci-

entiﬁc papers. American Documentation, 14(1):10–

25.

Lopez, P. (2009). Grobid: Combining automatic biblio-

graphic data recognition and term extraction for schol-

arship publications. In Proceedings of 13th Euro-

pean Conference on Digital Libraries, pages 473–474.

Springer.

Pertile, S. D. L., Moreira, V. P., and Rosso, P. (2016). Com-

paring and combining content-and citation-based ap-

proaches for plagiarism detection. J. Assn. Inf. Sci.

Tec., 67(10):2511–2526.

anchez-Vega, F., Villatoro-Tello, E., Montes-y-G

omez,

M., Rosso, P., Stamatatos, E., and Villase

nor-Pineda,

L. (2017). Paraphrase plagiarism identiﬁcation with

character-level features. Pattern Analysis and Appli-

cations, pages 1–13.

Soleman, S. and Fujii, A. (2017a). Plagiarism detection

based on citing sentences. In Proceedings of 21st

International Conference on Theory and Practice of

Digital Libraries, pages 485–497. Springer.

Soleman, S. and Fujii, A. (2017b). Toward plagiarism de-

tection using citation networks. In Proceedings of

12th International Conference on Digital Information

Management, pages 202–208. IEEE.

A Method for Plagiarism Detection over Academic Citation Networks

281