Paraphrase Detection based on Vector Space Model: A Study of

Utilization of Semantic Network for Improving Information

Nurwati,Yudi Santoso,Krisna Adiyarta

Universitas Budiluhur

Keywords: Paraphrasing, Vector Space Model, Precision and Recall

Abstract: Paraphrasing if seen in plain view, does not look like it, so we need a technique or model that can measure

the level of similarity between documents that will compare these documents. Vector Space Model is a

standard approach model used to find similarities between documents. This study aims to find a system

model that can be applied to paraphrase detection applications by utilizing semantic information as a tool

that is integrated with the Vector Space Model. This study will use prototyping research strategies. The

approach taken in conducting this investigation is to compare the performance of the system prototype

developed according to the research hypothesis with a standard prototype that is built according to a

standard framework. In its investigation, this research will use Confusion matrix as the most popular tool in

evaluating system performance using accuracy performance criteria, namely Precision and Recall. In this

way, it is expected that the semantic network model, data structure model, and algorithm that can be

integrated with the vector space model to produce a paraphrase detection system that has a perfect

performance is expected.

1 INTRODUCTION

Paraphrasing is a linguistic term which means re-

expressing a concept in another way in the same

language, but without changing its meaning.

Paraphrasing gives the author the possibility to

emphasize somewhat differently from the original

author.

The rapid development of technology makes it

easy for information users to find and find the

information needed. The internet is one of the most

widely used technology products for information

seekers by providing sources of information from

the authors themselves as well as duplicating with or

without including the original authors. The ease of

getting information and documents through internet

media creates new problems because it turns out that

documents are still found without mentioning the

source of the document's author. It was not known

intentionally or accidentally. Posts or pieces of

writing taken from other people's writings,

intentionally or unintentionally, if not correctly and

adequately referenced, can be categorized as

plagiarism, according to (Isa et al. 2014).

The task of identifying paraphrases has become

mainstream in the research area for natural language

processing. The success of application development

that utilizes semantic similarities is very dependent

on the ability of the system (algorithm) to determine

whether or not there is a semantic relationship

between two words or terms. In this study we see the

problem of paraphrasing in two forms, say A and B,

it is viewed as semantic quantification, the

relationship between two texts, for example, to what

extent text A has the same meaning as text B

(paraphrase relationship) or the extent of text A part

of semantic text B (entailment relationship).

Considering this fact, the formulation of the problem

from this study is It is difficult to build a perfect

paraphrase detection system that can detect text by

considering the physics of the two texts.

This study focuses on the use of semantic

network, which is used as a tool to represent

knowledge. This study focuses on the use of

semantic networks which are used as a tool to

represent paraphrase knowledge as a tool to improve

the performance of paraphrase detection systems.

Then the results of the vector space model with the

semantic network are expected to produce a

paraphrase detection system that is better than the

existing one.

172

Nurwati, ., Santoso, Y. and Adiyarta, K.

Paraphrase Detection based on Vector Space Model: A Study of Utilization of Semantic Network for Improving Information.

DOI: 10.5220/0008931301720176

In Proceedings of the 1st International Conference on IT, Communication and Technology for Better Life (ICT4BL 2019), pages 172-176

ISBN: 978-989-758-429-9

2. METHODOLOGY

In this detection, paper paraphrase uses a vector

space model-based method. The initial stage

describes the initial work in terms of conceptual

aspects. This includes conducting a literature review

related to research fields such as the definition of a

text similarity detection system and Semantic

Network. The initial stage describes the initial work

in terms of conceptual aspects. This includes

conducting a literature review related to research

fields such as the definition of a text similarity

detection system and Semantic Network. Some of

the processes that go through are the tokenization

process (this process has gone through a case-

folding process), then the filtering process is

continued. The process is continued with the

stemming process to get the basic syllables using the

Porter algorithm. The next step is the VSM (Vector

Space Model) Construction process then compares

the similarity of meaning (semantic measurement).

The general architecture of the proposed method is

shown in Figure 1.

The data tested is the Microsoft Research (MSR)

paraphrase data set taken in Dolan et al. 2004 was

quoted from (Daniel Ramage, Anna N. Rafferty

2009). Microsoft Paraphrase (MSR) paraphrase is a

collection of 5801 pairs of sentences collected

automatically from newswire over 18 months. Each

pair is written by two judges with their binary / not

one of the sentence sentences that are valid from the

other. Annotator is asked for the value of each

sentence pair quite accordingly. The Interannotator

Agreement is 83%. However, 67% of the couples

succeeded in paraphrasing, as well as the literature

did not reflect the scarcity of paraphrasing. Data sets

are present in pre-split to 4076 pair training and

1725 test pairs. Mihalcea et al. 2006 quoted from

(Daniel Ramage, Anna N. Rafferty 2009) give

results for several steps that underlie lexical-

semantic linkages. This is further divided into

action-based actions (using Latent Semantic

Analysis based on Landauer et al., 1998 quoted from

(Daniel Ramage, Anna N. Rafferty 2009) and

exchange of knowledge and knowledge-based

resources driven by WordNet. In this unsupervised

experimental setting, we consider using only the

threshold similarity values of our system and from

the Mihalcea algorithm to determine paraphrasing or

non-paraphrase considerations. For consistency with

research we

use a threshold of 0.5 to measure whether the

pair of sentences tested have the same meaning or

not.

prototype development

initial research (literature study) and

hypothesis development

document input

tokenisation

process

filtering process

stemming process

vector space

model process

system simulation and evaluation

the results of the Vector space

model process are tested again

using Confusion matrix

analysis of results

draw a conclusion

Figure 1 The general architecture of the proposed method

3. RESULTS

a. Experiment Standard Approach

Pre-processing

Tokenizing or tokenization is cutting each word in a

sentence or parsing by using a space as a delimiter

that will produce a word token. In the tokenization

stage, it also removes punctuation or special

characters (Bania Amburika, Yulison Herry

Chrisnanto 2016).

Figure 2 to display the process.

Paraphrase Detection based on Vector Space Model: A Study of Utilization of Semantic Network for Improving Information

173

Figure 2 Tokenisation process

Stemming Process

Stemming is returning words obtained from the

filtering results to its basic form, eliminating the

initial affixes (prefix) and final affixes (suffix) so

that the necessary words (Bania Amburika, Yulison

Herry Chrisnanto 2016) are obtained. For this stage

of stemming, we use the stemming of the Porter

algorithm. In figure 4 to show the process.

Figure 4 Stemming Process

b. TFIDF

calculations and similarity calculations between

documents using a vector space model. TF is a

simple weighting where it is crucial whether or not

a word is assumed to be proportional to the

number of occurrences of the word in the

document, while IDF is a weighting that measures

how important a word is in a document when

viewed globally in all documents (Bania

Amburika, Yulison Herry Chrisnanto 2016). After

going through the TFIDF calculation, proceed with

the vector space model process in figure 5. Similar

results cosine similarity use vector space model

0.766.

ICT4BL 2019 - International Conference on IT, Communication and Technology for Better Life

174

Figure 5 Similar results with the VSM formula

c. Similarity Results with the Semantic Network

method

The results of similarities use the Vector Space

Model with the Semantic Network method in

figure 6. Similar cosine similarity uses semantic

network method 0.766

Figure 6 Similar results with the Semantic Network method

d. Confusion Matrix table Next is the confusion matrix table for the similarity

of sentences using Semantic Network figure 7.

Figure 7 confusion matrix table for the similarity of sentences using a semantic network

Paraphrase Detection based on Vector Space Model: A Study of Utilization of Semantic Network for Improving Information

175

4. CONCLUSIONS AND

DISCUSSION

The results and analysis of word similarity by

comparing the Vector Space Model method with

Semantic Network in table 1.

Table 1 comparing the Vector Space Model method with

Semantic Network

Method Cosine Similarity

Vector Space Model 0,766

Semantic Network 0,766

From the table above, the Vector Space Model

and Semantic Network methods have the same

value. This is because the words in the Microsoft

Research Paraphrase Corpus (MSRP) data as

standard datasets are not all available in the Prolog

dataset (datasets containing words that have similar

meanings). In the prologue data set, if there is a

word id, there is a meaningful code in the MSRP

dataset, but it turns out that many do not have the

code. If there is no meaning code in the prologue

dataset (the similarity of the meaning of the word),

there is no word found. Results and analysis of

accurate measurement of precision and recall Vector

Space Models with Semantic Network at table 2.

Table 2. Results and analysis of accurate measurement of

precision and recall Vector Space Models with Semantic

Network

Method Precision Recall Accuracy

Vector Space

Model

0,71 0,90 0,67

Semantic

Network

0,71 0,91 0,67

The value of precision or measurement value of

the quality of how useful the search system is in

both the Vector Space Model and Semantic Network

measurements is equally worth 0.71. This means that

both in the Vector Space Model and Semantic

Network method the ability to recall information is

useful because it is 71%.

The recall value of the quality of how complete

the search system displays the relevant results in

either the Vector Space Model is 0.90, and the

Semantic Network is 0.91. This means that both in

the Vector Space Model and Semantic Network

method the ability to recalculate the acquisition

value is challenging to measure because the number

of all relevant documents in the database is huge,

which is around 90%.

Accuracy values are defined as the level of

closeness between predictive values , and the actual

value of Space Space Model and Semantic Network

are both worth 0.67 or 67%.

REFERENCES

Bania Amburika, Yulison Herry Chrisnanto, W.U., 2016.

Teknik Vector Space Model (VSM) Dalam Penentuan

Penanganan Dampak Game Online Pada Anak.

publikasiilmiah.unwahas.ac.id. Available at:

https://www.google.com/url?sa=t&rct=j&q=&esrc=s&

source=web&cd=1&cad=rja&uact=8&ved=2ahUKEw

juyNeFyf7gAhXu4nMBHbk0B-

wQFjAAegQIBBAC&url=https://publikasiilmiah.unw

ahas.ac.id/index.php/PROSIDING_SNST_FT/article/v

iewFile/1512/1595&usg=AOvVaw0.

Daniel Ramage, Anna N. Rafferty, and C.D.M., 2009.

Random Walks for Text Semantic Similarity.

Proceeding TextGraphs-4 Proceedings of the 2009

Workshop on Graph-based Methods for Natural

Language Processing, pp.23–31. Available at:

https://dl.acm.org/citation.cfm?id=1708131.

Isa, T.M. et al., 2014. Mengukur Tingkat Kesamaan

Paragraf Menggunakan Vector Space Model untuk

Mendeteksi Plagiarisme. Available at:

https://rp2u.unsyiah.ac.idindex.php/welcome/prosesD

ownload/588/5.

ICT4BL 2019 - International Conference on IT, Communication and Technology for Better Life

176