WEB AUTHENTIC AND SIMILAR TEXTS DETECTION USING
AR DIGITAL SIGNATURE
Marios Poulos, Nikos Skiadopoulos and George Bokos
Laboratory of Information Technology, Department of Achives and Library Sciences, Ionian University, Corfu, Greece
Keywords: Data mining, AR model, Semantic web, Information retrieval.
Abstract: In this paper, we propose a new identification technique based on an AR model with a complexity of size
O(n) times in web form, with the aim of creating a unique serial number for texts and to detect authentic or
similar texts. For the implementation of this purpose, we used an Autoregressive Model (AR) 15
th
order,
and for the identification procedure, we employed the cross-correlation algorithm. Empirical investigation
showed that the proposed method may be used as an accurate method for identifying same, similar, or
different conceptual texts. This unique identification method for texts in combination with SCI and DOI
may be the solution to many problems that the information society faces, such as plagiarism and clone
detections, copyright related issues, and tracking, and also in many facets of the education process, such as
lesson planning and student evaluation. The advantages of the exported serial number are obvious, and we
aim to highlight them while discussing its combination with DOI. Finally, this method may be used by the
information services sector and the publishing industry for standard serial-number definition identification,
as a copyright management system, or both.
1 INTRODUCTION
A challenging issue rising from the phenomenon of
the enormous increase of data and the requirement
of data integration from multiple sources is to find
near duplicate records efficiently. Near duplicate
records create high similarity to each other;
however, they are not bitwise-matching. There are
many causes for the existence of near duplicate data:
typographical errors, versioned, mirrored, or
plagiarized documents, multiple representations of
the same physical object, spam emails generated
from the same template, etc. (Xiao et al, 2008). In
recent, years many systems have been developed in
order to solve the above problems. Furthermore, in
the internet approach with these strongly dynamic
features, many times articles are published and after
a short period, they are removed from the URL
location. This phenomenon many times lead to
plagiarism practices. For this problem, (Phelps &
Wilensky 2000) propose a less burdensome solution:
compute a lexical signature for each document, or a
string of about five key identifying words in the
document. However, while this idea seems quite
practical, this calculation is very complex and as
shown, the observed complexity is achieved O(n2)
times (Klein & Nelson, 2008), where n is number of
the compared characters. The algorithm of the above
case is dependent upon the intention of the search. In
further detail, these algorithms weighted for Term
Frequency (TF: “how often does this word appear in
this document?”) were better at finding related
pages, but the exact page would not always be in the
top N results. Algorithms weighted for Inverse
Document Frequency (IDF: “in how many
documents does this word appear?”) were better at
finding the exact page but were susceptible to small
frequency changes in the document such as a fixed
spelling (Klein & Nelson, 2008). A common
statistical approach is the structure of text vectors
based on values relating the text, like the frequencies
of words or compression metrics (Lukashenko, et al.
2007). Based on statistical measures, each
document can be described with so-called
fingerprints, where n-grams are hashed and then
selected to be fingerprints (Lukashenko, R., et. al
2007). In brief, the above techniques can be
approximately grouped into two categories:
attribute-counting systems and structure-metric
systems (Chen, et al., 2004).
However, this approach causes many problems
due to the pair-to-pair comparison that increases the
89