purpose). This will be used to evaluate the overall
accuracy of the approach. Overall results of the
classification on testing MRPC dataset are
highlighted in Table 2.
Table 1: Average similarity score for paraphrase and non-
paraphrase cases in MRPC dataset
Method Paraphrase Non-paraphrase
Our method 88% 46%
Quantification (2)
WebJacquard
If-Idf Cosine Sim
Method [5]
64%
57%
42%
57%
33%
41%
24%.
36%
Table 2: Overall classification accuracy on MRPC testing
dataset.
Method Accuracy rate
Our method 84%
Quantification (2)
WebJacquard
If-Idf Cosine Sim
Method [5]
64%
53%
58%
71%
Results highlighted in Table 1 and Table 2 testify of
the usefulness of the proposed approach that fruitfully
combine Wikipedia based measure, WordNet based
semantic similarity and double checking model on the
top extracted snippets of the queries in order to infer
enhanced similarity measure. Future work involves
study of algebraical and asymptotical properties of
the elaborated measure as well as testing on
alternative corpus. Especially, it is easy to see that
expression (8) will require further refinements in the
case where the presence of false negative is dominant
in the dataset.
5 CONCLUSION
This paper contributes to the ongoing research of
developing efficient tools for paraphrase detection.
The approach advocates a web-based approach where
the snippets of the search are analyzed using
WordNet semantic based measure and Normalized-
based distance Wikipedia based measure. The
proposal has been designed in order to accommodate
a prudent attitude like reasoning. The test using
Microsoft Research Paraphrase Corpus has shown
good results with respect to some of state of the art
approaches. Although, the complexity of web search
outcome is well documented, the proposal opens
news ways to explore the timely availability of the
search results by exploring the similarity of the search
outcomes regardless of the accuracy of single search
results.
REFERENCES
D. Bollegala, Y. Matsuo, and M. Ishizuka, 2007.
“Measuring semantic similarity between words using
web search engines,” in Proc. of WWW ’07, pp. 757–
766
H. Chen, M. Lin, and Y. Wei, 2006. “Novel association
measures using web search with double checking,” in
Proc. of the COLING/ACL 2006, pp. 1009–1016.
R. Cilibrasi and P. Vitanyi, 2007 “The google similarity
distance,” IEEE Transactions on Knowledge and Data
Engineering, vol. 19, no. 3, pp. 370–383.
R. Collobert and J. Weston, 2008. A unified architecture for
natural language processing: deep neural networks with
multitask learning. In ICML.
B. Dolan and C. Brockett, 2005. Automatically
constructing a corpus of sentential paraphrases. In The
3rd International Workshop on Paraphrasing
(IWP2005).
B. Dolan, C. Quirk, and C. Brockett 2004. Unsupervised
Construction of Large Paraphrase Corpora: Exploiting
Massively Parallel News Sources. In COLING ’04:
Proceedings of the 20th international conference on
Computational Linguistics, p 350, Morristown, NJ,
USA. Association for Computational Linguistics.
C. Fellbaum, 1998. WordNet – An Electronic Lexical
Database, MIT Press.
A. Fernando and M. Stevenson, 2008. A semantic similarity
approach to paraphrase detection, Proceedings of the
11th Annual Research Colloquium of the UK Special
Interest Group for Computational Linguistics.
A. Islam, and Inkpen, D., 2007. Semantic similarity of short
texts Proceedings of the International Conference on
Recent Advances in Natural Language Processing
(RANLP 2007), Borovets, Bulgaria, pp. 291-297.
R. Mihalcea, R., Corley, C., and Strapparava, C., 2006.
Corpus-based and knowledge-based measures of text
semantic similarity Proceedings of the National
Conference on Artificial Intelligence (AAAI 2006),
Boston, Massachusetts, pp. 775-780.
M. Sahami and T. Heilman, 2006. A web-based kernel
function for measuring the similarity of short text
snippets. In Proc. of 15th International World Wide
Web Conference.
Z. Wu and M. Palmer, 1994. Verb semantics and lexical
selection. In 32nd. Annual Meeting of the Association
for Computational Linguistics, pages 133 –138, New
Mexico State University,Las Cruces, New Mexico.
Y. Zhang and J. Patrick, 2005. Paraphrase identification by
text canonicalization. In Proceedings of the
Australasian Language Technology Workshop 2005,
pages 160–166 Sydney, Australia, December.