analyze the relationship between the similarity of the
short text units, Soft Cosine measure can not be
discarded and has great potential. Soft Cosine
measure is a model-resistant, ie, regardless of the
input DL model and its dimensions, it provides equal
results (at least for CS and MSRP corpora).
In models that create embeddings for words, but
not for whole texts, vector presentations of texts are
necessary to obtain as a linear combination of words
vectors. The average value proved to be better than
the sum of the word vector. Similarly, the
normalization of the resulting documents vectors
gives better results.
8 DISCUSSION
For further studies, a well-annotated corpus for
paraphrasing is required. The authors of this paper
will present the corpus and make it publicly available.
That corpus will contain 100 documents and their
paraphrases. The existing two major MSRP and
Webis corpora are partially mis-annotated, because
(part of) the annotation is done through Amazon
Mechanical Turk which is not well done, and that can
be easily proved by insight into official results
(human evaluations) and their comparisons with pairs
of texts that are evaluated. Furthermore, the authors
plan to adapt Semeval-2014 Task 3 Cross-level
Semantic Similarity Corpus and carry out
experiments over it, as well as on the P4PIN corpus
pointed by the reviewer.
The paper proposed one possible mathematical
equation for calculating the near-optimal number of
dimensions of DL models. But can it be better? With
new corpora, it is possible to further check and then
modify it. It is certain that the dimension of the model
depends on the size of the input corpus on which the
models are trained. We will repeat experiments using
the equation
𝐷𝑖𝑚𝑙𝑜𝑔
𝑁𝑈𝑊∙ 𝑙𝑜𝑔
𝑁𝐷
(9)
where is NUW number of unique words, and ND is
the number of documents in the corpus. The logic for
such a concept of the equation is as follows: (a) All
unique words are required to encode into unique
binary records (the first part of the equation) and (b)
each word can be located in many contexts,
depending on the size of the corpus (the second part
of the equation).
Future experiments could be repeated with a
larger observation window when training DL models,
as well as part-of-speech tags could be used, where
the most promising is the usage of nouns and verbs.
Experiments should also be carried out with other
DL models, such as Doc2Vec, GloVe, USE, ELMo
and BERT. Only two DL models have been used in
the article because too much data would make it
difficult to create conclusions, given that the
emphasis of the article was an impact on the results
of various similarity/distances measures.
ACKNOWLEDGEMENTS
This research was funded by the University of Rijeka
grant number uniri-drustv-18-38.
REFERENCES
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T.
(2016). Enriching Word Vectors with Subword
Information. Computing Research Repository.
https://doi.org/1511.09249v1
Burrows, S., Potthast, M., & Stein, B. (2013). Paraphrase
acquisition via crowdsourcing and machine learning.
ACM Transactions on Intelligent Systems and
Technology, 4(3), 1. https://doi.org/10/gbdd2k
Charlet, D., & Damnati, G. (2017). SimBow at SemEval-
2017 Task 3: Soft-Cosine Semantic Similarity between
Questions for Community Question Answering.
Proceedings of the 11th International Workshop on
Semantic Evaluation (SemEval-2017), 315–319.
https://doi.org/10/gjvjk5
Clough, P., & Stevenson, M. (2009). Creating a Corpus of
Plagiarised Academic Texts. Proceedings of the Corpus
Linguistics Conference, January 2009.
Dolan, B., Brockett, C., & Quirk, C. (2005). Microsoft
Research Paraphrase Corpus. Microsoft Research.
https://www.microsoft.com/en-
ca/download/details.aspx?id=52398
Feng, F., Yang, Y., Cer, D., Arivazhagan, N., & Wang, W.
(2020). Language-agnostic BERT Sentence
Embedding. http://arxiv.org/abs/2007.01852
Harispe, S., Ranwez, S., Janaqi, S., & Montmain, J. (2015).
Semantic Similarity from Natural Language and
Ontology Analysis. Synthesis Lectures on Human
Language Technologies, 8(1), 1–254.
https://doi.org/10/gc3jtd
Harispe, S., Ranwez, S., Janaqi, S., & Montmain, J. (2016).
Semantic Measures for the Comparison of Units of
Language, Concepts or Instances from Text and
Knowledge Base Analysis. ArXiv:1310.1285 [Cs].
http://arxiv.org/abs/1310.1285
Jurgens, D., Pilehvar, M. T., & Navigli, R. (2014).
SemEval-2014 Task 3: Cross-Level Semantic
Similarity. SemEval@ COLING, 17–26.
Luu, V.-T., Forestier, G., Weber, J., Bourgeois, P., Djelil,
F., & Muller, P.-A. (2020). A review of alignment
based similarity measures for web usage mining.