reflect the supposed high relevance of these words in
the sentences, or both cases occur.
These poor results from BETO could be due to
several factors acting together, although the main
reason that could explain these results is that BETO
is actually a pre-trained model, meaning it has not
been specifically trained for a particular task, and
specifically not for a text similarity calculation task.
In fact, the model used is “BETO-base,” or BETO
in its base form. To address this, specific training on
BETO for the text similarity calculation task would
be required.
Moreover, if the same word appears in both
sentences, and if this word is not a stop-word or
punctuation mark, its Word similarity (objective
evaluation of the word pair) will be equal to 1 and will
top the list of word pairs with the highest similarity,
regardless of the model used.
Finally, during the lemmatization process with
Spacy, the word ‘perturbadora’ becomes
‘perturbadoro’, which is a non-existing word in
Spanish (the correct form should be ‘perturbador’)
This does not pose a practical problem since the
similarity calculation of the method is performed on
the lemmas of the words, and not on the words
themselves.
If one wanted to obtain appropriate results, fine-
tuning of the lemmatization process would be
required.
5 CONCLUSIONS
This work has highlighted the power of Natural
Language Processing in developing a method for
analyzing the explainability of similarity between
short texts in Spanish.
Although the original purpose was to consider
that the method would have a primary focus in the
academic field, in view of the tests carried out and the
results obtained, it has been determined that the
proposed method for sentence similarity calculation
is effective not only in this field but also in many
other areas of Natural Language Processing.
A method has been proposed to analyze the
explainability of similarity between short texts in
Spanish by evaluating existing technologies that
enabled its development, and by comparing four NLP
models we conclude that models trained for specific
tasks return better results in those activities than
models trained with a corpus and a more general
purpose. Similarly, it can be determined that there is
evidence to suggest that the dimensionality of
embeddings may affect the quality of results, with a
directly proportional relationship between the
number of dimensions and the results obtained.
In addition, by comparing the quality of the
results, it has been proven that single objective
assessment is not sufficient, and human inspection is
necessary to consolidate the model that best performs
the explainability of similarity calculation.
Upon completing this research, new possibilities
open up for future developments of methods and
systems to explain the similarity between short texts
in Spanish: manual validation by experts to clarify the
quality of the results with the proposed method;
expand the scope of the experiment (the experiment
conducted in this study considered three specific pairs
of sentences and four NLP models based on Google
BERT, returning the top k=5 pairs of sentences with
the highest similarity). Future work should include a
larger number of sentence pairs, i.e., a more extensive
corpus that covers a broader spectrum of language; as
well as testing other NLP models, whether based on
BERT or not, and even architectures not based on
Transformers; using and comparing the results with
other similarities and distances, such as Jaro-Winkler
and Levenshtein; as well as alternative metrics and
algorithms like BLEU (Papineni et al., n.d.) and
ROUGE (Lin, n.d.). However, it should be noted that
such comparisons fall outside the scope of the present
position paper.
Finally, the development of an interactive web
application that allows the user to input two sentences
and return the explanation of their degree of similarity
based on the most similar word would increase the
corpus size, and collecting user feedback (e.g.,
through icons or rating buttons), as well as
democratizing the value of Artificial Intelligence.
ACKNOWLEDGEMENTS
This research is supported by the UNIR project MLX-
PRECARM.
REFERENCES
Burrows, S., Gurevych, I., & Stein, B. (2015). The eras and
trends of automatic short answer grading. International
Journal of Artificial Intelligence in Education, 25(1),
60–117. https://doi.org/10.1007/S40593-014-0026-
8/TABLES/11
Lin, C.-Y. (2004). ROUGE: A Package for Automatic
Evaluation of Summaries.
Malkiel, I., Ginzburg, D., Barkan, O., Caciularu, A., Weill,
J., & Koenigstein, N. (2022). Interpreting BERT-based
Proposal of a Method for Analyzing the Explainability of Similarity Between Short Texts in Spanish