5.2 Evaluation
In this experiment, 15 fields could be considered as
relevant, that is, that represent the same information.
A total of 14 was retrieved according to Table 2. The
unidentified terms would be Tags ⇒ tags.
The revocation and precision rates are shown in
Table 4 together with the number of terms retrieved,
terms relevant retrieved and terms relevant repre-
sented by Ret, Rel Ret and Rel respectively.
Table 4: Recall and precision values.
Ret Rel Ret Rel Recall Precision
14 12 15 80% 85,71%
Among the terms retrieved, two were identified in-
correctly: Publication ⇒ Issue and Hour ⇒ Minute.
This happened because the semantic metric found
high values for these cases. Some difficulties encoun-
tered in correctly identifying the equivalent terms are
due to particularities of the chosen techniques.
Some words are not valid in the language and
therefore can not be analyzed semantically, just as
they can not be radical in Porter’s algorithm. How-
ever, they will always be analyzed for the variation of
characters having an appropriate threshold for each
case. Thus, from the indexes found, the process pre-
cision is considered good.
6 CONCLUSION
The main contribution concerns the process for ex-
tracting conceptual schemas that has as input a collec-
tion of documents in JSON format. These files may
be stored in a NoSQL database or in the Web in gen-
eral.
In order to exploit the flexibility of schemas, the
process aims to identify equivalences in the fields that
are written differently or at integration scenarios, ei-
ther for lack of standardization or for misunderstand-
ing, but that represent the same information. In this
way, it uses similarity techniques that cover simi-
lar spelling, synonyms and radical equivalence of the
word. The process is applied between documents and
within the document, generating relationships of type
1 : M or 0 : M, once in an NoSQL model these are
indicated about a entity nested in a root element.
The tests indicate that the process has more scope
as a greater number of variations, maintaining good
rates of revocation and precision. Inconsistencies oc-
curred in cases of words that even have the same
spelling, have different meanings, or questions of the
Wordnet library.
The extraction of a unified schema can also be
useful in future work to allow the submission of
queries about it, since a mapping indicates to which
other terms that consolidated term refers and points
the respective origin documents of the corresponding
terms. A future solution could be investigating the use
of algorithms to deal with homonyms.
With the growth in the volume of data and the
popularization of the data of the mono structured as
JSON, it is need to be concerned about schemes so
that it can develop applications that access them in a
coherent way. This proposal differs by exploring the
flexibility of schemas, identifying equivalent fields in
terms of synonymous, word radical and character gen-
erating a unified schema.
REFERENCES
Benson, S. R. (2014). Polymorphic data modeling. Master’s
thesis, Georgia Southern University.
Blaselbauer, V. M. and Josko, J. M. B. (2020). Jsonglue: A
hybrid matcher for json schema matching. Proceed-
ings of the Brazilian Symposium on Databases.
Goessner, S. (2007). Jsonpath - xpath for json.
http://goessner.net/articles/JsonPath/. Acessed in
2016, November.
Jovanovic, V. and Benson, S. (2013). Aggregate data mod-
eling style. SAIS 2013, pages 70–75.
Kettouch, M. S., Luca, C., Hobbs, M., and Dascalu, S.
(2017). Using semantic similarity for schema match-
ing of semi-structured and linked data. In 2017 Inter-
net technologies and applications (ITA), pages 128–
133. IEEE.
Klettke, M., St
¨
orl, U., Scherzinger, S., and Regensburg, O.
(2015). Schema extraction and structural outlier de-
tection for json-based nosql data stores. In BTW, vol-
ume 2105, pages 425–444.
Levenshtein, V. I. (1966). Binary codes capable of cor-
recting deletions, insertions and reversals. In Soviet
physics doklady, volume 10, page 707.
Lin, D. (1998). An information-theoretic definition of sim-
ilarity. In ICML, volume 98, pages 296–304. Citeseer.
Miller, G. A. (1995). Wordnet: a lexical database for en-
glish. Communications of the ACM, 38(11):39–41.
Porter, M. (2006). An algorithm for suffix
stripping. Program: electronic library
and information systems, 40(3):211–218.
https://doi.org/10.1108/00330330610681286.
Ruiz, D. S., Morales, S. F., and Molina, J. G. (2015). Infer-
ring versioned schemas from nosql databases and its
applications. In International Conference on Concep-
tual Modeling, pages 467–480. Springer.
Varga, V., J
´
anosi-Rancz, K. T., and K
´
alm
´
an, B. (2016).
Conceptual design of document nosql database with
formal concept analysis. Acta Polytechnica Hungar-
ica, 13(2).
A Text Similarity-based Process for Extracting JSON Conceptual Schemas
271