A Text Similarity-based Process for Extracting JSON Conceptual

Schemas

Fhabiana T. Machado, Deise Saccol, Eduardo Piveta, Renata Padilha and Ezequiel Ribeiro

Federal University of Santa Maria, Santa Maria, Brazil

Keywords:

JSON, Schema Extraction, Information Integration.

Abstract:

NoSQL (Not Only SQL) document-oriented databases stand out because of the need for scalability. This

storage model promises ﬂexibility in documents, using ﬁles and data sources in JSON (JavaScript Object

Notation) format. It also allows documents within the same collection to have different ﬁelds. Such differences

occur in database integration scenarios. When the user needs to access different datasources in an uniﬁed

way, it can be troublesome, as there is no standardization in the structures. In this sense, this work presents a

process for conceptual schema extraction in JSON datasets. Our proposal analyzes ﬁelds representing the same

information, but written differently. In the context of this work, differences in writing are related to treatment

of synonyms and character. To perform this analysis, techniques such as character-based and knowledge-based

similarity functions, as well as stemming are used. Therefore, we specify a process to extract the implicit

schema present in these data sources, applying different textual equivalence techniques in ﬁeld names. We

applied the process in an experiment from the scientiﬁc publications domain, correctly identifying 80% of

the equivalent terms. This process outputs an uniﬁed conceptual schema and the respective mappings for the

equivalent terms contributing to the schema integration’s problem.

1 INTRODUCTION

Due to the increasing volume of data generated

by various applications, NoSQL document-oriented

databases models were created. Their main charac-

teristics are schemaless and there are no complex re-

lationships.

Despite the allowed schema ﬂexibility, it is a

misconception to state that a schema does not ex-

ist. When using an application to access a NoSQL

database, it is assumed that certain ﬁelds exist with a

certain meaning and type. In this sense, there is an

implicit database schema: a set of assumptions about

the data structure in the code that manipulates it.

This storage model allows, for example, that a

ﬁeld can be present in some documents and in oth-

ers not, or that there are ﬁelds with distinct names,

including between documents belonging to the same

collection. In this way, there may exist different ﬁelds

of the same domain that represent the same infor-

mation, as occurs in integration scenarios of JSON

datasets, as presented at Figure 1.

The example in Figure 1 points to some informa-

tion with the same meaning, but represented quite dif-

ferently: (1) scores, line 4A, and score, line 4B,

Figure 1: Motivating example.

present only one difference in the plural. This type

of change can occur with other sufﬁxes and language

preﬁxes; (2) class id, line 3A, and id class, line

3B. The difference is that their terms are written in

reverse order, that is, one with the word “id” at the

beginning and another at the end of the word; (3)

student, line 2A, and learner, line 2B, however,

have different spellings, i.e., with synonymous words.

Other works on schema extraction in JSON data

sources, described in Section 3, are not concerned

with different spellings for equivalent ﬁelds or not

produces a conceptual schema. This issue become

important once a document oriented NoSQL database

has a ﬂexible schema and could not have standardized

dataset.

Therefore, the purpose of this work is to extract

264

Machado, F., Saccol, D., Piveta, E., Padilha, R. and Ribeiro, E.

A Text Similarity-based Process for Extracting JSON Conceptual Schemas.

DOI: 10.5220/0010475102640271

In Proceedings of the 23rd International Conference on Enterprise Information Systems (ICEIS 2021) - Volume 1, pages 264-271

ISBN: 978-989-758-509-8

the implicit schema in JSON data sources, identify-

ing equivalences between ﬁelds that are written dif-

ferently but representing the same information. In the

context of this work, being equivalent covers linguis-

tic similarity like characters, word stems and semantic

approach like synonyms resulting a uniﬁed concep-

tual scheme and mappings. This work it is validated

through an experiment that has as input documents

from digital libraries belonging to the domain of sci-

entiﬁc publications. At the end, recall and precision

indexes are discussed.

This paper is organized as follows. Section 2 de-

scribes the background. Section 3 related works. Sec-

tion 4 deﬁnes the process for extracting conceptual

schema from JSON datasets. Section 5 speciﬁes an

experiment. Finally, Section 6 presents the conclu-

sions.

2 BACKGROUND

Schemas in NoSQL Datastores. The concepts of

entity, attribute and relationship have some speciﬁci-

ties: an entity can be an abstraction of the real world

(Varga et al., 2016), as well as correspond to a JSON

document or object. An attribute can be represented

by ﬁelds of mono or multivalued arrays. NoSQL

databases have no explicit relationships; they use a

nested entity model, inferring implicit relationships.

Conceptual representation for NoSQL. The

IDEF1X (Integrated DEFinition for Information

Modelling) notation with adaptations characterize the

modeling suitable for NoSQL models (Benson, 2014)

(Jovanovic and Benson, 2013). This is done through

rules for representation. Its model is distinguished by

the fact that entities derive from a root.

Figure 2: Example of notation for aggregate-oriented data

models.

First item in Figure 2 points to a built-in aggregate

of one-to-many 1 : M relationship. The second item

illustrate cardinality relationships 0 : 1. The “REFI”

operator is positioned in the entity where only one

reference value will be “copied”, that is, referenced in

the other entity, thus generating a kind of connection

between the entities.

Techniques for Identiﬁcation of Textual Equiva-

lence. The following measures of text similarity cal-

culation are quite distinct, and, in general, calculate a

score in the range {0,1} for a pair of values.

String-based similarity measures operate on se-

quences and character composition, comparing the

distance between two texts. The Levenshtein mea-

sure (Levenshtein, 1966) was chosen, as it is widely

used. The process, however, can be used with other

character-based similarity measures.

Knowledge-based measure is based on the simi-

larity degree according to the information meaning in

a semantic network. One of the most popular tools in

this regard is Wordnet (Miller, 1995). In this work,

we use the Lin (Lin, 1998) measure, which applies

the information content of two nodes using the LCS

(Lowest Common Subsumer).

Although stemming is not a measure of similarity,

the Porter Stemming Algorithm (Porter, 2006) is ap-

plied through the equivalence of their stem indicating

equivalence or not.

3 RELATED WORK

Related works on schema extraction in JSON data

sources are concerned with version inference (Ruiz

et al., 2015), extraction of a schema with cardinalities

in a proper representation format (Klettke et al., 2015)

and a unique semantic approach (Kettouch et al.,

2017). However, they do not address the issue of dif-

ferent spellings for equivalent ﬁelds. This issue be-

come important in integration scenarios, scheme ﬂex-

ibility or for lack of standardization. In (Blaselbauer

and Josko, 2020) work, a linguistic approach is used

but generates only similarity graphics.

The work methodology uses a four-step process

that starts with JSON documents. Searches for ﬁelds

that have different spelling, but represent the same

information in the schema. This is accomplished

through techniques that identify equivalence in ﬁeld

names, generating as output a uniﬁed conceptual

representation of the implicit schema present in the

database.

4 A PROCESS FOR SCHEMA

EXTRACTION

In order to access JSON documents in a uniﬁed way,

we propose a process to schema extraction analyzing

textual similarity of its ﬁelds. The process is appli-

cable to documents from the same domain to identify

A Text Similarity-based Process for Extracting JSON Conceptual Schemas

265

ﬁelds representing the same information but written

differently.

As input, the process receives documents stored in

NoSQL in JSON format. As output, it generates the

conceptual representation of an uniﬁed schema.

Some deﬁnitions are given as follow:

Deﬁnition 1. Structure. In a json format Json :=

[key

: value

,...], structure refers to the ﬁelds of a

JSON document where Struc := [key

,key

,...], in the

sense of differentiating what is not a value.

Deﬁnition 2. Schema. Schema has a more compre-

hensive sense than structure, as it is usually related to

entities, relationships and attributes.

Deﬁnition 3. Delimiters. A set of JSON grammar

symbols that mark the beginning and the end of an

object or array, this is Delimiter := ([,], {,}).

Deﬁnition 3. Block. A block is a set of attributes.

The extraction process process is divided into four

main steps.

The pre-processing in Section 4.1, stage initially

aim at preparing the ﬁles, eliminating the data, keep-

ing only the ﬁelds. After pre-processing, the word

set receives similarity techniques: knowledge based

string, character based string and stemming. Its aim

is identifying equivalences in ﬁelds that represent the

same information but are written differently. This is

the similarity analysis step, Section 4.2.

After the similarity analysis, the next step is the

identiﬁcation of equivalences, Section 4.3. This

phase is responsible for analyzing the results of the

resulting measures, testing and inferring the equiva-

lence between the terms. Finally, a visual representa-

tion of the conceptual schema is generated along with

a table representing the mappings of the terms that

have been consolidated. This is the step of structure

representation in Section 4.4.

4.1 Pre-processing

The pre-processing step, shown in Figure 3, prepares

the documents for the following steps. All the ﬁles

in the collection are traversed by joining the ﬁelds in

a single ﬁle. The repeated terms are eliminated by

keeping only the distinct ﬁelds. This process aims to

decrease the number of comparisons, improving efﬁ-

ciency.

The extract ﬁelds activity is divided into two

steps. The ﬁrst one separates only the ﬁelds, that

is, the [key

,key

,...] portion of the Json := [key

value

,...]. The second one merges the ﬁelds into a

single ﬁle. The source document is stored in a list of

references to enable future mappings. The following

step is described:

Figure 3: Pre-processing step.

Input. An existing JSON document collection be-

longing to the same domain. Each document must

be validated by the grammar.

Extract Fields. This activity has the purpose of

traversing the documents by separating only the dis-

tinct ﬁelds, merging into a single ﬁle and saving a list

of references.

The extract ﬁelds activity is detailed in other sub-

activities:

Separate Fields from the Data. This activity pro-

cesses the documents keeping only the ﬁeld names

and the delimiter symbols doc

[key|delimiters]. This

is done for each document, generating a unique text

ﬁle. At this stage no comparison is made.

Merge Structure. In this step, the intention is to

avoid element redundancy [key

,key

,..], to re-

duce the number of entries for analysis of textual sim-

ilarity. Thus, it performs tests and comparisons to

merge and to maintain only the distinct ﬁelds. In this

process, the source reference of the different ﬁelds is

saved value

:= [doc

,doc

Output. The ﬁrst output artifact is a text-format

ﬁle containing a single document representing the

collection called the general structural document

Out

:= doc

[key|delimiters],doc

[key|delimiters],...

This should have, besides the delimiters that will as-

sist in the future steps, the ﬁelds with distinct names.

The hierarchy remains the same.

The second artifact is a list that contains the term

and the reference to which original documents it con-

tains Out

:= distinct

value

[doc

,doc

],...

4.2 Similarity Analysis

The purpose of the similarity analysis step, indicated

in Figure 4 is to apply different techniques of text

similarity analysis to identify words that represent the

same information, but have different spellings.

There are applied three different techniques to

each pair of words that compose the general structural

document.

Input. General structural document.

Similarity Analysis. In this activity, the input ﬁle is

copied and serves as the initial artifact for the applica-

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

266

Figure 4: Similarity analysis step.

tion of the three different techniques. These are exe-

cuted in parallel. The input passes through each tech-

nique and generates result matrices with values indi-

cating the similarity. The comparisons are made in

the form all with all, for earch pair [key

,key

] ∈ Out

assuming that all can be equivalent.

The similarity analysis is detailed in other sub-

activities:

Analyze stem of the word. This sub-activity refers

to the technique of Stemming, that is, stem extrac-

tor. It is executed using the Porter Algorithm (Porter,

2006). By applying rules to remove sufﬁxes and pre-

ﬁxes from the English language, the result is a radical

one. This is not necessarily a valid word and may

have no meaning.

In the case of the stemming technique, the stem

of the words stem key

and stem key

are compared

to deﬁne equivalence. If it is identical, the result will

be 1, as shown by Equation 1, otherwise it will be

0. Thus, in this technique, the resulting values will

always be integers 0 or 1.

Stem

sim

: (stem key

== stem key

) −→ 1 (1)

Analyze Character-based Similarity. It consists of ap-

plying some similarity technique. The Levenshtein

(Levenshtein, 1966) function was chosen because it

was applied to short texts, through the minimum num-

ber of operations required to transform one string into

another and is represented by Equation 2.

Lev

sim

: (key

,key

) −→ {0,1} (2)

Analyze Knowledge-based Similarity. It targets the

semantic focus. The implementation of the Wordnet

Similarity for Java is used applying the semantic mea-

sure Lin (Lin, 1998) that is based on information con-

tent and represented by Equation 3.

Lin

sim

: (key

,key

) −→ {0,1} (3)

In this case, the word indicated by key

must be valid

in Wordnet, otherwise the result will be assigned a

zero value.

Output. Three result arrays are generated with inte-

ger values in the range from zero to one. Each ar-

ray contains information about the result of a given

similarity function for each pair of words. The cor-

responding output Out

:= M

stem

lev

lin

is repre-

sented in an array whose main diagonal will always

be one. This also has the characteristic of being sym-

metrical both below and above the main diagonal.

After the end of the similarity analysis step, the

three generated arrays follow to identify equivalences.

4.3 Identiﬁcation of Equivalence

The step of identifying equivalences, shown in Fig-

ure 5, is decisive in the schema uniﬁcation process.

Figure 5: Identiﬁcation of equivalence step.

Its purpose is to perform tests and comparisons to de-

ﬁne the equivalence between ﬁelds. It uses the values

resulting from each measure applied in the previous

step.

Input. Matrices of results of applied techniques.

Identify Equivalence. In this activity, calculations

and tests are performed from a threshold. Afterward,

these results are synthesized in a uniﬁed structure. In

this same process are stored the consolidated ﬁelds

and to which others correspond.

The identify equivalence activity is detailed in

other sub-activities:

Calculate Equivalence. This sub-activity repre-

sented by Algorithm 1, is performed for each pair

values. Tests, comparisons and calculations are per-

formed and generate a single matrix M

res

where

res

∈ {0,1}.

Algorithm 1: Calculate equivalence.

input : M

stem

lev

lin

output: M

res

1 for e

i j

∈ M

stem

and M

lev

and , M

lin

2 if e

stem

or e

lev

or e

lin

== 1 then e

res

= 1 ;

3 if e

stem

or e

lev

or e

lin

== 0 then e

res

= 0 ;

4 if e

stem

== 0 and e

lin

== 0 then

5 if e

lin

> T

then e

res

= 1 ;

6 end

7 if e avg

res

> T

then e

res

= 1 ;

8 end

This sub-activity is based on the following premises:

• When a word is not valid according the the gram-

mar, it will not have a value in the radical and syn-

onym measures, that is, Stem

sim

: (a, b) = 0 and

A Text Similarity-based Process for Extracting JSON Conceptual Schemas

267

Lin

sim

: (a,b) = 0. In this case, the threshold T

deﬁned to the character measure.

• When the three measures have values in the

{0,1} interval, the equivalence is given through

a weighted average Avg with threshold T

where

weights are an arbitrary choice.

• For the deﬁnition of the threshold, the following

test was performed: given a set of 14 word pairs,

where 10 are equivalent and 4 are not, the recall

and precision indices were calculated for combi-

nations of the thresholds T

and T

, according to

Table 1.

Table 1: Threshold deﬁnition tests.

Recall Precision

0,75 0,50 0,636 0,700

0,70 0,50 0,909 0,714

0,70 0,45 0,909 0,667

0,60 0,50 1,18 0,591

Thus the deﬁned threshold were T

= 0,70 and

= 0,50.

• When the weighted average is applied, it is rep-

resented by the Equation 4. If Avg ≥ T

then its

considered equivalent.

Avg = (1 ∗ Stem

sim

+ 2 ∗ Lev

sim

+ 3 ∗ Lin

sim

)/6 (4)

Consolidate Structure. This sub-activity aims to gen-

erate a single listing of ﬁelds unifying those that were

considered equivalent in the previous step.

It has as input a M

res

analyzing each pair of

words (key

,key

) corresponding to the element e

res

∈

{0,1}. When e

res

= 1 the key

is kept and the map-

ping is saved. The element to be maintained is chosen

arbitrarily, being considered the ﬁrst occurrence.

This sub-activity generates a second list of refer-

ences with the terms and their equivalents Out

key

⇐⇒ key

Rebuild Structure. This sub-activity rearranges the

consolidated terms into a single document, called uni-

ﬁed structure, using Out

. This sub-activity is based

on the following premises:

• The uniﬁed structure is built from a collection

document considered base.

• The largest document belonging to the entry col-

lection is chosen as the basis, as it contains the

largest number of ﬁelds.

• Base document delimiters are kept/added in the

uniﬁed structure and ﬁeld values are ignored.

From the base document, each term is analyzed and

replaced by its equivalent if necessary. Terms that

have been consolidated but are not present in the

base document are added to the end of the structure.

Its generates the uniﬁed structure, that is, Out

uniq doc[keys, delimiters].

Output. The activity identify equivalence consists of

more extensive and represents the core of the work.

Two outputs are generated: Out

- a second list of

references containing equivalences between ﬁelds and

Out

- a uniﬁed structure.

After this step, it is necessary to generate a visual

conceptual representation, as well as to organize the

mappings.

4.4 Structure Representation

The structure representation step, shown in Figure 6,

aims to generate a visual notation and the mappings.

The ﬁrst is accomplished through the uniﬁed struc-

ture by applying conversion rules. The second gener-

ates the mappings of the consolidated terms to their

equivalents and their source documents.

Figure 6: Structure representation step.

Input. Uniﬁed structure, that is Out

and references

lists corresponding to the Out

and Out

Generate Representation. This activity produces the

conceptual schema and the mappings of the consoli-

dated terms to the equivalents identiﬁed in the pro-

cess. The generate representation activity is detailed

in other sub-activities:

Generate Mappings. This sub-activity aims to con-

sult both lists of references to infer new equivalences

between the ﬁelds, generating a ﬁnal table with the

consolidated term path represented by the JSONPath

notation (Goessner, 2007), the other matching terms

and occurrences in documents.

This activity is based on the following premises:

• The consolidated term path column is generated

from the uniﬁed structure.

• The matching terms column is generated based on

Out

and are checked new equivalences like A =

B,B = C ⇒ A = C

• The occurrences in documents column is gener-

ated based on Out

Adapt to Aggregate Notation. This sub-activity create

the representation through IDEF1X notation adapted

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

268

to the NoSQL model. It applies the deﬁned rules of

conversion, described in Section 4.4.1, in a text ﬁle

that corresponds to the uniﬁed structure.

An algorithm, represented by Algorithm 2, similar

to a parser is applied whose delimiters indicate the

transformation into a class or attribute, for example.

Algorithm 2: Adapt to aggregate notation.

input : Uniq doc[key,delimiters] and Rules

output: Schema

1 for event ∈ U niq doc do

2 switch event do

3 case { do R1 ⇒ new class;

4 case key

do R2 ⇒ attribute;

5 case [ do R3 ⇒ new class with one

attribute;

6 case [{ do R4 ⇒ new class;

7 end

8 end

This sub-activity is based on the following premises:

• Some rules have been deﬁned for converting the

uniﬁed structure into a conceptual scheme.

• The ﬁrst object in uniﬁed structure is considered

root entity.

• The delimiters assist in the assembling the con-

ceptual scheme and also indicate the type of rela-

tionship.

4.4.1 Rules

This new artifact, also represented by Algorithm 2,

presents some rules deﬁned during the work, to gener-

ate the conceptual schema and a visual representation.

These are used in the identiﬁcation of the blocks in

objects or arrays that help in the deﬁnition of classes,

attributes and relationships.

The following are brieﬂy described:

• The R1 Rule is applied to generate a new class

with relationship 0 : 1. This rule occurs when

identifying ‘{’, i. e., the beginning of a JSON ob-

ject.

• The R2 Rule transforms ﬁelds into corresponding

attributes of a class.

• The R3 Rule is applied to generate a new class

with relationship O : M. This class contains only

a single attribute represented by att. Occurs when

was found ‘[’, i.e., an array with enumeration of

items inside is opened.

• The R4 Rule also generates a new class with rela-

tionship 0 : M. Occurs when identifying ‘{[’, i.e.,

the beginning of an object with many arrays.

Figure 7: Example of applying the rules.

To illustrate the application of the rules, the Figure 7

is displayed that demonstrates each case described.

Output. This activity ends the extraction process

generating the two ﬁnal artifacts: the mappings and

the uniﬁed conceptual schema.

5 RESULTS

This section presents the extraction process in an ex-

periment and the results found. The application do-

main is related to scientiﬁc publications; entries are

exported references from academic libraries ﬁles in

JSON format. The execution of this extraction pro-

cess treats ﬁelds as a whole.

Evaluation. Precision and recall measures are used.

The recall is the proportion of the total of similar ex-

isting pair that appears in the ﬁnal result. On the other

hand, precision indicates the proportion of pairs of

values correctly identiﬁed as similar that appear in the

result.

5.1 Execution of the Extraction Process

The process consists of executing the sequence

of sub-activities presented in the four steps: pre-

processing, similarity analysis, equivalence identiﬁ-

cation and structure representation.

A implementation was developed in Java language

with libraries such as Simmetrics-core and Wordnet.

The project, executable and test ﬁles are available on

GitHub

. It accept as input, the folder containing the

JSON documents and the thresholds deﬁnition. Out-

puts a list of references, that is Out

and a matrix re-

sults M

res

. The process is terminated manually and

produces a visual representation.

The extraction process is based on the following

premises:

• It is considered that the ﬁelds in general mode can

be equivalent to any other, regardless of the hier-

archy level that they are and that represent infor-

mation of the same context.

https://github.com/ftmachado/schema-similarity

A Text Similarity-based Process for Extracting JSON Conceptual Schemas

269

Figure 8: Conceptual schema extracted by the process.

• The data hierarchy in this case is not being con-

sidered, as it would be necessary to have a domain

expert to assist the matching.

This experiment was carried out with the extraction of

50 ﬁles from each library, namely: Bibsonomy, DBLP

and Pubmed. These inputs and the thresholds were

submitted to our implementation, that shows which

ﬁelds are considered equivalent.

In the pre-processing step, the ﬁelds are sepa-

rated and consolidated into a single text ﬁle, called the

general structural document. Also generate the list of

references that contains the ﬁelds and the reference of

which documents occur.

Thus, in the similarity analysis step, matrices are

generated for each analysis. This step results in three

matrices with the results of each ﬁeld pair textual sim-

ilarity technique applied.

The next step of the process is identify equiva-

lences. It calculates equivalences resulting an matrix

with zero or one values and a second list of references

containing terms that were considered equivalent.

Table 2: Output list of references 2.

types = type Publication = Issue Tag = label

count = number Hour = Minute User = user

author = authors author = Author title = Title

editor = editors volume = Volume year = Year

journal = Journal number = Issue

Table 2 is the output of list of references. The sub-

activities of consolidating structure and rebuilding the

structure are performed manually, where the source

document containing the largest number of ﬁelds is

chosen for base.

Thus, the structure representation phase gener-

ates the two ﬁnal artifacts of the extraction process:

the mappings and the conceptual schema.

Table 3: Mapping result example.

Consolidated term path Matching term Occurr

$..pubmedarticle. Doc

medlinecitation. Year Doc

datecreated.year Doc

The mappings, exempliﬁed in Table 3 present the con-

solidated term path, the matching term, and the occur-

rence in original dataset.

The conceptual schema, shown in Figure 8, is

genereated based on the deﬁned rules starting from

a root element identiﬁed by the ROOT label, in this

case, the term pubmedarticle.

In Figure 8, the entities at the right side of the

root term represent nested arrays with relationships

of type 1 : M, identiﬁed by EMBED, and 1 : 1 indicated

by REFI. Fields that are not found in the structure of

the chosen base document are considered as a sepa-

rate block, ‘entity1’, at the left of the root element.

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

270

5.2 Evaluation

In this experiment, 15 ﬁelds could be considered as

relevant, that is, that represent the same information.

A total of 14 was retrieved according to Table 2. The

unidentiﬁed terms would be Tags ⇒ tags.

The revocation and precision rates are shown in

Table 4 together with the number of terms retrieved,

terms relevant retrieved and terms relevant repre-

sented by Ret, Rel Ret and Rel respectively.

Table 4: Recall and precision values.

Ret Rel Ret Rel Recall Precision

14 12 15 80% 85,71%

Among the terms retrieved, two were identiﬁed in-

correctly: Publication ⇒ Issue and Hour ⇒ Minute.

This happened because the semantic metric found

high values for these cases. Some difﬁculties encoun-

tered in correctly identifying the equivalent terms are

due to particularities of the chosen techniques.

Some words are not valid in the language and

therefore can not be analyzed semantically, just as

they can not be radical in Porter’s algorithm. How-

ever, they will always be analyzed for the variation of

characters having an appropriate threshold for each

case. Thus, from the indexes found, the process pre-

cision is considered good.

6 CONCLUSION

The main contribution concerns the process for ex-

tracting conceptual schemas that has as input a collec-

tion of documents in JSON format. These ﬁles may

be stored in a NoSQL database or in the Web in gen-

eral.

In order to exploit the ﬂexibility of schemas, the

process aims to identify equivalences in the ﬁelds that

are written differently or at integration scenarios, ei-

ther for lack of standardization or for misunderstand-

ing, but that represent the same information. In this

way, it uses similarity techniques that cover simi-

lar spelling, synonyms and radical equivalence of the

word. The process is applied between documents and

within the document, generating relationships of type

1 : M or 0 : M, once in an NoSQL model these are

indicated about a entity nested in a root element.

The tests indicate that the process has more scope

as a greater number of variations, maintaining good

rates of revocation and precision. Inconsistencies oc-

curred in cases of words that even have the same

spelling, have different meanings, or questions of the

Wordnet library.

The extraction of a uniﬁed schema can also be

useful in future work to allow the submission of

queries about it, since a mapping indicates to which

other terms that consolidated term refers and points

the respective origin documents of the corresponding

terms. A future solution could be investigating the use

of algorithms to deal with homonyms.

With the growth in the volume of data and the

popularization of the data of the mono structured as

JSON, it is need to be concerned about schemes so

that it can develop applications that access them in a

coherent way. This proposal differs by exploring the

ﬂexibility of schemas, identifying equivalent ﬁelds in

terms of synonymous, word radical and character gen-

erating a uniﬁed schema.

REFERENCES

Benson, S. R. (2014). Polymorphic data modeling. Master’s

thesis, Georgia Southern University.

Blaselbauer, V. M. and Josko, J. M. B. (2020). Jsonglue: A

hybrid matcher for json schema matching. Proceed-

ings of the Brazilian Symposium on Databases.

Goessner, S. (2007). Jsonpath - xpath for json.

http://goessner.net/articles/JsonPath/. Acessed in

2016, November.

Jovanovic, V. and Benson, S. (2013). Aggregate data mod-

eling style. SAIS 2013, pages 70–75.

Kettouch, M. S., Luca, C., Hobbs, M., and Dascalu, S.

(2017). Using semantic similarity for schema match-

ing of semi-structured and linked data. In 2017 Inter-

net technologies and applications (ITA), pages 128–

133. IEEE.

Klettke, M., St

orl, U., Scherzinger, S., and Regensburg, O.

(2015). Schema extraction and structural outlier de-

tection for json-based nosql data stores. In BTW, vol-

ume 2105, pages 425–444.

Levenshtein, V. I. (1966). Binary codes capable of cor-

recting deletions, insertions and reversals. In Soviet

physics doklady, volume 10, page 707.

Lin, D. (1998). An information-theoretic deﬁnition of sim-

ilarity. In ICML, volume 98, pages 296–304. Citeseer.

Miller, G. A. (1995). Wordnet: a lexical database for en-

glish. Communications of the ACM, 38(11):39–41.

Porter, M. (2006). An algorithm for sufﬁx

stripping. Program: electronic library

and information systems, 40(3):211–218.

https://doi.org/10.1108/00330330610681286.

Ruiz, D. S., Morales, S. F., and Molina, J. G. (2015). Infer-

ring versioned schemas from nosql databases and its

applications. In International Conference on Concep-

tual Modeling, pages 467–480. Springer.

Varga, V., J

anosi-Rancz, K. T., and K

alm

an, B. (2016).

Conceptual design of document nosql database with

formal concept analysis. Acta Polytechnica Hungar-

ica, 13(2).

A Text Similarity-based Process for Extracting JSON Conceptual Schemas

271