Querying Heterogeneous Document Stores

Hamdi Ben Hamadou

, Faiza Ghozzi

, Andr

e P

eninou

and Olivier Teste

Universit

e de Toulouse, UT3, IRIT (CNRS/UMR 5505), Toulouse, France

Universit

e de Sfax, ISIMS -MIRACL, Sfax, Tunisie

Universit

e de Toulouse, UT2J, IRIT (CNRS/UMR 5505), Toulouse, France

Keywords:

Information Systems, Document Stores, Structural Heterogeneity, Schema-independent Querying.

Abstract:

NoSQL document stores offer support to store documents described using various structures. Hence, the

user has to formulate queries using the possible representations of the desired information from different

schemas. In this paper, we propose a novel approach that enables querying operators over a collection of

documents with structural heterogeneity. Our work introduces an automatic query rewriting mechanism based

on combinations of elementary operators: project, restrict and aggregate. We generate a custom dictionary that

tracks all representations for attributes used in the documents. Finally, we discuss the results of our approach

with a series of experiments.

1 INTRODUCTION

Document-oriented stores are becoming very popular

because of their simple and efﬁcient ways to manage

large semi-structured data sets. Each record, usually

formatted in JSON, is stored inside a document in

schema-less fashion. So, a collection groups a he-

terogeneous set of documents for which no common

schema is required. Although this ﬂexibility is very

power-full at loading time, the resulting heterogeneity

presents a serious issue during querying phase. In-

deed, in order to obtain relevant results, users have

to be aware of all existing schemas while formulating

their queries and have to combines all the schemas

in complex queries. Three classes of heterogeneity

can be considered in the context of document stores.

(Shvaiko and Euzenat, 2005)

• Structural heterogeneity points to the different

structures that exist in documents. The main issue

is the existence of several paths to access the same

attribute; e.g., the position of an attribute denoted

“name” may not the same in two documents (nes-

ted, ﬂat).

• Syntactic heterogeneity exists when different at-

tributes refer to the same concept; e.g., the attri-

bute “name” may be denoted “name,” “names”

or “ﬁrst name” in different documents.

• Semantic heterogeneity exists when the same at-

tribute refers to different concepts; e.g., the attri-

bute “name” may designate a “person name”, an

“animal name” or a “disease name” depending

on documents.

In this paper, we focus on the structural heteroge-

neity issue in document stores.

Example. We use the example collection of ﬁgure 1

composed of ﬁve documents describing authors and

some of their publications. Documents are descri-

bed using JavaScript Object Notation (Bourhis et al.,

2017).

Let us suppose we are interested in collecting infor-

mation related to “name of authors” and their publi-

cations. The query will be formulated over the attribu-

tes “name” and “title”. Any user may expect results

for the ﬁve authors of the example (except perhaps for

“paul verlaine”) and possibly ﬁve titles. If we look

at ﬁgure 1, the attribute “name” does not present any

problem since it is always in the same position in the

ﬁve documents. However, the attribute “title” may

cause some issues because of its various structural po-

sitions within the documents. To reach the attribute

“title” various paths exist in the different document

schemas: “title,” “book.title,” “artwork.1.title” and

“artwork.2.title” (here “.1.” and “.2.” stand for the

indexes in the array “artwork”).

When using MongoDB data store system, we can

formulate the query db.C.ﬁnd( {}, {“name” : 1,

Ben Hamadou, H., Ghozzi, F., Péninou, A. and Teste, O.

Querying Heterogeneous Document Stores.

DOI: 10.5220/0006777800580068

In Proceedings of the 20th International Conference on Enterprise Information Systems (ICEIS 2018), pages 58-68

ISBN: 978-989-758-298-1

[ { name:"victor hugo",

title:"les miserables",

year:1862

{ name:"honore de balzac",

book:{title:"le pere Goriot",

year:1835

}

{ name:"paul verlaine",

birthyear:1844

{ name:"charles baudelaire",

artwork:[

{title:"les fleurs du mal",

year:1857

{title:"le spleen de Paris",

year:1855

}

]

{ name:"pierre de ronsard",

title:"les amours",

year:1557

}

]

Figure 1: Collection “C” of ﬁve example documents.

“title” : 1}). Executing such query will return the

following incomplete set of documents because of the

structural heterogeneity of the attribute “title”:

[ { name:"victor hugo",

title:"les miserables" },

{ name:"honore de balzac" },

{ name:"paul verlaine" },

{ name:"charles baudelaire" },

{ name:"pierre de ronsard",

title:"les amours" } ]

If we formulate an alternative query that matches

with another path of “title”, db.C.ﬁnd( {}, {“name”

: 1, “book.title” : 1}), the following incomplete set

of documents is returned:

[ { name:"victor hugo" },

{ name:"honore de balzac",

book:{title:"le pere Goriot"} }

{ name:"paul verlaine" },

{ name:"charles baudelaire" },

{ name:"pierre de ronsard" } ]

We can notice that each query returns a part of the

expected result (author/title pairs) meanwhile retur-

ning redundant incomplete results. Moreover, wit-

hout any other query, these incomplete two queries

results can lead the user to interpret that “charles bau-

delaire” has no publication in the collection; and that

is not true.

In the literature, two approaches are developed to

deal with structural heterogeneity. The data integra-

tion approach consists in transforming data according

to a uniﬁed schema to form a homogeneous collection

(Tahara et al., 2014). The automated schema disco-

very approach provides the various schemas at users

(Wang et al., 2015). The data integration may be a

time-consuming task because it implies to deﬁne the

mapping for every variation of schemas, while the au-

tomated schema discovery requires that users handle

many structures and manage heterogeneity by them-

selves.

Our approach is designed to resolve these issues.

It lets the user query a collection using one schema

of some documents, and our system transparently re-

writes the user query to take into account all existing

schemas. We develop a system that we call EasyQ

(Easy Query for NoSQL databases), which consists

of a schema-independent querying on heterogeneous

documents describing a given entity in document-

oriented stores. We opt for a solution based on virtual

data integration in which we introduce a data dictio-

nary that runs in transparent way and hides the com-

plexity of building the expected queries (Yang et al.,

2015).

This paper is organized as follows. The second

section reviews the most relevant works that deal with

querying heterogeneous documents. Section 3 ex-

plains the proposed approach and proposes the for-

malization of the approach. Section 4 presents our

ﬁrst experiments and the time/size cost of our appro-

ach regarding the size of collections and the variety of

schemas.

2 RELATED WORK

The problem of querying heterogeneous data is an

active research domain studied in several contexts

such as data-lake (Hai et al., 2016), federated database

(Sheth and Larson, 1990), data integration, schema

matching (Rahm and Bernstein, 2001). We classify

the state-of-the-art works into four main categories re-

garding the solution given to handle the heterogeneity

problems.

Schema Integration. The schema integration pro-

cess is performed as an intermediary step to facilitate

the query execution. In their survey paper, (Rahm and

Bernstein, 2001) presented the state-of-the-art techni-

ques used to automate the schema integration process.

Matching techniques can cover schemas or even in-

stances. Traditionally, lexical matches are used to

Querying Heterogeneous Document Stores

handle the syntactic heterogeneity. Furthermore, the-

saurus and dictionary are used to perform semantic

matching. The schema integration techniques may

lead to data duplication and possible initial under-

lying data structure loss, which may be impossible or

unacceptable to support legacy applications. Let us

notice that we built our schema-independent querying

upon the ideas developed in schema level matching

techniques.

Physical Re-factorization. Several works have

been conducted to enable querying over semi-

structured data without any prior schema validation

or restriction. Generally, they propose to ﬂatten XML

or JSON data into a relational form (Chasseur et al.,

2013) (Tahara et al., 2014), (DiScala and Abadi,

2016). SQL queries are formulated based on relati-

onal views built on top of the inferred data structures.

This strategy suggests performing heavy physical re-

factorization. Hence, this process requires additional

resources such as the need for external relational da-

tabase and extra efforts to learn the uniﬁed inferred

relational schema. Users dealing with those systems

have to learn new schemas every time they change

the workload, or new data are inserted (or updated) in

the collection because it is required to re-generate the

relational views and the stored columns after every

change.

Schema Discovery. Other works propose to infer

implicit schemas from semi-structured documents.

The idea is to give an overview of the different ele-

ments present in the integrated data (Baazizi et al.,

2017) (Ruiz et al., 2015). In (Wang et al., 2015) the

authors propose summarizing all documents schema

under a skeleton to discover the existence of ﬁelds or

sub-schema inside the collection. In (Herrero et al.,

2016) the authors suggest extracting collection struc-

tures to help developers while designing their ap-

plications. The heterogeneity problem here is de-

tected when the same attribute is differently repre-

sented (different type, different position inside docu-

ments). Schema inferring methods are useful for the

user to have an overview of the data and to take the

necessary measures and decisions during application

design phase. The limitation with such logical view is

the need to manual process while building the desired

queries by including the desired attributes and their

possible navigational paths. In that case, the user is

aware of data structures but is required to manage he-

terogeneity.

Querying Techniques. Others works suggest resol-

ving the heterogeneity problem by working on the

query side. Query rewriting (Papakonstantinou and

Vassalos, 1999) is a strategy to rewrite an input query

into several derivations to overcome the heteroge-

neity. The majority of works are designed in the con-

text of the relational database where heterogeneity is

usually restricted to the lexical level only. Regarding

the hierarchical nature of semi-structured data (XML,

JSON documents), the problem of identifying simi-

lar attributes is insufﬁcient to resolve the problem of

querying documents with structural heterogeneity. To

this end, the keyword querying has been adopted in

the context of XML (Lin et al., 2017). The process

of answering a keyword query on XML data starts by

identifying the existence of the keywords within the

documents without the need to know the underlying

schemas. The problem is that the results do not consi-

der the heterogeneity in term of attributes but assume

that if the keyword is found so document is adequate

and has to be returned to the user. Other alternati-

ves to ﬁnd different navigational paths leading to the

same attribute is supported by (Clark et al., 1999),

(Boag et al., 2002). Only the structural heteroge-

neity is partially addressed. There is always a need

to know the underlying document structures and to

learn a complex query language. In addition, these

solutions are not built to run over large-scale data. In

addition, we notice the same limitations considerati-

ons with JSONiq (Florescu and Fourny, 2013) the ex-

tension to XQuery designed to deal with large-scale

semi-structured data.

This paper takes these ideas one step further by in-

troducing a schema-independent querying approach

that is built over the native operators supported by do-

cument stores. We believe that, in collections of he-

terogeneous documents describing a given entity, we

are able to handle the documents heterogeneities via

the use of query rewriting mechanisms introduced in

this paper. Our approach is performed in a transpa-

rent way over the initial document structures. There

is no need to perform heavy transformation nor to use

further auxiliary systems.

3 EASY QUERY FOR NoSQL

DOCUMENT STORES

EasyQ is a tool that facilitates to the user the explora-

tory querying of a document store without having to

know the entire data structures of documents.

The ﬁgure 2 gives a high-level viewpoint of our

engine, divided into two parts: a dictionary builder

and a query rewriting engine. To ensure an efﬁcient

ICEIS 2018 - 20th International Conference on Enterprise Information Systems

query enrichment, we introduce EasyQ in early sta-

ges during data loading phase in order to generate

and materialize a dictionary containing all different

navigational paths for all attributes. From a general

point of view, the dictionary is updated each time a

document is updated, removed or inserted in the col-

lection. At the querying stage, EasyQ takes as input

the user query, called Q, formulated over ﬁelds and/or

sub-paths, and the desired collection. The EasyQ re-

writing engine reads from the dictionary and produces

an enriched query supported by the underlying docu-

ment store, called Qext. Finally, the document store

returns the results to the user.

Figure 2: EasyQ architecture: data structure extractor and

query rewriting engine.

In the remainder of this section, we describe the

formal data model and the extended query process.

3.1 Formal Data Model

Usually, a document-oriented store is modelled as a

collection of JSON documents.

Deﬁnition 1 (Collection). A collection C is deﬁned

as a set of documents:

C = {d

, ... , d

}

Deﬁnition 2 (Document). A document d

, ∀i ∈

[1,c], is deﬁned as a (key,value) pair:

= (k

)

• k

is a key that identiﬁes the document;

• v

= {a

i,1

: v

i,1

,. .., a

i,n

: v

i,n

} is the document va-

lue. The document value v

is deﬁned as an object

composed by a set of (a

i, j

, v

i, j

) pairs, where each

i, j

, is a string called attribute and each v

i, j

, is the

value that can be atomic (numeric, string, bool-

ean, null) or complex (object, array).

An atomic value is deﬁned as follows:

• v

i, j

= n if n ∈ N

∗

, the set of numeric values;

• v

i, j

= “s” if “s” is a string formatted in Unicode

characters of

∑

∗

;

• v

i, j

= b if b ∈ B the set of boolean values

B = {true, f alse};

• v

i, j

= ⊥ is a null value.

A complex value is deﬁned as follows:

• v

i, j

= {a

i, j,1

: v

i, j,1

, ... , a

i, j,n

i, j

: v

i, j,n

i, j

} is an ob-

ject value where v

i, j,k

are strings formatted in Uni-

code characters of

∑

∗

called attributes and v

i, j,k

are values; This is a recursive deﬁnition identical

to document value;

• v

i, j

= [v

i, j,1

, ... , v

i, j,n

i, j

] is an array of values.

In case of having document values v

i, j

as an object

or array, their inner values v

i, j,k

can be complex values

allowing to have different nesting levels. To cope with

nested documents and navigate through schemas, we

adopt classical navigational path notation (Bourhis

et al., 2017).

Deﬁnition 3 (Schema). A schema, denoted s

, infer-

red from the document value {a

i,1

: v

i,1

,. .., a

i,n

: v

i,n

}

is deﬁned as a set of paths:

= {p

,. .., p

}

Each p

is a path derived from the document value.

For multiple nesting levels, the path is extracted recur-

sively to ﬁnd the absolute navigational path starting

from the root to the atomic value that can be found in

the document hierarchy.

A schema s

of document d

is formally deﬁned as

follows:

∀ j ∈ [1..n

• if v

i, j

is atomic, s

= s

∪ {a

i, j

};

• if v

i, j

is an object, s

= s

∪ {a

i, j

} ∪ {∪

p∈s

i, j

.p}

where s

i, j

is the schema of v

, j

;

• if v

i, j

is an array, s

= s

∪ {a

i, j

} ∪

i, j

k=1



i, j

.k} ∪

{∪

p∈s

i, j,k

i, j

.k.p}



where s

i, j,k

is the schema of

the k

value from the array v

i, j

Example. Let us consider the collection C =

} composed of the documents intro-

duced in section 1, ﬁgure 1. The underlying schema

for the documents is described as follows:

s1 = { name, title, year }

s2 = { name, book, book.title, book.year }

s3 = { name, birthyear }

s4 = { name, artwork, artwork.1, artwork.2

, artwork.1.title, network.1.year

, artwork.2.title, artwork.2.year }

s5 = { name, title, year }

Querying Heterogeneous Document Stores

We can notice that the attribute “book” from do-

cument d

is an object in which are nested the at-

tributes “title” and “year”. So, that leads to hand-

ling two different navigational paths “book.title” and

“book.year”. We can also notice that the attribute

“artwork” in document d

is an array which is com-

posed of two sub-documents having the following

sub-schemas:

s4.1 = { title, year }

s4.2 = { title, year }

Thus, that leads us to add to the dictionary the four

aforementioned paths starting from “artwork”.

Deﬁnition 4 (Collection Schema). The schema S

inferred from collection C is deﬁned as follows:

[

i=1

Deﬁnition 5 (Dictionary). The dictionary dict

of a

collection C is deﬁned by a set of pairs:

dict

= {(p

, 4

)}

• p

∈ S

;

• 4

= {p

}

{

∀p

∈S

}. For each path

, 4

is the set of paths leading to p

Example. The dictionary dict

constructed from

the collection C of ﬁgure 1 is deﬁned hereafter.

Each dictionary entry p

refers to the set of all

extracted navigational paths 4

. For example,

the entry “year” refers to all navigational paths

{year, book.year, artwork.1.year, artwork.2.year}

leading to the attribute “year”.

{

(name, {name}),

(title, {title, book.title,

artwork.1.title, artwork.2.title}),

(year, {year, book.year,

artwork.1.year, artwork.2.year}),

(book, {book}),

(book.title, {book.title}),

(book.year, {book.year}),

(birthyear, {birthyear}),

(artwork, {artwork}),

(artwork.1, {artwork.1}),

(artwork.1.title, {artwork.1.title}),

(artwork.1.year, {artwork.1.year}),

(artwork.2, {artwork.2}),

(artwork.2.title, {artwork.2.title}),

(artwork.2.year, {artwork.2.year})

}

3.2 Querying Heterogeneous Document

Stores

The querying process is supported by a set of ele-

mentary operators. These operators are expressed by

native MongoDB query commands such as “ﬁnd” or

“aggregate”.

3.2.1 Kernel of Operators

The queries are deﬁned according to combinations of

elementary operators. The set of operators forms a

kernel, which is denoted K. For now, this kernel is

composed of three operators: projection, restriction

(or selection) and aggregation. Each elementary ope-

rator is unary; we call C

the queried collection, and

out

the resulting collection.

Deﬁnition 6 (Kernel). The kernel K is a minimal

closed set composed of the following unary operators.

k = {π, σ, γ}

• π

) = C

out

is a project operator, which con-

sists in restricting each document schema s

to a

subset of attributes A ⊆ S

• σ

) = C

out

is a restrict operator, which con-

sists in selecting documents from C

satisfying

the predicate p. A simple predicate is expres-

sed by a

where a

⊆ S

is an attribute,

∈ {= ; > ;< ;6= ;≥ ;≤ } is a comparison

operator, and v

is a value. It is possible to com-

bine predicates by logical connectors { ∨, ∧, ¬}.

We suppose that the predicate is deﬁned as, or nor-

malized to, a conjunctive normal form:



k,l



•

) = (C

out

) is an aggregate operator, which

consists of aggregating each group of docu-

ments having same values for G ⊆ S

and cal-

culating the aggregate values, F = { f (a

)| f ∈

{Sum,Max,Min,Avg,Count} ∧ a

∈ S

∧ a

/∈

G}.

Deﬁnition 7 (Query). A query Q is formulated by

composing previous unary operators as follows:

Q = q

◦ · ·· ◦ q

(C)

where ∀i ∈ [1,r], q

∈ K.

ICEIS 2018 - 20th International Conference on Enterprise Information Systems

Example. Let us consider the collection C of ﬁgure

1. We propose hereafter several examples of queries;

let us staying aware that structural heterogeneity ex-

ists in C and that those queries are not expected to

deal with the heterogeneity.

• “Search for the list of authors’ name and their pu-

blications”

name,title

[ { name:"victor hugo",

title:"les miserables" },

{ name:"honore de balzac" },

{ name:"paul verlaine" },

{ name:"charles baudelaire" },

{ name:"pierre de ronsard",

title:"les amours" } ]

• “‘Search for the titles of the publications of Pierre

de Ronsard and Charles Baudelaire”

(name,title

(σ

name=“Charles Baudelaire” ∨

name=“Pierre de Ronsard”

(C)) =

[ { name:"charles baudelaire" },

{ name:"pierre de ronsard",

title:"les amours" } ]

• “‘Search for the number of publications for each

authors”

name

count(title)

[ { name:"victor hugo", count:1 },

{ name:"honore de balzac", count:0 },

{ name:"paul verlaine", count:0 },

{ name:"charles baudelaire", count:0 },

{ name:"pierre de ronsard", count:1 } ]

As aforementioned, due to the structural heteroge-

neity of the attribute “title” we notice that these que-

ries do not give relevant results according to the sto-

red documents. To obtain relevant results users would

have to write complex queries taking into account the

various schemas.

3.2.2 Query Extension Process

Dealing with a collection of heterogeneous docu-

ments complicates the process of expressing queries.

Most of NoSQL systems do not give native support

to query heterogeneous documents. For instance, the

“ﬁnd” operator, as well as the “aggregate” pipeline

operator of MongoDB, is not able to automatically re-

cognize the numerous structures of the queried col-

lection. More precisely, the result does not include

values from navigational paths that are not explicitly

included in the query.

Our approach aims at enabling a transparent que-

rying process on a collection of heterogeneous docu-

ments via an automatic query rewriting process. It

employs the materialized dictionary to enrich the ini-

tial query by including the different navigational paths

that lead to desired attributes. It is described in the al-

gorithm 1 and parts are described hereafter:

• In case of projection, the list of projected attribu-

tes A is extended by the various navigational paths

for each attribute a

∈ A; the underlying idea

is to ask the dbms to search for all possible exis-

ting path for attributes.

• In case of restriction, the normal conjunctive form

of the predicate p is enriched by the set of exten-

ded disjunctions built from the navigational paths

k,l

for each attribute a

of the predicate; the un-

derlying idea is to ask the dbms to test all possible

existing paths for attributes.

• In case of aggregation, the operation is extended

using two operations: an added projection to deal

with the heterogeneity of attributes, and a classi-

cal aggregation to operate the calculus. The list of

attributes G is extended by the various navigatio-

nal paths 4

for each attribute a

∈ G. Each path

is renamed according to the attribute a

given in

the aggregation; in the algorithm 1, we note the

rename operation a

⇐ 4

. An equivalent pro-

jection is made for all attributes of F. Then the

true and classical aggregation can be done. The

underlying idea is to ask the dbms to ”ﬂatten”

all possible heterogeneous paths of attributes in G

and F in order to be able to group documents on

the same value of a same attribute and calculate

the aggregated value on a same attribute. Let us

notice that such operations are done by the dbms,

often in pipeline mode, and is not a physical fac-

torization (nor a physical ﬂattening).

Example. Let us consider the previous queries ex-

amples, section 3.2.1.

• The query rewriting engine rewrites the

query π

name,title

ﬁeld (respectively “name” and “title”),

the process consults the dictionary and ex-

tracts all the possible navigational paths

(respectively 4

name

= {name}, and 4

title

{title,book.title,artwork.1.title,artwork.2.title}).

The projection query is then rewritten as

name,title,book.title,artwork.1.title,artwork.2.title

[ { name:"victor hugo",

title:"les miserables" },

{ name:"honore de balzac",

Querying Heterogeneous Document Stores

Algorithm 1: Automatic extension of the initial

user query.

input: Q

output: Q

ext

← id // identity

foreach q

∈ Q do

switch q

case π

// projection

ext

←

∀a

∈A

ext

← Q

ext

◦ π

ext

end

case σ

Norm

// restriction

ext

←

(

∈4

k,l

)

ext

← Q

ext

◦ σ

ext

end

case

// aggregation

ext

← Q

ext

◦ (π

∀a

∈G

⇐(4

∀ f

) ∈F

⇐(4

)

◦

)

end

book:{title:"le pere Goriot" } }

{ name:"paul verlaine" },

{ name:"charles baudelaire",

artwork:[

{title:"les fleurs du mal"},

{title:"le spleen de Paris"}] },

{ name:"pierre de ronsard",

title:"les amours" } ]

• Our rewriting engine extends the query

name,title

(σ

name=“Charles Baudelaire” ∨ name=“Pierre de

Ronsard”

(C)) with the dictionary entries in the

same way as the previous query. The projected

attributes are extended as for the previous query.

Next, the process continues with the selection

query. The selection is rewritten by extending the

normal form of its predicate; the attribute “name”

has only one structural form, then the predicate is

not rewritten. The composed query is then rewrit-

ten as π

name,title,book.title,artwork.1.title,artwork.2.title

(σ

name=“Charles Baudelaire” ∨ name=“Pierre de Ronsard”

)

(C)) =

[ { name:"charles baudelaire",

artwork:[

{title:"les fleurs du mal"},

{title:"le spleen de Paris"}] },

{ name:"pierre de ronsard",

title:"les amours" } ]

• Our rewritten query transforms the query

name

count(title)

to rename the different heterogeneous paths.

Then, the query is rewritten as a composed

query such as

name

count(title)

(π

name:(name⇐(name))

title:(title⇐ (title|book.title| artwork.1.title|artwork.2.title))

(C)) =

[ { name:"victor hugo", count:1 },

{ name:"honore de balzac", count:1 },

{ name:"paul verlaine", count:0 },

{ name:"charles baudelaire", count:2 },

{ name:"pierre de ronsard", count:1 } ]

4 EXPERIMENTS

The overall goal of the next experiments is to study

if the rewriting process is acceptable along many di-

mensions: cost/overhead for query evaluation, size of

the dictionary and cost time for building it, number of

possible schemas that EasyQ can deal with. The pur-

pose of our ﬁrst experiments in this section is to study

the scale effects on the rewritten queries regarding

two main factors: the size of the queried collection

and the heterogeneity levels. In addition, we study

their effects on the dictionary. We choose MongoDB

to store the different datasets, the dictionary and to

run the rewritten queries.

Let us notice that, using MongoDB or any ot-

her classical document store, the rewriting process is

compatible with the underlying dbms engine. Indeed,

during any query evaluation, if a path is not present in

a document, it is simply ignored. Thus, the following

rules are applied during queries evaluation:

• Projection: for each document, any non-present

projection path is ignored and only those really

existing in the document are retrieved.

• Restriction: for each document, any non-present

enriched path is ignored since it has been included

in a disjunctive form. If no path is found in the

document, the condition is evaluated to false.

• Aggregation: The same rule applies than for pro-

jection since we use this operator for rewriting

purpose; grouping and aggregation computing are

classical ones.

Experimental Protocol. All experiments in this pa-

per were implemented in Python and ran on a server

with Intel I5 (3.4 GHz 4 cores), 16GB RAM and Cen-

tOS 7.0. We repeated each experiment 5 times and

we report the mean values. The details of the dataset

and the queries are presented in the remainder of this

section.

ICEIS 2018 - 20th International Conference on Enterprise Information Systems

Dataset. In this experimental evaluation, we em-

ploy synthetic datasets with various schemas and vo-

lume. All datasets are generated from the initial ﬂat

collection of documents that describe ﬁlms published

by IMDB

. To this end, we inject the structural he-

terogeneity by introducing new grouping ﬁelds. We

nest the initial attributes inside these new groups. The

values of those ﬁelds are randomly chosen from the

original ﬁlm collection. To add more complexity, we

can set the nesting level used for each generated struc-

ture. We built our custom data generator allowing us

to deﬁne several parameters such as the number of

schemas to produce in the collection, the percentage

of the presence of every generated schema. For each

schema, we can adjust the number of grouping ob-

jects. We mean by grouping object, a compound ﬁeld

in which we nest a subset of the document.

Let us notice that every dataset can be generated

in two versions: the generated heterogeneous one and

an equivalent ﬂat one. The ﬂat dataset contains data

from the heterogeneous one in which each document

is ﬂatten to its leaf attributes. So, each document va-

lues existing in heterogeneous dataset also exists in

the ﬂat one. This allows to compare queries over hete-

rogeneous data and equivalent homogeneous data (ﬂat

documents in our case) since, if they are relevant, they

should return the same number of documents and the

same values. We are currently working on the on-

line free delivery of datasets and datasets generator;

for the moment you can ask the authors or visit their

websites.

Queries. We deﬁne two queries composed of 2 and

8 predicates for projection and selection operators,

and we use all possible comparison operators on dif-

ferent data types. We generate for each query two ver-

sions constituted by the conjunctive form of the pre-

dicates in Q

, Q

queries, and disjunctive in Q

, Q

queries. Moreover, we introduce two other queries to

study the aggregation operator A

, A

• Q

& Q

select all documents where the “director

name” of the ﬁlm starts with the letter “A” and/or

the ﬁlm got as “gross” more than 100 K.

• Q

& Q

select all documents where the “director

name” of the ﬁlm starts with the letter “A” an-

d/or the ﬁlm got as “gross” more than 100 K an-

d/or the “duration” of the ﬁlm does not exceeds

200 minutes and/or the “title year” is less than

the year 1950 and/or the “country” of the ﬁlm is

known and/or the ﬁlm “language” is “English”

and/or the ﬁlm got “IMDB score” more less than

www.omdbapi.com

4 and/or the “number of Facebook likes” is grea-

ter than 500.

• A

group documents by “country” and “lan-

guage” and then aggregate by the function “Max”

over the “ﬁlm score”.

• A

group documents by “director name” and

“year” and then aggregate by the function “Sum”

over the “revenue”.

Table 1: Settings of the generated dataset for rewritten que-

ries evaluation.

Setting Value

# schemas 10

# groups per schema {5,6,1,3,4,2,7,2,1,3}

Nesting levels per schema {4,2,6,1,5,7,2,8,3,4}

% schema presence 10%

#attributes per schema Random

#attributes per group Random

Scale Effects on the Rewritten Queries. In this

test series, we try to study the effects of the scale

on the rewritten queries. We deﬁne three contexts in

which we run the above-deﬁned queries. The order of

query execution is set to be random to prevent the do-

cument store from reusing cache mechanisms. Here,

we describe the different execution contexts:

• We note “QBase” the query that refers to the ini-

tial user query (one of the above deﬁned queries),

and that is executed over the homogeneous ver-

sion of the dataset. The purpose of this ﬁrst con-

text is to study the native behaviour of the docu-

ment stores. We use this ﬁrst context as a baseline

for our experimentation.

• The “QRewritten” refers to the query “QBase”

rewritten by our approach and executed over the

heterogeneous version of the datasets. As afore-

mentioned the two datasets are considered ”equi-

valent”, then “QRewritten” is expected to return

the same number of documents (and content) than

“QBase”. It is the case in all the following expe-

riments.

• The “QAccumulated” refers to the set of equiva-

lent queries formulated on each possible schema

from the collection. In our case, it is made of 10

separated queries since we are dealing with col-

lections having ten schemas. It is executed over

the heterogeneous version of the datasets. For the

experiments, we wrote these queries ”by hand”.

Table 1 presents the characteristics of the datasets

used for this ﬁrst category of experiments. Let us no-

tice that each attribute is present in ten different sche-

mas at different nesting levels.

Querying Heterogeneous Document Stores

●

200 600

Q1 : conjunctive with 2 attributes

Collection size in GB

time in (s)

10 GB 50 GB 100 GB

●

1000 3000 5000

Q2 : disjunctive with 2 attributes

Collection size in GB

time in (s)

10 GB 50 GB 100 GB

●

100 300

Q3 : conjunctive with 8 attributes

Collection size in GB

time in (s)

10 GB 50 GB 100 GB

●

1000 4000 7000

Q4 : disjunctive with 8 attributes

Collection size in GB

time in (s)

10 GB 50 GB 100 GB

●

200 600

Aggregation query A1

Collection size in GB

time in (s)

10 GB 50 GB 100 GB

●

200 600 1000

Aggregation query A2

Collection size in GB

time in (s)

10 GB 50 GB 100 GB

● ●

QRewritten QBase QAccumulated

Figure 3: Collection size effect on the rewritten queries compared to classical ones.

Figure 3 shows our ﬁrst results. Each graphic

shows: x axis is the collection size (GB), y axis is

the time of query execution (s), blue curve refers to

“QRewritten”, green one to “QBase”, and red one to

“QAccumulated”, that is the sum of the evaluation of

the ten sub-queries.

As shown in ﬁgure 3, the behaviour of our rewrit-

ten query is similar to the baseline. Both “QRewrit-

ten” and “QBase” have execution time linear when

regarding collection size while the accumulated query

“QAccumulated” seems to exhibit exponential time

costs. We can notice also that the execution of our so-

lution is less than two times higher (e.g., disjunctive

form) than the normal execution of the baseline query.

Moreover, we score an overall overhead that does not

exceed 1,5 times in the different projection and se-

lection queries.

The same behaviour is also noticed while studying

the aggregation queries. Only “QRewritten” and

“QBase” are presented in the study of the aggrega-

tion queries. The rewriting of aggregation uses the

“aggregate” pipeline operator of MongoDB. It is re-

markable that although the necessary insertion of two

projections in the pipeline (cf. algorithm 1 and ex-

planations section 3.2.2), the time execution overhead

remains low.

For all queries, despite of the fact that each attribute

has been replaced by ten possible paths, the time exe-

cution overhead remains quite low. We believe that

this overhead is acceptable since we bypass the ex-

tra costs for refactoring the underlying data structu-

res. Unlike the baseline, our synthetic dataset con-

tains different grouping objects with varying nesting

levels. Then, the rewritten query contains several

navigational paths that are processed by the native

query engine of MongoDB to ﬁnd matches in each

visited document among the collection. Finally, let

us notice that the aggregation rewriting allows per-

forming complex computations that are particularly

time-consuming and prone to errors when done ”by

hand”.

Heterogeneity Effects on the Dictionary and the

Query Build Time. With this series of experiments,

we try to push the dictionary and the query rewriting

engine to their limits. For that, we generated a hetero-

geneous synthetic collection of 1 GB. We use the ini-

tial 28 attributes from the IMDB ﬂat ﬁlms collection.

The custom collections are generated in a way that

each schema inside a document is composed of two

grouping objects with no further nesting levels. We

generated collection having 10, 100, 1k, 3k and 5k

schemas. For this experiment, we test the use of the

query Q

introduced earlier in this section. We pre-

sent the dictionary size and the time needed to build

the rewritten query of Q

in Table 2

It is notable that the time to build the rewritten

query is very low, always less than two seconds when

5K distinct schemas exist in the collection. In ad-

dition, it is possible to construct a dictionary over a

highly heterogeneous collection of documents, here

ICEIS 2018 - 20th International Conference on Enterprise Information Systems

Table 2: Data diversity effects on query rewriting time and

dictionary size.

# of schemas Query rewriting in (s) Dictionary size

10 0.0005 40 KB

100 0.0025 74 KB

1 K 0.139 2 MB

3 K 0.6 7.2 MB

5 K 1.52 12 MB

our dictionary can support up to 5k of distinct sche-

mas. The resulting size of the materialized dictionary

is very promoting since it does not require signiﬁcant

storage space. Furthermore, we also believe that the

time spent to build the rewritten query is very inte-

resting and represent another advantage of our solu-

tion. When rewriting the queries, we try to ﬁnd dis-

tinct navigational paths for eight predicates. Having

5k of paths for each query predicate, these experi-

ments show that we are able to generate a selection

query with 40k of navigational paths expressed in dis-

junctive form.

5 CONCLUSION

In this paper, we provide a novel approach for que-

rying heterogeneous documents describing a given

entity over document-oriented data stores. Our ob-

jective is to allow users to perform their queries using

a minimal knowledge about data schemas. Our tool

EasyQ is based on two main principles. The ﬁrst one

is a dictionary that contains all possible paths for a gi-

ven ﬁeld. The second one is a rewriting module that

modiﬁes the user query to match all ﬁeld paths ex-

isting in the dictionary. Our approach is a syntactic

manipulation of queries. Therefore, it is grounded on

a strong assumption: the collection describes homo-

geneous entities, i.e., a ﬁeld has the same meaning in

all document schemas. If this assumption is not gua-

ranteed, users may face with irrelevant or incoherent

results.

We conduct experiments to compare the execu-

tion time cost of basic MongoDB queries and rewrit-

ten queries proposed by our approach. We conduct a

set of experiments by changing two primary parame-

ters, the size of the dataset and the structural heteroge-

neity inside a collection. Results show that the cost of

executing rewritten queries proposed in this paper is

higher when compared to the execution of basic user

queries. The overhead added to the performance of

our query is due to the combination of multiple access

path to a queried ﬁeld. Nevertheless, this time over-

head is neglectful when compared to the execution of

separated queries for each path. Let us notice that

an interesting advantage of EasyQ is that each time

a query is evaluated, it is ﬁrst rewritten according to

the dictionary taht is updated online. Therefore, the

query will always automatically deal with all existing

schemas.

These ﬁrst results are very encouraging to conti-

nue this research way and need to be strengthened.

Short-term perspectives are to continue evaluations

and to identify the limitation regarding the number of

paths and ﬁelds in the same query and regarding time

cost. More experiments still to be performed on larger

”real data” datasets. Another perspective is to study

in depth the process of the dictionary building in real

applications and in parallel of collection updates and

querying.

Finally, a long-term perspective is to enhance que-

rying over a collection of documents presenting se-

veral levels of heterogeneity, i.e., structural as well as

syntactic and semantic heterogeneities.

REFERENCES

Baazizi, M.-A., Lahmar, H. B., Colazzo, D., Ghelli, G., and

Sartiani, C. (2017). Schema inference for massive json

datasets. In (EDBT).

Boag, S., Chamberlin, D., Fern

andez, M. F., Florescu, D.,

Robie, J., Sim

eon, J., and Stefanescu, M. (2002).

Xquery 1.0: An xml query language.

Bourhis, P., Reutter, J. L., Su

arez, F., and Vrgo

c, D. (2017).

Json: data model, query languages and schema speci-

ﬁcation. In Proceedings of the 36th ACM SIGMOD-

SIGACT-SIGAI Symposium on Principles of Database

Systems, pages 123–135. ACM.

Chasseur, C., Li, Y., and Patel, J. M. (2013). Enabling json

document stores in relational systems. In WebDB, vo-

lume 13, pages 14–15.

Clark, J., DeRose, S., et al. (1999). Xml path language

(xpath) version 1.0.

DiScala, M. and Abadi, D. J. (2016). Automatic gene-

ration of normalized relational schemas from nested

key-value data. In Proceedings of the 2016 Internati-

onal Conference on Management of Data, pages 295–

310. ACM.

Florescu, D. and Fourny, G. (2013). Jsoniq: The history of a

query language. IEEE internet computing, 17(5):86–

90.

Hai, R., Geisler, S., and Quix, C. (2016). Constance:

An intelligent data lake system. In Proceedings of

the 2016 International Conference on Management of

Data, pages 2097–2100. ACM.

Herrero, V., Abell

o, A., and Romero, O. (2016). Nosql de-

sign for analytical workloads: variability matters. In

ER 2016, Gifu, Japan, November 14-17, 2016, Pro-

ceedings 35, pages 50–64. Springer.

Lin, C., Wang, J., and Rong, C. (2017). Towards hetero-

geneous keyword search. In Proceedings of the ACM

Turing 50th Celebration Conference-China, page 46.

ACM.

Querying Heterogeneous Document Stores

Papakonstantinou, Y. and Vassalos, V. (1999). Query rewri-

ting for semistructured data. In ACM SIGMOD Re-

cord, volume 28, pages 455–466. ACM.

Rahm, E. and Bernstein, P. A. (2001). A survey of approa-

ches to automatic schema matching. the VLDB Jour-

nal, 10(4):334–350.

Ruiz, D. S., Morales, S. F., and Molina, J. G. (2015). Infer-

ring versioned schemas from nosql databases and its

applications. In International Conference on Concep-

tual Modeling, pages 467–480. Springer.

Sheth, A. P. and Larson, J. A. (1990). Federated data-

base systems for managing distributed, heterogene-

ous, and autonomous databases. ACM Computing Sur-

veys (CSUR), 22(3):183–236.

Shvaiko, P. and Euzenat, J. (2005). A survey of schema-

based matching approaches. Journal on data seman-

tics IV, pages 146–171.

Tahara, D., Diamond, T., and Abadi, D. J. (2014). Sinew:

a sql system for multi-structured data. In Proceedings

of the 2014 ACM SIGMOD, pages 815–826. ACM.

Wang, L., Zhang, S., Shi, J., Jiao, L., and Hassanzadeh

(2015). Schema management for document stores.

Proceedings of the VLDB Endowment, 8(9):922–933.

Yang, Y., Sun, Y., Tang, J., Ma, B., and Li, J. (2015). Entity

matching across heterogeneous sources. In Procee-

dings of the 21th ACM SIGKDD, pages 1395–1404.

ACM.

ICEIS 2018 - 20th International Conference on Enterprise Information Systems