Table 2: Data diversity effects on query rewriting time and
dictionary size.
# of schemas Query rewriting in (s) Dictionary size
10 0.0005 40 KB
100 0.0025 74 KB
1 K 0.139 2 MB
3 K 0.6 7.2 MB
5 K 1.52 12 MB
our dictionary can support up to 5k of distinct sche-
mas. The resulting size of the materialized dictionary
is very promoting since it does not require significant
storage space. Furthermore, we also believe that the
time spent to build the rewritten query is very inte-
resting and represent another advantage of our solu-
tion. When rewriting the queries, we try to find dis-
tinct navigational paths for eight predicates. Having
5k of paths for each query predicate, these experi-
ments show that we are able to generate a selection
query with 40k of navigational paths expressed in dis-
junctive form.
5 CONCLUSION
In this paper, we provide a novel approach for que-
rying heterogeneous documents describing a given
entity over document-oriented data stores. Our ob-
jective is to allow users to perform their queries using
a minimal knowledge about data schemas. Our tool
EasyQ is based on two main principles. The first one
is a dictionary that contains all possible paths for a gi-
ven field. The second one is a rewriting module that
modifies the user query to match all field paths ex-
isting in the dictionary. Our approach is a syntactic
manipulation of queries. Therefore, it is grounded on
a strong assumption: the collection describes homo-
geneous entities, i.e., a field has the same meaning in
all document schemas. If this assumption is not gua-
ranteed, users may face with irrelevant or incoherent
results.
We conduct experiments to compare the execu-
tion time cost of basic MongoDB queries and rewrit-
ten queries proposed by our approach. We conduct a
set of experiments by changing two primary parame-
ters, the size of the dataset and the structural heteroge-
neity inside a collection. Results show that the cost of
executing rewritten queries proposed in this paper is
higher when compared to the execution of basic user
queries. The overhead added to the performance of
our query is due to the combination of multiple access
path to a queried field. Nevertheless, this time over-
head is neglectful when compared to the execution of
separated queries for each path. Let us notice that
an interesting advantage of EasyQ is that each time
a query is evaluated, it is first rewritten according to
the dictionary taht is updated online. Therefore, the
query will always automatically deal with all existing
schemas.
These first results are very encouraging to conti-
nue this research way and need to be strengthened.
Short-term perspectives are to continue evaluations
and to identify the limitation regarding the number of
paths and fields in the same query and regarding time
cost. More experiments still to be performed on larger
”real data” datasets. Another perspective is to study
in depth the process of the dictionary building in real
applications and in parallel of collection updates and
querying.
Finally, a long-term perspective is to enhance que-
rying over a collection of documents presenting se-
veral levels of heterogeneity, i.e., structural as well as
syntactic and semantic heterogeneities.
REFERENCES
Baazizi, M.-A., Lahmar, H. B., Colazzo, D., Ghelli, G., and
Sartiani, C. (2017). Schema inference for massive json
datasets. In (EDBT).
Boag, S., Chamberlin, D., Fern
´
andez, M. F., Florescu, D.,
Robie, J., Sim
´
eon, J., and Stefanescu, M. (2002).
Xquery 1.0: An xml query language.
Bourhis, P., Reutter, J. L., Su
´
arez, F., and Vrgo
ˇ
c, D. (2017).
Json: data model, query languages and schema speci-
fication. In Proceedings of the 36th ACM SIGMOD-
SIGACT-SIGAI Symposium on Principles of Database
Systems, pages 123–135. ACM.
Chasseur, C., Li, Y., and Patel, J. M. (2013). Enabling json
document stores in relational systems. In WebDB, vo-
lume 13, pages 14–15.
Clark, J., DeRose, S., et al. (1999). Xml path language
(xpath) version 1.0.
DiScala, M. and Abadi, D. J. (2016). Automatic gene-
ration of normalized relational schemas from nested
key-value data. In Proceedings of the 2016 Internati-
onal Conference on Management of Data, pages 295–
310. ACM.
Florescu, D. and Fourny, G. (2013). Jsoniq: The history of a
query language. IEEE internet computing, 17(5):86–
90.
Hai, R., Geisler, S., and Quix, C. (2016). Constance:
An intelligent data lake system. In Proceedings of
the 2016 International Conference on Management of
Data, pages 2097–2100. ACM.
Herrero, V., Abell
´
o, A., and Romero, O. (2016). Nosql de-
sign for analytical workloads: variability matters. In
ER 2016, Gifu, Japan, November 14-17, 2016, Pro-
ceedings 35, pages 50–64. Springer.
Lin, C., Wang, J., and Rong, C. (2017). Towards hetero-
geneous keyword search. In Proceedings of the ACM
Turing 50th Celebration Conference-China, page 46.
ACM.
Querying Heterogeneous Document Stores
67