maintains the catalog information by running
probing queries such as
db.collection.count()
to keep actual database statistics, e.g. cardinalities of
document collections. Similarly to the Derby
wrapper, it also provides information about available
indexes on document attributes.
The MFR wrapper implements an MFR planner
to optimize MFR expressions in accordance with
any pushed down selections. The wrapper uses
Spark’s Python API, and thus translates each
transformation to Python lambda functions. Besides,
it also accepts raw Python lambda functions as
transformation definitions. The wrapper executes the
dynamically built Python code using the reflection
capabilities of Python by means of the
eval()
function. Then, it transforms the resulting RDD into
a Spark DataFrame.
5 CONCLUSIONS
In this paper, we presented CloudMdsQL, a common
language for querying and integrating data from
heterogeneous cloud data stores and the
implementation of its query engine. By combining
the expressivity of functional languages and the
manipulability of declarative relational languages, it
stands in “the golden mean” between the two major
categories of query languages with respect to the
problem of unifying a diverse set of data
management systems. CloudMdsQL satisfies all the
legacy requirements for a common query language,
namely: support of nested queries across data stores,
data-metadata transformations, schema
independence, and optimizability. In addition, it
allows embedded invocations to each data store’s
native query interface, in order to exploit the full
power of data stores’ query mechanism.
The architecture of CloudMdsQL query engine is
fully distributed, so that query engine nodes can
directly communicate with each other, by
exchanging code (query plans) and data. Thus, the
query engine does not follow the traditional
mediator/wrapper architectural model where
mediator and wrappers are centralized. This
distributed architecture yields important
optimization opportunities, e.g. minimizing data
transfers by moving the smallest intermediate data
for subsequent processing by one particular node.
The wrappers are designed to be transparent, making
the heterogeneity explicit in the query in favor of
preserving the expressivity of local data stores’
query languages. CloudMdsQL sticks to the
relational data model, because of its intuitive data
representation, wide acceptance and ability to
integrate datasets by applying joins, unions and
other relational algebra operations.
The CloudMdsQL query engine has been
validated (Kolev et al., 2015; Bondiombouy et al.,
2015) with four different database management
systems – Sparksee (a graph database with Python
API), Derby (a relational database accessed through
its JDBC driver), MongoDB (a document database
with a Java API) and Apache Spark (a parallel
framework processing distributed data stored in
HDFS, accessed by Apache Spark API). The
performed experiments have evaluated the impact of
the used optimization techniques on the overall
query execution performance (Kolev et al., 2015;
Bondiombouy et al., 2015).
REFERENCES
Armbrust, M., Xin, R., Lian, C., Huai, Y., Liu, D.,
Bradley, J., Meng, X., Kaftan, T., Franklin, M.,
Ghodsi, A., Zaharia, M. 2015. Spark SQL: Relational
Data Processing in Spark. In ACM SIGMOD (2015),
1383-1394.
Bondiombouy, C., Kolev, B., Levchenko, O., Valduriez,
P. 2015. Integrating Big Data and Relational Data
with a Functional SQL-like Query Language. Int.
Conf. on Databases and Expert Systems Applications
(DEXA) (2015), 170-185.
CoherentPaaS, http://coherentpaas.eu (2013).
DeWitt, D., Halverson, A., Nehme, R., Shankar, S.,
Aguilar-Saborit J., Avanes, A., Flasza, M., Gramling,
J. 2013. Split Query Processing in Polybase. In ACM
SIGMOD (2013), 1255-1266.
Duggan, J., Elmore, A. J., Stonebraker, M., Balazinska,
M., Howe, B., Kepner, J., Madden, S., Maier, D.,
Mattson, T., Zdonik, S. 2015. The BigDAWG
Polystore System. SIGMOD Rec. 44, 2 (August 2015),
11-16.
Kolev, B., Valduriez, P., Bondiombouy, C., Jiménez-Peris,
R., Pau, R., Pereira, J. 2015. CloudMdsQL: Querying
Heterogeneous Cloud Data Stores with a Common
Language. Distributed and Parallel Databases, pp 1-
41, http://hal-lirmm.ccsd.cnrs.fr/lirmm-01184016.
LeFevre, J., Sankaranarayanan, J., Hacıgümüs, H.,
Tatemura, J., Polyzotis, N., Carey, M. 2014. MISO:
Souping Up Big Data Query Processing with a
Multistore System. In ACM SIGMOD (2014), 1591-
1602.
Özsu, T., Valduriez, P. 2011. Principles of Distributed
Database Systems – Third Edition. Springer, 850
pages.