Database Architectures: Current State and Development
Jaroslav Pokorný
1
and Karel Richta
2
1
Department of Software Engineering, Charles University, Prague, Czech Republic
2
Department of Computers, Czech Technical University, Prague, Czech Republic
Keywords: Database Architectures, Distributed Databases, Scalable Databases, MapReduce, Big Data Management
Systems, NoSQL Databases, NewSQL Databases, SQL-on-Hadoop, Big Data, Big Analytics.
Abstract: The paper presents shortly a history and development of database management tools in last decade. The
movement towards a higher database performance and database scalability is discussed in the context to
requirements of practice. These include Big Data and Big Analytics as driving forces that together with a
progress in hardware development led to new DBMS architectures. We describe and evaluate them mainly in
terms of their scalability. We focus also on a usability of these architectures which depends strongly on
application environment. We also mention more complex software stacks containing tools for management
of real-time analysis and intelligent processes.
1 INTRODUCTION
In 2007 we published the paper (Pokorny, 2007)
about database architectures and their relationships to
requirements of practice. In some sense, a
development of database architectures reflects always
requirements from practice. For example, earlier
Database Management Systems (DBMS) were
focused mainly on OLTP applications typical for
business environment. However, the requirements
changed also according to the progress in hardware
development and sometimes go hand in hand with a
theoretical progress. Today, hardware technology
makes it possible to scale higher data volumes and
workloads, often very specialized, in the way that was
not possible in middle 2000s. As a consequence some
new database architectures have emerged. Some of
them are scalable with some technical parameters
appropriate for Big Data storage and processing as
well as for cloud-hosted database systems. In other
words, they reflect the challenge of providing
efficient support for Big Volume of data in data-
intensive high performance computing (HPC)
environments. Moreover, the architectures of
traditional OLTP databases have been proven also
obsolete. Already in 2007, (Stonebraker, et al., 2007)
described in-memory relational DBMS (RDBMS)
overcoming former traditional DBMSs by nearly two
orders of magnitude. We could also observe that high
performance application requirements shifted from
transaction processing and data warehousing to
requirements posed by Web and business intelligence
applications.
This paper aims to discuss the basic
characteristics and the recent advancements of these
technologies, illustrate the strengths and weaknesses
of each technology and present some opportunities
for future work. Particularly, we will describe
movement in the development of database
architectures towards scalable architectures.
Obviously, this movement is related to the advent of
the Big Data era, where data volume, velocity, and
variety influence significantly its processing.
Although Big Data can be viewed from various
perspectives and in various dimensions, e.g.
economical, legal, organizational, and technological
one, we mention only the last one, i.e., Big Data
storage and processing. Big Data processing can be
of two categories – namely Big Analytics and online
read-write access to large volume of rather simple
data. This leads to the development of different types
of tools and associated architectures.
In Section 2, we describe a history of DBMS
architectures by binding to the work (Pokorný, 2007).
Section 3 is devoted to scalable architectures, namely
NoSQL databases, Hadoop and MapReduce, Big
Data Management Systems, NewSQL DBMSs,
NoSQL databases with ACID transactions, and SQL-
on-Hadoop systems. Section 4 summarizes these
approaches and concludes the paper.
152
Pokorny J. and Richta K..
Database Architectures: Current State and Development.
DOI: 10.5220/0005512001520161
In Proceedings of 4th International Conference on Data Management Technologies and Applications (DATA-2015), pages 152-161
ISBN: 978-989-758-103-8
Copyright
c
2015 SCITEPRESS (Science and Technology Publications, Lda.)
2 DBMS ARCHITECTURES
– A HISTORY
In this section we review two main phases of database
evolution which influenced the development of the
database software. We start with the universal
architecture consisting from a hierarchy of layers
where the layer k is defined by the layer k-1. The
proposed layers are not fine grained enough to map
them exactly to DBMS components but they became
the basis for the implementation of all traditional
DBMSs in the past. We describe also the
development of special purpose database servers
which were a response to the some inappropriate
properties of the universal architecture. Here we will
often use the more common short term database
instead of DBMS.
2.1 Universal Architecture
The concept of a multi-layered architecture consisting
of five layers L1,…, L5 was proposed in (Härder and
Reuter, 1983). There L1 ensures a file management
and L5 enables an access to data with a high-level
language, typically SQL. The middle layers use
auxiliary data structures for mapping higher-level
objects to more simple ones in a hierarchical way. In
other words, they enable to transform logical data
(tables, rows) and SQL operations to physical records
and sequences of simple access operations over them.
A typical property of the universal architecture is that
users can see only this outermost layer. In practice,
the number of layers is often reduced, e.g. to three,
due to some techniques enabling a more effective
performance (Härder, 2005a).
The same model can be used in the case of
distributed databases for every node of a network
together with a connection layer responsible for
communication, adaptation, or mediation services.
Also, a typical shared-nothing parallel RDBMS can
be described by this architecture. Layers L1-L4 are
presented usually at each machine in a cluster.
Extension of database requirements to process
other data types than relational, e.g. VITA (video,
image, text, and audio) has led to a solution to use
common maximally the layer L1. The others had to
be implemented for each type separately. Solutions
with universal servers or universal DBMS so popular
in 90s were based on adding loosely coupled
additional modules (components) for each new data
type. Their data models owned some object-
orientation features, but not standardized yet.
Vendors of leading DBMSs called these components
extenders, data blades, and cartridges, respectively.
Remind, that e.g. spatial and text components had a
background in software of specialized vendors built
on a file system equipped by a sophisticated
functionality for processing spatial objects and texts,
respectively. Integration of such components with
relational engines presented a problem and
effectiveness of the resulted system was never fully
resolved. Any efficient solution of optimization
problems resulted usually in modification of the
DBMS kernel that was very expensive, time-
consuming, and tending to errors.
Development of the L5 layer during 90s peaked in
a specification of so-called object-relational (OR)
data model. Its standardization in SQL started in the
version SQL:1999. Tables can have structured rows;
their columns can even be of user-defined types or of
new built-in data types. Relations can be sets of
objects (rows) linked via their IDs. For new built-in
data types there is a standardized set of predicates and
functions for manipulation of their instances.
Although the ORDB technology is already available
for use in all the major RDBMS products, its
industrial adoption rate is due to its complexity not
very high.
2.2 Special Purpose Database Servers
In 2005, (Stonebraker, 2005) and other scientists
claimed “the one size fits all” model of DBMS had
ended and raised a need to develop new DBMS
architectures reminding rather separate database
servers tailored to requirements of particular
applications types. Not only functionality was
considered, but also the way how data is stored.
Candidates for special-purpose database servers have
been found in application areas, namely:
data warehousing, OLAP
XML processing,
data streams processing,
text retrieval,
processing scientific data.
We consider also mobile and embedded DBMS.
This application area produces large and complex
datasets that requires more advanced database
support, than that one offered by universal DBMS.
2.2.1 Data Warehousing and OLAP
Without doubts data warehouses belong to the oldest
special purpose databases. Based principally on
RDBMS techniques, data warehouse architecture has
to include explicit specialized support for a
specialized logical model (denormalized,
multidimensional), historical, summarized and
DatabaseArchitectures:CurrentStateandDevelopment
153
consolidated data operations for ad hoc analysis
queries, and efficient access and implementation
methods for such operations. Optimizations using
column stores, pipelining operations, and vectorising
operations are used here to take advantage of
commodity server hardware. For example, column-
stores use “once-a-column” style processing, which is
aimed at better I/O and cache efficiency. Sybase IQ
(today in Version 16) belongs to pioneering
implementations of column-stores. An experience
with column-stores from the past emphasizes their
high performance and scalability for complex queries
against large data sets. That is why the variants of
column-stores have become a popular platform for
managing and analysing Big Data using SQL-based
analytic methods.
Oracle produces a remarkable solution for data
warehouses, so called Oracle Exadata Database
Machine (https://www.oracle.com/engineered-
systems/exadata/index.html, 2015). It is a complete,
optimized, hardware and software solution that
delivers extreme performance and database
consolidation for data warehousing and reporting
systems. It uses very efficient hybrid columnar
compression.
Today, the column-oriented HP Vertica Analytic
Database with massively parallel processing (MPP)
and shared-nothing architecture is most popular
(Lamb et al., 2012). This tool is cluster-based and
integrated with Hadoop. Its SQL has built-in many
analytics capabilities.
An example of a progress in OLAP database
technology is standalone OLAP server EssBase
(Anantapantula and Gomez, 2009) which stores data
in array-like structures, where the dimensions of the
array represent columns of the underlying tables, and
the values of the cells represent precomputed
aggregates over the data.
2.2.2 XML Processing
In late 90s, XML started to become more popular for
representing semi-structured data. New XML query
languages and XML DBMSs occurred. A significant
role in storing XML data in a database way is whether
the data is data-centric or document-centric.
One possibility how to store XML data is an XML-
enabled database. It means to map (shred) the XML
documents into data structures of the existing
RDBMS. Experiments show that such database is
most feasible if only simple XPath operations are
used or if the applications are designed to work
directly against the underlying relational schema.
Data-centric documents using XML as a data
transport mechanism are suitable for this approach.
A more advanced solution was to develop a
DBMS with a native XML storage (native XML
database or NXD). These databases are suitable for
document-centric XML data. An implementation of
NXD was a challenge in 2000th both for developers
and researchers of DBMSs. In database architectures,
NXDs provide a nice example when a DBMS needs
a separate engine. We can distinguish two main
approaches to NXD implementations:
NXD DBMS as a separate engine (Tamino
XML Server, XHive/DB, XIndice, eXist, etc.),
adding native XML storage to RDBMS (e.g.,
XML Data Synthesis by Oracle),
hybrid solutions, i.e. a RDBMS natively stores
and natively processes XML data (e.g., DB2 9
pureXML, ORACLE 11g with more storage
models for XML data, SQL Server 2012).
These approaches include also possibility to
parse and shred the XML documents to an
XML-data-driven relational schema.
An advantage of the latter is the possibility to mix
XML with relational data and/or with other types of
data (e.g. textual, RDF) within one DBMS. XML data
can be accessed via the combined use of SQL/XML
and full XQuery language. While critical data is still
in a relational format, the data that do not fit the
relational data model is stored natively in XML.
Härder shows in (Härder, 2005b) how the layered
architecture described in Section 2.1 can be used to
implement NXD DBMS.
Unfortunately, it seems in the last years, that
research in XML database area is not too intensive
and that XML databases become rather a niche topic.
The more lightweight, bandwidth-non-intensive
JSON (JavaScript Object Notation) is now emerging
as a preferred format in web-centric, so-called
NoSQL databases (Section 3.1). Even, PostgreSQL
from version 9.2 has a JSON data type.
2.2.3 Data Stream Processing
Data streams occur in many modern applications as,
e.g., network traffic analysis, collecting records of
transactions and their analysis, sensor networks,
application exploiting RFID tags, telephone calls,
health care applications, financial applications, Web
logs, click-streams, etc. Data transmission poses
continuous streaming data, which are indexed in time
dimension to be filtered to individual (possibly
mobile) users. Applications require near real-time
querying and analyses. These requirements justify the
DATA2015-4thInternationalConferenceonDataManagementTechnologiesandApplications
154
existence of special DBMSs because the relational
ones are not effective in this area.
Special purpose Data Stream Management
Systems (DSMS) have been implemented to deal with
these issues. These systems are not based primarily
on loading data into a database, except transiently for
the duration of certain operations. For example, data
streams containing RFID tags generated from events
obtained by RFID readers are filtered, aggregated,
transformed, and harmonized so that events
associated with them can be monitored in real-time.
Events processing is data centric, it requires in-
memory processing. Associated query languages are
based on SQL, e.g., StreamSQL
(www.streambase.com/developers/docs/latest/stream
sql/, 2004). Queries executed continuously over the
data passed to the system are called continuous
queries. Typical for DSMS are windows operators
that only select a part of the stream according to fixed
parameters, such as the size and bounds of the
window. For example, in the sliding windows, both
bounds move.
A pioneering STREAM project (http://
ilpubs.stanford.edu:8090/641/, 2004) started in 2002
and has officially wound down in 2006. A number of
other DSMS have occurred since then. For example,
Odysseus (http://odysseus.informatik.uni-
oldenburg.de/, 2007) belongs to this category. Many
big vendors such as Microsoft, IBM, or Oracle, have
their own data stream management solution.
However, there is neither a common standard for data
streaming query languages nor an agreement on a
common set of operators and their semantics until
today.
Although these systems proved to be an optimal
solution for on the fly analysis of data streams, they
cannot perform complex reasoning tasks.
2.2.4 Text Retrieval
Previous full text search engines did not have too
many database features. They provided typical
operations like index creation, full text search, and
index update. Again some hybrid solutions are
preferred today. Some projects use current DBMS as
backend to existing full text search engines. For
example, the index file is stored in a relational
database.
A significant representative of this development is
the search engine Apache Lucene (now in Version
4.7.1) (http://lucene.apache.org/, 2011). An
associated open source enterprise search platform
Apache Solr is now with Lucene integrated
(Lucene/Solr). Its major features include full text
search, hit highlighting, faceted search, dynamic
clustering, database integration, and rich document
(e.g., Word, PDF) handling.
Based on the standard SQL/MM from 2003 we
can find text retrieval features in SQL (similarly as
spatial objects and still images), e.g., in RDBMS
PostgreSQL. But performance tests (Lütolf, 2012)
have shown that full text search queries with a large
result set are fast with Apache Lucene/Solr but with
PostgreSQL are very slow.
2.2.5 Processing Scientific Data
A special category of database servers is used for
scientific data storage and processing. Biomedical
sciences, astronomy, etc., use huge data repositories
often organized in grid or cloud architectures.
However, a use of commercial cloud services raises
issues of governance, cost-effectiveness, trust and
quality of service. Consequently, some frameworks
use hybrid cloud, combining internal institutional
storage, cloud storage and cloud-based preservation
services into a single integrated repository
infrastructure. It seems, that universal and hybrid
servers are effective especially when the data
requirements can be simply decomposed into
relatively independent parts evaluated separately in
database kernel and in the module built for the given
special data type.
A notable representative of this data processing
category is DBMS SciDB proposed in 2009. Now,
SciDB is a DBMS optimized for multidimensional
data management of Big Data and for so called Big
Analytics (Stonebraker et al., 2013). Data structures
of SciDB include arrays and vectors as first-class
objects with built-in optimized operations. SciDB is
usable also for geospatial, financial, and industrial
applications.
2.2.6 Mobile and Embedded DBMSs
Embedded DBMSs are a special case of embedded
applications. Typically, they are single application
DBMS that are not shared with other users, their
management is automatic, and their functions are
substantially limited. Their self-management
includes at least backups, error recovery, or
reorganization of tables and indices. Such
applications run often on special devices, mostly
mobile. The client and server have wireless
connections. We talk then about mobile and
embedded DBMS.
A sufficient motivation for movement to this new
data management seems to be applications as
healthcare, insurance or field services, which use
DatabaseArchitectures:CurrentStateandDevelopment
155
mobile devices. In the case data is in backend
databases, and/or generally somewhere on Web, or
accessible, e.g., in form of cloud computing. In many
applications, databases are needed in a very restricted
version directly on mobile devices, e.g., on sensors.
An important property of these architectures is
synchronization with backend data source. Current
most mobile DBMSs only provide limited
prepackaged SQL functions for the mobile
application.
Typical commercial solutions for mobile and
embedded DBMSs include, e.g., Sybase's Ultralite,
IBM's Mobile Database, SQL Server Express, MS-
Pocket Access, SQL Server Compact, and Oracle
Lite.
Now mobile and embedded DBMSs typically:
use a component approach, i.e. a possibility to
configure the database functionality according
to application requirements and minimize
thereby the size of application software;
are often in-memory databases, even without
requirements on any persistence. It means they
use special query techniques and indexing
methods.
3 SCALABLE DATABASES
The authors of (Abiteboul et al., 2005) emphasized
two main driving forces in database area: Internet and
particular sciences, as the physics, biology, medicine,
and engineering. Said in today’s words, it means a
shift towards Big Data. The former led to
development of Web databases, the latter to scientific
databases that require today not only a relevant data
management (see Section 2.2.5) but also tools for
advanced analytics. Similar requirements occur in
Big Data and Big Analytics trends today. In the
course of the last eight years, the classical DBMS
architecture was challenged by a variety of these
requirements and changes, i.e. data volume and
heterogeneity, scalability and functional extensions in
data processing.
It is typical, that traditional distributed DBMS are
not appropriate for these purposes. They are many
reasons for it, e.g.:
database administration may be complex (e.g.
design, recovery),
distributed schema management,
distributed query management,
synchronous distributed concurrency control
(2PC protocol) decreases update performance.
DBMS architectures suitable for Big Data are
mostly built on a new hardware. A rather traditional
solution is based on a single server with very large
memory and multi-core multiprocessor. A popular
infrastructure for Big Data is a HPC cluster (a.k.a.
supercomputer). Some RDBMS installations use SSD
data storage that is 100 times faster in random access
to data than the best disks. Such installations are able
to process PBytes of data. HPC cluster as well as grid
is appropriate especially for Big Analytics in context
of scientific data. We can scale vertically such
databases, i.e. scale-up, by adding new resources to a
single server in a system. This approach to data
scalability is appropriate for corporate cloud
computing (Zhao et al., 2014).
Architectures suitable rather for customer cloud
computing scale DBMSs across multiple machines.
This technique, scale-out, uses well-known
mechanism called database sharding, which breaks a
database into multiple shared-nothing groups of rows
in the case of tabular data and spreads them across a
number of distributed servers. Sherds are not
necessarily disjunctive. Each server acts as the single
source of a data subset. Sharding is just another name
for horizontal data partitioning. However, database
sharding reminding classical distributed databases
cannot provide high scalability at large scale due to
the inherent complexity of the interface and ACID
guarantees mechanisms. This had an influence on
development of so called NoSQL databases (Cattell,
2010).
3.1 NoSQL Databases
A new generation of distributed database products
labelled NoSQL emerged from 2004. Obviously, the
term NoSQL is misleading. Though not based on the
relational data model, some of these products offer a
subset of SQL data access capabilities. To be able to
scale-out, NoSQL architectures differ from RDBMS
in many key design aspects (Pokorny, 2013):
simplified data model,
database design is rather query driven,
integrity constraints are not supported,
there is no standard query language,
unneeded complexity is reduced (simple API,
weakening ACID semantics, simple get, put,
and delete operations).
Most NoSQL databases have been designed to
query over high data volumes and provide little or no
support for traditional OLTP based on ACID
properties. Indeed, CAP theorem (Brewer, 2005) has
shown that a distributed database system can only
choose at most two out of three properties:
Consistency, Availability and tolerance to Partitions.
DATA2015-4thInternationalConferenceonDataManagementTechnologiesandApplications
156
Then preferring P in our non-reliable Internet
environment, NoSQL databases support A or C. For
example, Cassandra and Hbase databases have
chosen, AP and CP, respectively. Cassandra uses also
tuneable consistency, enabling different degrees of
balance between C, A, P.
In practice, mostly strict consistency is relaxed to
so-called eventual consistency. Eventually consistent
emphasizes that after a certain time period, the data
store comes to a consistent state.
There are many of NoSQL databases. Some well-
known lists, e.g. (http://nosql-database.org/, 2015), of
them consider also XML, object-oriented, and graph
databases and others in this category. Today, mainly
key-value stores and their little more complex
column-oriented and document-oriented variants
represent the NoSQL category. Remind that the
column-oriented data model uses the column as a
basic term and it is implemented similarly as
relational column stores (see Section 2.2.1). A
management of graphs is possible with graph
databases (see, e.g., Neo4j (http://neo4j.com/, 2015).
Every NoSQL database has some special features
and functionality which makes it different to decide
for using in an application. Diversity of query tools in
these databases is also very high and it seems that it
will be very difficult to develop a single standard for
all categories (Bach and Werner, 2013). A system
administrator must thus consider carefully which type
of database best suits user needs before committing to
one implementation or the other. Today, NoSQL
evolve and take on more traditional DBMS-like
features. However, their ad-hoc designs prevent their
wider adaptability and extensibility.
A special problem is quality and usability of these
engines. Partially interesting information can be
obtained from DB-Engines Ranking (http://db-
engines.com/en/ranking, 2015) where score of a
DBMS product expresses its popularity. Considering
NoSQL, the document store MongoDB, column-
oriented Cassandra, and key-value store appear in the
top ten rated database engines in April 2015.
Concerning the application level, NoSQL systems
are recommended for newly developed applications,
particularly for storage and processing Big Data, but
not for migrating existing applications which are
written on top of traditional RDMSs.
An important point is that the term NoSQL has
come to categorize a set of databases that are more
different than they are the same. Today, scaling is
certainly the first requirement but modern databases
must meet more, at least, the must:
adapt to change, e.g. mine new data sources and
data types without the database restructuring,
be able to offer tools for formulating rich
queries, creating indexes, and searching over
multi-structured and quickly changing data.
Unfortunately, only some of NoSQL databases meet
all three requirements.
3.2 Hadoop and MapReduce
Many NoSQL databases are based on Hadoop
Distributed File System (HDFS)
(http://hadoop.apache.org/docs/r0.18.0/hdfs_design.
pdf, 2007), which is a part so called Hadoop software
stack. This stack enables to access data by three
different sets of tools in particular layers which
distinguishes it from the universal DBMS
architecture with only SQL API in the outermost
layer. The NoSQL HBase is available as a key-value
layer with Get/Put operations as input. Hadoop
MapReduce (M/R) system server in the middle layer
enables to create M/R jobs. Finally, high-level
languages HiveQL, PigLatin, and Jaql are at disposal
for some users at the outermost layer. HiveQL is an
SQL-like language; Jaql is a declarative scripting
language for analysing large semi-structured datasets.
Pig Latin is not declarative. Whose programs are
series of assignments similar to an execution plan for
relational operations in a RDBMS.
On the analytics side, M/R emerged as the
platform for all analytics needs of the enterprise, i.e.
as an effective tool to pre-process unstructured and
semi-structured data sources such as images, text, raw
logs, XML/JSON objects etc. A special attention
belongs to above mentioned HiveQL. It is originally
a part of infrastructure (data warehousing application)
Hive (https://hive.apache.org, 2011), which is the
first SQL-on-Hadoop solution (see Section 3.6)
providing an SQL-like interface with the underlying
M/R. Hive converts the query in HiveQL into
sequence of M/R jobs.
Not all from M/R is perfect. For example, its
implementation is sub-optimal, it uses brute force
instead of indexing. Despite of the critique of many
aspects of M/R, there are many approaches to its
improvement (Doulkeridis and Nørvåg, 2014). For
example, often it is emphasized the one main
limitation of the M/R framework is that it does not
support the joining multiple datasets in one task.
However, this can still be achieved with additional
M/R steps (Zhao et al., 2014). An approach for
extending M/R for supporting real-time analysis is
introduced at Facebook (Borthakur et al., 2011).
On the other hand, rapid implementation of
majority of the data discovery and data science
attempts requires strong support for SQL with, e.g.
DatabaseArchitectures:CurrentStateandDevelopment
157
embedded analytical capabilities normally available
in an MPP-based analytic database.
Other solution is offered by so called Hadoop-
relational hybrids. For example, above mentioned
column-oriented HP Vertica Analytic Database is
cluster-based and integrated with Hadoop. Its SQL
has built-in many analytics capabilities.
3.3 Big Data Management Systems
Hadoop software stack dominates today in industry,
but for Big Analytics it is not too practical. To address
Big Analytics challenges, a new generation of
scalable data management technologies has emerged
in the last years. A Big Data Management System
(BDMS) is a highly scalable platform which supports
requirements of Big Data processing. Now, BDMS is
considered as a new component of more general
information architecture in an enterprise into which
Big Data solutions should be integrated.
Example: Considering Hadoop software stack as
BDMS architecture of a first-generation, a
representative of BDMS of a second-generation is
ASTERIX system (Vinayak et al., 2012). It is fully
parallel, able to store, access, index, query, analyse,
and publish very large quantities of semi-structured
data (represented in the JSON format). The
ASTERIX architecture (see Table 1) is a software
stack also with more layers for data access with some
new platforms. For example, Pregel (Malewicz et al.,
2010) is a system for large-scale graph processing on
distributed cluster of commodity machines.
Algebrick’s algebra layer is independent on a data
model and is therefore able to support high-level data
languages like HiveQL and Piglet (a subset of
PigLatin in Hadoop software Stack)) and others.
Partitioned parallel platform Hyracks for data-
intensive computing allows users to express a
computation as a directed acyclic graph of data
operators and connectors. IMRU provides a general
framework for parallelizing a large class of machine
learning algorithms, including batch learning, based
on Hyracs.
Microsoft also developed a Big Data software
stack targeted for Big Analytics (Chaiken et al.,
2008). Data are modelled by relations with a schema.
A declarative and extensible scripting language called
SCOPE (Structured Computations Optimized for
Parallel Execution) is a clone of SQL. Lower layers
of the stack contain a distributed platform Cosmos
designed to run on large clusters consisting of
thousands of commodity servers.
There are other software stacks addressing Big
Data challenges, e.g. Berkeley Data Analytics Stack,
Stratosphere Stack, etc.
Finally, note that to manage and analyse non-
relational Big Data not only NoSQL, BDMS, even no
DBMS is needed. Non-DBMS data platforms such as
low-level HDFS can be sufficient, e.g. for a batch
analysis.
3.4 NewSQL Databases
NewSQL is a subcategory of RDBMS preserving
SQL language and ACID properties. These tools
achieve high performance and scalability by offering
architectural redesigns that take better advantage of
modern hardware platforms such as shared-nothing
clusters of many-core machines with large or non-
volatile in-memory storage. Their architectures
provide much higher per-node performance than
traditional RDBMS.
Obviously the NewSQL architectures can
distinguish significantly. Some of them are really
new, e.g., VoltDB (http://voltdb.com/, 2015),
Clustrix (www.clustrix.com/, 2015), NuoDB
(www.nuodb.com/, 2015), and Spanner (Corbett,
2012), some are improved versions of MySQL. These
DBMSs are trying to be usable for applications
already written for an earlier generation of RDBMSs.
However, Spanner uses a little different relational
data model enabling to create hierarchies of tables.
Often, SQL features are not so strict, e.g. SQL in
VoltDB does not use NOT in WHERE clause. The
horizontal scalability of NewSQL databases is high.
They are appropriate for data volumes up to the PByte
level. An excellent review of above mentioned four
NewSQL systems and a lot of NoSQL DBMSs is
offered by the paper (Grolinger et al., 2013).
A high performance is often achieved by in-
memory processing approach. Thanks to considerable
technological advances during the last 30 years these
DBMSs finally become available in commercial
products. For example, above mentioned VoltDB is
in-memory, parallel DBMSs optimized for OLTP
applications. It is a highly distributed, relational
database that runs on a cluster on shared-nothing, in-
memory executor nodes.
The fundamental problem with in-memory
DBMSs, however, is that their improved performance
is only achievable when the database is smaller than
the amount of physical memory available in the
system. To overcome the restriction that all data fit in
main memory, a new technique, called anti-caching
(DeBrabant et al., 2013), was proposed for H-store
(the academic version of VoltDB). An anti-caching
DATA2015-4thInternationalConferenceonDataManagementTechnologiesandApplications
158
Table 1: The ASTERIX Software Stack.
Level of abstraction Data processing
L5 non-procedural access Asterix QL
ASTERIX
DBMS HiveQL, Piglet, …
Other HLL
compilers M/R jobs Pregel job
Algebricks Hadoop M/R Pregelix IMRU Hyracks
Algebra Layer compatibility jobs
L2-L4
algebraic approach
L1 file management Hyracks Data-parallel Platform
DBMS reverses the traditional hierarchy of disk-
based systems, i.e. all data initially resides in
memory, and when memory is exhausted, the least-
recently accessed records are collected and written to
disk.
3.5 NoSQL Databases with ACID
Transactions
A notable new generation of NoSQL databases we
mention enables ACID transactions. Many NoSQL
designers are therefore exploring a return to
transactions with ACID properties as the preferred
means of managing concurrency for a broad range of
applications. Using tailored optimizations, designers
are finding that implementation of ACID transactions
needs not sacrifice scalability, fault-tolerance, or
performance. Such NoSQL databases are also called
Enterprise NoSQL. They are database tools that have
the ability to handle the volume, variety, and velocity
of data like all NoSQL solutions, and have the
necessary features to run inside the business
environment.
These DBMSs:
maintain distributed design, fault tolerance,
easy scaling, and a simple, flexible base data
model,
extend the base data models of NoSQL,
have monitoring and performance tools,
are CP systems with global transactions.
A good example of such DBMS is a key-value
FoundationDB (https://foundationdb.com/, 2015)
with scalability and fault tolerance (and an SQL
layer). Oracle NoSQL Database provides ACID
complaint transactions for full CRUD (create, read,
update, delete) operations, with adjustable durability
and consistency transactional guarantees. The in
Section 3.4 mentioned Spanner is also NoSQL which
can be considered as NewSQL as well.
MarkLogic (Cromwell S., 2013) is a document
database that can store XML, JSON, text, and large
binaries such as PDFs and Microsoft Office
documents. MarkLogic run on HDFS, has full-text
search built directly into the database kernel.
MarkLogic is a true transactional database. Similarly,
the distributed graph database OrientDB guarantees
also ACID properties.
3.6 SQL-on-Hadoop Systems
Although Hadoop software stack seems to be
sufficiently mature now for some applications, we
have seen that there are still possibilities to improve
and optimize tools based on it. These concerns
especially SQL-on-Hadoop systems that evolve to
more database-like architectures, mainly towards
SQL.
Obviously, Hive mentioned in Section 3.2
belongs to this category. Technologies such as Hive
are designed for batch queries on Hadoop by
providing a declarative abstraction layer (HiveQL),
which uses M/R processing framework in the
background. Hive is used primarily for queries on
very large data sets and large ETL jobs.
SQL processed by a specialized (Google-inspired)
SQL engine on top of a Hadoop cluster is Cloudera
Impala (www.cloudera.com/content/
cloudera/en/products-and-services/cdh/impala.html,
2015) - designed as MPP SQL query engine that runs
natively in Hadoop. Impala provides interactive query
capabilities to enable traditional business intelligence
and analytics on Hadoop-scale datasets.
Splice Machine (Splice Machine, 2015) is a
Hadoop RDBMS. It is tightly integrated into Hadoop,
using HBase and HDFS as the storage level. Splice
Machine supports real-time ACID transactions.
The Vertica mentioned in Section 3.2 is often
categorized as a Hadoop-relational hybrid because it
can work with Hadoop together.
DatabaseArchitectures:CurrentStateandDevelopment
159
Table 2: New Database Architectures in last 15 years.
Milestone Category Subcategory Representatives
2009 NoSQL Key-value Redis
Column-oriented Cassandra
Document-oriented MongoDB
Graph databases Neo4j
2005 BDMS
1. generation
Hadoop software stack
2010
2. generation
Asterix software stack
2011 NewSQL General purpose NuoDB, VoltDB, Clustrix
Google’s hybrids Spanner
Hadoop-relational Vertica
SQL-on-Hadoop Hive, Impala, Presto, Splice
NoSQL with ACID FoundationDB, MarkLogic
There is also a technology of “connectors”. In this
architecture Hadoop and a DBMS product are
connected in a simple way so that data can be passed
back and forth between these two systems.
As a representative of this approach we can
mention Presto (http://prestodb.io/, 2013) - an open
source distributed SQL query engine for running
interactive analytic queries. Presto supports HiveQL
as well. A user has also connectors Hadoop/Hive,
Cassandra, and TPC-H at disposal that provide data
to queries. The last connector dynamically generates
data that can be used for experimenting with and
testing Presto. Unlike Hive, Presto and Impala follow
the Google distributed query engine processor
inspired by Google Dremel (Melnik, et al., 2010) – a
query system for analysis of read-only nested data.
4 CONCLUSIONS
Summarizing, the current approaches to DBMS
architecture turn in towards:
improvements of present database
architectures,
radically new database architecture designs.
The former includes challenges given by new
hardware possibilities and Big Analytics, i.e. some
low-level data processing algorithms have to be
changed. This influences also approaches to query
optimization. But (Zhao, et al., 2014) believe that it is
unlikely that MapReduce will completely replace
DBMSs even for data warehousing applications.
Both SQL-on-Hadoop systems and NoSQL
DBMS aimed at Big Data management require much
more care in a design and tuning of application
environment. It is necessary to select a database
product according to the role it is intended to play and
the data over which it will work. This causes higher
complexity of new architectures, particularly of their
hybrid variants, that are becoming dominant. We
have mentioned some query tools of today’s new
DBMSs, which are partially SQL-like. To use more
data manipulation languages in one hybrid platform
is certainly possible, but also complicated. Any
abstraction/virtualization layer would be useful.
However, the concept of a unifying language is not
yet real. The database history in the last 15 years is
summarized in Table 2.
We have discussed horizontal layers coming from
special software stacks. In real Big Data environment
it is necessary to consider explicitly applications
layers as well. They are vertical and include at least
universal information management, real-time
analytics, and intelligent processes, which are so
important to most organizations today. Behind we can
find data flows between particular data stores.
However, this all complicates the design of
enterprise information systems and create particular
challenges for research and development not only of
new database architectures.
ACKNOWLEDGEMENTS
This research has been partially supported by the
grant of GACR No. P103/13/08195S, and also
partially supported by the Avast Foundation.
REFERENCES
Anantapantula, S., Gomez, J.-S., 2009. Oracle Essbase 9
Implementation Guide, Packt Publishing. Birmingham.
Abiteboul, S., Agrawal, R., Bernstein, Ph., Carey, M., Ceri,
S., et. al., 2005. Lowell Database Research self-
assessment, Comm. of ACM, 48(5) pp.111-118.
Bach, M., Werner, A., 2014. Standardization of NoSQL
Database Languages. In BDAS 2014, 10th International
Conference, Ustron, Poland, pp. 50-60.
DATA2015-4thInternationalConferenceonDataManagementTechnologiesandApplications
160
Borthakur, D., Gray, J., Sarma, J. S., Muthukkaruppan, K.,
Spiegelberg, N., et al., 2011. Apache Hadoop goes real-
time at Facebook. In ACM SIGMOD Int. Conf. on
Management of Data, Athens, pp. 1071-1080.
Brewer, E., 2000. Towards robust distributed systems.
Invited Talk on PODC 2000, Portland, Oregon.
Cattell, R., 2010. Scalable SQL and NoSQL Data Stores.
SIGMOD Record, 39(4) 12-27.
Chaiken, R., Jenkins, B., Larson, Per-Åke, Ramsey, B.,
Shakib, D., et al., 2008. SCOPE: Easy and Efficient
Parallel Processing of Massive Data Sets. In VLDB ’08
– Conference proceedings, Auckland.
Corbett, J. C., Dean, J., Epstein, M., Fike, A., et al., 2012.
Spanner: Google’s Globally-Distributed Database. In
OSDI 2012, 10th Symposium on Operating System
Design and Implementation - Conference Proceedings,
Hollywood, pp. 1-14.
Cromwell S., 2013. Top 5: MarkLogic topped last week's
Silicon Valley fundings. Silicon Valley Business
Journal, April 15, 2013.
DeBrabant, J., Pavlo, A., Tu, S., Stonebraker, M., and
Zdonik, S., 2013. Anti-Caching: A New Approach to
Database Management System Architecture. In VLDB
Endowment – Conference Proceedings, 6(14), pp.
1942-1953.
Doulkeridis and Nørvåg, K., 2014. A survey of large-scale
analytical query processing in MapReduce, The VLDB
Journal, 23:355-380.
Grolinger, K., Higashino, W. A., Tiwari, A., and Capretz,
M. AM, 2013. Data management in cloud
environments: NoSQL and NewSQL data stores,
Journal of Cloud Computing: Advances, Systems and
Applications - a Springer Open Journal, 2:22.
Härder, T., Reuter, A., 1983. Concepts for Implementing
and Centralized Database Management System. In Int.
Computing Symposium on Application Systems
Development - Conference Proceedings, Nürnberg, pp.
28-10.
Härder, T., 2005. DBMS Architecture – the Layer Model
and its Evolution, Datenbank-Spektrum, 13, pp. 45-57.
Härder, T., 2005. XML Databases and Beyond-Plenty of
Architectural Challenges Ahead. Proceedings of
ADBIS 2005, LNCS 3631, Springer, pp. 1 – 16.
Lamb, M., Fuller, R., Varadarajan, N., Tran, B., Vandiver,
et al., 2012. The Vertica Analytic Database: C-Store 7
Years Later. Proceedings of VLDB Endowment, 5(12),
pp. 1790-1801.
Lütolf, S., 2012. PostgreSQL Full Text Search - An
Introduction and a Performance Comparison with
Apache Lucene /Solr. Seminar Thesis, Chur.
(http://wiki.hsr.ch/Datenbanken/files/Full_Text_Searc
h_in_PostgreSQL_Luetolf_Paper_final.pdf), (& May
2015).
Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C.,
Horn, I., Leiser, N., and Czajkowski, G., 2010. Pregel:
a system for large-scale graph processing. In
SIGMOD
'10 - Int. Conf. on Management of Data, Indianopolis,
pp. 135-146.
Melnik, S., Gubarev, A., Long, J.J., Romer, G., et al, 2010.
Dremel: Interactive Analysis of Web-Scale Datasets. In
VLDB - 36th Int'l Conf on Very Large Data Bases, pp.
330-339.
Pokorný, J., 2007. Database Architectures: Current Trends
and Their Relationships to Requirements of Practice. In
Advances in Information Systems Development: New
Methods and Practice for the Networked Society.
Information Systems Development Series, Springer
Science+Business Media, New York, pp. 267–277.
Pokorný, J., 2013. NoSQL Databases: a step to databases
scalability in Web environment. International Journal
of Web Information Systems, 9 (1), 69-82.
Splice Machine (2015) White Paper: Splice Machine: The
Only Hadoop RDBMS [online]. Available from:
http://www.splicemachine.com/ [Accessed: 15th
January, 2015].
Stonebraker, M., Cetintemel, U., 2005. One size fits all: An
idea whose time has come and Gone. In ICDE ’05 - 21st
Int. Conf. on Data Engineering, pp. 2–11.
Stonebraker, M., Madden, S., Abadi, D. J., Harizopoulos,
S., Hachem, N., et al., 2007. The End of an
Architectural Era (It’s Time for a Complete Rewrite).
In VLDB 2007 – Conference Proceedings, ACM, pp.
1150-1160.
Stonebraker, M., Brown, P., Zhang, D., and Becla, J., 2013.
SciDB: A Database Management System for
Applications with Complex Analytics. Computing in
Science and Engineering, 15(3) 54-62.
Vinayak, R., Borkar, V., Carey, M.-J., and Li, Ch. Ch.,
2012. Big data platforms: what's next? ACM Cross
Road, No. 1 pp. 44-49.
Zhao, L., Sakr, S., Liu, A., and Boughuettaya, A., 2014.
Cloud Data Management. Springer.
DatabaseArchitectures:CurrentStateandDevelopment
161