by the size of the database.
The remainder of this work is organized as
follows. We begin to describe in Section 2 the
Cassandra database used in these experiments, and
in Section 3 we explain and characterize the YCSB
benchmark. Section 4 presents our experimental
evaluation and lastly section 5 presents our
conclusions and future work.
2 CASSANDRA
NoSQL databases were created primarily to address
issues with web applications that need to operate
with enormous loads of data as well as being able to
scale without difficulty. Cassandra is a Column
Family NoSQL database that is designed to solve the
challenges associated with massive scalability. It can
support a very high update throughput while
delivering low latency. Cassandra is very similar to
the usual relational model, made of columns and
rows. The main difference is the stored data, which
can be structured, semi-structured or unstructured.
When it comes to storage in clusters, all of the
data is stored in distributed fashion over all nodes of
the cluster. When a node is added or removed, all of
its data is automatically distributed over other
available nodes, and a failing node will be replaced
instantly. Because of this, it is no longer necessary to
calculate and assign data to each node. Cassandra’s
architecture is known to be peer-to-peer (partitions
tasks or workloads between peers equally) and
overcomes master-slave limitations such as high
availability and massive scalability. Data is
replicated over multiple nodes in the cluster. Failed
nodes are detected by gossip protocols (peer-to-peer
communication protocol in which nodes periodically
exchange state information about themselves and
about other nodes they know about) and those nodes
can be replaced instantly (Cooper et al., 2010).
In Cassandra, data is indexed by a key that is of
the type String. This key represents a line where data
is found, and in each row the data is divided into
columns and column families. Each column in
Cassandra has a name, a value and a timestamp.
Both the value and the timestamp are provided by
the client application when data is inserted. Besides
the normal typed columns, another kind of column
exists, they are known as the super columns. The
thing that differentiates these columns from the
others is the fact that instead of having objects as
values, they have other columns as values.
In order to group columns, Cassandra has a
concept known as: Column Families, which is very
similar to relational database tables. Unlike columns,
the Column Families are not dynamic and must be
previously declared in a configuration file. They are
the unit of abstraction containing keyed rows which
group together columns and super columns of highly
structured data. Column families have no defined
schema of column names and types supported.
Similarly to columns, there is also the Super
Column Family, which is a Column family that just
contains Super columns. It is useful for modeling
complex data types such as addresses and other
simple data structures.
Lastly, column families are grouped into
Keyspaces. These Keyspaces can be compared to
Schemas in a relational database.
Cassandra was designed to handle large amounts
of data spread across many commodity servers.
Cassandra provides high availability through a
symmetric architecture that contains no single point
of failure and replicates data across nodes.
Cassandra’s architecture is a combination of
Google’s Big-Table (Chang et al., 2008) and
Amazon’s Dynamo (DeCandia et al., 2007). It is a
peer-to-peer model, which makes it tolerant against
single points of failure and provides horizontal
scalability. Each node exchanges information across
the cluster every second. A sequentially written
commit log on each node captures write activity to
ensure data durability (Datastax, 2014). Data is then
indexed and written to an in-memory structure called
memtable, which resembles a write-back cache.
Once the memory structure is full, the data is written
to disk in an SSTable (sorted string table) data file (a
file of key/value string pairs, sorted by keys). All
writes are automatically partitioned and replicated
throughout the cluster. When a read or write request
is made, any node in the cluster is able to handle it.
Through the key, the node that answered the
requisition can know which node possesses data
information.
3 YAHOO CLOUD SERVING
BENCHMARK (YCSB)
The Yahoo! Cloud Serving Benchmark (YCSB) is
one of the most used benchmarks, and provides
benchmarking for the bases of comparison between
NoSQL systems. The YCSB Client can be used to
benchmark new database systems by writing a new
class to implement the following methods (Cooper et
al., 2010): read, insert, update, delete and scan.
These operations represent the standard CRUD
operations: Create, Read, Update, and Delete.