Reads: We note that KVFS’s throughput exceeds
HDFS’s throughput when the number of mappers
is small. From Figure 4(b) we observe that when
the number of mappers range from 1 to 4, KVFS’s
read throughput is almost 2 times the throughput of
HDFS. This improvement is due to KVFSs use of
asynchronous I/O using zeromq(Hintjens, 2013) for
data requests. As the number of mappers increases,
HDFS’s and KVFS’s throughput become almost equal
as the benefits of asynchronous I/O diminish. The
maximum throughput achieved by KVFS, for the 8-
GBytes dataset, is 1170-MBytes/s and for HDFS is
1200-MBytes/s. For the 96-GBytes datasets the max-
imum throughput for KVFS is 390-MBytes/s, whereas
for HDFS is 490-MBytes/s with 4 mappers respec-
tively. These difference in performance between are
due to the fact that HDFS has a more efficient cache
policy. In future versions we are planning to improve
the caching policy and we believe that, after tuning,
KVFS could outperform the native HDFS setup.
Our results for both read and write comply with
the throughput of the SSDs for the big dataset (96-
GBytes) and with the network speed for the small the
dataset (8-GBytes). In both cases we almost achieve
the maximum throughput.
5 RELATED WORK
Similar to our approach, the Cassandra File System
(CFS) runs on top of Cassandra (Lakshman and Ma-
lik, 2010). CFS aims to offer a file abstraction over
the Cassandra key-value store as an alternative to
HDFS. Cassandra does not require HDFS but runs
over a local filesystem. Therefore in a Cassandra in-
stallation there is no global file-based abstraction over
the data. Our motivation differs, in that we are inter-
ested to explore the use of key-value stores as the low-
est layer in the storage stack. Although we are cur-
rently using HBase, our goal is to eventually replace
HBase with a key-value store that runs directly on top
of the storage devices, without the use of a local or
a distributed filesystem. This motivation is similar to
the Kinetic approach (Kinetic, 2016), which however,
aims at providing a key-value API at the device level
and then build the rest of the storage stack on top of
this abstraction.
The last few years there has been a lot of work
on both DFSs (Depardon et al., 2013; Thanh et al.,
2008) and NoSQL stores (Abramova et al., 2014;
Klein et al., 2015) from different originating points of
view. Next, we briefly discuss related work in DFSs
and NoSQL stores.
DFSs have been an active area over many years
due to their importance for scalable data access. Tra-
ditionally, DFSs have strived to scale with the number
of nodes and storage devices, and to eliminate syn-
chronization, network, and device overheads, with-
out compromising the richness of semantics (ideally
offering a POSIX compliant abstraction). Several
DFSs are available and in use today, including Lus-
tre, BeeGFS, OrangeFS, GPFS, GlusterFS, and sev-
eral other commercial and proprietary systems. Pro-
viding a scalable file abstraction while maintaining
traditional semantics and generality has proven to be
a difficult problem. As an alternative, object stores,
such as Ceph (Weil et al., 2006), draw a different bal-
ance between semantics and scalability.
With recent advances in data processing, the addi-
tional interesting and important realization has been
that in several domains it suffices to offer simpler
APIs and to design systems for restricted operating
points. For instance, HDFS uses very large blocks
(e.g. 64 MBytes) which simplifies dramatically meta-
data management. In addition, it does not allow up-
dates, which simplifies synchronization and recovery.
Finally, it does not achieve parallelism for a single
file, since each large chunk is stored in a single node,
simplifying data distribution and recovery. Given
these design decisions, HDFS and similar DFSs are
efficient for read-mostly, sequential workloads, with
large requests.
NoSQL stores have been used to fill in the need
for fine-grain lookups and the ability to scan data in
a sorted manner. NoSQL stores can be categorized in
four groups: key-value DBs (Level DB, Rocks DB,
Silo), document DBs (Mongo DB), column family
stores (HBase, Cassandra), and graph DBs (Neo4j,
Sparksee).
Such, data-oriented (rather than device-oriented)
approaches to storage and access bare a lot of merit
because they draw yet a different balance between se-
mantics and scalability. Until to date these approaches
have become popular in data processing frameworks,
however, they have little application in more general
purpose storage.
We foresee, that as our understanding of key-value
stores and their requirements and efficiency improves,
they will play an important role in the general-purpose
storage stack, beyond data processing frameworks.
6 CONCLUSIONS
In this paper we explore how to provide an HDFS ab-
straction over NoSQL stores. We map the file abstrac-
tion of HDFS to a table-based schema provided by
KVFS: An HDFS Library over NoSQL Databases
365