KVFS: An HDFS Library over NoSQL Databases

Emmanouil Pavlidakis, Stelios Mavridis, Giorgos Saloustros, Angelos Bilas

Abstract

Recently, NoSQL stores, such as HBase, have gained acceptance and popularity due to their ability to scale-out and perform queries over large amounts of data. NoSQL stores typically arrange data in tables of (key,value) pairs and support few simple operations: get, insert, delete, and scan. Despite its simplicity, this API has proven to be extremely powerful. Nowadays most data analytics frameworks utilize distributed file systems (DFS) for storing and accessing data. HDFS has emerged as the most popular choice due to its scalability. In this paper we explore how popular NoSQL stores, such as HBase, can provide an HDFS scale-out file system abstraction. We show how we can design an HDFS compliant filesystem on top a key-value store. We implement our design as a user-space library (KVFS) providing an HDFS filesystem over an HBase key-value store. KVFS is designed to run Hadoop style analytics such as MapReduce, Hive, Pig and Mahout over NoSQL stores without the use of HDFS. We perform a preliminary evaluation of KVFS against a native HDFS setup using DFSIO with varying number of threads. Our results show that the approach of providing a filesystem API over a key-value store is a promising direction: Read and write throughput of KVFS and HDFS, for big and small datasets, is identical. Both HDFS and KVFS throughput is limited by the network for small datasets and from the device I/O for bigger datasets.

References

  1. Abramova, V., Bernardino, J., and Furtado, P. (2014). Experimental evaluation of nosql databases. In International Journal of Database Management Systems, pages 01-16.
  2. Borthakur, D. (2008). Hdfs architecture guide. HADOOP APACHE PROJECT http://hadoop. apache. org/common/docs/current/hdfs design. pdf.
  3. Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. (2008). Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26(2):4.
  4. Depardon, B., Mahec, G., and Seguin, C. (2013). Analysis of six distributed file systems. In HAL Technical Report.
  5. George, L. (2015). HBase: The Definition Guide . O'Reilly Media.
  6. Ghemawat, S., Gobioff, H., and Leung, S.-T. (2003). The google file system. ACM SIGOPS Operating Systems Review - SOSP 7803, 37(5):29-43.
  7. Hintjens, P. (2013). ZeroMQ: Messaging for Many Applications. ” O'Reilly Media, Inc.”.
  8. Hunt, P., Konar, M., Junqueira, F. P., and Reed, B. (2010). Zookeeper: Wait-free coordination for internet-scale systems. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, USENIXATC'10, pages 11-11, Berkeley, CA, USA. USENIX Association.
  9. Kambatla, K. and Chen, Y. (2014). The truth about mapreduce performance on ssds. In Proc. USENIX LISA.
  10. Kinetic (2016). Open storage kinetic project. http://www. openkinetic.org/. Accessed: January 2016.
  11. Klein, J., Gorton, I., Ernst, N., Donohoe, P., Pham, K., and Matser, C. (2015). Performance evaluation of nosql databases: A case study. In Proceedings of the 1st Workshop on Performance Analysis of Big Data Systems, PABS 7815, pages 5-10, New York, NY, USA. ACM.
  12. Lakshman, A. and Malik, P. (2010). Cassandra: A decentralized structured storage system. SIGOPS Oper. Syst. Rev., 44(2):35-40.
  13. O'Neil, P., Cheng, E., Gawlick, D., and O'Neil, E. (1996). The log-structured merge-tree (lsm-tree). Acta Inf., 33(4):351-385.
  14. Shvachko, K., Kuang, H., Radia, S., and Chansler, R. (2010). The hadoop distributed file system. In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pages 1-10.
  15. Thanh, T., Mohan, S., Choi, E., Kim, S., and Kim, P. (2008). A taxonomy and survey on distributed file systems. In Networked Computing and Advanced Information Management, 2008. NCM'08. Fourth International Conference on, volume 1, pages 144-149.
  16. Weil, S. A., Brandt, S. A., Miller, E. L., Long, D. D. E., and Maltzahn, C. (2006). Ceph: A scalable, highperformance distributed file system. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7, OSDI 7806, pages 22-22, Berkeley, CA, USA. USENIX Association.
  17. White, T. (2012). Hadoop: The definitive guide . ” O'Reilly Media, Inc.”.
  18. Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. (2010). Spark: Cluster computing with working sets. HotCloud, 10:10-10.
Download


Paper Citation


in Harvard Style

Pavlidakis E., Mavridis S., Saloustros G. and Bilas A. (2016). KVFS: An HDFS Library over NoSQL Databases . In Proceedings of the 6th International Conference on Cloud Computing and Services Science - Volume 1: DataDiversityConvergence, (CLOSER 2016) ISBN 978-989-758-182-3, pages 360-367. DOI: 10.5220/0005924003600367


in Bibtex Style

@conference{datadiversityconvergence16,
author={Emmanouil Pavlidakis and Stelios Mavridis and Giorgos Saloustros and Angelos Bilas},
title={KVFS: An HDFS Library over NoSQL Databases},
booktitle={Proceedings of the 6th International Conference on Cloud Computing and Services Science - Volume 1: DataDiversityConvergence, (CLOSER 2016)},
year={2016},
pages={360-367},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005924003600367},
isbn={978-989-758-182-3},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 6th International Conference on Cloud Computing and Services Science - Volume 1: DataDiversityConvergence, (CLOSER 2016)
TI - KVFS: An HDFS Library over NoSQL Databases
SN - 978-989-758-182-3
AU - Pavlidakis E.
AU - Mavridis S.
AU - Saloustros G.
AU - Bilas A.
PY - 2016
SP - 360
EP - 367
DO - 10.5220/0005924003600367