Highly Scalable Sort-merge Join Algorithm for RDF Querying

Zbyněk Falt, Miroslav Čermák, Filip Zavoral

Abstract

In this paper, we introduce a highly scalable sort-merge join algorithm for RDF databases. The algorithm is designed especially for streaming systems; besides task and data parallelism, it also tries to exploit the pipeline parallelism in order to increase its scalability. Additionally, we focused on handling skewed data correctly and efficiently; the algorithm scales well regardless of the data distribution.

References

  1. Albutiu, M.-C., Kemper, A., and Neumann, T. (2012). Massively parallel sort-merge joins in main memory multi-core database systems. Proc. VLDB Endow., 5(10):1064-1075.
  2. Bednarek, D., Dokulil, J., Yaghob, J., and Zavoral, F. (2012a). Bobox: Parallelization Framework for Data Processing. In Advances in Information Technology and Applied Computing.
  3. Bednarek, D., Dokulil, J., Yaghob, J., and Zavoral, F. (2012b). Data-Flow Awareness in Parallel Data Processing. In 6th International Symposium on Intelligent Distributed Computing - IDC 2012. Springer-Verlag.
  4. Broekstra, J., Kampman, A., and Harmelen, F. v. (2002). Sesame: A generic architecture for storing and querying RDF and RDF schema. In ISWC 7802: Proceedings of the First International Semantic Web Conference on The Semantic Web, pages 54-68, London, UK. Springer-Verlag.
  5. Cermak, M., Dokulil, J., Falt, Z., and Zavoral, F. (2011). SPARQL Query Processing Using Bobox Framework. In SEMAPRO 2011, The Fifth International Conference on Advances in Semantic Processing, pages 104- 109. IARIA.
  6. Cieslewicz, J., Berry, J., Hendrickson, B., and Ross, K. A. (2006). Realizing parallelism in database operations: insights from a massively multithreaded architecture. In Proceedings of the 2nd international workshop on Data management on new hardware, DaMoN 7806, New York, NY, USA. ACM.
  7. DeWitt, D. J., Naughton, J. F., Schneider, D. A., and Seshadri, S. (1992). Practical skew handling in parallel joins. In Proceedings of the 18th International Conference on Very Large Data Bases, VLDB 7892, pages 27-40, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  8. Dittrich, J.-P. and Seeger, B. (2002). Progressive merge join: A generic and non-blocking sort-based join algorithm. In VLDB, pages 299-310.
  9. Dittrich, J.-P., Seeger, B., Taylor, D. S., and Widmayer, P. (2003). On producing join results early. In Proceedings of the twenty-second ACM SIGMOD-SIGACTSIGART symposium on Principles of database systems, PODS 7803, pages 134-142, New York, NY, USA. ACM.
  10. Falt, Z., Bednarek, D., Cermak, M., and Zavoral, F. (2012a). On Parallel Evaluation of SPARQL Queries. In DBKDA 2012, The Fourth International Conference on Advances in Databases, Knowledge, and Data Applications, pages 97-102. IARIA.
  11. Falt, Z., Bulanek, J., and Yaghob, J. (2012b). On Parallel Sorting of Data Streams. In ADBIS 2012 - 16th East European Conference in Advances in Databases and Information Systems.
  12. Falt, Z., Cermak, M., Dokulil, J., and Zavoral, F. (2012c). Parallel sparql query processing using bobox. International Journal On Advances in Intelligent Systems, 5(3 and 4):302-314.
  13. Frigo, M., Leiserson, C. E., Prokop, H., and Ramachandran, S. (1999). Cache-Oblivious Algorithms. In FOCS, pages 285-298.
  14. Gordon, M. I., Thies, W., and Amarasinghe, S. (2006). Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. SIGARCH Comput. Archit. News, 34(5):151-162.
  15. Groppe, J. and Groppe, S. (2011). Parallelizing join computations of sparql queries for large semantic web databases. In Proceedings of the 2011 ACM Symposium on Applied Computing, SAC 7811, pages 1681- 1686, New York, NY, USA. ACM.
  16. Hua, K. A. and Lee, C. (1991). Handling data skew in multiprocessor database computers using partition tuning. In Proceedings of the 17th International Conference on Very Large Data Bases, VLDB 7891, pages 525- 535, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  17. Jena (2013). Jena - a semantic web framework for Java. Available at: http://jena.apache.org/, [Online; Accessed February 4, 2013].
  18. Li, W., Gao, D., and Snodgrass, R. T. (2002). Skew handling techniques in sort-merge join. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pages 169-180. ACM.
  19. Liu, B. and Rundensteiner, E. A. (2005). Revisiting pipelined parallelism in multi-join query processing. In Proceedings of the 31st international conference on Very large data bases, VLDB 7805, pages 829-840. VLDB Endowment.
  20. Lu, H., Tan, K.-L., and Sahn, M.-C. (1990). Hash-based join algorithms for multiprocessor computers with shared memory. In Proceedings of the sixteenth international conference on Very large databases, pages 198-209, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  21. Ming, M. M., Lu, M., and Aref, W. G. (2004). Hash-merge join: A non-blocking join algorithm for producing fast and early join results. In In ICDE, pages 251-263.
  22. Prud'hommeaux, E. and Seaborne, A. (2008). SPARQL Query Language for RDF. W3C Recommendation.
  23. Schmidt, M., Hornung, T., Lausen, G., and Pinkel, C. (2008). Sp2bench: A sparql performance benchmark. CoRR, abs/0806.4627.
  24. Schneider, D. A. and DeWitt, D. J. (1989). A performance evaluation of four parallel join algorithms in a sharednothing multiprocessor environment. SIGMOD Rec., 18(2):110-121.
  25. Vinther, K. (2006). The Funnelsort Project. Available at: http://kristoffer.vinther.name/projects/funnelsort/, [Online; Accessed February 4, 2013].
  26. Virtuoso (2013). Virtuoso data server. Available at: http://virtuoso.openlinksw.com, [Online; Accessed February 4, 2013].
Download


Paper Citation


in Harvard Style

Falt Z., Čermák M. and Zavoral F. (2013). Highly Scalable Sort-merge Join Algorithm for RDF Querying . In Proceedings of the 2nd International Conference on Data Technologies and Applications - Volume 1: DATA, ISBN 978-989-8565-67-9, pages 293-300. DOI: 10.5220/0004489702930300


in Bibtex Style

@conference{data13,
author={Zbyněk Falt and Miroslav Čermák and Filip Zavoral},
title={Highly Scalable Sort-merge Join Algorithm for RDF Querying},
booktitle={Proceedings of the 2nd International Conference on Data Technologies and Applications - Volume 1: DATA,},
year={2013},
pages={293-300},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004489702930300},
isbn={978-989-8565-67-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 2nd International Conference on Data Technologies and Applications - Volume 1: DATA,
TI - Highly Scalable Sort-merge Join Algorithm for RDF Querying
SN - 978-989-8565-67-9
AU - Falt Z.
AU - Čermák M.
AU - Zavoral F.
PY - 2013
SP - 293
EP - 300
DO - 10.5220/0004489702930300