Highly Scalable Sort-merge Join Algorithm for RDF Querying
Zbyněk Falt, Miroslav Čermák, Filip Zavoral
2013
Abstract
In this paper, we introduce a highly scalable sort-merge join algorithm for RDF databases. The algorithm is designed especially for streaming systems; besides task and data parallelism, it also tries to exploit the pipeline parallelism in order to increase its scalability. Additionally, we focused on handling skewed data correctly and efficiently; the algorithm scales well regardless of the data distribution.
References
- Albutiu, M.-C., Kemper, A., and Neumann, T. (2012). Massively parallel sort-merge joins in main memory multi-core database systems. Proc. VLDB Endow., 5(10):1064-1075.
- Bednarek, D., Dokulil, J., Yaghob, J., and Zavoral, F. (2012a). Bobox: Parallelization Framework for Data Processing. In Advances in Information Technology and Applied Computing.
- Bednarek, D., Dokulil, J., Yaghob, J., and Zavoral, F. (2012b). Data-Flow Awareness in Parallel Data Processing. In 6th International Symposium on Intelligent Distributed Computing - IDC 2012. Springer-Verlag.
- Broekstra, J., Kampman, A., and Harmelen, F. v. (2002). Sesame: A generic architecture for storing and querying RDF and RDF schema. In ISWC 7802: Proceedings of the First International Semantic Web Conference on The Semantic Web, pages 54-68, London, UK. Springer-Verlag.
- Cermak, M., Dokulil, J., Falt, Z., and Zavoral, F. (2011). SPARQL Query Processing Using Bobox Framework. In SEMAPRO 2011, The Fifth International Conference on Advances in Semantic Processing, pages 104- 109. IARIA.
- Cieslewicz, J., Berry, J., Hendrickson, B., and Ross, K. A. (2006). Realizing parallelism in database operations: insights from a massively multithreaded architecture. In Proceedings of the 2nd international workshop on Data management on new hardware, DaMoN 7806, New York, NY, USA. ACM.
- DeWitt, D. J., Naughton, J. F., Schneider, D. A., and Seshadri, S. (1992). Practical skew handling in parallel joins. In Proceedings of the 18th International Conference on Very Large Data Bases, VLDB 7892, pages 27-40, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
- Dittrich, J.-P. and Seeger, B. (2002). Progressive merge join: A generic and non-blocking sort-based join algorithm. In VLDB, pages 299-310.
- Dittrich, J.-P., Seeger, B., Taylor, D. S., and Widmayer, P. (2003). On producing join results early. In Proceedings of the twenty-second ACM SIGMOD-SIGACTSIGART symposium on Principles of database systems, PODS 7803, pages 134-142, New York, NY, USA. ACM.
- Falt, Z., Bednarek, D., Cermak, M., and Zavoral, F. (2012a). On Parallel Evaluation of SPARQL Queries. In DBKDA 2012, The Fourth International Conference on Advances in Databases, Knowledge, and Data Applications, pages 97-102. IARIA.
- Falt, Z., Bulanek, J., and Yaghob, J. (2012b). On Parallel Sorting of Data Streams. In ADBIS 2012 - 16th East European Conference in Advances in Databases and Information Systems.
- Falt, Z., Cermak, M., Dokulil, J., and Zavoral, F. (2012c). Parallel sparql query processing using bobox. International Journal On Advances in Intelligent Systems, 5(3 and 4):302-314.
- Frigo, M., Leiserson, C. E., Prokop, H., and Ramachandran, S. (1999). Cache-Oblivious Algorithms. In FOCS, pages 285-298.
- Gordon, M. I., Thies, W., and Amarasinghe, S. (2006). Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. SIGARCH Comput. Archit. News, 34(5):151-162.
- Groppe, J. and Groppe, S. (2011). Parallelizing join computations of sparql queries for large semantic web databases. In Proceedings of the 2011 ACM Symposium on Applied Computing, SAC 7811, pages 1681- 1686, New York, NY, USA. ACM.
- Hua, K. A. and Lee, C. (1991). Handling data skew in multiprocessor database computers using partition tuning. In Proceedings of the 17th International Conference on Very Large Data Bases, VLDB 7891, pages 525- 535, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
- Jena (2013). Jena - a semantic web framework for Java. Available at: http://jena.apache.org/, [Online; Accessed February 4, 2013].
- Li, W., Gao, D., and Snodgrass, R. T. (2002). Skew handling techniques in sort-merge join. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pages 169-180. ACM.
- Liu, B. and Rundensteiner, E. A. (2005). Revisiting pipelined parallelism in multi-join query processing. In Proceedings of the 31st international conference on Very large data bases, VLDB 7805, pages 829-840. VLDB Endowment.
- Lu, H., Tan, K.-L., and Sahn, M.-C. (1990). Hash-based join algorithms for multiprocessor computers with shared memory. In Proceedings of the sixteenth international conference on Very large databases, pages 198-209, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
- Ming, M. M., Lu, M., and Aref, W. G. (2004). Hash-merge join: A non-blocking join algorithm for producing fast and early join results. In In ICDE, pages 251-263.
- Prud'hommeaux, E. and Seaborne, A. (2008). SPARQL Query Language for RDF. W3C Recommendation.
- Schmidt, M., Hornung, T., Lausen, G., and Pinkel, C. (2008). Sp2bench: A sparql performance benchmark. CoRR, abs/0806.4627.
- Schneider, D. A. and DeWitt, D. J. (1989). A performance evaluation of four parallel join algorithms in a sharednothing multiprocessor environment. SIGMOD Rec., 18(2):110-121.
- Vinther, K. (2006). The Funnelsort Project. Available at: http://kristoffer.vinther.name/projects/funnelsort/, [Online; Accessed February 4, 2013].
- Virtuoso (2013). Virtuoso data server. Available at: http://virtuoso.openlinksw.com, [Online; Accessed February 4, 2013].
Paper Citation
in Harvard Style
Falt Z., Čermák M. and Zavoral F. (2013). Highly Scalable Sort-merge Join Algorithm for RDF Querying . In Proceedings of the 2nd International Conference on Data Technologies and Applications - Volume 1: DATA, ISBN 978-989-8565-67-9, pages 293-300. DOI: 10.5220/0004489702930300
in Bibtex Style
@conference{data13,
author={Zbyněk Falt and Miroslav Čermák and Filip Zavoral},
title={Highly Scalable Sort-merge Join Algorithm for RDF Querying},
booktitle={Proceedings of the 2nd International Conference on Data Technologies and Applications - Volume 1: DATA,},
year={2013},
pages={293-300},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004489702930300},
isbn={978-989-8565-67-9},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 2nd International Conference on Data Technologies and Applications - Volume 1: DATA,
TI - Highly Scalable Sort-merge Join Algorithm for RDF Querying
SN - 978-989-8565-67-9
AU - Falt Z.
AU - Čermák M.
AU - Zavoral F.
PY - 2013
SP - 293
EP - 300
DO - 10.5220/0004489702930300