5.2 Impact of Query Accelerator
In our query accelerator, multiple threads are used to
make sub query jobs executed in parallel. After
testing its impact, we found that performance has
been improved by almost 30% in average. In
addition, the numbers of jobs have directly influence
on the performance improvement. So the four query
mentioned in 5.1 have different improvement. Query
11 and query 6 which both have 4 sub query jobs
should have a same improvement of performance
and higher than query 1, 14. However the impact of
join jobs cannot be ignored, which will reduce the
performance. So the query 11 has a worse
performance improvement than query 6 because the
number of join jobs in query 11 is more than in
query 6.
5.3 Impact of File Splitting Handler
We investigate the performance without and with
file splitting handler using a dataset which includes
500 universities. In the experiments, we choose a
certain data which are located at different positions
of the dataset. Result shows that the closer to the end
position the data is located, the lower performance
we get, because more split files need to be read. The
average improvement rate reaches to 74.53%.
6 CONCLUSIONS
AND FUTURE EXTENSIONS
In this paper we address the problem of querying
RDF data from large repositories and then
investigate its performance. In order to improve the
performance, several optimizations such as file
splitting handler, query accelerator and job reducing
handler have been explored.
As a conclusion, the work has returned
encouraging results, almost 70% improvement
according to the results of our experiments. Also due
to these optimizations, the query time has a sublinear
speedup with the increasing number of datasets.
However, the performance of queries is not yet
competitive because too many jobs are created,
especially for the SPARQL query which has a wide
hierarchy information or inference.
So in our future work, we will put more focus on
the pre-processor module to reduce the number of
jobs as much as possible. One interesting way could
be to analyse each sub query before starting a job.
The sub queries which read the same input file
should to be combined as one job. However in this
way, we should take care of the output and find a
way to divide the output into different files.
REFERENCES
Andrew, N. and Jane, H. and Yuan-Fang, Li. (2008). A
Scale-Out RDF Molecule Store for Distributed
Processing of Biomedical Data. In Semantic Web for
Health Care and Life Sciences Workshop, Beijing,
China.
Georgia, D. S. and Dimitrios, A. K. and Theodore, S.P.
(2008). Semantics-Aware Querying of Web-
Distributed RDF(S) Repositories. In SIEDL2008,
Proceedings of 1st Workshop on Semantic
Interoperability in the European Digital Library, pp.
39-50, 2008.
Hyunsik, C. and Jihoon, S. and YongHyun, C. and Min, K.
S. and Yon, D C. (2009). SPIDER: A System for
Scalable, Parallel/Distributed Evaluation of large-
scale RDF Data. In CIKM’09, November 2-6, 2009,
Hong Kong, China, ACM 978-1-60558-512-3/09/11.
Min, C. and Martin, F. and Baoshi, Y. and Robet, M.
(2004). A Subscribable Peer-to-Peer RDF Repository
for Distributed Metadata Management. Journal of
Web Semantics: Science, Services and Agents on the
World Wide Web 2(2004) 109-130.
Mohammad, F. H. and Pankil, D. and Latifur, K. (2009).
Storage and Retrieval of Large RDF Graph Using
Hadoop and MapReduce. In CloudCom 2009, LNCS
5931, pp. 680-686. 2009.
Tom, W. (2009). Hadoop: The Definitive Guide. O’Reilly
Media, Yahoo!Press.
Yuanbo, G. and Zhengxiang, P. and Jeff, H. (2005).
LUBM: A Benchmark for OWL Knowledge Base
Systems. Journal of Web Semantics, 3(2), pp.158-182.
IMPLEMENTATION AND OPTIMIZATION OF RDF QUERY USING HADOOP
515