ticular, in this work we faced the challenge of paral-
lelizing many parts of the prototype by applying the
Map-Reduce approach, in order to evaluate the fea-
sibility of this Big Data approach in the area of Big
Open Data. Specifically, the Hammer prototype im-
plements the retrieval technique presented in (Peluc-
chi et al., 2017). We demonstrated that the adoption
of modern standard technology specifically designed
for big data management, such as Apache Hadoop and
MongoDB, can be effective in this context, in partic-
ular increasing the number of nodes in the Apache
Hadoop ecosystem, even though the retrieval tech-
nique produces a large number of rewritten queries
(neighbour queries) and several parts of the prototype
are currently not optimized.
As far as the effectiveness of the query technique
is concerned, i.e., evaluation of the capability to re-
trieve what users want, we refer to our previous work
(Pelucchi et al., 2017), in which the technique has
been extensively introduced and evaluated on a cor-
pus of open data sets. In that paper we showed that
the technique is actually effective, in particular in
comparison with a tool like Apache Solr, that is a
stand-alone search engine. We discovered that, al-
though Apache Solr behaves quite well, our technique
is capable of better focusing on data sets of interest;
furthermore, it extracts only items of interest (while
Apache Solr does not in a classic configuration).
In the future work, we will optimize the imple-
mentation of many components of the Hammer proto-
type, in order to get near real time response times. In
particular, we plan to replace the native Hadoop im-
plementation with Spark on a Hadoop Cluster to ob-
tain dramatic improvement of performance (accord-
ing to (Zaharia et al., 2010), Spark is 10x faster then
Hadoop).
Finally, we will extend queries to provide com-
plex features such as join and spatial joins ((Bordogna
and Psaila, 2004)) of retrieved data sets. In particular,
we are considering, as a starting point, the concept of
query disambiguation, in order to improve the genera-
tion of neighbour queries; a work we are considering
as a starting point is (Bordogna et al., 2012). Fur-
thermore, we think that a post processing of results is
necessary, in particular when thousands of items are
retrieved. We think that useful operators could be de-
fined, similar to those introduced in (Bordogna et al.,
2008).
Similarly, the adoption of NoSQL databases for
persistent storage of retrieved results could be use-
ful, since the Hammer prototype provides collec-
tions of heterogeneous JSON objects, possibly geo-
referenced. A good idea could be to integrate the
concept of blind querying and the Hammer engine as
part of the J-CO-QL query language (Bordogna et al.,
2017), which is able to query heterogeneous collec-
tions of possibly geo-tagged JSON objects, providing
high-level operators which natively deal with spatial
representation and properties.
REFERENCES
Bordogna, G., Campi, A., Psaila, G., and Ronchi, S. (2008).
A language for manipulating clustered web docu-
ments results. In Proceedings of the 17th ACM con-
ference on Information and knowledge management,
pages 23–32. ACM.
Bordogna, G., Campi, A., Psaila, G., and Ronchi, S. (2012).
Disambiguated query suggestions and personalized
content-similarity and novelty ranking of clustered re-
sults to optimize web searches. Information Process-
ing & Management, 48(3):419–437.
Bordogna, G., Capelli, S., and Psaila, G. (2017). A big geo
data query framework to correlate open data with so-
cial network geotagged posts. In International Con-
ference on Geographic Information Science, pages
185–203. Springer, Cham.
Bordogna, G. and Psaila, G. (2004). Fuzzy-spatial sql. In
International Conference on Flexible Query Answer-
ing Systems, pages 307–319. Springer.
Braunschweig, K., Eberius, J., Thiele, M., and Lehner, W.
(2012). The state of open data. In WWW2012.
Carrara, W., Chan, W. S., Fischer, S., and van Steenbergen,
E. (2015). Creating Value through Open Data. Euro-
pean Union.
Cohen, W. W. (1998). Integration of heterogeneous
databases without common domains using queries
based on textual similarity. In ACM SIGMOD Record,
volume 27, pages 201–212. ACM.
Cukier, K. (2010). Data, data everywhere: A special report
on managing information. Economist Newspaper.
Davies, T. G., Rahman, I. A., Lautenschlager, S., Cunning-
ham, J. A., Asher, R. J., Barrett, P. M., Bates, K. T.,
Bengtson, S., Benson, R. B., Boyer, D. M., et al.
(2017). Open data and digital morphology. In Proc. R.
Soc. B, volume 284, page 20170194. The Royal Soci-
ety.
Dean, J. and Ghemawat, S. (2008). Mapreduce: simplified
data processing on large clusters. Communications of
the ACM, 51(1):107–113.
Dean, J. and Ghemawat, S. (2010). Mapreduce: a flexible
data processing tool. Communications of the ACM,
53(1):72–77.
Kononenko, O., Baysal, O., Holmes, R., , and Godfrey,
M. (2014). Mining modern repositories with elastic-
search. In MSR. June 29-30 2014, Hyderabad, India.
Liu, J., Dong, X., and Halevy, A. Y. (2006). Answering
structured queries on unstructured data. In WebDB.
2006, Chicago, Illinois, USA, volume 6, pages 25–30.
Citeseer.
The Challenge of using Map-reduce to Query Open Data
341