building large-scale search engines that are capable
of running in the Cloud. We restrict ourselves to
open-source libraries, including Lucene, Solr, mon-
goDB, Cassandra, and Hadoop. We explicitly do not
add extra implementation other that publicly availa-
ble components. We investigate each variation, in
terms of scalability through data partitioning, redun-
dancy through replication, consistency either
through the NoSQL databases or through open-
source synchronization libraries, such as Zookeeper.
The ease of management of the multi-node cluster is
also an important issue in our evaluation. Perfor-
mance plays a major part in our analysis. We build a
benchmarking platform on top of the systems under
investigation. For each variation, we construct a
small and a large cluster. In our experiments, we
measure both the speed of indexing as well as the
search time and the throughput of the searching
threads. The results of the experiments show that
Solr and Hadoop provide the best tradeoff in terms
of scalability, stability and manageability. Search
engines based on NoSQL databases offer either a
superior indexing speed, or fast searching times.
Unfortunately, they suffer from stability in their
integration implementations.
In the future, we plan to contribute to LuMongo
by fixing its memory leakage problem. A good con-
tribution would also be the extension of Solandra to
support SolrCloud instead of a single Solr instance.
Having done this, the owner of the large-scale search
engine would have the choice between either using
the Hadoop infrastructure or a NoSQL cluster instal-
lation depending on availability in his/her environ-
ment and his/her knowledge.
ACKNOWLEDGEMENTS
We would like to thank the Bibliotheca Alexandrina
for providing us with the necessary hard-ware for
conducting the benchmarking experiments.
REFERENCES
Akioka, S. and Muraoka, Y., 2010. HPC Benchmarks on
Amazon EC2, Proceedings of the IEEE 24
th
Interna-
tional Conference on Advanced Information Network-
ing and Applications Workshops (WAINA).
Bojanova, I. and Samba, A., 2011. Analysis of Cloud
Computing Delivery Architecture Models, IEEE
Workshops of International Conference on Advanced
Information Networking and Applications (WAINA).
Blur, n.d., Apache Blur (Incubating) Home,
https://incubator.apache.org/blur/, retrieved July
2015.
Brewer, E., 2000. Towards Robust Distributed Systems.
ACM Symposium on Principles of Distributed Compu-
ting.
Cutting, D. and Pedersen, J., 1990. Optimizations for
Dynamic Inverted Index Maintenance, Proceedings of
SIGIR ’90.
DB-Engines, n.d., Knowledge Base of Relational and
NoSQL Database Management Systems, http://db-
engines.com/en/ranking, retrieved July 2015.
Dean, J. and Ghemawat, S., 2008. MapReduce: simplified
data processing on large clusters. Communications of
the ACM. 51, 1, 107–113.
Edlich, S., Friedland, A., Hampe, J., Brauer, B., 2010.
NoSQL: Introduction to the World of non-relational
Web 2.0 Databases (In German) NoSQL: Einstieg in
die Welt nichtrelationaler Web 2.0 Datenbanken,
Hanser Verlag.
Internet Archive BA, n.d., Internet Archive at Bibliotheca
Alexandrina, http://www.bibalex.org/en/project/
details?documentid=283, retrieved July 2015.
Karambelkar, H.V., 2015. Scaling Big Data with Hadoop
and Solr, Packt Publishing, 2
nd
Edition.
Katta, n.d., http://katta.sourceforge.net/, retrieved July
2015.
Khare, R. et al., 2004: Nutch: A flexible and scalable
open-source web search engine. Technical Report Or-
egon State University. 1, 32–32.
Kuc, R. and Rogozinski, M., 2015. Mastering Elas-
ticsearch, Packt Publishing, 2
nd
Edition.
Lakshman, A. and Malik, P., 2010. Cassandra: a decentral-
ized structured storage system. SIGOPS Operating
Systems Review, 44(2):35–40.
Lucene - Index File Formats, n.d. https://lucene.apache.
org/core/3_0_3/fileformats.html, retrieved July 2015.
Lucene - Class LockFactory, n.d., http://lucene.apache.
org/core/4_8_0/core/org/apache/lucene/store/LockFa
ctory.html, retrieved July 2015.
LuMongo, n.d., LuMongo Realtime Time Distributed
Search, http://lumongo.org/, retrieved July 2015.
McCandless, M., Hatcher, E., and Gospodnetić, O., 2010.
Lucene in Action, Manning, 2
nd
Edition.
Nagi, K., 2007. Bringing Information Retrieval Back To
Database Management Systems, Proceedings of
IKE'07, International Conference on Information and
Knowledge Engineering.
Neo4j, n.d., http://www.neo4j.org, retrieved July 2015.
Pessach, Y., 2013. Distributed Storage: Concepts, Algo-
rithms, and Implementations, CreateSpace Independ-
ent Publishing Platform.
Plugge, E., Hawkins, D., and Membrey, P., 2010. The
Definitive Guide to mongoDB: The NoSQL Database
for Cloud and Desktop Computing, Apress.
Rabl, T. et al., 2012. Solving big data challenges for en-
terprise application performance management, Pro-
ceedings of the VLDB Endowment, Volume 5 Issue 12,
pp 1724-1735.
Redix, n.d., http://redis.io/, retrieved July 2015.
Solr, n.d., Solr - Apache Lucene - The Apache Software