building large-scale search engines that are capable 
of running in the Cloud. We restrict ourselves to 
open-source libraries, including Lucene, Solr, mon-
goDB, Cassandra, and Hadoop. We explicitly do not 
add extra implementation other that publicly availa-
ble components. We investigate each variation, in 
terms of scalability through data partitioning, redun-
dancy through replication, consistency either 
through the NoSQL databases or through open-
source synchronization libraries, such as Zookeeper. 
The ease of management of the multi-node cluster is 
also an important issue in our evaluation. Perfor-
mance plays a major part in our analysis. We build a 
benchmarking platform on top of the systems under 
investigation. For each variation, we construct a 
small and a large cluster. In our experiments, we 
measure both the speed of indexing as well as the 
search time and the throughput of the searching 
threads. The results of the experiments show that 
Solr and Hadoop provide the best tradeoff in terms 
of scalability, stability and manageability. Search 
engines based on NoSQL databases offer either a 
superior indexing speed, or fast searching times. 
Unfortunately, they suffer from stability in their 
integration implementations.  
In the future, we plan to contribute to LuMongo 
by fixing its memory leakage problem. A good con-
tribution would also be the extension of Solandra to 
support SolrCloud instead of a single Solr instance. 
Having done this, the owner of the large-scale search 
engine would have the choice between either using 
the Hadoop infrastructure or a NoSQL cluster instal-
lation depending on availability in his/her environ-
ment and his/her knowledge. 
ACKNOWLEDGEMENTS 
We would like to thank the Bibliotheca Alexandrina 
for providing us with the necessary hard-ware for 
conducting the benchmarking experiments. 
REFERENCES 
Akioka, S. and Muraoka, Y., 2010. HPC Benchmarks on 
Amazon EC2, Proceedings of the IEEE 24
th
 Interna-
tional Conference on Advanced Information Network-
ing and Applications Workshops (WAINA). 
Bojanova, I. and Samba, A., 2011. Analysis of Cloud 
Computing Delivery Architecture Models, IEEE 
Workshops of International Conference on Advanced 
Information Networking and Applications (WAINA). 
Blur, n.d., Apache Blur (Incubating) Home, 
 
https://incubator.apache.org/blur/, retrieved July 
2015. 
Brewer, E., 2000. Towards Robust Distributed Systems. 
ACM Symposium on Principles of Distributed Compu-
ting. 
Cutting, D. and Pedersen, J., 1990. Optimizations for 
Dynamic Inverted Index Maintenance, Proceedings of 
SIGIR ’90. 
DB-Engines, n.d., Knowledge Base of Relational and 
NoSQL Database Management Systems, http://db-
engines.com/en/ranking, retrieved July 2015. 
Dean, J. and Ghemawat, S., 2008. MapReduce: simplified 
data processing on large clusters. Communications of 
the ACM. 51, 1, 107–113. 
Edlich, S., Friedland, A., Hampe, J., Brauer, B., 2010. 
NoSQL: Introduction to the World of non-relational 
Web 2.0 Databases (In German) NoSQL: Einstieg in 
die Welt nichtrelationaler Web 2.0 Datenbanken, 
Hanser Verlag. 
Internet Archive BA, n.d., Internet Archive at Bibliotheca 
Alexandrina,  http://www.bibalex.org/en/project/ 
details?documentid=283, retrieved July 2015. 
Karambelkar, H.V., 2015. Scaling Big Data with Hadoop 
and Solr, Packt Publishing, 2
nd
 Edition. 
Katta, n.d., http://katta.sourceforge.net/, retrieved July 
2015. 
Khare, R. et al., 2004: Nutch: A flexible and scalable 
open-source web search engine. Technical Report Or-
egon State University. 1, 32–32. 
Kuc, R. and Rogozinski, M., 2015. Mastering Elas-
ticsearch, Packt Publishing, 2
nd
 Edition. 
Lakshman, A. and Malik, P., 2010. Cassandra: a decentral-
ized structured storage system. SIGOPS Operating 
Systems Review, 44(2):35–40. 
Lucene - Index File Formats, n.d. https://lucene.apache. 
org/core/3_0_3/fileformats.html, retrieved July 2015. 
Lucene - Class LockFactory, n.d., http://lucene.apache. 
org/core/4_8_0/core/org/apache/lucene/store/LockFa
ctory.html, retrieved July 2015. 
LuMongo, n.d., LuMongo Realtime Time Distributed 
Search, http://lumongo.org/, retrieved July 2015. 
McCandless, M., Hatcher, E., and Gospodnetić, O., 2010. 
Lucene in Action, Manning, 2
nd
 Edition. 
Nagi, K., 2007. Bringing Information Retrieval Back To 
Database Management Systems, Proceedings of 
IKE'07, International Conference on Information and 
Knowledge Engineering. 
Neo4j, n.d., http://www.neo4j.org, retrieved July 2015. 
Pessach, Y., 2013. Distributed Storage: Concepts, Algo-
rithms, and Implementations, CreateSpace Independ-
ent Publishing Platform. 
Plugge, E., Hawkins, D., and Membrey, P., 2010. The 
Definitive Guide to mongoDB: The NoSQL Database 
for Cloud and Desktop Computing, Apress. 
Rabl, T. et al., 2012. Solving big data challenges for en-
terprise application performance management, Pro-
ceedings of the VLDB Endowment, Volume 5 Issue 12, 
pp 1724-1735.  
Redix, n.d., http://redis.io/, retrieved July 2015. 
Solr, n.d., Solr - Apache Lucene - The Apache Software