Bringing Search Engines to the Cloud using Open Source Components

Khaled Nagi

2015

Abstract

The usage of search engines is nowadays extended to do intelligent analytics of petabytes of data. With Lucene being at the heart of the vast majority of information retrieval systems, several attempts are made to bring it to the cloud in order to scale to big data. Efforts include implementing scalable distribution of the search indices over the file system, storing them in NoSQL databases, and porting them to inherently distributed ecosystems, such as Hadoop. We evaluate the existing efforts in terms of distribution, high availability, fault tolerance, manageability, and high performance. We believe that the key to supporting search indexing capabilities for big data can only be achieved through the use of common open-source technology to be deployed on standard cloud platforms such as Amazon EC2, Microsoft Azure, etc. For each approach, we build a benchmarking system by indexing the whole Wikipedia content and submitting hundreds of simultaneous search requests. We measure the performance of both indexing and searching operations. We stimulate node failures and monitor the recoverability of the system. We show that a system built on top of Solr and Hadoop has the best stability and manageability; while systems based on NoSQL databases present an attractive alternative in terms of performance.

References

  1. Akioka, S. and Muraoka, Y., 2010. HPC Benchmarks on Amazon EC2, Proceedings of the IEEE 24th International Conference on Advanced Information Networking and Applications Workshops (WAINA).
  2. Bojanova, I. and Samba, A., 2011. Analysis of Cloud Computing Delivery Architecture Models, IEEE Workshops of International Conference on Advanced Information Networking and Applications (WAINA).
  3. Blur, n.d., Apache Blur (Incubating) Home, https://incubator.apache.org/blur/, retrieved July 2015.
  4. Brewer, E., 2000. Towards Robust Distributed Systems. ACM Symposium on Principles of Distributed Computing.
  5. Cutting, D. and Pedersen, J., 1990. Optimizations for Dynamic Inverted Index Maintenance, Proceedings of SIGIR 7890.
  6. DB-Engines, n.d., Knowledge Base of Relational and NoSQL Database Management Systems, http://dbengines.com/en/ranking, retrieved July 2015.
  7. Dean, J. and Ghemawat, S., 2008. MapReduce: simplified data processing on large clusters. Communications of the ACM. 51, 1, 107-113.
  8. Edlich, S., Friedland, A., Hampe, J., Brauer, B., 2010. NoSQL: Introduction to the World of non-relational Web 2.0 Databases (In German) NoSQL: Einstieg in die Welt nichtrelationaler Web 2.0 Datenbanken, Hanser Verlag.
  9. Internet Archive BA, n.d., Internet Archive at Bibliotheca Alexandrina, http://www.bibalex.org/en/project/ details?documentid=283, retrieved July 2015.
  10. Karambelkar, H.V., 2015. Scaling Big Data with Hadoop and Solr, Packt Publishing, 2nd Edition.
  11. Katta, n.d., http://katta.sourceforge.net/, retrieved July 2015.
  12. Khare, R. et al., 2004: Nutch: A flexible and scalable open-source web search engine. Technical Report Oregon State University. 1, 32-32.
  13. Kuc, R. and Rogozinski, M., 2015. Mastering Elasticsearch, Packt Publishing, 2nd Edition.
  14. Lakshman, A. and Malik, P., 2010. Cassandra: a decentralized structured storage system. SIGOPS Operating Systems Review, 44(2):35-40.
  15. Lucene - Index File Formats, n.d. https://lucene.apache. org/core/3_0_3/fileformats.html, retrieved July 2015.
  16. Lucene - Class LockFactory, n.d., http://lucene.apache. org/core/4_8_0/core/org/apache/lucene/store/LockFa ctory.html, retrieved July 2015.
  17. LuMongo, n.d., LuMongo Realtime Time Distributed Search, http://lumongo.org/, retrieved July 2015.
  18. McCandless, M., Hatcher, E., and Gospodnetic, O., 2010. Lucene in Action, Manning, 2nd Edition.
  19. Nagi, K., 2007. Bringing Information Retrieval Back To Database Management Systems, Proceedings of IKE'07, International Conference on Information and Knowledge Engineering.
  20. Neo4j, n.d., http://www.neo4j.org, retrieved July 2015.
  21. Pessach, Y., 2013. Distributed Storage: Concepts, Algorithms, and Implementations, CreateSpace Independent Publishing Platform.
  22. Plugge, E., Hawkins, D., and Membrey, P., 2010. The Definitive Guide to mongoDB: The NoSQL Database for Cloud and Desktop Computing, Apress.
  23. Rabl, T. et al., 2012. Solving big data challenges for enterprise application performance management, Proceedings of the VLDB Endowment, Volume 5 Issue 12, pp 1724-1735.
  24. Redix, n.d., http://redis.io/, retrieved July 2015.
  25. Solr, n.d., Solr - Apache Lucene - The Apache Software Foundation! http://lucene.apache.org/solr/, retrieved July 2015.
  26. Solr-1045, n.d., Build Solr index using Hadoop MapReduce, https://issues.apache.org/jira/browse/SOLR1045, retrieved July 2015.
  27. Solr-1301, n.d., Add a Solr contrib that allows for building Solr indices via Hadoop's Map-Reduce., https://issues. apache.org/jira/browse/SOLR-1301, retrieved July 2015.
  28. Smiley, D., Pugh, E., Parisa, K., Mitchell, and Apache M., 2015. Solr Enterprise Search Server, Packt Publishing, 3rd Edition.
  29. Storm, n.d., Storm - The Apache Software Foundation, https://storm.apache.org/, retrieved July 2015.
  30. Wikipedia-dumps, n.d., Wikipedia article dump, https://dumps.wikimedia.org/enwiki/, retrieved July 2015.
  31. ZooKeeper, n.d., Apache Zookeeper, https://zookeeper. apache.org/, retrieved July 2015.
Download


Paper Citation


in Harvard Style

Nagi K. (2015). Bringing Search Engines to the Cloud using Open Source Components . In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2015) ISBN 978-989-758-158-8, pages 116-126. DOI: 10.5220/0005632701160126


in Bibtex Style

@conference{kdir15,
author={Khaled Nagi},
title={Bringing Search Engines to the Cloud using Open Source Components},
booktitle={Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2015)},
year={2015},
pages={116-126},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005632701160126},
isbn={978-989-758-158-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2015)
TI - Bringing Search Engines to the Cloud using Open Source Components
SN - 978-989-758-158-8
AU - Nagi K.
PY - 2015
SP - 116
EP - 126
DO - 10.5220/0005632701160126