Author:
Khaled Nagi
Affiliation:
Faculty of Engineering and Alexandria University, Egypt
Keyword(s):
Search Engine, Scalability, Fault Tolerance, Open-Source, Lucene, Solr, NoSQL, Hadoop.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Business Analytics
;
Business Intelligence Applications
;
Data Analytics
;
Data Engineering
;
Information Extraction
;
Knowledge Discovery and Information Retrieval
;
Knowledge-Based Systems
;
Symbolic Systems
Abstract:
The usage of search engines is nowadays extended to do intelligent analytics of petabytes of data. With
Lucene being at the heart of the vast majority of information retrieval systems, several attempts are made to
bring it to the cloud in order to scale to big data. Efforts include implementing scalable distribution of the
search indices over the file system, storing them in NoSQL databases, and porting them to inherently distributed
ecosystems, such as Hadoop. We evaluate the existing efforts in terms of distribution, high availability,
fault tolerance, manageability, and high performance. We believe that the key to supporting search
indexing capabilities for big data can only be achieved through the use of common open-source technology
to be deployed on standard cloud platforms such as Amazon EC2, Microsoft Azure, etc. For each approach,
we build a benchmarking system by indexing the whole Wikipedia content and submitting hundreds of simultaneous
search requests. We measur
e the performance of both indexing and searching operations. We
stimulate node failures and monitor the recoverability of the system. We show that a system built on top of
Solr and Hadoop has the best stability and manageability; while systems based on NoSQL databases present
an attractive alternative in terms of performance.
(More)