Bringing Search Engines to the Cloud using Open Source

Components

Khaled Nagi

Dept. of Computer and Systems Engineering, Faculty of Engineering, Alexandria University, Alexandria, Egypt

Keywords: Search Engine, Scalability, Fault Tolerance, Open-Source, Lucene, Solr, NoSQL, Hadoop.

Abstract: The usage of search engines is nowadays extended to do intelligent analytics of petabytes of data. With

Lucene being at the heart of the vast majority of information retrieval systems, several attempts are made to

bring it to the cloud in order to scale to big data. Efforts include implementing scalable distribution of the

search indices over the file system, storing them in NoSQL databases, and porting them to inherently dis-

tributed ecosystems, such as Hadoop. We evaluate the existing efforts in terms of distribution, high availa-

bility, fault tolerance, manageability, and high performance. We believe that the key to supporting search

indexing capabilities for big data can only be achieved through the use of common open-source technology

to be deployed on standard cloud platforms such as Amazon EC2, Microsoft Azure, etc. For each approach,

we build a benchmarking system by indexing the whole Wikipedia content and submitting hundreds of sim-

ultaneous search requests. We measure the performance of both indexing and searching operations. We

stimulate node failures and monitor the recoverability of the system. We show that a system built on top of

Solr and Hadoop has the best stability and manageability; while systems based on NoSQL databases present

an attractive alternative in terms of performance.

1 INTRODUCTION

Since Doug Cutting originally wrote Lucene

(McCandless et al., 2010) in 1999 after a long series

of scientific publications dating back to 1990 (Cut-

ting and Pedersen, 1990), it has emerged as the

standard full text search engine in the open-source

community. Several other open-source projects, such

as Solr (Smiley et al., 2015) and Elasticsearch (Kuc

and Rogozinski, 2015), are built on top of Lucene

and offer extended search facilities, such as faceted

navigation, hit highlighting, auto-suggest, Geo-

spatial search.

Now, search engines are required to do intelli-

gent analytics of petabytes of data. Back in 2007, the

first attempts (Nagi, 2007) were made to provide

scalable, robust and distributed search engines by

porting the core of Lucene storage classes to run on

relation database management systems. With the

emergence of NoSQL database management systems

and inherently distributed ecosystems, such as Ha-

doop, many open-source prototypes and implemen-

tations attempt nowadays to support the necessary

features for any large-scale cloud-based implementa-

tion of a search engines (Karambelkar, 2015).

In this work, we investigate the most prominent

publicly available implementations. We believe that

the key to the success of any large-scale search en-

gine will remain the same as the success of the orig-

inal Lucene, which is openness. In our Work, we

explicitly refrain from adding any customized im-

plementation to the off-the-shelf open-source com-

ponents. We apply only the tweaks supplied by the

official performance tuning recommendations from

the providers.

Our contribution is the independent evaluation of

the existing approaches in terms of support for dis-

tribution - in which data partitioning and replication

while maintaining consistency - play a major role.

We always investigate the effect of node failures,

since almost all popular and modern cloud providers

nowadays, such as Amazon EC2 (Akioka and Mu-

raoka, 2010) and Microsoft Azure (Bojanova and

Samba, 2011), are built on commodity hardware.

Furthermore, we take into consideration the ease of

management of the cluster. However, our main focus

is the evaluation of the performance of both index-

ing and searching of these systems.

The rest of the paper is organized as follows. In

Section 2, the features desired in a distributed high-

ly-scalable search engines together with a brief

116

Nagi, K..

Bringing Search Engines to the Cloud using Open Source Components.

In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2015) - Volume 1: KDIR, pages 116-126

ISBN: 978-989-758-158-8

background of the technologies in use are presented.

Section 3 brings a detailed description of the sys-

tems under investigation. In Section 4, the perfor-

mance evaluation is presented. Section 5 concludes

the paper and presents an outlook to our future work

in this area.

2 BACKGROUND AND RELATED

WORK

The following features are the key to the success of

any cloud-based large-scale search engine:

 Partitioning (Sharding): It is splitting the index

into several independent sections. Each section

can be viewed as a separate index and is indexed

independently. A query is answered by pro-

cessing it at the shards in question before the re-

sult is consolidated and returned to the user.

 Replication: It provides redundancy and in-

creases data availability. With multiple copies of

data on different servers, replication protects an

index from the loss of a single node. In some

cases, replication can be used to increase read

capacity.

 Consistency: A newly indexed document is not

necessarily made available to the next search re-

quest. However, the index data structure must be

consistent under whatever storage model used to

store it. Taking a deeper look into the structure of

Lucene (“Lucene - Index File Formats”, n.d.), for

example; the content of one internal block is de-

pendent on the content of another. Consistency

between these blocks must be guaranteed all

times, whereas consistency across the independ-

ent shards is not a must.

 Fault-tolerance: it means the absence of any

Single Point of Failure (SPoF). Most modern

clouds are based on commodity hardware. The

temporary absence of a node is expected to occur

at any point of time. This should never lead to

the failure of the whole search engine.

 Manageability: A cloud-based search engine is

spread across several dozens of servers. The ad-

ministration of these servers and the services de-

ployed on them must be made easily: either

through a Command Line Interface (CLI), pro-

grammatically embeddable interface, e.g., JMX,

or most preferably via web administration con-

soles.

 High Performance: Cloud-based search engines

should be capable of indexing the shards in par-

allel. They should also process hundreds of

search queries in parallel with a reasonable re-

sponse time (e.g., under 3 seconds).

2.1 Lucene-based Search Engines

A full text search index is an efficient cross-

reference lookup data structure. Usually, a variation

of the well-known inverted index structure is used

(Cutting and Pedersen, 1990).

The indexing process begins with collecting the

available set of documents by the data gatherer. The

parser converts them to a stream of plain text. In the

analysis phase, the stream of data is tokenized ac-

cording to predefined delimiters and a number of

operations are performed on the tokens, e.g., the

removal of all stop words and the reduction of the

words to their roots to enable phonetic searches.

The searching process begins with parsing the

user query. The tokens have to be analyzed by the

same analyzer used for indexing. Then, the index is

traversed for possible matches. The fuzzy query

processor is responsible for defining the match crite-

ria and the score of the hit.

Lucene (McCandless et al., 2010) is at the heart

of almost every full-text search engine. It provides

several useful features, such as ranked searching,

fielded searching and sorting. Searching is done

through several query types including: phrase que-

ries, wildcard queries, proximity queries, range que-

ries. It allows for simultaneous indexing and search-

ing by implementing a simple pessimistic locking

algorithm (“Lucene - Class LockFactory”, n.d.).

An important internal feature of Lucene is that it

uses a configurable storage engine. In its standard

release, it comes with a codec to store the index on

the disc or maintain it in-memory for smaller indi-

ces. The internal structure of the index file is public

and is platform independent (“Lucene - Index File

Formats”, n.d.). This ensures its portability. Back in

2007, this concept was used to store the index effi-

ciently into Relational Database Management Sys-

tems (Nagi, 2007). The same technique is used today

to store the index in other NoSQL databases, such as

Cassandra (Lakshman and Malik, 2010) and mon-

goDB (Plugge et al., 2010).

Apache Solr (Smiley et al., 2015) is built on-top

of Lucene. It is a web application that can be de-

ployed in any servlet container. It adds the following

functionality to Lucene:

 XML/HTTP and JSON APIs

 Hit highlighting

 Faceted search and filtering

 Geospatial search

 Caching

Bringing Search Engines to the Cloud using Open Source Components

117

 Near real-time searching of newly indexed

documents.

 Web administration interface

SolrCloud (Smiley et al., 2015) was released in

2012. It is an extension to Solr that allows for both

sharding and replication. The management of this

distribution is seamlessly integrated into an intuitive

web administration console. Figure 1 illustrates the

configuration of one our setups in the web admin-

istration console.

Figure 1: Screenshot of the web administration console.

Elasticsearch (Kuc and Rogozinski, 2015)

evolved almost in parallel to Solr and SolrCloud.

Both bring the same set of features. Both are very

performant. Both are open-source and use a different

combination of open-source libraries. At their hearts,

both have Lucene. In general, Solr seems to be

slightly more popular than Elasticsearch; whereas

Elasticsearch is expanding more in the direction of

data analytics.

2.2 NoSQL Databases

The main strength of NoSQL databases comes from

their ability to manage extremely large volumes of

data. For this type of applications, ACID transaction

properties are too restrictive. More relaxed models

emerged such as the CAP theory or eventually con-

sistent emerged (Brewer, 2000). It means that any

large-scale distributed DBMS can guarantee for two

of three aspects: Consistency, Availability, and Par-

tition tolerance. In order to solve the conflicts of the

CAP theory, the BASE consistency model (BAsical-

ly, Soft state, Eventually consistent) is defined for

modern applications (Brewer, 2000). This principle

goes well with information retrieval systems, where

intelligent searching is more important than con-

sistent ones.

A good overview of existing NoSQL database

management systems can be found in (Edlich, et al,

2010). Mainly, NoSQL database systems fall into

four categories:

 graph databases,

 key-value systems,

 column-family systems, and

 document stores.

Graph databases concentrate on providing new

algorithms for storing and processing very large and

distributed graphs. They are often faster for associa-

tive data sets. They can scale more naturally to large

data sets as they do not require expensive join opera-

tions. Neo4j (“neo4j”, n.d.) is a typical example of a

graph databases.

Key-value systems use associative arrays (maps)

as their fundamental data structure. More complicat-

ed data structures are often implemented on top of

the maps. Redis (“Redix”, n.d.) is a good example of

a basic key-value systems.

The data model of column-family systems pro-

vides a structured key-value store where columns are

added only to specified keys. Different keys can

have different number of columns in any given fami-

ly. A prominent member of the column family stores

is Cassandra (Lakshman and Malik, 2010). Apache

Cassandra is a second generation of distributed key

value stores; developed at Facebook. It is designed

to handle very large amounts of data spread across

many commodity servers without a single point of

failure. Replication is done even across multiple data

centers. Nodes can be added to cluster without

downtime.

Document-oriented databases are also a subclass

of key-value stores. The difference lies in the way

the data is processed. A document-oriented system

relies on internal structure in the document order to

extract metadata that the database engine uses for

further optimization. Document databases are sche-

maless and store all related information together.

Documents are addressed in the database via a

unique key. Typically, the database constructs an

index on the key and all kinds of metadata. mon-

goDB (Plugge et al., 2010), first developed in 2007,

is considered to be the most popular NoSQL nowa-

days (“DB-Engines”, n.d.). mongoDB provides high

availability with replica sets.

In all attempts to store Lucene index files in

NoSQL databases, the contributors take the logical

index file as starting point. The set of logical files

are broken into logical blocks that are stored in the

database. It is therefore clear that plain key-value

data stores and graph databases are not suitable for

storing a Lucene index. On the other hand, docu-

ment stores, such as mongoDB, are ideal stores for

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

118

Lucene indices. One Lucene logical file maps easily

to a mongoDB document. Similarly, the Lucene

logical directory (files) is mapped to a Cassandra

column family (rows), which is captured using an

inherited implementation of the abstract Lucene

Directory class. The files of the directory are

broken down into blocks (whose sizes are capped).

Each block is stored as the value of a column in the

corresponding row.

2.3 Inherently Distributed Ecosystems

After the release of (Dean and Ghemawat, 2008),

Doug Cutting worked on a Java-based MapReduce

implementation to solve scalability issues on Nutch

(Khare et al., 2004); which is an open-source web

crawler software project to feed search engines with

content. This was the base for the Hadoop open

source project; which became a top-level Apache

Foundation project. Currently, the main Hadoop

project includes these modules:

 Hadoop Common: It supports the other Hadoop

modules.

 Hadoop Distributed File System (HDFS): A dis-

tributed file system.

 Hadoop YARN: A job scheduler and cluster re-

source management.

 Hadoop MapReduce: A YARN-based system for

parallel processing of large data sets.

Each Hadoop task (Map or Reduce) works on the

small subset of the data it has been assigned so that

the load is spread across the cluster. The map tasks

generally load, parse, transform, and filter data. Each

reduce task is responsible for handling a subset of

the map task output. Intermediate data is then copied

from mapper tasks by the reducer tasks in order to

group and aggregate the data. It is definitely appeal-

ing to use the MapReduce framework in order to

construct the Lucene index using several nodes of a

Hadoop cluster.

The input to a MapReduce job is a set of files

that are spread over the Hadoop Distributed File

System (HDFS). In the end of the MapReduce oper-

ations, the data is written back to HDFS. HDFS is a

distributed, scalable, and portable file system. A

Hadoop cluster has one namenode and a set of

datanodes. Each datanode serves up blocks of data

over the network using a block protocol. HDFS

achieves reliability by replicating the data across

multiple hosts. Hadoop recommends a replication

factor of 3. Since the release of Hadoop 2.0 in 2012,

several high-availability capabilities, such as provid-

ing automatic fail-over of the namenode, are imple-

mented. This way, HDFS comes with no single point

of failure. HDFS was designed for mostly immuta-

ble files (Pessach, 2013) and may not be suitable for

systems requiring concurrent write-operations. Since

the default storage codec for Solr is append-only, it

matches HDFS. With the extreme scalability, ro-

bustness and wide-spread of Hadoop clusters, it

offers the perfect store for Solr in Cloud-based envi-

ronments.

Additionally, there are three ecosystems that can

be used in building distributed search engines: Katta,

Blur and Storm.

Katta (“Katta”, n.d.) brings Apache Hadoop and

Solr together. It brings search across a completely

distributed MapReduce-based cluster. Katta is an

open-source project that uses the underlying Hadoop

HDFS for storing the indices and providing access to

them. Unfortunately, the development of Katta has

been stopped. The main reason is the inclusion of

several of the Katta features within the SolrCloud

project.

Apache Blur (“Blur”, n.d.) is a distributed search

engine that can work with Apache Hadoop. It is

different from the traditional big data systems in that

it provides a relational data model-like storage on

top of HDFS. Apache Blur does not use Apache

Solr; however, it consumes Apache Lucene APIs.

Blur provides data indexing using MapReduce and

advanced search features; such as a faceted search,

fuzzy, pagination, and a wildcard search. Blur shard

server is responsible for managing shards. For Syn-

chronization, it uses Apache ZooKeeper

(“ZooKeeper”, n.d.). Blur is still in the apache incu-

bator status. The current release version 0.2.3 works

with Hadoop 1.x and is not validated using the

scalability features coming with Hadoop 2.x.

The third project Storm (“Storm”, n.d.) is also in

its incubator state at Apache. Storm is a real time

distributed computation framework. It processes

huge data in real time. Apache Storm processes

massive streams of data in a distributed manner. So,

it would be a perfect candidate to build Lucene indi-

ces over large repositories of documents once it is

reaches the release state. Apache Storm uses the

concept of Spout and Bolts. Spouts are data inputs;

this is where data arrives in the Storm cluster. Bolts

process the streams that get piped into it. They can

be fed data from spouts or other bolts. The bolts can

form a chain of processing, with each bolt perform-

ing a unit task in a concept similar to MapReduce.

Bringing Search Engines to the Cloud using Open Source Components

119

3 SYSTEMS UNDER

INVESTIGATION

3.1 Solr on Cassandra

Solandra is an open-source project that uses Cassan-

dra instead of the operating system file system for

storing indices in the Lucene index format (“Lucene

- Index File Formats”, n.d.). The project is very

stable. Unfortunately, the last commit dates back to

2010. The current Solandra version available for

download uses Apache Solr 3.4 and Cassandra 0.8.6.

That's why any installation would use Solr and not

SolrCloud. The details of the Cassandra-based dis-

tributed data storage is completely hidden behind the

CassandraDirectory

class and its associated

classes. Solandra uses its own index reader called

SolandraIndexReaderFactory

by overriding the

default index reader.

Under Solandra, Solr and Cassandra run both

within the same JVM. However, with a slight recon-

figuration, we run a Cassandra cluster instead. In a

small implementation, the Cassandra cluster spreads

over 3 nodes and 7 nodes in the larger one as illus-

trated in Figure 2.

Figure 2: Our Solandra installation.

On Cassandra, each node exchanges information

across the cluster every second. A sequentially writ-

ten commit log on each node captures write activity

to ensure data durability. Data is then indexed and

written to an in-memory structure. Once the memory

structure is full, the data is written to disk in an

SSTable data file. All writes are automatically parti-

tioned and replicated throughout the cluster. A clus-

ter is arranged as a ring of nodes. Clients send

read/write requests to any node in the ring; that takes

on the role of coordinator node, and forwards the

request to the node responsible for servicing it. A

partitioner decides which nodes store which rows.

This way, both sharding and replication are au-

tomatically made available by Casandra. Cassandra

also guarantees the consistency of the blocks read by

its various nodes. Although fault-tolerance is a

strong feature of Cassandra, Solr itself is the single

point of failure in this implementation, due to the

absence of the integration with SolrCloud. Unfortu-

nately, Solandra does not support the administration

console of Solr. The only management option is

through the Cassandra CLI.

3.2 Lucene on mongoDB

Another open-source NoSQL-based project is Lu-

Mongo (“LuMongo”, n.d.). LuMongo provides the

flexibility and power of Lucene queries with the

scalability and ease of use of mongoDB. All data in

LuMongo is stored in mongoDB including indices

and documents. Inherently mongoDB can be sharded

and replicated. LuMongo itself operates as a cluster.

On error, clients can fail to another cluster node.

Nodes in the cluster can be added and removed dy-

namically through a simple CLI command. The CLI

offers to query the health status of cluster, list avail-

able indices, get their counts, submit simple queries,

and fetch documents.

LuMongo indices are broken down into shards

called segments. Each segment is an independent

index. A hash of the document's unique identifier

determines which segment a document's indexed

fields will be stored into. In our smaller implementa-

tion, illustrated in Figure 3, the segments are stored

in a 3x3 mongoDB cluster for the small setup and 7

shards and 3 replicas for the larger setup to match

the number of LuMongo servers; which is 3 and 7

respectively.

Figure 3: Our LuMongo implementation.

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

120

In this setup, sharding is implemented in both

LuMongo and mongoDB. The mongoDB takes care

of partitioning seamlessly. mongoDB guarantees the

consistency of the index store, while LuMongo

guarantees the consistency of the search result.

There is no single point of failure in mongoDB and

LuMongo.

3.3 SolrCloud

SolrCloud (Smiley et al., 2015) contains a cluster of

Solr nodes. Each node runs one or more collections.

A collection holds one or more shards. Each shard

can be replicated among the nodes. Apache

ZooKeeper (“ZooKeeper”, n.d.) is responsible for

maintaining co-ordination among various nodes. It

provides load-balancing and failover to the Solr

cluster. Synchronization of status information of the

nodes is done in-memory for speed and is persisted

on the disk at fixed checkpoints. Additionally, the

Zookeeper maintains configuration information of

the index; such as schema information and Solr

configuration parameters. Usually, there are more

than one Zookeeper for redundancy. Together, they

build a Zookeeper ensemble. When the cluster is

started, one of the Zookeeper nodes is elected as a

leader. The same occurs for Solr. There is a leader

responsible for each shard.

SolrCloud distributes search across multiple

shards transparently. The request gets executed on

all leaders of every shard involved. Search is possi-

ble with near-real time; i.e., after a document is

committed. Figure 4 illustrates our small cluster

implementation. We build the cluster using a

Zookeeper ensemble consisting of 3 nodes. We

install 3 SolrCloud instances on three different ma-

chines, define 3 shards and replicate them 3 times.

Figure 4: Our SolrCloud Implementation.

In the larger cluster, we extend the Zookeeper en-

semble to spread 7 machines. We use 7 SolrCloud

instance to master 7 shards while keeping the repli-

cation factor at 3.

3.4 SolrCloud on Hadoop

Building SolrCloud on Hadoop is an extension to the

implementation described in Section 3.3. The same

Zookeeper ensemble and SolrCloud instances are

used. Solr is then configured to read and write indi-

ces in the HDFS by implementing an

HdfsDirec-

toryFactory

and implementing a lock type based

on HDFS. Both come with the current stable version

of Solr (“Solr”, n.d.), version 5.2.1. Figure 5 illus-

trates our small cluster implementation. We leave

replication to the HDFS. We set the replication fac-

tor on HDFS to 3 to be consistent with the rest of the

setups. For the small cluster, we also use a 3 node

Hadoop installation. For the large cluster, we use a 7

node cluster.

Figure 5: Our SolrCloud implementation over Hadoop.

Solr provides indexing using MapReduce in two

ways. In the first way, the indexing is done at the

map side (“Solr-1045”, n.d.). Each Apache Hadoop

mapper transforms the input records into a set of

(key, value) pairs, which then get transformed into

SolrInputDocument

. The Mapper task then cre-

ates an index from

SolrInputDocument

. The

Reducer performs de-duplication of different indices

and merges them if needed. In the second way, the

indices are generated in the reduce phase (“Solr-

1301”, n.d.). Once the indices are created using

either ways, they can be loaded by SolrCloud from

Bringing Search Engines to the Cloud using Open Source Components

121

HDFS and used in searching. We use the first way

and employ 20 nodes in the indexing process.

3.5 Functional Comparison

Table 1 summarizes the functional differences be-

tween all 4 systems under investigation.

Table 1: Functional Comparison of the systems under

investigation.

Solr on

Cassandra

Lucene on

mongoDB

SolrCloud SolrCloud

on Hadoop

Sharding done by

Cassandra

done by

mongoDB

done by Solr done by

Solr

Replica-

tion

done by

Cassandra

done by

mongoDB

sync. on the

level of the file

system under

to coordination

of Zookeeper

done by

HDFS

Con-

sistency

guaran-

teed by

Cassandra

guaranteed

by Lu-

Mungo and

mongonDB

done by Solr

and managed

by Zookeeper

guaranteed

y HDFS,

Solr and

Zookeeper

Fault-

tolerance

Solr is

SPoF

No SPoF No SPoF No SPoF

Manage-

ability

CLI CLI Web web for

Solr + web

for Hadoop

4 BENCHMARKING

In our order to evaluate the performance of the vari-

ous search engine clusters under investigation, we

build a full text search engine of the English Wik-

ipedia (“Wikipedia-dumps”, n.d.). The index is built

over 49 GB of textual content. We develop a

benchmarking platform on top of each search engine

under investigation as illustrated in

Figure 6

The searching workload generator composes

queries of single terms, which are randomly extract-

ed from a long list of common English words. It

submits them in parallel to the application. The in-

dexing workload generator parses the Wikipedia

dump and sends the page title, the content, and other

attributes such as timestamp and revision numbers to

the benchmarking platform workers, which in turn

pass them to the search engine cluster be indexed.

The benchmarking platform manages two connec-

tion pools of worker threads. The first pool consists

of several hundreds of searching workers threads

that process the search queries coming from the

searching workload generator. The second pool

consists of index inserting workers threads that

process the updated content coming from the index-

ing workload generator. Both worker types submit

their requests over http to the search engine cluster

under investigation. The performance of the system

including that of the search engine cluster is moni-

tored using the performance monitor unit.

Figure 6: Components of the benchmarking platform.

4.1 Input Parameters and Performance

Metrics

We choose the maximum number of fetched hits to

be 50. This is a realistic assumption taking into con-

sideration that no more than 25 hits are usually dis-

played on a web page. We choose to read the content

of these 50 hits and not only the title while fetching

the result-set. This exaggerated implementation is

intended to artificially stress test the search engines

clusters under investigation. The number of search

threads is varied from 32 to 320 to match the size of

connection pool for the searching worker threads. In

case of high load, the workload generator distributes

its searching search threads over 4 physical ma-

chines to avoid throttling the requests by the hosting

client. Due to locking restrictions inherent in Lu-

cene, we restrict our experiments to maximum one

indexing worker per node in the search engine clus-

ter.

In all our experiments, we monitor the response

time of the search operations from the moment of

submitting the request till receiving the overall re-

sult. We also monitor the system throughput in terms

of:

 searches per second, and

 index inserts per hour.

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

122

Additionally, the performance monitor constantly

monitors CPU and memory usages of the machines

running the search engine cluster.

4.2 System Configuration

In order to neutralize the effect of using virtualized

nodes in globalized data cloud centers; such as Am-

azon EC2 or Microsoft Azure, we conduct our ex-

periments in an isolated cluster available at the In-

ternet Archive of the Bibliotheca Alexandrina (“In-

ternet Archive BA”, n.d.). The Bibliotheca Alexan-

drina possesses a huge dedicated computer center for

archiving the Internet, digitizing material at Biblio-

theca Alexandrina and other digital collections.

The Internet Archive at the Bibliotheca Alexan-

drina has about 35 racks each rack is comprised of

30 to 40 nodes and a gigabit switch connecting

them. The 35 racks are connected also with a gigabit

switch. The nodes are based on commodity servers

with a total capacity of 7000 TB.

The Bibliotheca Alexandrina dedicated one rack

with 20 nodes to our research for approx. one month.

The nodes are connected with a gigabit switch and

are isolated from the activities of the Internet Ar-

chive during the period of our experiments. Each

node has an Intel i5 CPU 2.6 GHz, 8 GB RAM, 4

SATA hard disks 3 TB each.

For each search engine cluster, we construct a

small version and a larger one as described in Sec-

tion 3. The small cluster consists of three nodes each

containing a shard (a portion of the index) while the

larger one is built over 7 nodes. In all installations

have a replication factor of 3.

4.3 Indexing

Indexing speed varies largely with the number of

nodes involved in the index building operation.

Lucene; and hence Solr; employs a pessimistic lock-

ing mechanism while inserting data into the index.

This locking mechanism is being kept for all

backend implementations. From our current experi-

ments and from previous ones (Nagi, 2007), we

conclude that there is no benefit in having more than

one indexing thread per Lucene index (or Solr

shard).

This means that the increase in number of shards

and their dedicated indexing Lucene/Solr yields to a

proportional increase in the speed of indexing. The

increase is also linear for all systems under investi-

gation. In other words, the indexing speed of a 3

nodes cluster is 3 times that’s of a cluster consisting

of a single node. Respectively, the indexing speed of

a 7 nodes cluster is 2.3 times that’s of a cluster con-

sisting of 3 nodes. A clear winner in this contest is

SolrCloud on Hadoop that employs MapReduce in

indexing. Using all 20 nodes available in the

MapReduce operation increases the speed by factor

of 18. A minimum overhead is wasted later on in

merging the indices into 3 and 7 nodes, respectively.

In order to normalize a comparison between all

systems, we plot the throughput of using one index-

ing thread on a 3 shards, 3 replica cluster in Figure

7. These numbers are roughly multiplied by the

number of nodes involved to get the overall indexing

speed.

Figure 7: Normalized indexing speed.

On the normalized scale, NoSQL backends bring

very different results. Casandra has by far the fastest

rate of insertion (60% faster than SolrCloud). This

experiment confirms the results reported by (Rabl et

al., 2012) proving the high throughput of Cassandra

as compared to other NoSQL databases. On the

other hand, mongoDB-based storage is the slowest.

SolrCloud brings very good results on the file sys-

tem. The overhead of storage on HDFS is about 26%

which is very acceptable taking into consideration

the advantages of storing data on Hadoop clusters in

cloud environments and the huge speed-ups due to

the use of MapReduce in indexing.

4.4 Searching

Searching is more important than indexing. We

repeat the search experiments with the number of

search threads varying from 32 to 320. The duration

of each experiment is set to 15 minutes to eliminate

any transient effect.

The set of experiments is repeated for both the

small cluster and the large cluster. The response time

for the small cluster is illustrated in Figure 8 and the

large cluster in Figure 9. The throughput in terms of

number of searches per second versus the number of

Bringing Search Engines to the Cloud using Open Source Components

123

searching threads is plotted in Figure 10 for the

small cluster and in Figure 11 for the larger one.

The bad news is that the response time of the

single Solr on the Cassandra cluster is far higher

than the other systems (>10 seconds). So, we

dropped plotting its values for both clusters. The

same applies to the throughput, which was much

lower than its counterparts (< 50 searhes/second).

Again this matches the findings in (Rabl et al.,

2012), where the high throughput of Cassandra

comes at the cost of read latency.

The good news is that the response time for the

other systems is very much below the usual 3 sec-

onds threshold tolerated by a searching user. The

maximum search time measured on the small cluster

is below 1.8 seconds and 1.4 seconds for the larger

cluster. The curves also show that the response time

of the larger cluster is better than the smaller cluster

under all settings. This means that the performance

of the system is enhanced by the increase of the

number of nodes. The system did not achieve its

saturation yet.

The figures also illustrate the impact of HDFS

on the response time and the overall throughput of

the search. Although the search time is increased by

almost 40% and the throughput is almost halved, the

absolute values remain far below the user threshold

of 3 seconds by retrieving the hits and the contents

of each hit for a result-set size of 50 in less than 2

seconds.

Another important remark is that the perfor-

mance of all systems degrade gracefully with the

increase of workload except for LuMongo. Under

heavy workloads, (192 for the small cluster and 288

for the large cluster) LuMongo runs out of heap

memory. We track down the problem to be in fetch-

ing the content of the documents after returning the

document ids from the search engine. There is a

small memory leakage in LuMongo that causes the

abortion of the searches under heavy loads. Hav-

ingthis solved in future releases on LuMongo, Lu

Figure 8: Search time on the small cluster.

Mongo will be a very important choice regarding its

superior response time illustrated in Figure 8 and

Figure 9.

Figure 9: Search time on the large cluster.

The throughput curves, Figure 10 and Figure 11,

illustrate that the throughput saturates after a certain

number of concurrent search threads. In the small

cluster, Figure 10, the three setups saturate at 64

concurrent threads. On the large cluster, Figure 11,

this number increases to 128.

Figure 10: Throughput of the small cluster.

Figure 11: Throughput of the large cluster.

5 CONCLUSION AND FUTURE

WORK

In this paper, we investigate the available options for

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

124

building large-scale search engines that are capable

of running in the Cloud. We restrict ourselves to

open-source libraries, including Lucene, Solr, mon-

goDB, Cassandra, and Hadoop. We explicitly do not

add extra implementation other that publicly availa-

ble components. We investigate each variation, in

terms of scalability through data partitioning, redun-

dancy through replication, consistency either

through the NoSQL databases or through open-

source synchronization libraries, such as Zookeeper.

The ease of management of the multi-node cluster is

also an important issue in our evaluation. Perfor-

mance plays a major part in our analysis. We build a

benchmarking platform on top of the systems under

investigation. For each variation, we construct a

small and a large cluster. In our experiments, we

measure both the speed of indexing as well as the

search time and the throughput of the searching

threads. The results of the experiments show that

Solr and Hadoop provide the best tradeoff in terms

of scalability, stability and manageability. Search

engines based on NoSQL databases offer either a

superior indexing speed, or fast searching times.

Unfortunately, they suffer from stability in their

integration implementations.

In the future, we plan to contribute to LuMongo

by fixing its memory leakage problem. A good con-

tribution would also be the extension of Solandra to

support SolrCloud instead of a single Solr instance.

Having done this, the owner of the large-scale search

engine would have the choice between either using

the Hadoop infrastructure or a NoSQL cluster instal-

lation depending on availability in his/her environ-

ment and his/her knowledge.

ACKNOWLEDGEMENTS

We would like to thank the Bibliotheca Alexandrina

for providing us with the necessary hard-ware for

conducting the benchmarking experiments.

REFERENCES

Akioka, S. and Muraoka, Y., 2010. HPC Benchmarks on

Amazon EC2, Proceedings of the IEEE 24

Interna-

tional Conference on Advanced Information Network-

ing and Applications Workshops (WAINA).

Bojanova, I. and Samba, A., 2011. Analysis of Cloud

Computing Delivery Architecture Models, IEEE

Workshops of International Conference on Advanced

Information Networking and Applications (WAINA).

Blur, n.d., Apache Blur (Incubating) Home,

https://incubator.apache.org/blur/, retrieved July

2015.

Brewer, E., 2000. Towards Robust Distributed Systems.

ACM Symposium on Principles of Distributed Compu-

ting.

Cutting, D. and Pedersen, J., 1990. Optimizations for

Dynamic Inverted Index Maintenance, Proceedings of

SIGIR ’90.

DB-Engines, n.d., Knowledge Base of Relational and

NoSQL Database Management Systems, http://db-

engines.com/en/ranking, retrieved July 2015.

Dean, J. and Ghemawat, S., 2008. MapReduce: simplified

data processing on large clusters. Communications of

the ACM. 51, 1, 107–113.

Edlich, S., Friedland, A., Hampe, J., Brauer, B., 2010.

NoSQL: Introduction to the World of non-relational

Web 2.0 Databases (In German) NoSQL: Einstieg in

die Welt nichtrelationaler Web 2.0 Datenbanken,

Hanser Verlag.

Internet Archive BA, n.d., Internet Archive at Bibliotheca

Alexandrina, http://www.bibalex.org/en/project/

details?documentid=283, retrieved July 2015.

Karambelkar, H.V., 2015. Scaling Big Data with Hadoop

and Solr, Packt Publishing, 2

Edition.

Katta, n.d., http://katta.sourceforge.net/, retrieved July

2015.

Khare, R. et al., 2004: Nutch: A flexible and scalable

open-source web search engine. Technical Report Or-

egon State University. 1, 32–32.

Kuc, R. and Rogozinski, M., 2015. Mastering Elas-

ticsearch, Packt Publishing, 2

Edition.

Lakshman, A. and Malik, P., 2010. Cassandra: a decentral-

ized structured storage system. SIGOPS Operating

Systems Review, 44(2):35–40.

Lucene - Index File Formats, n.d. https://lucene.apache.

org/core/3_0_3/fileformats.html, retrieved July 2015.

Lucene - Class LockFactory, n.d., http://lucene.apache.

org/core/4_8_0/core/org/apache/lucene/store/LockFa

ctory.html, retrieved July 2015.

LuMongo, n.d., LuMongo Realtime Time Distributed

Search, http://lumongo.org/, retrieved July 2015.

McCandless, M., Hatcher, E., and Gospodnetić, O., 2010.

Lucene in Action, Manning, 2

Edition.

Nagi, K., 2007. Bringing Information Retrieval Back To

Database Management Systems, Proceedings of

IKE'07, International Conference on Information and

Knowledge Engineering.

Neo4j, n.d., http://www.neo4j.org, retrieved July 2015.

Pessach, Y., 2013. Distributed Storage: Concepts, Algo-

rithms, and Implementations, CreateSpace Independ-

ent Publishing Platform.

Plugge, E., Hawkins, D., and Membrey, P., 2010. The

Definitive Guide to mongoDB: The NoSQL Database

for Cloud and Desktop Computing, Apress.

Rabl, T. et al., 2012. Solving big data challenges for en-

terprise application performance management, Pro-

ceedings of the VLDB Endowment, Volume 5 Issue 12,

pp 1724-1735.

Redix, n.d., http://redis.io/, retrieved July 2015.

Solr, n.d., Solr - Apache Lucene - The Apache Software

Bringing Search Engines to the Cloud using Open Source Components

125

Foundation! http://lucene.apache.org/solr/, retrieved

July 2015.

Solr-1045, n.d., Build Solr index using Hadoop MapRe-

duce, https://issues.apache.org/jira/browse/SOLR-

1045, retrieved July 2015.

Solr-1301, n.d., Add a Solr contrib that allows for building

Solr indices via Hadoop's Map-Reduce., https://issues.

apache.org/jira/browse/SOLR-1301, retrieved July

2015.

Smiley, D., Pugh, E., Parisa, K., Mitchell, and Apache M.,

2015. Solr Enterprise Search Server, Packt Publishing,

Edition.

Storm, n.d., Storm - The Apache Software Foundation,

https://storm.apache.org/, retrieved July 2015.

Wikipedia-dumps, n.d., Wikipedia article dump,

https://dumps.wikimedia.org/enwiki/, retrieved July

2015.

ZooKeeper, n.d., Apache Zookeeper, https://zookeeper.

apache.org/, retrieved July 2015.

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

126