we built our prototype using quite powerful tools that
are either open source or freely available for research
purposes.
First of all, we used Hbase as the basic tool for
data management as it is highly scalable and fault-
tolerant – Hbase is an open source, non-relational,
distributed database system, modeled after Google’s
BigTable and developed as part of Apache Software
Foundation’s Apache Hadoop project. Moreover, as
it is flexible as not tied to a fixed scheme, we can
add new columns (attributes) to predefined family col-
umn tables without any interruption in service deliv-
ery. The latter feature has been crucial for data en-
richment as it could happen that mining results could
affect only a limited portion of the overall database.
Moreover, it enables fast reading of data, thus making
the querying phase quite efficient. Unfortunately, for
full-text search, response times are still too high thus
limiting the usability of the system; therefore, we im-
plemented a specialized index by using Solr, which
is an open source enterprise search platform from the
Apache Lucene project, whose major features include
full-text search, hit highlighting, faceted search, real-
time indexing, dynamic clustering, database integra-
tion, NoSQL features and rich document (e.g., Word,
PDF) handling.
The Solr’s feature of supporting faceted naviga-
tion turned out to be very useful for our purposes:
facets are generated for modeling data dimensions
that allow users to drill-down or roll-up data. Further-
more, facets allow user suggestions based on previous
search performed. We fully exploited the SolrCloud
release that enjoys enhanced highly scalable and fault
tolerant features and empowers distributed indexes
as it is based on HDFS as file system. We used
Zookeeper (another software project of the Apache
Software Foundation), for enabling distributed con-
figuration and synchronization services as well as
naming registry for large distributed systems. We pin-
point that, as Hbase uses the same services, the over-
all architecture turned out to be rather powerful and
flexible.
As we want to achieve full integration between
Hbase and SolrCloud, we need to take into account
multivalued attribute management. Indeed, Solr-
Cloud provides a support for multivalued storage,
whereas for Hbase we need to suitably pre-elaborate
them (e.g., by adding a colum suffix). As an example,
consider a hotel having several email contacts. Using
Hbase we can model this as follows:
columnFamily anagra fic in f o[email
1
:<
value1 >,email
2
:< value2 >,· · ·
email
n
:< valuen >].
On the contrary, SolrCloud allows the definition
of a multivalued field as follows:
< fieldname = “email”type = “string”indexed =
“true”multivalued = “true”stored = “true/ >
In order to guarantee full integration of both sys-
tems we need to provide a mapping between the
two systems that can be performed by Morphline, a
new command-based framework that simplifies data
preparation for Apache Hadoop workloads. The con-
figuration file (named morphline.conf) will contain
commands like the one reported in the following:
extractHBaseCells{mappings :
{inputColumn : ”columnFamily anagra f ic in f o :
email”out putField : ”email”type : stringsource :
value}
In order to speed-up the index construction we ex-
ploit map-reduce as it allows the batch construction of
the overall index by accessing all nodes in the cluster.
However, in some application scenarios (e.g.
monitoring systems) we need a (near) real time index-
ing that can be done by the Lily Indexer that provides
the ability to quickly and easily search for any content
stored in HBase by indexing HBase rows into Solr,
without writing a line of code. Indeed, it is fully com-
plaint with several extraction tools like Flume that is
a distributed Apache servicee for efficiently collect-
ing, aggregating and moving large amounts of stream-
ing data flows so that data are available for searching
immediately after their insertion in the data storage
layer.
2 BACKGROUND ON COMPLEX
SEARCHING
A typical example of system devoted to complex data
querying is represented by search engines. The re-
sults of a search returned by the engine cannot be con-
sidered as a custom map built by query results but,
based on them, additional knowledge about data be-
ing queried can be learnt by iterative refinement of
search dimensions and parameters as reported in Fig-
ure 1.
In this process, the type of research being per-
formed has to be taken into account. Indeed, there
is a big difference between the simple search of
well-defined terms and the dynamic learning by ex-
ploratory research. Obviously enough, in the first
case, a search engine such as Google, is able to give
Figure 1: Learining By Results.
DATA2015-4thInternationalConferenceonDataManagementTechnologiesandApplications
374