Desing of Full-text Search for Database and Linkedin Social

Network in Electrophysiology

Jan Štěbeták, Roman Mouček and Jan Koreň

Department of Computer Science and Engineering, University of West Bohemia, Univerzitní 8, Pilsen, Czech Republic

Keywords: Neuroinformatics, Electroencephalography, Event-related Potentials, Information Retrieval, Full-text

Search, Index Design.

Abstract: EEG/ERP (electroencephalography, event-related potential) laboratories produce experimental data and

metadata. Authors’ research group has contributed to the building of a neuroinformatics infrastructure by

developing and integrating data management and analytic tools for EEG/ERP research - the EEG/ERP

Portal. With the development of the Portal and the increasing amount of data/metadata, a proper full text

search mechanism for efficient information retrieval is necessary to improve the user experience. The

presented solution combines search over data/metadata stored in an electrophysiological database and in the

LinkedIn social network. Open source search engines, criteria, suitable engine selection, and index design

are presented. Integration of the full-text solution to the EEG/ERP Portal is described.

1 INTRODUCTION

Accessing required information from a large set of

data in a quick and user-friendly manner is no longer

an unachievable goal. Advancements in the field of

information retrieval in the last few decades have

made its applications very common. Full-text search,

as one of such applications, has in fact become an

essential part of everyday’s life in a modern society.

Our research group specializes in the research of

brain activity; especially attention of drivers is

investigated. We widely use the methods of

electroencephalography (EEG) and event-related

potentials (ERP). EEG/ERP experiments usually

take long time and produce a lot of data.

As the members of the XXX National Node of

International Neuroinformatics Coordinating Facility

(INCF, 2011) we participate in definition and

development of standardized formats for

electrophysiology research. Our efforts, following

INCF recommendations (Pelt and Horn, 2007)

resulted in the central custom solution - the

EEG/ERP Portal. This portal serves as a managing

application for storing and managing experimental

data and metadata.

In this paper we briefly present the common

architecture of full-text search engines. Specific

open-source full-text search engines are also

described. Then the basic requirements and selection

of a suitable full-text search engine are introduced.

The next section is focused on creating a document

model for indexed data and metadata stored in

EEG/ERP Portal and articles or discussion in the

LinkedIn social network. Since a model of our

electrophysiological database is specific, an

identification of domain entities had to be made first.

We combine indexing data from both sources (the

database and the LinkedIn) into a common index.

The presented solution of full-text search is

integrated into the EEG/ERP Portal. This solution

will enable users of the EEG/ERP Portal to retrieve

information from database and the LinkedIn in one

place.

2 STATE OF THE ART

This section briefly describes the architecture of full-

text search engines, open source engines and the

EEG/ERP Portal.

2.1 Architecture of Full-text Search

Engines

A full-text search engines comprise several steps in

order to provide a user with search results to a given

238

Št

ebeták J., Mou

cek R. and Kore

n J..

Desing of Full-text Search for Database and Linkedin Social Network in Electrophysiology.

DOI: 10.5220/0004748602380243

In Proceedings of the International Conference on Health Informatics (HEALTHINF-2014), pages 238-243

ISBN: 978-989-758-010-9

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

query. Query is in this case a text phrase, optionally

enriched by special operators which serve for

refining the query. It is a user of the search system

who comes up with the query, expecting that the

system will fulfill his/her information need by

returning relevant search results (Koren, 2013).

2.2 Available search engines

Indri (Indri, 2013) is an academic C++ based text

search engine developed at the University of Massa-

chusetts and is a part of the Lemur Project. Its API

is accessible also from other languages such as Java

or C#. In the technical paper (Turtle, Hedge and

Rowe, 2012) it was shown that Indri is achievers in

comparison with Lucene even slightly better results

in terms of retrieval effectiveness for short queries,

index size and performance.

Lucene (Lucene, 2013) is an open source Java-

based search library. It can enrich applications by its

ability to index data and search over them.

According to (Middleton and Baeza-Yates, 2007),

it belongs to the fastest engines and its performance

is high especially when querying one- or two-word

phrases. Its small size of index it creates is also a

plus when the lack of memory could be an issue.

Solr (Solr, 2013) is an open source search server

built on top of Lucene which runs within a servlet

container, such as Jetty or Tomcat. Because it can be

considered as a Lucene extension, most of the

Lucene terminology is used for Solr as well. Solr

extends Lucene by providing many useful features

related to full-text search, e.g. keyword highlighting,

faceted search, or rich document. Since Solr runs as

a separate process, it communicates with

applications via HTTP requests, which represent

query data, and HTTP responses, representing

search results found in the index, by exposing its

REST-like API.

Sphinx (Sphinx, 2013) is an open source search

server distributed under GPL license. It consists of

an indexing tool, which is also referred to as indexer,

and a searching daemon. Sphinx is written in C++

and is especially designed for indexing database

systems (it integrates well with MySQL). Among its

strengths we can name quick installation and

deployment and a perfect online documentation for

an open source project.

2.3 EEG/ERP Portal

The purpose of the EEG/ERP Portal (Jezek and

Moucek, 2010) (Fig. 1) is to serve as a managing

tool for EEG/ERP experiments that enables to share

and interchange stored experiments (data, metadata,

experimental scenarios, etc.) among interested

laboratories.The EEG/ERP Portal is developed as a

standalone product running on servers in our

department. The usage of the Portal does not require

any software installation, only a web browser.

Data and metadata are stored in the Oracle

relational database. Access to the database is

ensured with the Hibernate framework (Hibernate,

2013). Each database entity is presented by the Java

POJO class. The model of the database will be

described in Section 4.

The EEG/ERP Portal is registered in the

LinkedIn social network. Articles, news, and

discussion are stored in the database and also in the

LinkedIn, where we created a research group. Posts

in the LinkedIn are not stored in our database.

Currently, we design a full-text search solution,

which allows searching information in the database

and also in the LinkedIn. That enables users of the

EEG/ERP Portal to retrieve information from both

sources (database and LinkedIn) in one place.

Figure 1: EEG/ERP Portal.

3 SELECTION OF SUITABLE

FULL-TEXT SEARCH ENGINE

Based on our expectations (such as speed or

independence of data sources) and used technologies

by developing the EEG/ERP Portal, the selection of

full-text search engine is restricted by the following

criteria (Koren, 2013):

 Speed - the speeds of compared search engines

differ only slightly and the observable

differences depend on a specific use case.

Therefore, all listed search engines can be

considered as a suitable solution.

DesingofFull-textSearchforDatabaseandLinkedinSocialNetworkinElectrophysiology

239

 Integration with the EEG/ERP Portal - The

EEG/ERP Portal is based on Java technologies

and this is why search engines providing Java

API are easier to be integrated to the working

infrastructure and therefore preferred.

 Other Features and Extension – Since we offer

comfort to users of the EEG/ERP Portal, it is

necessary to have a set of built-in features such

as result highlighting, faceted search, synonym

search, etc.

 Independence of Data Sources - The chosen

search engine must be able to accept data from

various sources and not to be limited only to

one specific data source, such as relational

database. The reason behind this is the

mentioned need to index LinkedIn articles as

well as to enable further possible indexing

scenarios in the future, such as indexing .pdf or

XML files.

 Independence of other Technologies - This

criterion means that the search engine should

not rely on a specific technology to be used.

Dependence of Hibernate Search on Hibernate

or heavy orientation of Sphinx on MySQL may

serve as the examples of the search engines

which perform well if certain conditions are

met, but cannot run or do not perform well if

not.

 Community - Numerous and active developer

community also plays a big role in the final

choice. The bigger community around the

search engine is, the higher is the chance that

the engine development will not stop early,

new features will be introduced and found bugs

will be resolved quickly.

The search engines were evaluated based on

these criteria. Since it was stated for the speed

criterion that its differences for the discussed search

engines were not significant, it is not included in the

final evaluation.

Note that the last criterion, community cannot be

evaluated by an exact manner. This criterion

involves the following four points: the size of

mailing lists, the number of search results about the

search engines found on Google, the number of

related blog posts as well as the number of posts on

specialized websites.

According to the evaluation, the Lucene based

full-text search Solr was chosen.

4 INDEX DESIGN

The full-text search engines create an index

document that ensures faster search through source

texts or databases. This section is focused on

identifying domain entities and design of a common

index structure for the electrophysiological database

that the EEG/ERP Portal uses and the LinkedIn

social network.

4.1 Identification of Domain Entities

Data and metadata are logically stored into tables.

Unfortunately, our database contains 71 tables.

Therefore, we do not attach our data model as a

figure. Instead of the data model, we describe core

data that we store. There are tables containing core

data. These tables represent domain entities. Other

tables extend core data. In relation to POJO classes,

the domain entities are represented by parent classes

in the full-text search context. There exist one-to-

one or one-to-many associations between parent

POJO classes and child classes.

The following list shows domain entities. It also

captures relations between parent and child classes

(Koren, 2013):

 Article - Articles are represented by the Article

class. It contains the title and text string fields

that hold values of article title and its text,

respectively. Articles and news are stored in

the database and also in the LinkedIn social

network. Articles can be commented on, so

each article may have one or more article

comments associated. Text of the comments

themselves is contained in the ArticleComment

instances.

 Experiment - This class has the field

environmentNote to describe the experiment

environment. Apart from this field, it also

refers to related Weather, Disease, DataFile,

Hardware and Software objects that include

information about weather, diseases of a tested

subject, related data files, and used hardware

and software during an experiment,

respectively.

 Person - This class contains a person’s name,

surname and a note about the person. Although

there are also associations to other classes, it is

not necessary to include them for indexing and

searching purposes.

 Research group - This class keeps a name, title

and description of a research group and as in

the case of the Person class, its structure

HEALTHINF2014-InternationalConferenceonHealthInformatics

240

matches the structure of its domain entity, so

no other referenced class instances are needed.

 Scenario - This class contains its name, title

and description. The class itself corresponds to

its domain entity.

The full-text search functionality is oriented on

searching and displaying information about the

domain entities mentioned above.

Many POJO classes possess the title and

description fields. It is also worth noticing that the

fields description, text, and note contain a text

information and have a similar meaning. On the

other hand, there are several class fields that are

distinctive only for a few or for a single domain

entity such as the temperature field in the

Experiment class.

4.2 Ensuring Document Uniqueness

When storing different types of documents (entities)

under a single index, a problem of document

uniqueness arises. Although a relational database

record can be uniquely identified by its primary key

and a source table, its information about the source

table is lost after its corresponding Solr document is

created. Hence, the original uniqueness is lost as

there might be documents created from records

coming from different tables and having the same id.

The following fields related to unique

identification of documents were added in the Solr

schema file:

 Id - The purpose of this field is to keep an

original ID value.

 Class - Identifies a type of indexed document

(entity). In combination with id value it

provides an identification of a specific object.

 Uuid - This field serves as a unique identifier

for each document in the index. It consists of

two concatenated parts. The first part is the

class name of an indexed object; the second

part is the ID value of the object. For example,

the uuid value of an article with ID 20 is

cz.zcu.kiv.eegdatabase.data.pojo.Article20.

4.3 Proposed Index Structure

Based on the structure of the database, object

hierarchy, and the fields contained in these objects,

the following index fields were configured:

 Title - It is a title of a parent object.

 Text - It is a longer textual sequence such as

description or a note of parent object. It also

includes text of articles or discussion stored in

the database or the LinkedIn within our

research group.

 Name - It contains a person’s name.

 Source - It determines if the object’s source is

the database or LinkedIn (in case of LinkedIn

articles).

 Temperature - It stores a temperature during an

experiment

 Param_datatype – A model of our database

allows adding additional information about

core tables. This field represents a data type or

value of an additional parameter of a parent

POJO class.

 File_mimetype - It stores mimetype values of

stored data file of parent POJOs. In this case, it

concerns the Scenario class.

 Child_title - It is a multi-valued field which

stores all titles of child objects.

 Child_text - This field is analogous to the text

field. The difference is that it is a multi-valued

field and stores all textual values of child

objects.

 Child_File_mimetype - It is used for the same

reasons as the File_mimetype field, except that

it is used for child object values (e.g. DataFile

class).

5 INTEGRATION INTO EEG/ERP

PORTAL

This section describes necessary implementation

steps to integrate the Solr based full-text search into

the EEG/ERP Portal.

To separate common functionality of indexing

from the specialized step during which a Solr

document is built, the Template Method design

pattern (Gamma, et al. 1994) was used. By applying

this design pattern, the class hierarchy depicted in

Figure 2 was implemented.

The parent abstract class Indexer takes care of

performing the generic steps only by providing an

HttpSolrServer instance.

PojoIndexer class is responsible for indexing

data from the database. We created a set of

annotations, which POJO classes are marked with.

The annotation @Indexed means that the marked

class should be indexed. The annotation @SolrField

marks an attribute of a class and set the relationship

with belonging field in the index schema. The

annotation @SolrId marks an attribute representing

unique identifier of indexed class. Indexing using

annotation interface works in cooperation with Java

reflection.

DesingofFull-textSearchforDatabaseandLinkedinSocialNetworkinElectrophysiology

241

Figure 2: UML diagram of Indexer classs (Koren, 2013).

LinkedInIndexer class is responsible for indexing

articles, news and, discussion from the LinkedIn

social network within our research group. We use

LinkedIn REST api via Spring Social.

The following example shows the usage of field

selectors in the REST call. The REST call in the

listing gets full information about twenty latest

LinkedIn articles published in the EEG/ERP Portal

group. Note the group-id placeholder which is later

substituted by the real value of the EEG/ERP Portal

group ID.

http://api.linkedin.com/v1/groups/{grou

p-id}/ posts: (creation -timestamp,

title, summary, id, creator:(first -

name ,last -name))? count=20& start =0&

order=recency"

Data stored in the database are indexed on

change. However, LinkedIn articles are independent

of the EEG/ERP Portal and cannot be indexed on

change. Therefore, a periodical indexing was

created. In Figure 3, the created user interface is

shown.

5.1 Addition of a New Entity

The presented index schema is dependent on the

database structure. It usually happens that the

database is modified, new entities are added. When a

new entity is added, two situations arise:

 A new entity includes metadata that are already

covered by the index schema (e.g. titles, texts,

notes, or descriptions). In this case, no change

of the index schema is required.

 A new entity includes new type of metadata. In

this case, it is necessary to add this attribute as

a field into the index schema.

6 RESULTS

We implemented and successfully integrated a Solr

based solution of full-text search into the EEG/ERP

Portal.

The Solr server runs at the address

147.228.63.134/solr.

We created several JUnit tests. These tests covers

use cases such as searching a string in the database

or LinkedIn and returning whole record, indexing

and unindexing (when data is removed from the

database) of records, or verification of proper

indexing (Koren, 2013).

Currently, the database includes approximately

200 experiments. Time requested to display full-text

search results is 1s.

7 CONCLUSIONS

We created a research group in LinkedIn. This group

is connected to the EEG/ERP Portal. Users of our

Portal can contribute to the Portal or the LinkedIn.

Therefore, we presented solution combines full-text

search data/metadata from an electrophysiological

database and the LinkedIn social network. Based on

requirements described in Section 3, the full-text

search engine Solr was chosen.

The index schema presented in Section 4 is

designed according to our database. However, new

entities can be easily added. If a new entity includes

type of metadata that are already included in the

index schema, no change of the index schema is

HEALTHINF2014-InternationalConferenceonHealthInformatics

242

Figure 3: Full-text search user interface.

required. Otherwise, the new attribute can be simply

added into the schema.

Our solution puts together search over data and

metadata from the database and news and discussion

from the LinkedIn within our research group.

Integration to the EEG/ERP Portal allows users

searching from both sources in one place.

Our future work will focus on improvement of

the full-text search solution. It includes defining

synonyms of electrophysiology research and

integrating them into the full-text search solution in

the EEG/ERP Portal.

ACKNOWLEDGEMENTS

The work was supported by the UWB grant SGS-

2013-039 Methods and Applications of Bio- and

Medical Informatics.

REFERENCES

International Neuroinformatics Coordinating Facility

(INCF, 2011),

http://www.incf.org/about/what-is-neuroinformatics

Pelt, J. van, Horn J. van, 2007, Workshop report. 1st INCF

Workshop on Sustainability of Neuroscience

Databases. Stockholm.

Middleton, C., Baeza-Yates, R., 2007, “A comparison of

open source search engines,” tech. rep., 10.

http://wrg.upf.edu/WRG/dctos/Middleton-Baeza.pdf

Indri homepage (2013), http://

www.lemurproject.org/indri.php

Turtle, H., Hegde, Y., Rowe, S. A., 2012, “Yet another

comparison of lucene and indri performance”, in

Proceedings of the SIGIR 2012 Workshop on Open

Source Information Retrieval, vol. 46, pp. 64–67.

Lucene homepage (2013), http://lucene.apache.org/

Sphinx homepage (2013), http://sphinxsearch.com

Solr homepage (2013), http://lucene.apache.org/solr

Gamma E., Helm, R., Johnson, R., Vlissides, J., 1994,

Design Patterns: Elements of Reusable Object-

Oriented Software. Addison-Wesley Professional, 1

ed.

Hibernate (2013), http://www.hibernate.org/

Jezek, P., Moucek, R., 2010, “System for storage and

management of EEG/ERP experiments – generation

on ontology” Databases and Information System

Integration vol. 1, Madeira: SciTePress, ICEIS.

Koren, J., 2013 “Fulltext search in the database and in

text of social netwoks” Master Thesis, Pilsen.

DesingofFull-textSearchforDatabaseandLinkedinSocialNetworkinElectrophysiology

243