LINK INTEGRATOR

A Link-based Data Integration Architecture

Pedro Lopes, Joel Arrais and José Luís Oliveira

Universidade de Aveiro, DETI/IEETA, Campus Universitário de Santiago, 3810 – 193 Aveiro, Portugal

Keywords: Data integration, Link integration, Service integration, Web application, Web2.0, Lsdb.

Abstract: The evolution of the World Wide Web has created a great opportunity for data production and for the

construction of public repositories that can be accessed all over the world. However, as our ability to

generate new data grows, there is a dramatic increase in the need for its efficient integration and access to

all the dispersed data. In specific fields such as biology and biomedicine, data integration challenges are

even more complex. The amount of raw data, the possible data associations, the diversity of concepts and

data formats, and the demand for information quality assurance are just a few issues that hinder the

development of a general proposal and solid solutions. In this article we describe a lightweight information

integration architecture that is capable of unifying, in a single access point, several heterogeneous

bioinformatics data sources. The model is based on web crawling that automatically collects keywords

related with biological concepts that are previously defined in a navigation protocol. This crawling phase

allows the construction of a link-based integration mechanism that conducts users to the right source of

information, keeping the original interfaces of available information and maintaining the credits of original

data providers.

1 INTRODUCTION

World Wide Web and its associated technologies are

evolving rapidly as is the ability to develop and

deploy customized solutions for users. It is

becoming easier over time to create novel

applications with attractive interfaces and advanced

features. However, this evolution leads to several

problems. The major problem is in finding

information with certified quality: being extremely

easy to deploy new content online, the amount of

incorrect and invalid information is growing. This

process also leverages content heterogeneity. The

same concept can be stored in different media

formats or database models and accessed by

different kinds of methods. Data integration

architectures propose a single unifying framework

that can connect to several distinct data sources and

offer their content to users through web applications,

remote services or any other kind of data exchange

API.

Despite the advances in application development

and web standards, integrating information is still a

challenging task. This difficulty can be mostly seen

in life sciences. Biology and biomedicine are using

information technologies to solve their problems or

share new discoveries. The Human Genome Project

– HGP (Collins et al., 1998) – required novel

software applications that could help in solving

biological problems faster, ease biologists’ everyday

tasks, aid in the publication of relevant scientific

discoveries and promote researchers’

communication and cooperation. This exponential

necessity empowered a new field of research

denominated bioinformatics, requiring combined

efforts from biologists, statisticians, and software

engineers. The success of HGP brought about a new

wave of research projects (Adams et al., 1991,

Cotton et al., 2008) that resulted in a dramatic

increase of information available online.

With the tremendous amount of available data,

integration is the most common issue when

developing new solutions in the area of

bioinformatics. Applications like UniProt (Bairoch

et al., 2005) and Ensembl (Hubbard et al., 2002) or

warehouses like NCBI (Edgar et al., 2002, Pruitt and

Maglott, 2001) aim to integrate biological and

biomedical data that is already available. The

applied data integration strategies are suitable when

the purpose is to cover large data sources that use

standard data transfer models. However, when the

274

Lopes P., Arrais J. and Luís Oliveira J. (2009).

LINK INTEGRATOR - A Link-based Data Integration Architecture.

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval, pages 274-277

DOI: 10.5220/0002291702740277

 SciTePress

data is presented in pure HTML, REST web

services, CSV sheets or plain text, the integration

process can be quite complex. Along with this

problem there is the fact that when small

applications are integrated in this environment, the

content authorship is lost. This means that the

original researcher, who made a great effort to

publish his work online, will be hidden behind a

small link contained in a simple list among similar

links. The architecture described in this paper solves

both these problems. We propose an integration

model that can easily integrate information by

storing its URL and displaying it inside a centralized

application. With this, the data integration problem

is partially solved and the original application is

presented and extended to users.

The following section debates Link Integrator

organization and section 3 describe a real world

implementation scenario. Finally, section 4 contains

some final remarks about our research.

2 LINK INTEGRATOR

To deal with the presented integration issues, one

can adopt several distinct strategies for data

integration. These approaches differ mostly on the

amount and kind of data that is merged in the central

database. Different architectures will also generate a

different impact on the application performance and

efficiency.

Warehouse (Polyzotis et al., 2008, Reddy et al.,

2009, Zhu et al., 2008) solutions consist in

replicating integrated data sources in a single

physical location with a unifying data model.

Mediator-based solutions (Haas et al., 2001) rely on

a middleware layer for the creation of a proxy

between the client and the original integrated

servers. Link-based integration (Maglott et al., 2007,

Oliveira et al., 2004) strategies simply list URLs

linking the original data sources. Arrais work (Arrais

et al., 2007) presents a solid analysis on these

strategies, their main advantages and disadvantages,

resulting in an optimal hybrid solution.

In addition to these strategies, current trends

involve developments in the meta-applications area.

Mashups (Belleau et al., 2008) and workflows (Oinn

et al., 2004) have a growing popularity among data

and service integration architectures. When the goal

is to offer real-time web-based dynamic integration,

with an increased user-control, Lopes’ work (Lopes

et al., 2008) presents a valuable solution that can be

adapted to several scenarios.

We have chosen to design a solution based on

link integration scenarios. Our choice is mostly due

to the fact that the biological content available

online is dynamically changing and evolving swiftly.

2.1 Architecture

Link Integrator architecture metaphor relies on

typical three-layer architecture. The proposed

architecture divides the application structure: data

access, application logic and presentation layer.

The data access layer is completely independent

from our application. It represents the external data

sources that are accessed by originally integrated

applications. The application logic layer is crucial in

our system: externally it deals with the

communications with the integrated applications;

internally it deals with the application execution

cycle. The presentation layer is where any user in his

web browser can access a single unifying interface,

designed to be attractive and fulfilling high quality

usability patterns.

It is important to highlight that our system does

not have any kind of communication or data

exchange with the original data sources. This type of

activity would breach the initial system

requirements, transforming the integration model

from link-based to mediator-based.

2.2 Execution

Link Integrator is composed of a map, containing

navigational information that the crawler will read to

configure its processing cycle; a topic driven web

crawler capable of parsing and processing the URLs

it gathers and a relational database to store the

information gathered by the crawler.

In this system, the most important part is the web

crawler. The map contains mostly URL addresses

and regular expressions that are used by the crawler

to retrieve, parse and process web documents. The

crawling results are triplets containing an association

between: an Entity, which is the generic category

where the found URLs will fit; a Class that

represents the container where the link belongs; a

link to the Application, the URL, which is the main

result of the crawling cycle. Both the Entities and

the Classes are defined in the map; the crawler only

finishes the triplet by associating the URL with the

Class and with the Entity.

The system operation cycle – Figure 1 – is

simple. Initially, the system reads a predefined

navigation map containing Entities, Classes, URLs

and regular expressions to find other URLs. The

crawler will then parse and process the web pages

generating sets of results – URL-Class-Entity triplets

– that will be stored in containers. These containers

LINK INTEGRATOR - A Link-based Data Integration Architecture

275

Figure 2: DiseaseCard interface with Link Integrator information.

Figure 1: Link Integrator execution workflow.

can be serialized in XML and passed to another

application or, as we do, stored in a relational

database. After this crawling and retrieving phase,

the application engine generates, from the data

stored in the relational database, a tree that is

presented to the user in the application’s main

workspace.

3 DISEASECARD

DiseaseCard is a link-based integration application

with the main purpose of connecting and presenting

disease related information spread throughout the

Internet. It was first launched in 2005 and has grown

to a mature state where up to 2000 rare diseases

stored in the Online Mendelian Inheritance in Man

(Hamosh et al., 2005) database are covered.

When DiseaseCard was created, web

applications where mostly based on static HTML

layouts to display content. Along with Internet

evolution, web application complexity has also

evolved: Web2.0 dynamic applications and web

service responses cannot be parsed by traditional

web crawlers.

Locus specific databases contain information

that is as important for researchers as the

information already gathered in DiseaseCard. These

databases store information about genetic variations

and relating this information with other genetics

marks can offer newer insights on rare disease

studies. Joining DiseaseCard paradigm with amount

of information contained in LSDBs we have the

perfect test bed for Link Integrator. Adding LSDBs

to DiseaseCard will increase the added value it

represents in the biomedical community and will

validate the goals we set when developing our

system. The new on development version of

DiseaseCard – Figure 2 – will be supported by the

Link Integrator engine that can join information

from several LSBDs applications like LOVD

(Fokkema et al., 2005) or UMD (Al and Junien,

2000). The current portal is available at

www.diseasecard.org.

4 CONCLUSIONS

Taking into account the amount of data available

after the Human Genome Project, the area of life

sciences is probably the discipline most affected by

information quantity, diversity and heterogeneity.

Recently, an area of growing importance is related to

the human variome. Variation studies often result in

KDIR 2009 - International Conference on Knowledge Discovery and Information Retrieval

276

new web applications denominated locus specific

databases and they usually contain information

about sequence variations among individuals for a

particular gene. In addition, content ownership and

its growing importance is gaining relevance. Despite

the fact that for regular end-users, access to

scientific content is easier when provided by a

centralized service, researchers who want to publish

their work are almost obliged to create their own

applications if they want to keep the authorship of

their work visible.

The described architecture and application intend

to overcome these problems with three key features

for both users and researchers. First, integration is

based on simple Internet URLs that are parsed and

processed to gather the most significant information.

This means that developers will not have to make

any changes to the application core and that we are

able to integrate any URL-accessible content.

Secondly, the original applications will be shown

inside our application. Thus, the content owners will

not be shown as a link but as part of a complete

application. Finally, external applications can be

extended inside our system: information exchanges,

text-mining and other user customization features

can be developed to enhance the original

applications.

ACKNOWLEDGEMENTS

The research leading to these results has received

funding from the European Community's Seventh

Framework Programme (FP7/2007-2013) under

grant agreement nº 200754 - the GEN2PHEN

project.

REFERENCES

Adams, M. D., Kelley, J. M., et al. (1991) Complementary

DNA sequencing: expressed sequence tags and human

genome project. Science, 252, 1651-1656.

Al, B. E. T. & Junien, C. (2000) UMD (Universal

Mutation Database): A Generic Software to Build and

Analyze Locus-Specific Databases. Human Mutation,

94.

Arrais, J., Santos, B., et al. (2007) GeneBrowser: an

approach for integration and functional classification

of genomic data. Journal of Integrative

Bioinformatics, 4.

Bairoch, A., Apweiler, R., et al. (2005) The Universal

Protein Resource (UniProt). Nucleic Acids Research,

33, 0-159.

Belleau, F., Nolin, M.-A., et al. (2008) Bio2RDF: Towards

a mashup to build bioinformatics knowledge systems.

Journal of Biomedical Informatics, 41, 706-716.

Collins, F. S., Patrinos, A., et al. (1998) New Goals for the

U.S. Human Genome Project: 1998-2003. Science,

282, 682-689.

Cotton, R. G. H., Auerbach, A. D., et al. (2008)

GENETICS: The Human Variome Project. Science,

322, 861-862.

Edgar, R., Domrachev, M. & Lash, A. E. (2002) Gene

Expression Omnibus: NCBI gene expression and

hybridization array data repository. Nucleic Acids

Research, 30, 207-210.

Fokkema, I. F., Den Dunnen, J. T. & Taschner, P. E.

(2005) LOVD: easy creation of a locus-specific

sequence variation database using an "LSDB-in-a-

box" approach. Human Mutation, 26, 63-68.

Haas, L. M., Schwarz, P. M., et al. (2001) DiscoveryLink:

A system for integrated access to life sciences data

sources. IBM Systems Journal, 40, 489-511.

Hamosh, A., Scott, A. F., et al. (2005) Online Mendelian

Inheritance in Man (OMIM), a knowledgebase of

human genes and genetic disorders. Nucleic Acids

Research, 33, 514-517.

Hubbard, T., Barker, D., et al. (2002) The Ensembl

genome database project. Nucleic Acids Research, 30,

38-41.

Lopes, P., Arrais, J. & Oliveira, J. L. (2008) Dynamic

Service Integration using Web-based Workflows.

Proceedings of the 10th Internation Conference on

Information Integration and Web Applications &

Services. Linz, Austria, Association for Computer

Machinery.

Maglott, D., Ostell, J., et al. (2007) Entrez Gene: gene-

centered information at NCBI. Nucleic Acids

Research, 35.

Oinn, T., Addis, M., et al. (2004) Taverna: a tool for the

composition and enactment of bioinformatics

workflows. Bioinformatics, 20, 3045-3054.

Oliveira, J. L., Dias, G. M. S., et al. (2004) DiseaseCard:

A Web-based Tool for the Collaborative Integration of

Genetic and Medical Information. Proceedings of the

5th International Symposium on Biological and

Medical Data Analysis, ISBMDA 2004. Barcelona,

Spain, Springer.

Polyzotis, N., Skiadopoulos, S., et al. (2008) Meshing

Streaming Updates with Persistent Data in an Active

Data Warehouse. Knowledge and Data Engineering,

IEEE Transactions on, 20, 976-991.

Pruitt, K. D. & Maglott, D. R. (2001) RefSeq and

LocusLink: NCBI gene-centered resources. Nucleic

Acids Research, 29, 137-140.

Reddy, S. S. S., Reddy, L. S. S., et al. (2009) Advanced

Techniques for Scientific Data Warehouses. Advanced

Computer Control, 2009. ICACC '09. International

Conference on.

Zhu, Y., An, L. & Liu, S. (2008) Data Updating and Query

in Real-Time Data Warehouse System. Computer

Science and Software Engineering, 2008 International

Conference on.

LINK INTEGRATOR - A Link-based Data Integration Architecture

277