ONTOLOGY BASED INTEGRATION OF DISTRIBUTED AND

HETEROGENEOUS DATA SOURCES IN ACGT

Luis Martín, Alberto Anguita, Víctor Maojo

Biomedical Informatics Group, Artificial Intelligence Laboratory

School of Computer Science, Universidad Politécnica de Madrid, Campus de Montegancedo S/N

28660 Boadilla del Monte, Madrid, Spain

Erwin Bonsma, Anca Bucur, Jeroen Vrijnsen

Phillips Research, Healthcare System Architecture, High Tech Campus 37, 5656 AE Eindhoven, The Netherlands

Mathias Brochhausen, Christian Cocos, Holger Stenzhorn

IFOMIS, Universität des Saarlandes, Postfach 151150, 66041 Saarbrücken, Germany

Manolis Tsiknakis, Martin Doerr, Haridimos Kondylakis

Institute of Computer Science

Foundation for Research and Technology - Hellas, GR-71110 Heraklion, Crete, Greece

Keywords: Ontology-based Biomedical Database Integration, Semantic Mediation, Ontologies, Post-genomic Clinical

Trials, Service Oriented Architectures.

Abstract: In this work, we describe the set of tools comprising the Data Access Infrastructure within Advancing

Clinico-genomic Trials on Cancer (ACGT), a R&D Project funded in part by the European. This

infrastructure aims at improving Post-genomic clinical trials by providing seamless access to integrated

clinical, genetic, and image databases. A data access layer, based on OGSA-DAI, has been developed in

order to cope with syntactic heterogeneities in databases. The semantic problems present in data sources

with different nature are tackled by two core tools, namely the Semantic Mediator and the Master Ontology

on Cancer. The ontology is used as a common framework for semantics, modelling the domain and acting as

giving support to homogenization. SPARQL has been selected as query language for the Data Access

Services and the Mediator. Two experiments have been carried out in order to test the suitability of the

selected approach, integrating clinical and DICOM image databases.

1 INTRODUCTION

Data integration across heterogeneous data sources

and data aggregation across different aspects of the

biomedical spectrum is at the centre of current

biopharmaceutical R&D. A technological

infrastructure supporting such a knowledge

discovery process should, ideally, allow for:

 Data to be searched, queried, extracted,

integrated and shared in a scientifically and

semantically consistent manner across

heterogeneous sources, both public and

proprietary, ranging from chemical structures

and omics to clinical trials data;

 Discovery and invocation of scientific tools

that are shared by the community, rather than

repeatedly developed by each and every

organisation that needs to analyse their data

and

 Both the sharing of tools, and their

integration as modules in a generic

framework, applied to relevant dynamic

301

Martín L., Anguita A., Maojo V., Bonsma E., Bucur A., Vrijnsen J., Brochhausen M., Cocos C., Stenzhorn H., Tsiknakis M., Doerr M. and Kondylakis H.

(2008).

ONTOLOGY BASED INTEGRATION OF DISTRIBUTED AND HETEROGENEOUS DATA SOURCES IN ACGT.

In Proceedings of the First International Conference on Health Informatics, pages 301-306

 SciTePress

datasets. We refer to this process as

“discovery driven scientific workflows” which

ideally would also execute fast and in an

unsupervised manner.

Needless to say that our current inability to

efficiently share data and tools, in a secure and

efficient way, is severely hampering the research

process. The objective of the Advancing Clinico-

Genomic Trials on Cancer (ACGT) project is to

contribute to the resolution of these problems

through the development of a unified technological

infrastructure which will facilitate the seamless and

secure access and analysis, of multi-level clinical

and genomic data enriched with high-performing

knowledge discovery operations and services in

support of multi-centric, postgenomic clinical trials.

Integrated access to heterogeneous biomedical

data is at the core of the problems that need to be

resolved. This paper presents the main

methodological and technological challenges

addressed in the implementation of an ontology-

based data integration architecture within the context

of the ACGT project. Emphasis is given to the

description of the ACGT Data Access Architecture

which is comprised by a set of key services, namely

the ACGT-Data Access Services, the ACGT-

Semantic Mediator, and the ACGT-Master

Ontology, as well as additional dedicated tools.

While the first two services provide the means to

resolve syntactic and semantic heterogeneities when

accessing integrated databases, the latter acts as a

core resource supporting the data integration

process.

2 BACKGROUND

Database Integration aims at facilitating users in

querying sets of heterogeneous sources of

information in an intuitive and transparent way. The

research community has been dealing with different

kinds of methods during the last decade, namely

Data Warehousing (Kimball, 1996), Federated

Database Systems (Sheth, 1990), Mediator-based

approaches (Wiederhold, 1992), and other hybrid

approaches. From the technical point of view, three

categories can be differentiated, namely data

translation, query translation and information

linkage.

In Data Translation, data from the different

databases are integrated in a centralized repository.

Before the integration, these data must be modified

in order to fit the requirements of the unified

schema—the central repository has its own schema

different from the ones belonging to the underlying

databases. The most popular example of a DT-based

technology is Data Warehousing, which is now in its

industrial exploitation phase.

By contrast, Query Translation does not perform

actual integration of data, but transformation of a

query when it is launched. A mediation software

offers a representation of a virtually integrated set of

databases to the users. The user is able to build and

launch a query based on this representation. The

mediator receives the query and transforms it into a

set of dedicated sub-queries for the underlying

databases. After their actual execution in the

corresponding databases, the results are integrated

by the mediator software to be presented to the user.

On the other hand, Information Linkage just defines

cross-reference links between databases to perform

database integration. Some examples of usage of IL

are MEDLINE, GENBANK, OMIM, and the World

Wide Web itself.

There exist two main ways to deal with Query

Translation: Global as View and Local as View. In

both approaches the system has a description of the

domain. In Local as View, views representing the

databases are described using the knowledge

contained in the global schema. In Local as View no

additional work apart from defining a single view is

necessary when a new database needs to be

integrated in the system. However, translation of

queries becomes leads to performance problems

(Abiteboul, 1998) (Ullman, 1997). Conversely, in

Global as View (Cali, 2001) a global model is built

using information from the underlying databases and

from the domain model. Query translation in Global

as View is straightforward, since the links are

actually stored in the schema, but it needs of a global

revision when new sources are added.

During the last years, ontologies have been used

as global domain models in database integration,

obtaining promising results, mainly in the fields of

biomedicine and bioinformatics. Biomedical

ontologies have adopted the role of domain

homogenizing tools in the last decade. We

distinguish three major classes of biomedical

ontologies: Generic Medical Ontologies—dealing

with the entire domain of Medicine—, Specific

Medical Ontologies—describing a single domain

within Medicine—, and Specific Biomedical

Ontologies—supporting a specific biomedical

domain. Some examples of these three categories

are:

 Generic Medical Ontologies: SNOMED CT

(SNOMED, 2007), UMLS (Lindberg, 1990).

HEALTHINF 2008 - International Conference on Health Informatics

302

and GALEN (GALEN, 2007), HL7 RIM

(HL7, 2007). Both SNOMED CT and UMLS

have been proved to be theoretically unsound

(Ceusters, 2003). HL7 RIM, even though

widely used, has been subjected to a number

of criticisms that also question its theoretical

soundness (Smith, 2006).

 Specific Medical Ontologies: The

Foundational Model of Anatomy (FMA,

2007) is a highly stabile and rigorously

developed ontology. One system frequently

mentioned when talking about the state of the

art in ontology-based cancer research and

management is caCORE (caCORE, 2007), a

highly developed environment making use of

UMLS and NCI Thesaurus and Metathesaurus

representations (NCI, 2007).

 Specific biomedical Ontologies: Gene

Ontology (GO, 2007) is one example. Other

examples can be found at the OBO Foundry

(OBO, 2007). There is a high number of

Specific Biomedical Ontologies, and they

follow a variety of different standards. This

shows the importance of quality assessment in

ontology development within this domain.

The following section describes in detail the

database access architecture adopted to integrate

clinical trials databases including image information.

3 THE ACGT DATA ACCESS

INFRASTRUCTURE

The ACGT platform is comprised by a set of

services and resources supporting the different needs

of clinicians and researchers involved in a post-

genomic clinical trial. The ACGT platform

architecture follows a layer based design, as can be

seen in Figure 1.

Figure 1: The ACGT platform architecture.

The ACGT data access infrastructure forms part of

this architecture. This infrastructure is comprised by

three core resources, together with other satellite

tools that give support to the complete data access

task. These core resources are, namely: the ACGT

Master Ontology on Cancer (ACGT-MO), the

ACGT Data Access Services (ACGT-DAS) and the

ACGT Semantic Mediator (ACTG-SM). Figure 2

shows the architecture of the ACGT Data Access

Infrastructure.

Figure 2: The ACGT data access infrastructure.

The next sections give a detailed description of these

three components.

3.1 ACGT-MO

ACGT deals with the integration of data from a

variety of heterogeneous sources. There exists a lack

of standardization among data from different clinical

trials, which leads to a loss in the possible

knowledge exchanging power. Ontology based data

management becomes then a major advantage in the

way to achieve consistency in data collection and

processing policies.

The ACGT-MO employs the resources of a Top

Level Ontology, called Basic Formal Ontology

(BFO, 2007). This choice is based on its proven high

applicability to the biomedical field (Grenon, 2004).

The ACGT Master Ontology inherits BFO’s

foundational principles:

 realism

 perspectivalism

 fallibilism

 adequatism

Figure 3 shows the BFO structure.

ONTOLOGY BASED INTEGRATION OF DISTRIBUTED AND HETEROGENEOUS DATA SOURCES IN ACGT

303

Figure 3: The Basic Formal Ontology.

The ACGT-MO has been developed using the

OWL-DL language, achieving the maximum level of

expressivity to describe the domain of post-genomic

clinical trials on cancer. For its development and

mantainance, the Protégé editor (Protégé, 2007) has

been used.

The ACGT-MO basically contains two sets of

elements, namely i) Classes and ii) Properties. The

former group contains the concepts of the ontology

(the so-called universals) structured in a taxonomy

using is_a type relations to establish links between

classes—e.g., CanonicalBodySubstance is_a

BodySubstance. The latter represents the set of

relations connecting the classes of the taxonomy. In

order to fit the requirements of data integration in

biomedical reality, and to express the truths of

Medicine and Biology, a wide variety of relations

(besides from mere is_a) has been included. A few

examples of structure in the tree of relations are

hasBloodPressure is a child of hasPressure which, in

turn, is a child of hasMagnitude, or hasFunction is a

child of implements. An important part of the

relations list has been imported from the Open

Source Relation Ontology (RO) (RO, 2007).

3.2 ACGT-DAS

The ACGT-DAS provide a means to solve syntactic

heterogeneities—i.e. they provide uniform data

access interface. ACGT-DAS are required also to

export the data schema of each individual source, in

order to aid the clients in building queries.

The ACGT-DAS offer a web service interface.

They have been implemented using the Open Grid

Services Architecture Data Access and Integration

(OGSA-DAI) services (Antonioletti, 2005)

SPARQL (SPARQL, 2007) has been chosen as

the query language. More expressive than its

predecessor RDQL, the language used by an early

version of the mediator, SPARQL offers new

features, becoming an intermediate level (in terms of

expression) language, appropriate for being used as

common query language. It is less expressive than

Structured Query Language (SQL), due to the lack

of support of any form of aggregation. SQL is a

relational specific language, so it cannot be used as

common language by the ACGT-DAS (mainly

because of the selected query translation approach).

On the other hand, it is more expressive than

DICOM.

Figure 4: Example of DICOM query in SPARQL.

A relational database and a DICOM wrapper

have been developed so far. We used D2RQMap

(Bizer, 2004) for the implementation of the

relational databases wrapper. This was a

straightforward process, due to the technology

choices. The development of the wrapper for

querying DICOM image databases was not as direct.

DICOM uses a four-level hierarchical information

model, not a classical relational model. This

structure caused difficulties in the query

transformation process, since SPARQL is more

expressive than DICOM, so the final queries may

not be able to represent the view that is expressed in

the original one. Figure 4 shows an example of a

DICOM query expressed in SPARQL. A set of

special functionalities had to be implemented to

support retrieving of DICOM images as well.

3.3 ACGT-SM

The ACGT-SM aims at solving the semantic

heterogeneities present in databases to support

interoperability and integration. The ACGT-SM is

supported by a set of satellite tools, like the mapping

tool and the data cleaning module among others.

The approach selected to perform database

integration has been Local as View. This decision is

based on the nature of data in the biomedical

HEALTHINF 2008 - International Conference on Health Informatics

304

domain, and more concretely in post-genomic

clinical trials. Local as View based techniques

require less amount of effort when the structure of

data sources changes or when new ones need to be

integrated in the system. However, as stated before,

some performance issues are associated to this kind

of approaches. In order to overcome these

difficulties, the domain representing the integrated

set of databases is constrained. This restriction is

based on requirements specified by the end users.

Semantic heterogeneities are tackled following

an ontology based approach. The ACGT-MO acts as

a semantic framework supporting homogenization.

In Local as View, the ACGT-MO acts as global

schema. The goal of this global schema is twofold:

1) provides a means to build the local views of the

underlying databases, and 2) represents the set of

queries that can be formulated by the users.

SPARQL has been selected as query language

for the ACGT-SM. As said previously, SPARQL is

used as query language by the ACGT-DAS. This

homogeneity allows saving time and memory

resources. When a query is launched through the

ACGT-SM, the software divides it into a set of

dedicated queries for the underlying data access

services, wrapping the actual databases. No interface

is needed to translate these queries, given that the

same query language is used by the ACGT-SM and

the ACGT-DAS.

The results of a query are returned by the

wrappers in XML SPARQL Result format. The

ACGT-SM builds an integrated set of results as an

ontology instance file. These instances are

represented using the OWL ontology description

language. Other formats, such as CSV (Comma

Separated Variables) are supported to fit the

requirements of the Data Analysis Tools in ACGT.

Parallel to the ACGT-SM, an API for creating

mappings between the ACGT-MO and RDF

Schemas of data sources to be integrated has been

developed. This API offers a flexible and generic

approach for creating mappings, and is based on

path mapping. Paths from the ACGT-MO are

mapped to paths from an RDF Schema, providing a

way to translate queries to the ACGT-MO—which

are basically sets of paths—into queries to the RDF

Schema. A graphical interface on top of this API is

being developed as well.

3 EXPERIMENTS AND RESULTS

We have tested the first version of our tools in a case

study including a set of three different sources,

including two clinical relational databases—SIOP

nephroblastoma database and TOP breast cancer

database— and a DICOM repository of images. We

carried out two experiments integrating DICOM

source and each one of the clinical trials databases.

Our tool integrated the sources successfully, and the

generated schemas were validated by experts in the

domain.

The experiments were performed using a

dedicated web interface. This interface was built for

demonstration purposes within the ACGT project,

but it is not the final query interface—this interface

is now in its design phase, and is going to be able to

support more complex queries. Figure 5 shows the

result of the execution of a query combining SIOP

and DICOM data.

Figure 5: Instances retrieved: SIOP and DICOM

integration.

As can be seen in Figure 5, the user can request

the DICOM studies related to the patient selected by

clicking on the relation button.

The experiments showed the suitability of the

approach adopted to cope with data access and

integration in the post-genomic clinical trials

domain. This prototype was developed using Java

and HTML languages.

4 CONCLUSIONS

In this work, we present a set of tools to provide data

access, allowing seamless access and integration of

heterogeneous databases. To this end, a clinical trials

on cancer domain ontology has been developed, the

ACGT-MO, and two core services to overcome

syntactic and semantic heterogeneities, namely the

ACGT-DAS and the ACGT-SM.

The ACGT-MO covers the domain of clinical

trials on cancer, and has been built using Clinical

Report Forms from SIOP and TOP trials. The

ONTOLOGY BASED INTEGRATION OF DISTRIBUTED AND HETEROGENEOUS DATA SOURCES IN ACGT

305

ACGT-MO follows the recommendations of the

OBO Foundry.

The ACGT-DAS resolve syntactic

heterogeneities present in disparate sources of

information. They provide a homogenous query

language, SPARQL, and a web service interface

developed using the OGSA-DAI middleware.

The ACGT-SM is able to process user queries

formulated by means of a global model—i.e. the

ACGT-MO—, and to retrieve information from a set

of integrated heterogeneous databases. The ACGT-

SM is supported by a set of satellite tools tackling

with problems such as mapping and instance

homogenization.

The results obtained in the carried out

experiments prove that this approach can properly

integrate relational and image databases.

In the second phase of the project, we plan to

add new types of sources, such as public web

databases, different file formats—e.g. plain text,

Excel spreadsheets, XML, etc— and microarray

data.

ACKNOWLEDGEMENTS

The authors would like to thank all members of the

ACGT consortium who are actively contributing to

addressing the R&D challenges faced. The ACGT

project (FP6-2005-IST-026996) is partly funded by

the EC and the authors are grateful for this support.

REFERENCES

Abiteboul S., Duschka O., 1998. Complexity of answering

queries using materialized views. In Proc. of the 17th

ACM SIGACT SIGMOD SIGART Symp. on Principles

of Database Systems (PODS’98), pages 254-265.

Antonioletti, M., et al. 2005. The Design and

Implementation of Grid Database Services in OGSA-

DAI- In: Concurrency and Computation: Practice and

Experience, Volume 17, Issue 2-4, Pages 357-376.

BFO, The Basic Formal Ontology. Available at:

http://www.ifomis.uni-saarland.de/bfo [13 oct 2007]

Bizer, C., Seaborne, A. 2004. D2RQ – Treating Non-RDF

Databases as Virtual RDF Graphs. In: Proc. of the 3rd

International Semantic Web Conference (ISWC2004),

Hiroshima, Japan. Poster presentation.

caCORE: A Common Framework for Cancer Data

Management. Available at:

http://ncicb.nci.nih.gov/infrastructure/cacore_overvie

w [13 oct 2007]

Cali, A., De Giacomo, G., Lenzerini, M., 2001. Models for

information integration: Turning local-as-view into

global-as-view. In: Proc. of Int. Workshop on

Foundations of Models for Information Integration

(10th Workshop in the series Foundations of Models

and Languages for Data and Objects).

Ceusters W, Smith B, Kumar A, Dhaen C. 2003. Mistakes

in Medical Ontologies: Where Do They Come From

and How Can They Be Detected?. in: Pisanelli DM

(ed.): Ontologies in Medicine: Proceedings of the

Workshop on Medical Ontologies, Rome, October

2003. IOS Press, Amsterdam.

FMA, The Foundational Model of Anatomy. Available at:

http://fme.biostr.washington.edu:8089/FME/index.htm

l [13 oct 2007]

GALEN, Common Reference Model. Available at:

http://www.opengalen.org [13 oct 2007]

GO, Gene Ontology project. Available at:

http://www.geneontology.org [13 oct 2007]

Grenon, P., Smith, B., Goldberg, L., “Biodynamic

Ontology. 2004. Applying BFO in the Biomedical

Domain,” in: Ontologies in Medicine, D. M. Pisanelli,

Ed., Amsterdam: IOS Press, pp. 20-38.

HL7, Reference Information Model. Available at:

http://www.hl7.org/Library/data-

model/RIM/modelpage_mem.htm. [13 oct 2007]

Kimball R., 1996. The Data Warehouse Toolkit: Practical

Techniques for Building Dimensional Data

Warehouses, John Wiley.

Lindberg C. The Unified Medical Language System

(UMLS) of the National Library of Medicine. Journal

of the American Medical Record Association 1990;

61(5):40-2.

NCI Enterprise Vocabulary Services. Available at:

http://www.nci.nih.gov/cancerinfo/terminologyresourc

es [13 oct 2007]

OBO Foundry ontologies. Available at:

http://obofoundry.org [13 oct 2007]

Protégé, The Ontology Editor and Knowledge Acquisition

System. Available at: http://protege.stanford.edu [13

oct 2007]

RO, OBO Relation Ontology. Available at:

http://www.obofoundry.org/ro [13 oct 2007]

Sheth A. P., Larson J. A., 1990. Federated Database

Systems for Managing Distributed, Heterogeneous,

and Autonomous Databases, ACM Computing

Surveys, 22(3): pp.183-236.

Smith, B, Ceusters, W. HL7 RIM: An Incoherent

Standard, Stud Health Technol. Inform. 2006; 124:

133–38

SNOMED Clinical Terms Core Content. Available at:

http://www.snomed.org/snomedct/coreterms.html [13

oct 2007]

SPARQL Query Language for RDF. Available at:

http://www.w3.org/TR/rdf-sparql-query/ [13 oct 2007]

Ullman J.D., 1997. Information integration using logical

views. In Proc. of the 6th Int. Conf. on Database

Theory (ICDT’97), volume 1186 of Lecture Notes in

Computer Science, pages 19-40. Springer-Verlag.

Wiederhold G., 1992. Mediators in the Architecture of

Future Information Systems, IEEE Computer, 25(3):

pp. 38-49.

HEALTHINF 2008 - International Conference on Health Informatics

306