SEMANTIC INTEGRATION OF DISPARATE DATA SOURCES IN

THE COG PROJECT

Jos de Bruijn

Digital Enterprise Research Institute

University of Innsbruck, Technikerstrasse 13, A-6020, Innsbruck, Austria

Keywords:

Semantic Information Integration, Semantic Information Management, COG Project, Enterprise Application

Integration.

Abstract:

We present a novel approach to the integration of structured information sources in enterprises, based on Se-

mantic Web technology. The semantic information integration approach presented in this paper was applied

in the COG project. We describe Unicorn’s Semantic Information Management along with the Unicorn Work-

bench tool, which is a component part of the Unicorn System, and how they were applied in the project to

solve the information integration problem. We used the Semantic Information Management Methodology and

the Unicorn Workbench tool to create an Information Model (an ontology) based on data schemas taken from

the automotive industry. We map these data schemas to the Information Model in order to make the meaning

of the concepts in the data schemas explicit and relate them to each other, thereby creating an information

architecture that provides a uniﬁed view of the data sources in the organization.

1 INTRODUCTION

Information integration across application boundaries

or even across company boundaries is a topic of inter-

est to many enterprises and according to some stud-

ies, up to 30%

of future IT spending will go into En-

terprise Application Integration (EAI). There are sev-

eral reasons for the increase in the need of integration

existing applications (Fensel, 2003): company merg-

ers require integration of software infrastructure; new

software systems need to be integrated with legacy

systems; there are no integrated optimal solutions for

all the needs of a company and new protocols; stan-

dards continue to emerge and evolve and companies

need to be compliant with these new standards in or-

der to enable cooperation with other companies.

In the Corporate Ontology Grid (COG) project we

aim to overcome the problems in semantic hetero-

geneity between data sources in by semantic integra-

tion of the sources using a central Information Model

(i.e. ontology). We built the Information Model using

existing applications, data sources (assets) and input

from domain experts. We then created a mapping be-

tween each data asset and the central model, thereby

David Sink in InformationWeek, May 2002,

InfoWorld January 2002 survey of 500 IT leaders.

assigning a well-understood meaning to the concepts

in each asset. The mappings now enabled us to, using

the Information Model, discover the location of infor-

mation throughout the data sources in the enterprise.

Furthermore, because the mappings are created in a

formal way, they enable us to automatically generate

transformations between different sources.

In section 2 we present the Unicorn Workbench

tool, which we used to solve the information integra-

tion problem in the COG project. Then, we present

the information integration problem and the solution

for the COG project in section 3. We then mention

some related work in section 4 and ﬁnally provide

some conclusions.

2 INTRODUCING THE UNICORN

WORKBENCH

The Unicorn Workbench, a java-based tool created

by Unicorn, was built to support the Unicorn Seman-

tic Information Management (Schreiber, 2003) (SIM)

and to enable SIM implementations in enterprises.

All phases (except the ﬁrst) in the SIM Methodology

are to some extent supported by the Unicorn tool.

The basic concept in Unicorn is the Unicorn

de Bruijn J. (2004).

SEMANTIC INTEGRATION OF DISPARATE DATA SOURCES IN THE COG PROJECT.

In Proceedings of the Sixth International Conference on Enterprise Information Systems, pages 9-14

DOI: 10.5220/0002623000090014

 SciTePress

Unicorn Workbench

External Asset Layer

mapping mapping mapping mapping

Model Layer

Editing

Transformations /

Queries

Data Discovery

RDBMS XML documents

Figure 1: Semantic Information Management

Project, which consists of the Information Model, the

schemas belonging to the external assets (the data

sources), the transformations, the queries, and the de-

scriptors (meta-data for human readers) for all these

concepts.

The architecture of Unicorn consists of two main

layers (see ﬁgure 1), namely:

• The External Assets layer contains the mappings to

all the (disparate) data sources. All kinds of data

schemas can be imported into a Unicorn project,

as long as there is a parser for it. Current sup-

ported formats are relational database schemas,

XML schemas, as well as COBOL Copybooks.

New parsers can be written using the Asset API.

• The Model Layer contains the ontology (also called

Information Model). The ontology contains all the

packages, classes, properties, and business rules for

describing the meaning of the data residing in the

external assets.

The Model layer describes the meaning of the data

and the External Asset layer describes the location of

the data. In order to make the ontology active, the

Unicorn Workbench provides three functions for the

user. An editing function, used to create and main-

tain the ontology and the mappings to the different

data sources. The second function is the data discov-

ery function, which can be employed by the user to

discover the location of data, residing in the disparate

data sources, using the ontological model in Unicorn.

Finally, there is a transformation and querying func-

tion with which the user can create transformations

of instances between different data sources and issue

queries against the ontology. The queries are syntac-

tically translated to the query language of the data

source and semantically translated (i.e. the query is

automatically rephrased using terms from the external

asset) to be used with the data schema of the source.

Gather

Requirements

Collect

Metadata

Rationalize

(create

mappings)

Utilize

Publish/

Deploy

Construct

Ontology

Figure 2: The Semantic Information Management Method-

ology

Besides these external assets, other Unicorn ontolo-

gies and ERWin

models can be directly imported into

the Model Layer of the project (the ontology). In this

way, existing logical data models and component on-

tologies can be leveraged in the project. Using com-

ponent ontologies decreases the development effort of

the project and allows to leverage proven models.

2.1 Ontology Engineering In The

Unicorn Workbench

In the COG project, we followed the Semantic Infor-

mation Management (SIM) Methodology (Schreiber,

2003) for the creation of the ontology and the map-

ping of the disparate data sources. The SIM Method-

ology (see Figure 2) consists of six steps:

1. Gather requirements. Requirements for the infor-

mation architecture are collected and the scope of

the project is established.

2. Collect Metadata. All data assets relevant to the

project are catalogued and the metadata (i.e. data

schemas and existing conceptual models) are im-

ported into the Unicorn Workbench.

3. Construct Information Model. Using the imported

metadata, the ontology is created through a pro-

cess of reverse engineering and/or manual identi-

ﬁcation of classes, properties, and business rules in

the source schemas.

4. Rationalize. In the rationalization phase, the map-

pings between the data schemas and the ontology

are created. If the model needs to be reﬁned, there

will be an iteration step back to phase three. In

general, when creating mappings from the compos-

ites in the external assets to the ontology, missing

classes, properties, and business rules are discov-

ered, which necessitates many iterations between

the phases three and four to complete the model

and the mappings.

5. Publish/Deploy. The Information Model, along

with the mappings, is published to relevant stake-

holders and the information model along with the

An Entity-Relationship Diagram editor, see

http://www3.ca.com/Solutions/Product.asp?ID=260

ICEIS 2004 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

transformations between the data sources is de-

ployed to the runtime architecture.

6. Utilize. Processes need to be created to ensure

maintenance of the architecture. Changes in the

data sources need to be reﬂected in the ontology,

mappings need to be updated according to these

changes, etc . . .

The implementation of phases two up to and in-

cluding six of the methodology are facilitated by the

Unicorn Workbench tool.

This paper focusses on the third and fourth phase of

the methodology, the support by the Unicorn tool, and

our experiences with the methodology and the tool

in the COG project. We ﬁrst describe the Unicorn

Workbench tool and then our experiences in the COG

project.

2.2 Data Schema Integration In The

Unicorn Workbench

During the metadata collection phase, a number of

data schemas have been imported in the Unicorn

project. These schemas have been used in the con-

struction phase to aid in constructing the ontology.

These source schemas now have to be mapped to the

ontology in order to give meaning to the data resid-

ing in these sources and to enable locating data and

to enable data transformation between the disparate

sources and issuing queries to the sources. Just like in

the construction phase, it is very important to involve

the domain experts in the mapping (rationalization)

phase.

During the rationalization phase the user is aided

by the Unicorn Workbench in creating the mappings

between the data assets and the ontology. It is pos-

sible to either view the mappings from the viewpoint

of the data assets and in this way determine for each

type and property to decide to which class/property it

should be mapped and to create the mapping. Another

possibility is to use the Data Discovery feature to drill

down the class hierarchy to ﬁnd out which mappings

currently exist for the classes in the ontology. If a

class is identiﬁed that requires further mapping, the

designer can switch back to the view of the desired

data asset and create the mappings.

If during the mapping phase, the designer discov-

ers missing classes, properties, or business rules, the

designer iterates back to the construction phase (phase

three) to add the necessary concepts to the model. It is

our experience from the COG project that especially

missing properties and business rules are discovered

in the rationalization phase and not so much missing

classes.

Mappings are created in two stages. First the

Coarse Mapping (see Figure 3) is created, linking

composites in the data sources (e.g. tables, complex

Figure 3: Coarse mapping in the Unicorn Workbench

Figure 4: Detailed mapping in the Unicorn Workbench

types) to classes in the ontology. Then the Detailed

Mapping (see Figure 4) is created, linking atoms from

the source schemas to the properties in the ontology.

Conditions can be speciﬁed on the values of the

atoms of the composite in the source schema. This

way certain groups of instances can be mapped to dif-

ferent classes in the ontology. These conditional in-

stance mapping groups are, however, not required to

be mutually exclusive. Conditional mapping can not

only be applied to classes (global conditional map-

ping), but also to individual properties (local condi-

tional mapping), as mappings are eventually created

on the property level, during the detailed mapping.

The other way around is also possible: different com-

posites can be mapped to the same class.

The mappings in the detailed mapping stage are

usually made between atoms and direct properties. A

direct property is a regular property of the concerning

class. It is however also possible to create mappings

to indirect properties. An indirect property is not a

property of the concerning class, but a direct or an in-

direct property of a class, which is the type a direct

property of the class (i.e. an indirect property is a

property of a related class). It is therefore possible to

map to an indirect property at an arbitrary depth.

When instances of different composite types in the

source schema need to be mapped to a single class in

the ontology, it is possible to create a mapping view.

Such a mapping view can contain joins over different

composites within one external asset. A join can be

deﬁned over the composites using an existing or an

(in Unicorn created) “implicit” foreign key.

Foreign keys in the database or “implicit” foreign

keys are used either to indicate a simple relationship

with another class or to indicate inheritance. In the

SEMANTIC INTEGRATION OF DISPARATE DATA SOURCES IN THE COG PROJECT

former case, the foreign key can be mapped to a prop-

erty in the class that references the target class that

represents the target composite type of the foreign

key. In the latter case, the foreign key is mapped di-

rectly to the inheritance relationship. This latter case

applies when the extension of the composite (i.e. the

set of instances described by the composite) is a sub-

set of the extension of another composite and this re-

lationship is made explicit in the database using a for-

eign key.

When mapping external assets to the central ontol-

ogy, it is possible to use so-called subtype mapping.

An atom in a composite in the external asset can be

mapped to a subclass of the type of the property in

the ontology. This is of course a valid mapping, be-

cause a subclass of the original type is also a valid

type for the property.

3 INFORMATION INTEGRATION

IN THE COG PROJECT

In the COG project we used external assets provided

by CRF (Centro Ricerche Fiat). The source schemas

are taken from real-life data sources currently in use

by CRF and sources to be implemented at CRF in the

future. The goal of the project is to implement a single

integrated (semantic) information architecture for the

various information sources provided by CRF in order

to show the applicability of using ontologies for infor-

mation integration in industry. The sources include

relational databases, XML data sources and PDF and

spreadsheet documents. These PDF documents are

accessed using the LiveLink document management

system, which in turn has an XML interface. For

the integration of the Excel spreadsheet documents,

a special parser was written using the Asset API.

3.1 The Information Integration

Problem In The COG Project

There were ﬁve main data sources (see also Figure 5)

provided by CRF to be integrated in the COG project.

These sources consist of three relational databases,

namely CATnet, PSI, and WelcomHome, one XML

data sources, namely LiveLink, and a collection of

Excel spreadsheet documents.

WelcomHome is an application used for project

management (it supports the Microsoft Project appli-

cation). LiveLink is a knowledge and document man-

agement system. CATnet and PSI are applications de-

veloped in-house at CRF to support the vehicle devel-

opment and testing process.

CATnet has been developed to support the entire

testing process, which is a major part of the FIAT ve-

PSI

CATNET

LiveLink Excel

WelcomHome

MS-Project

Web Front End

Query, metadata

exploration, display

and reporting.

COG

Information

Model

Figure 5: COG logical architecture. Shows the integration

of the different data sources using the Unicorn Workbench

tool.

hicle (prototype) development process and essential

for ensuring the quality of the products. Test requests

that are linked to test plans are submitted to the CAT-

net system. These test plans support the planning of

the execution of the tests, which is related to the vehi-

cle development phase, the test laboratory, etc. . .

The test plans are linked to the so-called “technical

memory”, where the technical procedures for test ex-

ecution are speciﬁed. This technical memory consists

of PDF documents accessible through the LiveLink

system. After the test executions, the test results are

stored in the CATnet database for later retrieval. This

retrieval of test results is critical in order to assure

quality of the products. Whenever a customer com-

plaint is received or a defect in a vehicle is detected,

it has to be possible to easily retrieve the results of the

tests performed on the concerning automobile com-

ponent.

To facilitate this retrieval process, there is a cus-

tomer perspective (also called the ‘Voice of Cus-

tomer’ or VoC) tree for each vehicle model in the PSI

system. The customer perspective trees are used to

locate the appropriate tests that should have been per-

formed on a particular vehicle system. The marketing

manager can drill down the tree to locate the speciﬁc

tests that have been performed, after which the test

results can be retrieved from the CATnet system.

3.2 Solving the integration problem

in COG using the Semantic

Information Management

During the development of the ontology in the COG

project, some shortcomings in the Unicorn tool be-

came apparent. In order to overcome these shortcom-

ings, these problems were fed into the requirements

ICEIS 2004 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

for the new version of the Unicorn Workbench. Dur-

ing the COG project, a new version of Unicorn (v2.5)

was released that overcomes most shortcomings in

the old version identiﬁed during the development for

COG. Examples of features developed especially for

the COG project are the mapping to subtypes, speci-

ﬁcation of inverse properties, and mapping of foreign

keys to inheritance relationships.

Another important development for the new ver-

sion of Unicorn that has proven very useful for the

COG project was the Asset API. Using this API it was

possible to develop parsers for the integration of Ex-

cel documents and Microsoft Access databases (i.e.

PSI) that can be used for importing data schemas from

these platforms. Without this Asset API and these

parsers, these sources could not have been integrated.

Based on the data sources provided by CRF and

the interviews with domain experts at CRF, four main

subject areas for the Information Model were identi-

ﬁed, namely:

• Motor vehicle development. This area covers the

motor vehicles themselves, such as automobile

parts and vehicle systems.

• Project management. This area covers the project

management capabilities present in WelcomHome.

• Testing Management. This area covers the manage-

ment of the execution of the tests. Things like test-

ing guidelines, test plan, testing laboratories, and

so on, are the components of this area.

• Test requests. This area covers the customer per-

spective tree, standards, test requests, test results,

etc . . .

For each of these subject areas a package in the

ontology was created, after which the classes, prop-

erties, and business rules were created corresponding

to the composites in the source schemas. Interviews

with domain experts were mostly used as input for

the ontology development process, as well as working

with the current applications (mainly PSI and CAT-

net) in use. This was the third phase (the construction

phase) in the SIM methodology.

The next step (phase four - rationalization) was to

map the source schemas to the ontology. This map-

ping was done using the mapping functionality of the

Unicorn tool. During this mapping phase, many iter-

ations back to phase three were necessary. It turned

out that the classes in the ontology had been identi-

ﬁed correctly, but still many properties and business

rules had to be added or changed while trying to map

the external assets.

4 RELATED WORK

The MOMIS (Bergamaschi et al., 2001) approach

is a semi-automatic approach to schema integration,

developed at the university of Modena, Italy. The

approach has not been used in any major industrial

projects and is mainly an academic research activity.

Any data source can be connected to the architecture,

as long as an ODL

wrapper is created. MOMIS

has a single mediator, which provides a global data

schema and query interface to the user.

InfoSleuth (Fowler et al., 1999) is a multi-agent

system for semantic inter-operability in heteroge-

neous data sources. Agents are used for query and

instance transformations between data schemas. An

agent is aware of its own ontology and the mapping

between that ontology and the data schema, it is aware

of the shared ontologies and it can map its ontol-

ogy to those of other agents. InfoSleuth uses several

shared ontologies, made available through the ontol-

ogy agents. Individual data sources have (through

the resource agents) a mapping to these shared on-

tologies. The shared ontologies are linked together

through one-to-one ontology mapping. Note that the

user agents use the shared ontologies as their vocab-

ulary and local ontologies are only maintained by the

resource agents.

PROMPT (Noy and Musen, 2000) provides a a

semi-automatic approach to ontology merging, not

speciﬁcally data schema integration. In the ontology

merging in PROMPT it is assumed that the original

ontologies no longer exist after the merged ontology

has been created. Therefore, there are no mappings

between the original and the merged ontologies, as

there are in most data schema integration solutions.

ONION (Mitra and Wiederhold, 2001) takes a cen-

tralized, hierarchical approach to ontology mapping,

where the user views the (global) articulation ontolo-

gies. The source ontologies are mapped to each other

via articulation ontologies that are in turn used by the

user to express queries. The articulation ontologies

are organized in a tree structure. An articulation on-

tology used for the mapping of two source ontologies

can in turn be one of the sources for another articula-

tion ontology. The creation of a hierarchy can be seen

as a form of ontology clustering. But while (Visser

and Tamma, 1999) take a top-down approach (ﬁrst

the root application ontology is speciﬁed, then child

ontologies are created as is necessary), ONION takes

a bottom-up approach in the creation of the articula-

tion ontologies; furthermore, there is no deﬁned root

ontology for the cluster.

SEMANTIC INTEGRATION OF DISPARATE DATA SOURCES IN THE COG PROJECT

5 CONCLUSION

In this paper we have outlined the problems in in-

formation integration we were facing in the COG

project. We have described a solution to the integra-

tion problem using the Semantic Information Man-

agement (SIM) Methodology (Schreiber, 2003) and

the Unicorn Workbench tool, which we have applied

in the COG project.

We have described how the SIM together with the

Unicorn Workbench was used in the COG project and

what the role is in the overall COG Architecture.

Many problems in the construction of the Infor-

mation Model and the mapping to the disparate data

schemas were caused by the poor understanding of

the source data schemas. The data schemas contained

concepts in the Italian language, while the ontology

engineering and mapping was done by non-Italian

speakers. What further complicated the matter was

the fact that the users that worked with the existing

applications were no expert on the database schemas

that were being used, which made the mapping a hard

problem. It turned out that the only possibility for

the ontology engineer to construct the ontology was

to have a look at the applications together with the

end-users, which was a tedious job.

These problems indicate the necessity of the usage

of a central Information Architecture, through which

the nature of the data residing throughout the organi-

zation can be understood.

Much of the mentioned related work consists of

academic research prototypes. The Unicorn Work-

bench tool, along with the Semantic Information

Management architecture, has proven itself in many

projects in an industrial setting.

Many of the mentioned approaches take a semi-

automatic approach to the data schema (or ontology)

mapping. The mentioned approaches all use ontol-

ogy mapping or ontology merging as a basis. The

Unicorn Workbench does not map ontologies explic-

itly, but is specialized in the mapping of database (and

several other types of) schemas to a central ontology

and provides an integration platform for data sources

throughout the enterprise.

Major limitations of the current Unicorn Work-

bench tool are the lack of support for semi-automatic

mapping, as in PROMPT (Noy and Musen, 2000) and

Chimæra (McGuinness et al., 2000), and the lack of

support for the integration of ontologies. The tool

support only the integration of data sources into one

ontology, and does not support the mapping of several

ontologies in different organization(al unit)s. The ex-

istence of only one ontology can lead to several prob-

lems, as pointed out in (Visser and Cui, 1998) and

(Uschold, 2000).

ACKNOWLEDGEMENTS

The research presented in this paper was funded by

the COG project, under contract number IST-2001-

38491, http://www.cogproject.org/. Some materials

presented in this paper are the copyright of Unicorn

Solutions, Inc. and are used with permission.

We would like to acknowledge all partners in the

COG project and all people in DERI who have looked

at earlier versions and provided useful feedback.

REFERENCES

Bergamaschi, S., Castano, S., Beneventano, D., and

Vincini, M. (2001). Semantic integration of hetero-

geneous information sources. Special Issue on Intelli-

gent Information Integration, Data & Knowledge En-

gineering, 36(1):215–249.

Fensel, D. (2003). Ontologies: Silver Bullet for Knowledge

Management and Electronic Commerce, 2nd edition.

Springer-Verlag, Berlin.

Fowler, J., Nodine, M., Perry, B., and Bargmeyer, B. (1999).

Agent-based semantic interoperability in infosleuth.

SIGMOD Record, 28(1).

McGuinness, D., Fikes, R., Rice, J., and Wilder, S. (2000).

An environment for merging and testing large ontolo-

gies. In Proc. 7th Intl. Conf. On Principles of Knowl-

edge Representation and Reasoning (KR2000), Col-

orado, USA.

Mitra, P. and Wiederhold, G. (2001). An algebra for se-

mantic interoperability of information sources. In

IEEE International Conference on Bioinformatics and

Biomedical Egineering, pages 174–182.

Noy, N. F. and Musen, M. A. (2000). Prompt: Algorithm

and tool for automated ontology merging and align-

ment. In Proc. 17th Natl. Conf. On Artiﬁcial Intelli-

gence (AAAI2000), Austin, Texas, USA.

Schreiber, Z. (2003). Semantic information management:

Solving the enterprise data problem. To be found on

the http://www.unicorn.com/ website.

Uschold, M. (2000). Creating, integration, and maintain-

ing local and global ontologies. In Proceedings of

the First Workshop on Ontology Learning (OL-2000)

in conjunction with the 14th European Conference on

Artiﬁcial Intelligence (ECAI-2000), Berlin, Germany.

Visser, P. and Cui, Z. (1998). On accepting heterogeneous

ontologies in distributed architectures. In Proceedings

of the ECAI98 workshop on applications of ontologies

and problem-solving methods, Brighton, UK.

Visser, P. and Tamma, V. (1999). An experience with

ontology clustering for information integration. In

Proceedings of the IJCAI-99 Workshop on Intelligent

Information Integration in conjunction with the Six-

teenth International Joint Conference on Artiﬁcial In-

telligence, Stockholm, Sweden.

ICEIS 2004 - DATABASES AND INFORMATION SYSTEMS INTEGRATION