CHANGE MANAGEMENT IN DATA INTEGRATION SYSTEMS
Rahee Ghurbhurn, Philippe Beaune
Génie Industriel et Informatique, Ecole des Mines de St Etienne
158 cours Fauriel 42000 St Etienne, France
Hugues Solignac
STMicroelectronics, zi Peynier Rousset 13790 Rousset
Keywords: Data Integration, Change management, Multiagent Systems, Ontology.
Abstract: In this paper, we present a flexible architecture allowing applications and functional users to access
heterogeneous distributed data sources. Our proposition is based on a multi-agent architecture and a domain
knowledge model. The objective of such an architecture is to introduce some flexibility in the information
systems architecture. This flexibility can be in terms of the ease to add or remove existing/new applications
but also the ease to retrieve knowledge without having to know the underlying data sources structures. We
propose to model the domain knowledge with the help of one or several ontologies and to use a multi-agent
architecture maintain such a representation and to perform data retrieval tasks. The proposed architecture
acts as a single point of entry to existing data sources. We therefore hide the heterogeneity allowing users
and applications to retrieve data without being hindered by changes in these data sources.
1 INTRODUCTION
Due to economic globalisation, organizations now
evolve in a highly competitive environment where
having the right information at the right time is the
key of success. In the 90s, companies have invested
massively in information systems so as to be able to
store and analyse data through the use of data
warehouses. The result was one or more information
systems for each department. To derive higher value
added from the stored data, departments started to
share the information found in their systems.
Organization wide data sharing was rationalized
through the use of Enterprise Application Integration
solutions. From a point to point architecture we
obtained an organized global architecture composed
of heterogeneous applications communicating
through message buses. The problem is that in an
economy where the key word is flexibility the
information system rationalization created a rigid
architecture.
Indeed, the tight coupling of applications and
data sources makes it difficult to adapt them to
technological, organizational or economical
changes. Application or data source evolution are
difficult to realize as it impact the whole system.
Moreover, users having access to several data
sources owned by different departments may find it
difficult to identify the appropriate data for their
analysis. Indeed, domain concepts may be found in
several data sources but their semantic may differ
from a domain to another. In some cases it may be
interesting to use data collected by another
department to make an analysis more complete.
Our proposal consists in modelling different user
domain knowledge and associating them to their
corresponding data found in heterogeneous
distributed data sources. The user domain
knowledge is expressed in the form of one or several
ontologies (Gruber, 1993) that can be shared across
the organization or between several organizations.
To take into consideration the changing aspect of an
organization’s environment and the distributed
nature of data sources, the knowledge representation
is manipulated and maintained by a Multi-agent
system (MAS)(Nwana, 1996; Sycara et al., 1996).
In this paper, we are going to consider our
motivations and existing approaches in section 2.
Section 3 deals with our proposition in terms of
knowledge representation and in terms of MAS
268
Ghurbhurn R., Beaune P. and Solignac H. (2007).
CHANGE MANAGEMENT IN DATA INTEGRATION SYSTEMS.
In Proceedings of the Ninth International Conference on Enterprise Information Systems - DISI, pages 268-273
DOI: 10.5220/0002376102680273
Copyright
c
SciTePress
architecture. Lastly section 4 presents a brief
conclusion and some perspectives.
2 MOTIVATIONS AND EXISTING
APPROACHES
Our work is being carried out in an industrial
environment composed of several independent
heterogeneous and distributed data sources. These
data sources are accessed by several applications for
production analysis, production planning and follow
up, decisional and reporting activities. Some
applications are old legacy applications with direct
access to the data sources other communicate
through message buses and other by means of flat
files. These applications and data sources belong to
different services and are therefore independent.
This implies that the data sources can be modified at
any time, by a service, without any notification to
the others.
Our objective is to design and implement a
mechanism that allow functional user to explore the
existing knowledge found in different data sources
and retrieve the associated data without having to
know the underlying data sources’ structures. This
data exploration and retrieval should be possible
even if changes have been operated on the data
sources. Moreover, we also consider the fact that if
the data sources are changed the applications
accessing them are also impacted. We propose to
isolate these applications from this type of changes
by creating a single point of entry common to all
data sources.
The problem of integrating distributed data
sources has been addressed by several research
communities namely the database, artificial
intelligence and the knowledge representation
community. Several solutions have been proposed.
Federated databases (Busse, 1999) consist in
defining a canonical schema to map the data sources
logical schemas. Federated data sources create a
tightly coupled global information system and it is
therefore difficult to make each components of the
system evolve without impacting the global
structure.
Multi-database query languages allows data
sources to be loosely integrated as there is no global
schema, but the main problem is that the semantic
heterogeneity is not dealt with by the system but by
the users. Indeed the user has to know the physical
and logical characteristics of each data sources to be
able to take into account the data heterogeneity.
There are several implementations of multi-database
query languages namely MSQL (Litwin et al.,
1989), SchemaSQL (Lakshmanan et al., 2001) and
FRAQL (Sattler and Saake, 2000). These languages
are all SQL extensions.
Data integration through mediation (Wiederhold,
1992; Levy, 1999) consists in defining for each data
source a local schema describing the content of these
data sources. These local schemas are then mapped
to a global schema that describes the relationship
between the content of the local schemas. The
mapping between the local and the global schema
can be specified by either the global-as-view or the
local-as-view approach. This approach allows
loosely coupled data integration in the sense that the
data sources remain independent but it creates a tight
coupling between the local and global schemas.
Indeed when ever a data source changes the
mappings have to be updated and this may prove to
be a tedious task. Some projects based on the
mediator approach are SIMS (Arens et al., 1993),
TSIMMIS (Chawathe et al., 1994),
Infosleuth(Bayardo et al., 1997). Recently two new
approaches global-local as view and both-as-view
have been proposed by (Friedman, 1999) and
(McBrien and Poulovassilis, 2003) respectively.
These approaches aim at combining the advantages
of global-as-view and local-as-view. That is
combining the ease of transforming queries and the
ease of adding new data sources.
Ontology driven integration consists in defining
a local ontology for each data sources and link them
either by defining a global ontology containing
concepts that subsumes the local concepts or by
defining inter-ontology mappings. Ontology driven
integration presents the same inconvenient as the
mediation approach. There is a tight coupling
between the local schemas and the global schema.
Moreover, it may be difficult to reconcile different
ontologies to establish inter-ontology mappings. The
main projects base on ontology driven integration
are OBSERVER (Mena et al., 2000) and KRAFT
(Peerce et al., 2000).
All the above mentioned approaches use
database views to enable data integration via a
global schema. According to us these solutions are
difficult to deploy in an environment where the data
sources are autonomous. By autonomous we mean
that each company or service administers its own
data sources and can therefore modify their structure
without any notification. Moreover, database view
administration may be a complex task if the
modelled domain evolves rapidly. Indeed changes in
the domain may require the deletion of attributes or
CHANGE MANAGEMENT IN DATA INTEGRATION SYSTEMS
269
relations in the data sources invalidating the views
used for data integration (Bellahsene, 2002, Amy et
al, 2002).
Another point is that the above mentioned
approaches do not provide any backup mechanism to
allow for query answering if one data source is
offline. Indeed to the best of our knowledge we did
not find any mechanism that redirect query to
backup data sources if one them fails.
Lastly, these solutions have been built for human
use and do not take into account the fact that some
planning or decisional applications need to access
these data. In fact when ever a data source is
modified, it impacts not only the users but also the
applications using the data source. The applications
have to be updated either by updating the queries the
application uses or by updating the source code for
legacy applications.
3 PROPOSITION
Our proposition consists in expressing the content of
the data sources in terms of users’ domain
knowledge. This knowledge representation takes the
form of one or several ontologies, each representing
a different domain, expressed in OWL DL. The
ontology does not only express the relation between
concepts that are present in the data sources but also
their localization. That allow us to reuse a same data
source in different domain ontologies. These
relationships are used to replace views and are used
to reformulated user queries. In the following we are
going to explain our knowledge modelling and why
do we need a MAS to maintain it.
3.1 Knowledge Modelling
As we said previously, our knowledge representation
concern domain knowledge. The main reason is that
modelling domain knowledge offers greater stability
as compared to data source logical models.
Moreover, logical schema presents several
mechanisms used for data storage and retrieval
optimization and therefore not embodying any
domain knowledge that may interest functional
users.
Thus we are going to describe concepts like
“integrated circuit”, “equipment”, “production
engineer” and for each of them, their associated
properties. For example, an equipment can have as
properties, a serial number, a localization, a date of
purchase, a production load and so on. After having
identified the concepts and their properties, their
semantic is specified by defining the appropriate
relationships between each concept.
In this form, the ontology allows users to browse
through the knowledge contained in the data sources
but it does not allow them to retrieve any associated
data. We therefore have to add some description
about their localization. For the time being, we
consider only databases but our model can easily be
extended to the description of any other data source
like XML files or object databases.
To have a complete description of a concept, we
consider that properties can be expressed as a
function of one or several data source attributes
found in the same or different data sources. For
example, a production engineer and an equipment
maintenance engineer will be interested in different
aspects of an equipment. But both of them would be
interested in indicators coming from another domain
like the production load (production domain) or the
maintenance actions performed on an equipment
(maintenance domain). Indeed the maintenance
engineer may use the production load indicator to
plan and determine future maintenance actions. The
production engineer will perhaps want to know the
recent maintenance action performed in case high
production loss. The problem is that this information
may not be found in the same data source. For these
reasons we associate to a property one or more data
sources. Our proposal is not based on views, the
queries over the data sources are formed at runtime
based on the selected concepts and properties. The
figure below is a UML representation of the basic
concepts of our ontology.
Figure 1: Ontology model.
ICEIS 2007 - International Conference on Enterprise Information Systems
270
Figure 1. represents the UML model
corresponding to the structure of our ontology. A
concept is composed of properties and relations.
Each property can have one or more attributes
referencing the corresponding data found in the data
sources. Relations define the type of relationship
existing between two concepts.
Having linked the knowledge representation to
the data sources, users can now browse through the
knowledge contained in the data sources and ask the
system to retrieve the desired data. But before being
able to present the retrieve data to the users, we have
to firstly convert the queries expressed in terms of
domain knowledge into queries understandable by
the data sources. Domain queries are composed of
concepts, properties and constraints on selected
concepts and properties. The query transformation
mechanism is explained in section 3.2.2. Secondly,
we have to check for the data structure consistency
and harmonize them. Another issue is the coherence
of the ontology with the data sources. As we said
earlier, we assume that the data sources are
independent and can be modified at any time. In this
case, we must be able to detect the data source
changes, evaluate the impact on the ontology, make
the necessary updates and deal with queries that
have been issued just before the data source change.
In the next section, we propose to address all
these issues through a MAS. The latter will have as
main objective to convert queries expressed in terms
of domain knowledge into queries understandable by
data sources, retrieve the corresponding data, and
harmonize them before presenting them to the users.
The MAS must also be able to detect any change in
the database structure, evaluate its impact on the
ontology and take corrective measures to keep the
representation coherent and deal with issued queries.
3.2 Multi-agent Architecture
In the past years, MAS have been used in a variety
of applications like distributed computations,
simulation, computer games, information gathering,
data mining, production and supply chain planning.
The use of MAS paradigm is motivated by the
various interesting properties it presents.
Software agents are autonomous, this property
allows them to act on behalf of the user. They can
contain some level of intelligence, which is either
hard coded through fixed rules or acquired using
learning engines. This so-called intelligence allow
them to interact and adapt to changes in their
environment.
Agents are able to communicate with users, other
systems and other agents as required.
Communication is an important characteristic, in the
sense that it allows several agents, having no
complete information, to cooperate and execute
complex tasks or achieve complex objectives.
The motivations behind the use of MAS as a
potential solution to our distributed and
heterogeneous data retrieval problem are based on
organisational and architectural aspects.
Organisational aspects include the facts that MAS
can easily auto-adapt to changing environment and
re-synchronise the elements composing the system.
Architectural aspects concern the robustness and
flexibility in terms of administration or operation.
That is MAS can be partly administered or updated
without having to switch the whole system offline.
The MAS architecture that we propose has three
basic functions; ontology construction assistance,
information retrieval and ontology coherence
maintenance.
3.2.1 MAS and Ontology Construction
Assistance
The data sources being distributed, we decided to
affect a resource agent to each data source. After
having identified the concepts of a user domain and
their corresponding attributes in the data sources, the
system administrator can ask the agents to retrieve
the corresponding meta-data to instantiate our
“database” and “sources” concept in our ontology.
The association of the domain concept to the sources
is done manually by the administrator. The
administrator must also define the links that exists
between different data sources. These links will be
used when converting ontological queries into SQL
queries and when fusing the result sets
corresponding to the queries. The instantiation and
mapping is done through the use of an ontology
agent.
3.2.2 MAS and Information Retrieval
The information retrieval mechanism, that we
propose, does not use views for query reformulation
and checks for the availability of the data sources
before retrieving data. If the data source is not
available the mechanism is able to determine the
accessibility of the data sources participating in the
query and retranslate the user query based on backup
data sources.
Once the ontology built, it allows functional users to
formulate queries in terms of domain knowledge to
retrieve data without having to know the data
CHANGE MANAGEMENT IN DATA INTEGRATION SYSTEMS
271
sources’ underlying structures. The main problem is
that these domain knowledge queries are not
understood by the data sources. The ontology agent
translates domain knowledge queries into queries
understandable by data sources by using the
information found in the ontology. The steps in our
query conversion mechanism are:
From the selected concepts and properties find
the corresponding data source attributes
From the relevant attributes find the relevant
data sources and data sources’ tables
Compose the queries for each data source
From the relevant sources find the applicable
links between them, from the ontology and
automatically include the necessary attributes
and tables in the appropriate queries (defined
in the previous step)
Query the data sources
As we previously said, we stored information
regarding the data sources for each data source
attribute added to the ontology. From this
information we can find out to which table of which
data source the attribute belongs. This information
allows us to fill the SELECT and FROM part of an
SQL query. Concerning the WHERE part and more
precisely the join clauses, it can be built from a
directed graph.
This graph is built by using the meta-information
regarding the attributes forming the SELECT part of
the query. The nodes of the graph represent the data
source tables forming the WHERE clause of the
query. The join clauses are determined by finding a
path from one node to another. If several paths are
found we choose the one that minimises the join
cost.
An ontological query can be broken into several
SQL queries concerning several data sources. The
links, between the data sources, defined by the
administrator are used to automatically include
additional attributes and tables to the queries
obtained in the previous step.
The queries are addressed to their respective data
sources. On receiving the queries each resource
agent checks their consistency, that is if the elements
composing the queries are found in the data source.
If an inconsistency is found, the resource agent
notifies the ontology agent who asks the other
resource agent to cancel the data source querying
and restarts the query reformulation steps but this
time using backup attributes found in the ontology.
If the data sources are successfully queried, the
result is sent to the users via the task agent
responsible of data harmonization. If the data source
querying is unsuccessful then an alert is sent to the
administrator and a notification sent to the user.
As with any data repository, query response time
must be as short as possible. To reduce the query
response time, we defined a conversion matrix
composed of all the properties contained in our
ontology and their corresponding attributes found in
the data sources. Their equivalence is specified by a
“1” at the intersection of the row and a column
respectively. Therefore whenever a query translation
has to be performed, no inference is made on the
ontology; instead the conversion matrix is used,
speeding up the translation.
As compared to approaches based on views the
query transformation is done at runtime and
therefore only existing attributes and online data
sources are selected. Moreover the burden of views
administration and maintenance is considerably
reduced. Indeed, the only views created are views
generally created by each local data source
administrator to speed up query answering. These
views can therefore be administered locally without
any impact on the global data integration system.
3.2.3 MAS and Ontology Coherence
As we mentioned in section 2, we consider that the
applications and the data sources are independent
and can therefore be modified at any time by the
system administrator. This poses serious problems as
regards to the coherence of our ontologies with the
data sources. Indeed, if the system administrator
removes some attributes, tables or even a data base
without any notification the users will not be
informed and will continue to send queries,
containing the removed attributes, to the data
sources.
We propose to use the resource agents to monitor
data sources and automatically detect any change,
operated by the administrators, in the data sources’
structures. When a change occurs, a notification is
sent to the ontology agent. Using the conversion
matrix, the latter is able to identify the impacted
concepts and evaluate the impact on the ontology.
The impact is measured by determining the number
of relations that the impacted concept has. After
having determined the impact of the changes on the
ontology, the ontology agent can either, in case of
attributes deletion, prevent the users from selecting
the properties of the concepts that have changed or
inform the system administrator that new data have
been added to the data sources. The administrator
can then decide if the new data should be associated
to a concept or not.
ICEIS 2007 - International Conference on Enterprise Information Systems
272
Moreover, our proposition must also deal with
queries that have been issued by user unaware of the
changes. In this case, the system tries to return a
partial answer and an explanation for the partial
result.
4 CONCLUSION
We presented in this paper our current work on a
MAS for information systems semantic
interoperability. The aims of such an architecture is,
firstly, to be able to explore and share knowledge
present in heterogeneous distributed data sources
within and between organization. Secondly isolate
users and applications from changes in the data
sources while allowing them to retrieve any data
from any data sources at any time. We have partly
implemented our MAS architecture and developed a
small ontology describing the knowledge contained
in three data sources. Some future perspectives
would be to develop an database annotation
mechanism that allows our agents to automatically
add new data sources to the ontology
REFERENCES
Gruber, T.R, 1993. Toward principles for the design of
ontologies used for knowledge sharing. In Formal
Ontology in Conceptual Analysis and Knowledge
Representation, Kluwer Academic Publishers. The
Netherlands.
Nwana, H.S. 1996, Software Agents: An Overview. In
Knowledge Engineering Review, vol. 11, pages 205-
244
Sycara, K., Pannu, A., Williamson, M., Zeng, D. and
Decker, K. 1996, Distributed Intelligent Agents. In
Intelligent Systems and their Applications, vol. 11,
pages 36-46. IEEE Expert.
Busse, S., Kutsche, R., Leser, U. and Weber, H., 1999,
Federated Information Systems: Concepts,
Terminology and Architectures, In Technische
Universitat Berlin., Technical Report .
Litwin, W., Abdellatif, A., Zeroual, A., Nicolas, B., and
Vigier, P. 1989, MSQL: a multidatabase language. In
Information sciences, vol. 49, pages 59-101. Elsevier
Science
Lakshmanan, L.V.S., Sadri, F. and Subramanian, S.N.,
2001 SchemaSQL: An extension to SQL for
multidatabase interoperability. In ACM Transactions
on Database Systems, vol. 26, pages 476-519. ACM
Press
Sattler, K., Conrad, S. and Saake, G. 2000. Adding
Conflict Resolution Features to a Query Language for
Database Federations. In Proc. 3nd Int. Workshop on
Engineering Federated Information Systems Akadem.
Verlagsgesellschaft
Wiederhold, G. 1992. Mediators in the Architecture of
Future Information Systems. In Computer, vol. 25,
pages 38-49, IEEE Computer Society Press
Levy, A. 1999. Combining Artificial Intelligence and
Databases for Data Integration. In Lecture Notes in
Computer Science. Springer.
Arens, Y., Chee, C.Y., Hsu, C. and Knoblock, C.A. 1993
Retrieving and Integrating Data from Multiple
Information Sources. In International Journal of
Cooperative Information Systems.
Chawathe, S., Garcia-Molina, H., Hammer, J., Ireland, K.,
Papakonstantinou, Y., Ullman, J.D. and Widom, J.,
1994. The TSIMMIS Project: Integration of
heterogeneous information sources. In Proceedings
10th Anniversary Meeting of the Information
Processing Society of Japan
Bayardo, R. J., Bohrer, W., Brice, R., Cichocki, A.,
Fowler, J., Helal, A., Kashyap, V., Ksiezyk, T.,
Martin, G., Nodine, M., Rashid, M., Rusinkiewicz, M.,
Shea, R., Unnikrishnan, C., Unruh, A., Woelk, D.,
1997. InfoSleuth: agent-based semantic integration of
information in open and dynamic environments. In
SIGMOD '97: Proceedings of the 1997 ACM
SIGMOD international conference on Management of
data.
Mena, E., Illarramendi, A., Kashyap V. and Sheth, A.P,
2000 OBSERVER: An Approach for Query
Processing in Global Information Systems Based on
Interoperation Across Pre-Existing Ontologies. In
Distributed and Parallel Databases, vol. 8, pages 223-
271. Springer
Preece, A.D., Hui, K., Gray, W.A., Marti, P., Bench-
Capon T.J.M., Jones, D.M., and Cui, Z. 1999 The
KRAFT architecture for knowledge fusion and
transformation. In Expert Systems, Springer, Berlin
Bellahsene, Z. 2002. Schema evolution in data
warehouses. Knowledge Information System vol. 4,
pages 283-304
Lee, A. J., Nica, A., and Rundensteiner, E. A. 2002. The
EVE Approach: View Synchronization in Dynamic
Distributed Environments. IEEE Transactions on
Knowledge and Data Engineering, vol 14, issue 5,
pages 931-954
Miller, L., Seaborne, A. and Reggiori, A., 2002 Three
Implementations of SquishQL, a Simple RDF Query
Language. In ISWC '02: Proceedings of the First
International Semantic Web Conference on The
Semantic Web.
McBrien, P. and Poulovassilis, A., 2003, Data integration
by bi-directional schema transformation rules. In 19th
International Conference on Data Engineering.
Friedman, M. and Levy A. and Millstein T. 1999,
Navigational plans for data integration. In Proceedings
of the sixteenth national conference on Artificial
intelligence and the eleventh Innovative applications
of artificial intelligence conference innovative
applications of artificial intelligence, American
Association for Artificial Intelligence.
CHANGE MANAGEMENT IN DATA INTEGRATION SYSTEMS
273