metadata not always follow a standard and cannot be
automatically analyzed; consequently, integration
activities will depend on users either to compare
concepts or to validate correspondences between
schemas. Resolving naming conflicts is not an
automatic task, being semi-automatic at best (Kent,
1998).
Semantically rich conceptual models are the
basis for semantic data integration. Although
conceptual models have been discussed and studied
for over thirty years, very little has been said about
the modeling process. The creation of such a model
implies that the designer has to acquire concepts of a
universe of discourse, what requires a method. Also,
conceptual models must be represented by means of
an ontological language which the constructs must
be enough to semantically describe all the existing
concepts (Lopes et al., 2009).
Data related to concept identification and schema
must be described in the metaschema. The concept
schema is the concept structure, which must be a
XML schema and describe all types, attributes and
constraints that define the concept. In other words,
the concept schema is the canonical concept model
for the organization, which is the basis for solving
structure conflicts.
Comparison of Schemas. This activity aims at
providing the baseline for structuring conflict
resolution. The definition of the relations between
schemas and concepts is the central step of this
activity. Each identified concept must be mapped to
at least one local schema; the relation between a
concept and a local schema is specified through a
query defined in a language known by the data
source to which the schema is linked (SQL for
relational databases or xPath for XML files). All
defined mappings are stored in the metadata base.
The query must access data that is mapped in the
concept canonical model. For instance, to recover
data about concept “c1”, that is in a local schema
“s1”, related to a PostgresSQL data source “ds1”,
the following query can be used:
Select * from t1
“t1” is the table in which the data about concept
“c1” is stored in schema “s1”; the “*” represents the
set of attributes that comply with the elements
described for the concept “c1” in its canonical
schema.
An example using a complex concept could be a
query for an address, which would access more than
one table in the schema, such as:
Select e.logradouro, e.numero, c.cidade,
u.uf from endereco e, cidade c, uf u
where e.codigoCidade = c.codigo and
e.codigoUf = u.codigo.
When it comes to the definition of the relation
between the concept and the local schema, it is
necessary to map the attributes defined in the
canonical schema to the values to be returned by the
query. The establishment of this relation allows for
the resolution of part of the structure conflicts
mentioned above.
The proposed approach adds a new step
(Infrastructure implementation) after the
Comparison of schemas step. In the Infrastructure
implementation step, data integration services should
be implemented.
The information described in the metaschema is
the basis for the execution of the next steps,
conforming the schemas and merging and
restructuring.
Conforming the Schemas. In this activity, type, key
and scale conflicts are resolved, and the integrated
schema is built. When the concept service receives a
new data request, it contacts the metadata service to
verify which data services must be called; it then
accesses the appropriate data services and queries
the concept data. Data services then access the
metadata services to check for information about
connections to the data sources. Finally, the data
services query the data sources, get the requested
data and return them to the concept service. Such
concept service calls the integration service
responsible for the conforming step, which unifies
the data and returns them to the concept service.
Merging and Restructuring. In this activity, the
concept service calls the integration services which
will merge the data, based on the quality criteria
defined in the metaschema, and return them to the
concept service, which returns the integrated data,
formatted according to the concept schema, to the
requester.
4 CASE STUDY
The scenario for the case study is the Brazilian
government census bureau, IBGE (Brazilian
Institute for Geography and Statistics), in which a
great volume of heterogeneous data sources are
geographically distributed, and frequently
exchanged among the foundation’s offices This
environment is ideal for the deployment and study of
the proposed solution. The study started with the
evaluation of some already modeled business
processes; the processes for data validation and
dissemination used during year 2000 Brazilian
census were selected. The choice was based on the
A SERVICE-BASED APPROACH FOR DATA INTEGRATION BASED ON BUSINESS PROCESS MODELS
225