input ontology into an
-connection with the largest
possible number of connected knowledge bases.
• Clustering Strategy: This approach has been
proposed by (Pei et al., 2006). First, schemas are
clustered based on their contextual similarity.
Second, attributes of the schemas that are in the
same schema cluster are clustered to find attribute
correspondences between these schemas. Third,
attributes are clustered across different schema
clusters using statistical information gleaned from
the existing attribute clusters to find attribute
correspondences between more schemas.
• Statistical Strategy: This approach has been
introduced by (He et al., 2003) and (He et al., 2004)
with MGS (for hypothesis modeling, generation, and
selection) and a DCM (Dual Correlation Mining)
framework. The MGS framework is an approach for
global evaluation, building upon the hypothesis of
the existence of a hidden schema model that
probabilistically generates the schemas we observed.
This evaluation estimates all possible “models,”
where a model expresses all attributes matchings.
Nevertheless, this approach does not take into
consideration complex mappings. DCM framework
has been proposed for local evaluation, based on the
observation that co-occurrence patterns across
schemas often reveal the complex relationships of
attributes. However, these approaches suffer from
noisy data. HSM (Holistic Schema Matching) and
PSM (Parallel Schema Matching) have been
proposed by (Su et al., 2006) to find matching
attributes across a set of Web database schemas of
the same domain. HSM integrates several steps:
matching score calculation that measures the
probability of two attributes being synonymous,
grouping score calculation that estimates whether
two attributes are grouping attributes. PSM forms
parallel schemas by comparing two schemas and
deleting their common attributes. HSM and PSM are
purely based on the occurrence patterns of attributes
and require neither domain-knowledge, nor user
interaction.
In our work, we propose a decomposition
approach which divides XML schemas into small
sub-schemas with the use of linguistic and tree
mining techniques. Our approach is similar to the
fragmentation strategy. The main difference lies in
the way to find intra-schemas structures called
shared sub-structures in COMA++ (Do and Rahm,
2007). More precisely, our approach extends
fragmentation method to find inter-schemas
structures in automatic manner and is applied on
several schemas at once.
3 XML SCHEMAS
DECOMPOSITION APPROACH
We propose a decomposition approach, as a pre-
matching phase, which break down large XML
schemas into smaller sub-schemas to improve the
performance of large schema matching. Our
approach identifies and extracts common structures
between and within XML schemas (inter and intra-
schemas) and finds the sub-schemas candidates for
matching.
As illustrated in figure 3, our proposed approach
is composed of three phases: (1) converting XML
schemas in trees, (2) identifying and mining frequent
sub-trees, (3) finding relevant frequent sub-trees.
Our approach is based on the following
observations and assumptions: a) Schemas at large
scale are various and voluminous, b) Schemas in the
same domain contain the same domain concepts, and
c) In one schema, several sub-schemas are
redundant.
We discuss in this section the different phases of
decomposition approach.
Figure 3: Decomposition approach.
3.1 XML Trees:
From Schemas to Trees
The goal of this initial phase is to transform XML
schemas into trees and to find linguistic relations
between elements. This aims at improving
decomposition with considering not only exactly the
same labels of elements but also the linguistic
similar elements.
We firstly need to parse the XML schemas and
transforming them into trees. The main feature of
these large schemas is that they contain referential
constraints. Then parsing these schemas becomes a
difficult exercise. To cope with these constraints, we
duplicate the segment which they refer to resolve
their multiple contexts. We notice that most previous
match systems focused on simple schemas without
referential elements.
An XML schema is then modeled as a labeled
unordered rooted tree. Each element or attribute of
IMPROVING REAL WORLD SCHEMA MATCHING WITH DECOMPOSITION PROCESS
153