Bayesian Networks for Matcher Composition in Automatic Schema

Matching

Daniel Nikovski

, Alan Esenther

, Xiang Ye

, Mitsuteru Shiba

and Shigenobu Takayama

Mitsubishi Electric Research Laboratories, 201 Broadway, Cambridge, MA 02139, U.S.A.

Mitsubishi Electric Corporation, 5-1-1 Ofuna, Kamakura, Kanagawa 247-8501, Japan

Keywords:

Data Integration, Virtual Databases, Uncertain Schema Matching.

Abstract:

We propose a method for accurate combining of evidence supplied by multiple individual matchers regarding

whether two data schema elements match (refer to the same object or concept), or not, in the ﬁeld of automatic

schema matching. The method uses a Bayesian network to model correctly the statistical correlations between

the similarity values produced by individual matchers that use the same or similar information, in order to

avoid overconﬁdence in match probability estimates and improve the accuracy of matching. Experimental

results under several testing protocols suggest that the matching accuracy of the Bayesian composite matcher

can signiﬁcantly exceed that of the individual component matchers.

1 INTRODUCTION

The problem of automatic schema matching (ASM)

between two or more database schemas arises in

many applications, such as data migration, when one

database has to be incorporated into another, virtual

databases, where a single interface is used to access

multiple databases, and data analysis, when multiple

databases are stored in a data warehouse with a single

schema. When two database schemas that describe

the same problem domain are given (e.g. purchase

orders, real-estate listings, books, etc.), the objec-

tive of an automatic schema matching (ASM) method

is to discover which pairs of elements from the two

schemas are likely to match, that is, likely to refer to

the same entity (e.g. shipping address, house price,

book title, etc.), and possibly to also estimate the con-

ﬁdence of such a match.

The ASM problem is usually very difﬁcult,

because when database designers create database

schemas, they rarely provide full and unambiguous

information about what individual schema elements

represent. Even if any such information exists, it is

usually not meant for computer processing. Rather,

database designers usually choose suitable words or

abbreviations for the names of data elements, so as to

facilitate future maintenance of the data schemas by

themselves or other humans. Because of this common

practice, lexical analysis of the names of data ele-

ments could be an effective approach to ASM. For ex-

ample, the names “Street”, “Str”, and “StreetName”

can be recognized to refer to a street, possibly in an

address, and lexical analysis by string matching can

reveal this similarity. A different type of information

that might be useful for ASM is the structure of the

data schemas, if present. In many cases, schemas are

not represented by a ﬂat list of element names, but the

elements are organized in a hierarchy. For example,

the element “CustomerName” might have three sub-

elements, “FirstName”, “MiddleInitial”, and “Fami-

lyName”. Using such structural information is an-

other approach to ASM. Many more approaches exist:

for example, when the actual values of two database

ﬁelds come from the same statistical distribution (e.g.,

over names, numbers, etc.), this can serve as evidence

that the corresponding schema elements match. Dic-

tionaries, thesauri, and other auxiliary data sources

have been used for ASM purposes, too (Rahm and

Bernstein, 2001).

Due to the difﬁculty of the problem, no single

method has been shown to perform best on all ASM

tasks. This has led to the idea that multiple basic

matchers of the types described above can be used to-

gether in a composite matcher (Do and Rahm, 2002;

Tang and Li, 2006). The purpose of the composite

matcher is to combine the output of the individual

matchers and arrive at a more accurate set of likely

matches. In most cases, the output of an individual

matcher k for a given pair of elements S

and S

is a similarity value v

in the interval [0, 1], where

Nikovski D., Esenther A., Ye X., Shiba M. and Takayama S..

Bayesian Networks for Matcher Composition in Automatic Schema Matching.

DOI: 10.5220/0004001500480055

In Proceedings of the 14th International Conference on Enterprise Information Systems (ICEIS-2012), pages 48-55

ISBN: 978-989-8565-10-5

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

= 0 means no similarity, and v

= 1 means full con-

ﬁdence that the two elements match. When given a

library of K different individual matchers, the objec-

tive, then, is to ﬁnd a composite similarity measure v

that is a function of the individual outputs v

, k = 1, K.

Several methods for combining similarity values

have been proposed. The LSD system (Doan et al.,

2003) uses machine learning techniques to estimate

weighting coefﬁcients w

such that the ﬁnal similarity

measure v is a weighted average of the individual sim-

ilarity measures: v =

∑

k=1

. The COMA system

(Do and Rahm, 2002) extends this approach with the

minimum and maximum operators: v

min

= min

and v

max

= max

Although experimental results suggest that these

methods for combining similarity values lead to

matching accuracy that is higher than that of the ac-

curacy of the individual matchers, it can be recog-

nized that they are speciﬁc approaches to the funda-

mental problem of combining evidence from multi-

ple sources (in this case, multiple individual match-

ers), and make very speciﬁc assumptions about the

statistical structure of the evidence. These assump-

tions might or might not be warranted in practice. We

propose a general method for correct modeling of any

kind of statistical structure in the evidence, based on

Bayesian networks and probabilistic reasoning, and a

statistically grounded method for composing matcher

evidence using these Bayesian networks.

2 BAYESIAN NETWORKS FOR

COMBINING OUTPUTS OF

MULTIPLE SCHEMA

MATCHERS

When combining evidence from multiple sources, one

of the major problems and causes for errors is the im-

proper modeling of correlation and other forms of sta-

tistical dependence between variables in the problem

domain. For example, when two very similar match-

ers k and l are applied to an ASM problem, their out-

puts v

and v

will be highly correlated — when v

high, then v

will be high, too, and vice versa. For ex-

ample, a lexical matcher based on edit (Levenshtein)

distance would assign a medium-level similarity to

the pair of element names “Street” and “State”; simi-

larly, a lexical matcher based on the Jaccard distance

between the sets of letters in the two elements would

assign such similarity to the pair. For another pair of

elements, for example “Street” and “Address1”, both

lexical matchers would compute low similarity, be-

casue in this case similarity cannot be established on

the basis of string matching. In either case, not only

is the computed similarity misleading as regards to

the correct match, but both matchers provide the same

kind of evidence (both positive or both negative), so

its (in this case, harmful) inﬂuence is reinforced. If

a weighted sum of the two similarity values is used,

the same evidence will be counted twice, in practice,

which will result in a phenomenon known as over-

conﬁdence. One of the matchers is almost redundant,

and including it in the composition process might ac-

tually decrease the accuracy of matching. This effect

has been observed in other ﬁelds where evidence has

to be combined, such as medical diagnosis, and one

possible tool for handling it has been belief reason-

ing in Bayesian networks. Our method for combining

matcher output is based on such a network.

2.1 Representation

A Bayesian network (BN) is a probabilistic graphical

model that represents a set of random variables and

their conditional dependencies by means of a directed

acyclic graph (DAG). An edge in the DAG between

two nodes signiﬁes that the variable Y corresponding

to the child node is statistically conditionally depen-

dent on the variable X corresponding to the parent

node. This dependence is expressed in a conditional

probability table (CPT) stored in the child node for

Y . If X ∈ Par(Y ), where Par(Y ) is the set of par-

ent nodes of Y , this table contains probability entries

Pr(Y = y|Par(Y ) = z) for every possible combina-

tion of values x that X can take on and conﬁgurations

(sets of values) z that the variables in Par(X) can take

on. Likewise, when there is no direct edge between

two nodes, they are assumed to be conditionally inde-

pendent given their parents. In particular, when two

nodes have a common parent, but no edge between

them, they are assumed to be conditionally indepen-

dent given the value of their parent. The presence (or

absence) of edges in the DAG of a Bayesian network

is a way to express the statistical dependence (corre-

lation) between variables.

A Bayesian network to be used for combining out-

puts of individual matchers in an ASM task is shown

in Figure 1. Its DAG is a tree of depth four, with

some additional edges between some of the nodes.

The meaning of the nodes is as follows:

1. At the ﬁrst (top) level, the root node corresponds

to a Boolean variable signifying whether two

schema elements match. This is the ﬁnal hypoth-

esis that has to be evaluated.

2. The nodes at the second level of the trees repre-

sent independent ways in which the two element

BayesianNetworksforMatcherCompositioninAutomaticSchemaMatching

Overall Match

LM2

Lexical

Match

Structural

Matcher

Synonym

Matcher

Instance

Match

LM1

IM1

IM2

V V

Figure 1: A Bayesian network for combining the output of

multiple individual matchers.

names can match (lexical, structural, instance-

based, etc.). It is expected that these variables

are largely uncorrelated, because they use dif-

ferent information to test for possible matches.

They also each correspond to clusters of individ-

ual matchers whose output is correlated. In Fig-

ure 1, one cluster represents the hypothesis that

the two elements match lexically, and the other

cluster represents the hypothesis that the instances

(values) of the two elements in their respective

databases match.

3. The nodes at the third level of the tree are also

Boolean and represent the individual hypothesis

that the two elements match, according to a single

matcher. In Figure 1, these include two lexical

matchers LM1 and LM2, one structural matcher,

one synonym matcher, and two instance matchers

IM1 and IM2.

4. The leaves of the tree, at the fourth level, represent

the similarity values V

, k = 1,K of the individ-

ual matchers whose outputs have to be combined

(in this case, for the sake of illustration, K = 6).

These variables are continuous, and their possible

values are the real numbers v

The overall structure of the BN expresses the

understanding that when two elements match (or

don’t), the outputs of the structural matcher, syn-

onym matcher, the lexical match variable, and the

instance match variable will be statistically indepen-

dent. This is what is to be expected on a matching

task, because these matchers all use different infor-

mation from the two data schemas in order to com-

pute an estimate about whether the elements match.

However, the outputs of the two lexical matchers LM1

and LM2 would be correlated, as expected if they

use the same information (the names of the two el-

ements). That is why there exists an edge between

nodes LM1 and LM2. Similarly, the output of the

two instance matchers would be correlated, too, be-

cause they would both use the same information to

base their estimates on (namely, the contents of the

two corresponding database ﬁelds). Accordingly, an

edge between nodes IM1 and IM2 reﬂects this depen-

dency. This structure of the BN, then, corresponds to

our understanding of which matchers produce highly

correlated outputs, and which ones are statistically in-

dependent.

2.2 Parameter Estimation

In addition to the graph of the BN, if the network is

to be used for inference, the parameters in its CPTs

have to be speciﬁed, too. This can be done by means

of labeled cases, where pairs e

= (S

) of el-

ements S

and S

, l = 1, ..., N have been run

through all K matchers, to produce the corresponding

similarity values v

l,k

, l = 1, .. ., N, k = 1, ...,K, and

the correct labeling for some or all of the remaining

Boolean variables has been supplied, too.

If labels for all Boolean variables have been sup-

plied, then the estimation of the probabilities in

the CPTs of the Boolean nodes could be reduced

to frequency counting. That is, the entry Pr(Y =

y|Par(Y ) = z) is equal to the ratio of the number of

cases when Y had a speciﬁc value y (either True or

False) and the parents Par(Y ) of Y were in conﬁgura-

tion z, and the number of times the parents of Y were

in conﬁguration z (regardless of the value of Y ). For

the continuous nodes V

, a suitable parametric model

for the similarity values must be chosen. One possible

model is a normal (Gaussian) distribution with mean

µand variance σ

. Then, two separate normal distri-

butions N(µ

k,+

,σ

k,+

) and N(µ

k,−

,σ

k,−

) are estimated

for positive (matching) and negative (non-matching)

cases (pairs of elements), respectively. The mean µ

k,+

is the average of the similarity values v

k,i

of all data

cases where the parent node X

of V

has been labeled

with value True. The parameter σ

k,+

is the sampled

standard deviation of these cases. Analogously, the

parameters µ

k,−

and σ

k,−

are the sample mean and

standard deviation of v

k,i

over all cases when the par-

ent node X

has been labeled with the value False.

It is also possible to estimate the parameters in the

CPTs when only some of the nodes have been labeled.

A typical situation arises when a human designer has

provided feedback about whether the two elements

match (that is, has assigned a Boolean value to the

root node of the BN), but has not explained why they

match (that is, whether the match is lexical, instance-

based, structural, based on a dictionary, etc.) This sit-

uation is more challenging, but as long as the graph

of the network is known and ﬁxed, it is still possible

ICEIS2012-14thInternationalConferenceonEnterpriseInformationSystems

to estimate the most likely values of the parameters in

its CPT. This problem is known as parameter learn-

ing with partially observed data in Bayesian networks,

and can be solved by means of gradient ascent in the

likelihood function or the Expectation Maximization

algorithm, among other methods (Heckerman, 2001;

Thiesson, 1995).

Assuming there is a data set Σ of N independent

training cases, the log-likelihood scoring function is

log L(Θ|Σ) =

∑

i=1

∑

l=1

log P(X

|Pa(X

), θ

where Σ denotes the training data set, Pa(X

) denotes

the parents of the node X

, i = 1, ..., M, and Θ is the

parameter vector Θ =

{

, .. ., θ

}

However, we only have partial observations,

which means that there are several hidden nodes with

no labels. For each training case, one pair of ele-

ments S

and S

is run through all K individual

matchers to produce the corresponding similarity val-

ues v

i, j,k

, and a true label of two elements matching or

not for the root node OverallMatch is provided by the

human designer. With known structure and partial ob-

servation, we can use the EM (expectation maximiza-

tion) algorithm to ﬁnd a locally optimal maximum-

likelihood estimate of the parameters (Murphy, 2003).

After learning parameters from a training data set,

each discrete node has a conditional probability ta-

ble (CPT) specifying the probability of each state of

the node given each possible combination of parents’

states.

2.3 Inference

Given the individual similarity values V

= v

, k =

1,K that have been reported by all individual match-

ers, and a full Bayesian network with CPTs estimated

from data, we can evaluate the probability that the two

elements match on the basis of all evidence, by means

of a standard computational process known as belief

updating. One possible method to perform belief up-

dating is to construct the join tree of the Bayesian net-

work, and use if for inference. This can be done by

means of a number of commercial or freely available

reasoning engines. The continuous variables V

, un-

der the chosen Gaussian parametrization, can be in-

corporated into the process of belief updating in the

form of virtual (uncertain) evidence (Pan et al., 2006).

To supply virtual evidence to a belief updating engine,

all that is needed is the likelihood ratio of the observed

values v

for the similarity value variables V

L(V

= v

)

Pr(V

= v

= T )

Pr(V

= v

= F)

N(v

|µ

k,+

,σ

k,+

)

N(v

|µ

k,−

,σ

k,−

)

where N(v|µ,σ

) is the probability that measurement

v comes from normal distribution with mean µ and

variance σ

, and X

is the parent node of V

in the

BN.

After the process of belief updating concludes, all

Boolean nodes in the network will be assigned proba-

bility values according to the observed evidence (val-

ues) v

for the similarity value variables V

. The prob-

ability of the root node is the ﬁnal estimate that the

two elements match, given the combined evidence of

the individual matchers.

3 EXPERIMENTAL RESULTS

In order to evaluate the match accuracy of any

matcher described below, we used ﬁve XML schemas

for purchase orders, CIDX, Excel, Noris, Paragon and

Apertum, kindly provided to us by the University of

Leipzig. The ﬁgure of merit for evaluation of the ac-

curacy of matching was the popular f-measure, de-

ﬁned as the harmonic mean of precision and recall,

as used in the information retrieval community. If

the number of true matches identiﬁed by the match-

ing system as such (hits) is A, the number of true

matches not identiﬁed as such (misses) is B, and the

number of cases when two elements do not match,

but the matcher incorrectly declares a match (false

positives) is C, the f-measure F can be computed as

F = 2A/(2A + B +C).

We developed 13 basic schema matchers and eval-

uated the ability of the proposed Bayesian method to

combine their outputs so as to improve the accuracy of

matching. Of these, 11 were lexical matchers: Cosi-

neSimilarity, HammingDistance, JaroMeasure, Lev-

enshteinString, BigramDistance, TrigramDistance,

QuadgramDistance, PreﬁxName, SufﬁxName, Afﬁx-

Name, SubstringDistance. One matcher, PathName,

was structural, comparing the entire paths of the two

elements in their respective XML schemas. The last

basic matcher was neither lexical nor structural: the

Synonym matcher declared a match if and only if the

two tested elements were found in a list of synonyms

relevant to the domain of purchase orders. Based on

their method of operation, the similarity values com-

puted by the 11 lexical matchers can be expected to

be highly correlated and statistically dependent; in

contrast, the synonym matcher could be expected to

produce output that is largely independent of the lexi-

cal matchers. Experimental evaluation of their pair-

wise dependence conﬁrms this intuition: Figure 2

shows the pairwise correlation between all 13 pairs of

matchers, evaluated from all pairs of elements in all

ten pairs of schemas. Clearly, all 11 lexical matchers

BayesianNetworksforMatcherCompositioninAutomaticSchemaMatching

Figure 2: Pair-wise correlations between all pairs of basic

matchers, numbered as follows: 1: Edit Distance; 2: Sub-

string Distance; 3: Bi-Gram Distance; 4: Tri-Gram Dis-

tance; 5: Quad-Gram Distance; 6: Cosine Similarity; 7:

Hamming Distance; 8: Jaro Measure; 9: Afﬁx Name; 10:

Preﬁx Name; 11: Sufﬁx Name; 12: Path Name; 13: Syn-

onym.

Figure 3: Scatter plot of similarity values computed by the

Edit Distance (LevenshteinString) and SubstringDistance

matchers. Their output is clearly correlated, resulting in a

correlation coefﬁcient of 0.9892.

are highly correlated, whereas their correlation with

the Synonym matcher is minimal. Somewhat surpris-

ingly, the structural matcher, PathName, is the least

correlated with any other matcher.

The kind of major correlation that exists between

lexical matchers is illustrated in Figure 3 that shows

a scatter plot of the similarity values computed by

the LevenshteinString (edit distance) matcher and the

SubstringDistance matcher. Their high correlation

(0.9892) makes one of them almost redundant, if the

other one is present.

Regarding the experimental evaluation of match-

ing accuracy, as with any machine learning method,

care should be given to the training and testing eval-

uation protocol, that is, which data are used for train-

ing and which data are used for testing. We used three

evaluation protocols, as described below.

3.1 Testing on Training Data set

This is the simplest evaluation protocol, where we

use the same data set for testing and training. Its

purpose is to evaluate how well we can ﬁt the train-

ing data. Under this protocol, we deﬁne ten match-

ing tasks that correspond to all possible pairs of the

ﬁve XML schemas. For each matching task (pair of

schemas), we build a dedicated Bayesian composite

matcher that is speciﬁc for this task. The same data

set, then, is used as evidence to predict the belief for

every pair of elements. This is the most lenient eval-

uation protocol, since the learning algorithm has seen

during training the data that will be used for testing.

After a similarity matrix is computed for all pairs

of elements of two database schemas, an additional

global matching step called Max1/Delta is performed

to produce the ﬁnal match decisions, based on the

understanding that most often (but not always) map-

pings between database elements are one-to-one (Do

and Rahm, 2002). Since this procedure is sensitive

to the exact value of the Delta parameter, we present

below results as a function of that paramater. After

global match decisions have been obtained, they are

compared with the ground truth, and the f-measure for

this pair of schemas is computed. These f-measures

are averaged over all pairs of tasks in the testing data

set (in this case, ten pairs of tasks), in order to arrive

at the ﬁnal overall f-measure.

Figure 4 shows a comparison between all 13 ba-

sic matchers and the Bayesian Composite Matcher

(BCM). The accuracy of the BCM reaches 0.819 and

is signiﬁcantly higher than that of any other matcher.

It is also practically constant for a wide range of the

parameter Delta. The performance of Path Name

matcher is better than other individual matchers, be-

cause it is a hybrid matcher combining two basic

match techniques.

3.2 Leave-One-Out Cross Validation

(LOOCV)

A more realistic testing protocol is under the leave-

one-out cross validation (LOOCV) method, where

training and testing data are clearly separated. Each

of the ten pairs of schemas is used for testing, using

a BCM that was learned using the other nine pairs of

schemas. The results are averaged over the ten pairs,

as follows:

1. Build training and testing data sets for 10 test

tasks. For instance, if the similarity matrix of

ICEIS2012-14thInternationalConferenceonEnterpriseInformationSystems

Figure 4: Comparison of average f-measure between the

Bayesian Composite Matcher and all other matchers.

Excel ↔ Noris is used as testing set, the training

data set for this test task is a collection of similar-

ity matrices of the remaining 9 schema pairs.

2. Learn one Bayesian composite matcher for each

task based on its training data.

3. Implement Max1/Delta selection approach on

the composite similarity matrix generated by each

Bayesian Composite Matcher.

3.3 Exclusive Leave-One-Out Cross

Validation (ExclLOOCV)

The second protocol described above still allowed

the training algorithm to see data from the pair of

schemas that would be used for testing, but not the

ground truth for their direct match. To eliminate any

exposure of the training algorithm to data that would

be used for testing, we modiﬁed the LOOCV pro-

cedure as follows. For each task, if the test pair is

A ↔ B, the training examples only come from the

three remaining schemas not involving either A nor B.

For example, if one test set is Excel ↔ Noris, it will

be tested with the Bayesian composite matcher that

has used only the following three pairs of schemas for

training: CIDX ↔ Apertum, CIDX ↔ Paragon, and

Apertum ↔ Paragon. This is the maximally realistic

testing protocol.

Figure 5 shows a comparison between the two

variants of the LOOCV evaluation protocol for the

Bayesian Composite Matcher. It can be seen that the

accuracy drops to 0.76 under usual LOOCV and 0.73

under exclusive LOOCV.

4 RELATED WORK

As mentioned in the ﬁrst section, many methods for

creating composite matchers have been tried, and this

section explains the difference between them and the

proposed approach. One major distinction between

Figure 5: Comparison of Bayesian composite matcher per-

formance under LOOCV and exclusive LOOCV testing

protocols.

these methods is whether they rely on manual tun-

ing of the composition structure and parameters, or

such parameters are estimated from a training set and

veriﬁed on an independent test set. The composi-

tion methods developed in the COMA (Do and Rahm,

2002; Do and Rahm, 2007) and GLUE (Doan et al.,

2003) systems are based on manual tuning of the

composition parameters, so comparison with learning

methods for tuning parameters is not entirely correct;

a composite matcher that is manually tuned with a

speciﬁc set of schemas in mind can certainly be ex-

pected to be more accurate than a learning matcher

that is tested under a cross-validation protocol.

Among the learning methods for composing

matchers, our approach is most similar to the one pro-

posed by Marie and Gal (Marie and Gal, 2007), who

have approached the problem from a Bayesian net-

work perspective, too, arguing that a disciplined ap-

proach to handling match uncertainty has to be ap-

plied. However, their approach is based on Naive

Bayes networks, that is, two-level Bayesian networks

with one root node that corresponds to the matching

event, and many leaf nodes that are directly children

to the root node. It can be shown that such a Naive

Bayes network has the same classiﬁcation properties

as a logistic regression model, and the decision sur-

face is linear, similar to the one used in the LSD and

GLUE systems (Doan et al., 2003; Doan et al., 2003).

In contrast, a full (non-naive) Bayesian network like

the one proposed in this paper can model arbitrary

correlations and decision surfaces.

Furthermore, the Bayesian network proposed in

this paper is also different from the Bayesian net-

work classiﬁers used in the YAM system (Duchateau

et al., 2009) in that our network includes unobserv-

able nodes corresponding to types of matchers; in

contrast, YAM employs the BayesNet classiﬁer from

the WEKA library that can learn the structure of a

fully observable network by adding and removing

edges, but cannot add unobservable nodes (Witten and

BayesianNetworksforMatcherCompositioninAutomaticSchemaMatching

Frank, 2005). Unobservable nodes corresponding to a

type of matcher (e.g. lexical, dictionary-based, struc-

tural, etc.) present a natural way of representing the

conditional dependency between multiple matchers of

the same type, because they restrict the edges of the

graph only to the nodes of the same type. In contrast,

a fully-connected BN without hidden nodes would re-

quire an exponential number of CPT parameters to be

estimated, which would make it practically impossi-

ble to collect the data necessary for estimating them.

This problem is further compounded by the continu-

ous values of the similarity values produced by basic

matchers — in fact, it is not immediately clear how

YAM would have been able to learn a fully connected

BN with 13 continuous nodes representing the simi-

larity values of each basic matcher, from the few thou-

sand examples available from the PO dataset under

the two LOOCV protocols.

On the other hand, non-linear classiﬁers such as

decision trees (Duchateau et al., 2008) can indeed

represent non-linear decision surfaces from a lim-

ited number of training examples, but are not inher-

ently probabilistic, and the binary decisions output

by them are not easy to use in the global assign-

ment process that determines the entire mapping be-

tween two schemas from the pair-wise matches be-

tween their individual elements. Other probabilistic

approaches to the automatic schema matching prob-

lem include the use of an attribute dictionary in the

AUTOMATCH system, where training examples of

matching schemas are used to compile the dictionary,

and candidate elements from new schemas are com-

pared probabilistically to the dictionary. Although

this approach does result in probabilistic estimates of

matches, the compilation of the dictionary requires

many training examples, and is best suited to do-

mains where many pairs of entire schemas have to be

matched repeatedly.

5 CONCLUSIONS AND FUTURE

WORK

We have proposed a novel method for creating com-

posite matchers for the purpose of automatic schema

matching. Its main advantage is the explicit model-

ing of the conditional statistical dependence between

the similarity values computed by individual basic

matchers. Experiments suggest that it combines suc-

cessfully the outputs of such matchers, and achieves

matching accuracy signiﬁcantly exceeding that of the

individual matchers. Furthermore, its outputs are es-

timates of the genuine probabilities of match, which

allows the application of decision-theoretic methods

for optimal judgment whether elements match, or not.

Further work will focus on leveraging the clear se-

mantics of the computed probabilities for improving

the accuracy of the global matching algorithm, as well

as on improving the computational properties of the

proposed Bayesian method.

REFERENCES

E. Rahm, P. A. Bernstein, A Survey of Approaches to Auto-

matic Schema Matching, VLDB Journal, 10:4 2001.

H. H. Do, E. Rahm, COMA - A System for Flexible Com-

bination of Schema Matching Approaches, in Pro-

ceedings of the 28th International Conference on Very

Large Data Bases (VLDB), 2002.

W. Li, C. Clifton, A Tool for Identifying Attribute Corre-

spondences in Heterogeneous Databases Using Neu-

ral Network, Journal of Data and Knowledge Engi-

neering 33: 1, 49-84, 2000.

A. Doan, P. Domingos, and A. Halevy., Learning to Match

the Schemas of Databases: A Multistrategy Approach,

Machine Learning Journal, no. 50, pp. 279–301, 2003.

S. Bergamaschi, S. Castano, M. Vincini, D. Beneventano,

Semantic Integration of Heterogeneous Information

Sources, Journal of Data and Knowledge Engineering

36: 3, 215-249, 2001.

H. H. Do, E. Rahm, Matching Large Schemas: Approaches

and Evaluation, Journal of Information Systems, Vol.

32, Issue 6, Sep. 2007.

A. H. Doan, P. Domingos, A. Halevy, Reconciling Schemas

of Disparate Data Sources: A Machine Learning Ap-

proach, SIGMOD 2001.

D. W. Embley, Multifaceted Exploitation of Metadata for

Attribute Match Discovery in Information Integration.

WIIW 2001.

D. Heckerman, A Tutorial on Learning Bayesian Networks,

Journal of Learning in Graphical Models, pp. 301-

354, 2001.

K. Murphy, An Introduction to Machine Learning and

Graphical Models, the Intel Workshop on Machine

Learning, Sep. 2003.

J. Tang, J. Z. Li, Using Bayesian Decision for Ontology

Mapping, Journal of Web Semantics, Vol. 4, Issue 4,

Dec. 2006.

Thiesson, B., Accelerated Quantiﬁcation of Bayesian Net-

works with Incomplete Data, Proceedings of the Con-

ference on Knowledge Discovery in Data, 1995, pp.

306-311.

Rong Pan, Yun Peng, Zhongli Ding, Belief Update in

Bayesian Networks Using Uncertain Evidence, 18th

IEEE International Conference on Tools with Artiﬁ-

cial Intelligence (ICTAI’06), 2006, pp.441-444.

A. Marie and A. Gal. Managing Uncertainty in Schema

Matcher Ensembles. Proceedings of the 1st Interna-

tional Conference on Scalable Uncertainty Manage-

ment. Washington, DC, October 2007, pp. 60-73.

A. H. Doan, J. Madhavan, R. Dhamankar, P. Domingos, A.

Halevy, Learning to Match Ontologies on the Seman-

ICEIS2012-14thInternationalConferenceonEnterpriseInformationSystems

tic Web, The VLDB Journal 12 (4), 2003, pp. 303-

319.

F. Duchateau, Z. Bellahsene and R. Coletta, A Flexible Ap-

proach for Planning Schema Matching Algorithms,

OTM Conferences (CooPIS), 2008, pp. 249-264.

F. Duchateau, R. Coletta, Z. Bellahsene, R. J. Miller,

Not Yet Another Matcher, Proceedings of CIKM’09,

Hong-Kong, China, November 2009, pp. 2079-2080.

Ian H. Witten, Eibe Frank, Data Mining: Practical Machine

Learning Tools and Techniques, Second Edition, Mor-

gan Kaufmann, 2005.

Berlin, J., A. Motro: Database Schema Matching Us-

ing Machine Learning with Feature Selection. CAiSE

2002, pp.452-466.

BayesianNetworksforMatcherCompositioninAutomaticSchemaMatching