Frank, 2005). Unobservable nodes corresponding to a
type of matcher (e.g. lexical, dictionary-based, struc-
tural, etc.) present a natural way of representing the
conditional dependency between multiple matchers of
the same type, because they restrict the edges of the
graph only to the nodes of the same type. In contrast,
a fully-connected BN without hidden nodes would re-
quire an exponential number of CPT parameters to be
estimated, which would make it practically impossi-
ble to collect the data necessary for estimating them.
This problem is further compounded by the continu-
ous values of the similarity values produced by basic
matchers — in fact, it is not immediately clear how
YAM would have been able to learn a fully connected
BN with 13 continuous nodes representing the simi-
larity values of each basic matcher, from the few thou-
sand examples available from the PO dataset under
the two LOOCV protocols.
On the other hand, non-linear classifiers such as
decision trees (Duchateau et al., 2008) can indeed
represent non-linear decision surfaces from a lim-
ited number of training examples, but are not inher-
ently probabilistic, and the binary decisions output
by them are not easy to use in the global assign-
ment process that determines the entire mapping be-
tween two schemas from the pair-wise matches be-
tween their individual elements. Other probabilistic
approaches to the automatic schema matching prob-
lem include the use of an attribute dictionary in the
AUTOMATCH system, where training examples of
matching schemas are used to compile the dictionary,
and candidate elements from new schemas are com-
pared probabilistically to the dictionary. Although
this approach does result in probabilistic estimates of
matches, the compilation of the dictionary requires
many training examples, and is best suited to do-
mains where many pairs of entire schemas have to be
matched repeatedly.
5 CONCLUSIONS AND FUTURE
WORK
We have proposed a novel method for creating com-
posite matchers for the purpose of automatic schema
matching. Its main advantage is the explicit model-
ing of the conditional statistical dependence between
the similarity values computed by individual basic
matchers. Experiments suggest that it combines suc-
cessfully the outputs of such matchers, and achieves
matching accuracy significantly exceeding that of the
individual matchers. Furthermore, its outputs are es-
timates of the genuine probabilities of match, which
allows the application of decision-theoretic methods
for optimal judgment whether elements match, or not.
Further work will focus on leveraging the clear se-
mantics of the computed probabilities for improving
the accuracy of the global matching algorithm, as well
as on improving the computational properties of the
proposed Bayesian method.
REFERENCES
E. Rahm, P. A. Bernstein, A Survey of Approaches to Auto-
matic Schema Matching, VLDB Journal, 10:4 2001.
H. H. Do, E. Rahm, COMA - A System for Flexible Com-
bination of Schema Matching Approaches, in Pro-
ceedings of the 28th International Conference on Very
Large Data Bases (VLDB), 2002.
W. Li, C. Clifton, A Tool for Identifying Attribute Corre-
spondences in Heterogeneous Databases Using Neu-
ral Network, Journal of Data and Knowledge Engi-
neering 33: 1, 49-84, 2000.
A. Doan, P. Domingos, and A. Halevy., Learning to Match
the Schemas of Databases: A Multistrategy Approach,
Machine Learning Journal, no. 50, pp. 279–301, 2003.
S. Bergamaschi, S. Castano, M. Vincini, D. Beneventano,
Semantic Integration of Heterogeneous Information
Sources, Journal of Data and Knowledge Engineering
36: 3, 215-249, 2001.
H. H. Do, E. Rahm, Matching Large Schemas: Approaches
and Evaluation, Journal of Information Systems, Vol.
32, Issue 6, Sep. 2007.
A. H. Doan, P. Domingos, A. Halevy, Reconciling Schemas
of Disparate Data Sources: A Machine Learning Ap-
proach, SIGMOD 2001.
D. W. Embley, Multifaceted Exploitation of Metadata for
Attribute Match Discovery in Information Integration.
WIIW 2001.
D. Heckerman, A Tutorial on Learning Bayesian Networks,
Journal of Learning in Graphical Models, pp. 301-
354, 2001.
K. Murphy, An Introduction to Machine Learning and
Graphical Models, the Intel Workshop on Machine
Learning, Sep. 2003.
J. Tang, J. Z. Li, Using Bayesian Decision for Ontology
Mapping, Journal of Web Semantics, Vol. 4, Issue 4,
Dec. 2006.
Thiesson, B., Accelerated Quantification of Bayesian Net-
works with Incomplete Data, Proceedings of the Con-
ference on Knowledge Discovery in Data, 1995, pp.
306-311.
Rong Pan, Yun Peng, Zhongli Ding, Belief Update in
Bayesian Networks Using Uncertain Evidence, 18th
IEEE International Conference on Tools with Artifi-
cial Intelligence (ICTAI’06), 2006, pp.441-444.
A. Marie and A. Gal. Managing Uncertainty in Schema
Matcher Ensembles. Proceedings of the 1st Interna-
tional Conference on Scalable Uncertainty Manage-
ment. Washington, DC, October 2007, pp. 60-73.
A. H. Doan, J. Madhavan, R. Dhamankar, P. Domingos, A.
Halevy, Learning to Match Ontologies on the Seman-
ICEIS2012-14thInternationalConferenceonEnterpriseInformationSystems
54