Multi-domain Schema Clustering and Hierarchical Mediated Schema Generation

Qizhen Huang, Chaoliang Zhong, Jun Zhang

Abstract

In data integration, users can access multiple data sources through a uniform interface. Yet it is still not easy to query from data sources where many domains coexist even if the data sources are clustered into several domains since users have to write different query clauses for each different domain. Previous researches have presented various data integration techniques, but nearly all of them require the schemas of data sources to be integrated belong to the same domain, or failed to address that some different domains may be the sub-domains of a high level domain in which case a more abstract query clause for upper domain can substitute several less abstract query clauses for lower domains. In this paper, we propose a graph-based approach for clustering schemas which would finally expose to users a hierarchical mediated schema forest, and a query forwarding mechanism to transform queries down along the schema forest. A set of experimental results demonstrate that our schema clustering algorithm is effective in clustering the data sources into hierarchical schemas, queries on the mediated schemas could achieve answers with good accuracy, and the cost of writing query clauses for users is reduced without losing query accuracy.

References

  1. Sarma, D.A., Dong, X. and Halevy, A., 2008. Bootstrapping pay-as-you-go data integration systems. In SIGMOD, ACM.
  2. Dong, X., Halevy, A. and Yu, C., 2007. Data integration with uncertainty. In VLDB, ACM.
  3. Aboulnaga A. and Gebaly, K.E., 2007. ┬Ábe: User guided source selection and schema mediation for internet scale data integration. In ICDE, IEEE.
  4. Mahmoud, H.A. and Aboulnaga, A., 2010. Schema clustering and retrieval for multi-domain pay-as-yougo data integration systems, In SIGMOD, ACM.
  5. He, B., Tao, T. and Chang, K. C.-C., 2004. Organizing structured web sources by query schemas: a clustering approach. In CIKM, ACM.
  6. Madhavan, J., Cohen, S., Dong, X., Halevy, A., Jeffery, S., Ko, D. and Yu, C., 2007. Web-scale data integration: you can afford to pay as you go. In CIDR.
  7. Murphy, K., 2012. Machine learning: a probabilistic perspective. MIT Press, London.
  8. Wand, Y. and Weber, R., 2002. Research Commentary: Information Systems and Conceptual Modeling--A Research Agenda. Information Systems Research, vol.13(4), pp.363-376.
  9. Carmel, D., Roitman, H. and Zwerdling, N., 2009. Enhancing Cluster Labeling Using Wikipedia. In SIGIR, ACM.
  10. Strehl, A. and Ghosh, J, 2003. Cluster ensembles: a knowledge reuse framework for combining multiple partitions, The Journal of Machine Learning Research, vol.3, pp.583-617.
Download


Paper Citation


in Harvard Style

Huang Q., Zhong C. and Zhang J. (2014). Multi-domain Schema Clustering and Hierarchical Mediated Schema Generation . In Proceedings of the 16th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-027-7, pages 111-118. DOI: 10.5220/0004825601110118


in Bibtex Style

@conference{iceis14,
author={Qizhen Huang and Chaoliang Zhong and Jun Zhang},
title={Multi-domain Schema Clustering and Hierarchical Mediated Schema Generation},
booktitle={Proceedings of the 16th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2014},
pages={111-118},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004825601110118},
isbn={978-989-758-027-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 16th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - Multi-domain Schema Clustering and Hierarchical Mediated Schema Generation
SN - 978-989-758-027-7
AU - Huang Q.
AU - Zhong C.
AU - Zhang J.
PY - 2014
SP - 111
EP - 118
DO - 10.5220/0004825601110118