A FRAMEWORK FOR DATA CLEANING IN DATA WAREHOUSES

Taoxin Peng

Abstract

It is a persistent challenge to achieve a high quality of data in data warehouses. Data cleaning is a crucial task for such a challenge. To deal with this challenge, a set of methods and tools has been developed. However, there are still at least two questions needed to be answered: How to improve the efficiency while performing data cleaning? How to improve the degree of automation when performing data cleaning? This paper challenges these two questions by presenting a novel framework, which provides an approach to managing data cleaning in data warehouses by focusing on the use of data quality dimensions, and decoupling a cleaning process into several sub-processes. Initial test run of the processes in the framework demonstrates that the approach presented is efficient and scalable for data cleaning in data warehouses.

References

  1. Atre, S., 1998. Rules for data cleansing. Computerworld.
  2. Galhardas, H., Florescu, D., Shasha, D., 2001. Declaratively Data Cleaning: Language, Model, and Algorithms. In Proceedings of the 27th International Conference on Very Large Databases (VLDB), Roma, Italy.
  3. Halevy, A., Rajaraman, A., Ordille, J., 2006. Data Integration: The Teenage Years. In the 32nd International Conference on Very Large Databases. Seoul, Korea.
  4. Hipp, J., Guntzer, U., Grimmer, U., 2001. Data Quality Mining. In the 6th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.
  5. 28-09-2003 10-09-2005 Jarke, M. M., Jeusfeld, A., Quix, C., Vassiladis, P., 1999. Architecture and Quality in Data Warehouses: An Extended Repository Approach. Information Systems, 24(3).
  6. Luebbers, D., Grimmer, U., Jarke, M., 2003. Systematic Development of Data Mining-Based Data Quality Tools. In the 29th International Conference on Very Large Databases, Berlin, Germany.
  7. Liu, H., Shah, S., Jiang, W., 2004. On-line Outlier Detection and Data Cleaning. Computers and Chemical Engineering 28.
  8. Mecella, M., Scannapieco, M., Virgillito, A., Baldoni, R., Catarci, T., Batini, C., 2003. The DAQUINCIS Broker: Querying Data and Their Quality in Cooperative Information Systems. Journal of Data Semantics, Vol. 1, LNCS 2800.
  9. Muller, H., Freytag, J. C., 2003. Problems, Methods, and Challenges in Comprehensive Data Cleansing. Technical. Report, HUB-1B-164.
  10. Rahm, E., Do, H., 2000. Data Cleaning: Problems and Current Approaches. IEEE Bulletin of the Technical Committee on Data Engineering, Vol 23, No. 4.
  11. Raman, V., Hellerstein, J., 2001. Potter's Wheel: An Interactive Data Cleaning System. In the 27th International Conference on Very Large Databases. Roma, Italy.
  12. Sung, S., Li, Z., Sun, P., 2002. A fast Filtering Scheme for Large Database Cleaning. In the 11th International Conference on Information and Knowledge Management, Virginia, USA.
  13. Winkler, W., 2003. Data Cleaning Methods, In the Conference SIGKDD, Washington DC, USA.
  14. Wang, Y., Storey, V., Firth, C., 1995. A Framework for Analysis of Data Quality Research, IEEE Transactions on Knowledge and Data Engineering, Vol. 7, No. 4.
Download


Paper Citation


in Harvard Style

Peng T. (2008). A FRAMEWORK FOR DATA CLEANING IN DATA WAREHOUSES . In Proceedings of the Tenth International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-8111-36-4, pages 473-478. DOI: 10.5220/0001706004730478


in Bibtex Style

@conference{iceis08,
author={Taoxin Peng},
title={A FRAMEWORK FOR DATA CLEANING IN DATA WAREHOUSES},
booktitle={Proceedings of the Tenth International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2008},
pages={473-478},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001706004730478},
isbn={978-989-8111-36-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Tenth International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - A FRAMEWORK FOR DATA CLEANING IN DATA WAREHOUSES
SN - 978-989-8111-36-4
AU - Peng T.
PY - 2008
SP - 473
EP - 478
DO - 10.5220/0001706004730478