Task Clustering on ETL Systems - A Pattern-Oriented Approach

Bruno Oliveira, Orlando Belo

Abstract

Usually, data warehousing populating processes are data-oriented workflows composed by dozens of granular tasks that are responsible for the integration of data coming from different data sources. Specific subset of these tasks can be grouped on a collection together with their relationships in order to form higherlevel constructs. Increasing task granularity allows for the generalization of processes, simplifying their views and providing methods to carry out expertise to new applications. Well-proven practices can be used to describe general solutions that use basic skeletons configured and instantiated according to a set of specific integration requirements. Patterns can be applied to ETL processes aiming to simplify not only a possible conceptual representation but also to reduce the gap that often exists between two design perspectives. In this paper, we demonstrate the feasibility and effectiveness of an ETL pattern-based approach using task clustering, analyzing a real world ETL scenario through the definitions of two commonly used clusters of tasks: a data lookup cluster and a data conciliation and integration cluster.

References

  1. Akkaoui, Z. El et al., 2013. A BPMN-Based Design and Maintenance Framework for ETL Processes. International Journal of Data Warehousing and Mining (IJDWM), 9.
  2. Akkaoui, Z. El et al., 2011. A model-driven framework for ETL process development. In DOLAP 7811 Proceedings of the ACM 14th international workshop on Data Warehousing and OLAP. pp. 45-52.
  3. Akkaoui, Z. El et al., 2012. BPMN-Based Conceptual Modeling of ETL Processes. Data Warehousing and Knowledge Discovery Lecture Notes in Computer Science, 7448, pp.1-14.
  4. Akkaoui, Z. El & Zimanyi, E., 2009. Defining ETL worfklows using BPMN and BPEL. In DOLAP 7809 Proceedings of the ACM twelfth international workshop on Data warehousing and OLAP. pp. 41-48.
  5. Golfarelli, M. & Rizzi, S., 2009. Data Warehouse Design: Modern Principles and Methodologies, McGraw-Hill.
  6. Kimball, R. & Caserta, J., 2004. The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data,
  7. Köppen, V., Brüggemann, B. & Berendt, B., 2011. Designing Data Integration: The ETL Pattern Approach. The European Journal for the Informatics Professional, XII(3).
  8. Muñoz, L., Mazón, J.-N. & Trujillo, J., 2009. Automatic Generation of ETL Processes from Conceptual Models. In Proceedings of the ACM Twelfth International Workshop on Data Warehousing and OLAP. DOLAP 7809. New York, NY, USA: ACM, pp. 33-40.
  9. Oliveira, B. & Belo, O., 2012. BPMN Patterns for ETL Conceptual Modelling and Validation. The 20th International Symposium on Methodologies for Intelligent Systems: Lecture Notes in Artificial Intelligence.
  10. Oliveira, B. & Belo, O., 2013. Approaching ETL Conceptual Modelling and Validation Using BPMN and BPEL. In 2nd International Conference on Data Management Technologies and Applications (DATA).
  11. OMG, 2011. Documents Associated With Business Process Model And Notation (BPMN) Version 2.0.
  12. Ouyang, C. et al., 2007. Pattern-based translation of BPMN process models to BPEL web services. International Journal of Web Services Research (JWSR),5,pp.42-62.
  13. Pentaho, “Pentaho Data Integration”. Available at: http://www.pentaho.com/product/data-integration [Accessed March 16, 2015].
  14. Rahm, E. & Do, H.H., 2000. Data Cleaning: Problems and Current Approaches. IEEE Data Engineering Bulletin, 23, p.2000.
  15. Santos, V. & Belo, O., 2013. Modeling ETL Data Quality Enforcement Tasks Using Relational Algebra Operators. Procedia Technology, 9(0), pp.442-450.
  16. Singh, G., Su, M. & Vahi, K., 2008. Workflow task clustering for best effort systems with Pegasus. Proceedings of the 15th international conference on Advanced information systems engineering.
Download


Paper Citation


in Harvard Style

Oliveira B. and Belo O. (2015). Task Clustering on ETL Systems - A Pattern-Oriented Approach . In Proceedings of 4th International Conference on Data Management Technologies and Applications - Volume 1: DATA, ISBN 978-989-758-103-8, pages 207-214. DOI: 10.5220/0005559302070214


in Bibtex Style

@conference{data15,
author={Bruno Oliveira and Orlando Belo},
title={Task Clustering on ETL Systems - A Pattern-Oriented Approach},
booktitle={Proceedings of 4th International Conference on Data Management Technologies and Applications - Volume 1: DATA,},
year={2015},
pages={207-214},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005559302070214},
isbn={978-989-758-103-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of 4th International Conference on Data Management Technologies and Applications - Volume 1: DATA,
TI - Task Clustering on ETL Systems - A Pattern-Oriented Approach
SN - 978-989-758-103-8
AU - Oliveira B.
AU - Belo O.
PY - 2015
SP - 207
EP - 214
DO - 10.5220/0005559302070214