HAMAKE: A DATA FLOW APPROACH TO DATA PROCESSING IN HADOOP

Vadim Zaliva; Vladimir Orlov

doi:10.5220/0003893804570461

HAMAKE: A DATA FLOW APPROACH TO DATA PROCESSING IN HADOOP

Vadim Zaliva, Vladimir Orlov

2012

Abstract

Most non-trivial data processing scenarios using Hadoop typically involve launching more than one MapReduce job. Usually, such processing is data-driven with the data funneled through a sequence of jobs. The processing model could be expressed in terms of dataflow programming, represented as a directed graph with datasets as vertices. Using fuzzy timestamps as a way to detect which dataset needs to be updated, we can calculate a sequence in which Hadoop jobs should be launched to bring all datasets up to date. Incremental data processing and parallel job execution fit well into this approach. These ideas inspired the creation of the hamake utility. We attempted to emphasize data allowing the developer to formulate the problem as a data flow, in contrast to the workflow approach commonly used. Hamake language uses just two data flow operators: fold and foreach, providing a clear processing model similar to MapReduce, but on a dataset level.

References

Bialecki, A., Cafarella, M., Cutting, D., and O'Malley, O. (2005, Retrieved Feburay 06, 2012). Hadoop: a framework for running applications on large clusters built of commodity hardware.
Bialecki, A., Cafarella, M., Cutting, D., and O'Malley, O. (2005, Retrieved Feburay 06, 2012). Hadoop: a framework for running applications on large clusters built of commodity hardware.
Dean, J. and Ghemawat, S. (2008). Map Reduce: Simplified data processing on large clusters. Communications of the ACM-Association for Computing MachineryCACM, 51(1):107-114.
Dean, J. and Ghemawat, S. (2008). Map Reduce: Simplified data processing on large clusters. Communications of the ACM-Association for Computing MachineryCACM, 51(1):107-114.
Gangadhar, M. (2010. Retrieved Feburay 06, 2012). Benchmarking and optimizing hadoop. http:// www.slideshare.net/ydn/hadoop-summit-2010- benchmarking-and-optimizing-hadoop.
Gangadhar, M. (2010. Retrieved Feburay 06, 2012). Benchmarking and optimizing hadoop. http:// www.slideshare.net/ydn/hadoop-summit-2010- benchmarking-and-optimizing-hadoop.
Kahn, A. (1962). Topological sorting of large networks. Communications of the ACM, 5(11):558-562.
Kahn, A. (1962). Topological sorting of large networks. Communications of the ACM, 5(11):558-562.
Linkedin (2010. Retrieved Feburay 08, 2012). Azkaban: Simple hadoop workflow. http:// sna-projects.com/ azkaban/.
Linkedin (2010. Retrieved Feburay 08, 2012). Azkaban: Simple hadoop workflow. http:// sna-projects.com/ azkaban/.
Manning, C., Raghavan, P., and Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press.
Manning, C., Raghavan, P., and Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press.
McCallum, A., Nigam, K., and Ungar, L. H. (2000). Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. KDD 7800.
McCallum, A., Nigam, K., and Ungar, L. H. (2000). Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. KDD 7800.
Olston, C., Chiou, G., Chitnis, L., Liu, F., Han, Y., Larsson, M., Neumann, A., Rao, V., Sankarasubramanian, V., Seth, S., et al. (2011). Nova: continuous pig/hadoop workflows. In Proceedings of the 2011 international conference on Management of data, pages 1081-1090. ACM.
Olston, C., Chiou, G., Chitnis, L., Liu, F., Han, Y., Larsson, M., Neumann, A., Rao, V., Sankarasubramanian, V., Seth, S., et al. (2011). Nova: continuous pig/hadoop workflows. In Proceedings of the 2011 international conference on Management of data, pages 1081-1090. ACM.
Orlov, V. and Bondar, A. (2011. Retrieved Feburay 06, 2012). Hamake syntax reference. http:// code.google.com/p/hamake/wiki/HamakeFileSyntax Reference.
Orlov, V. and Bondar, A. (2011. Retrieved Feburay 06, 2012). Hamake syntax reference. http:// code.google.com/p/hamake/wiki/HamakeFileSyntax Reference.
Weil, K. (2010. Retrieved Feburay 06, 2012). Hadoop at twitter. http://engineering.twitter.com/2010/04/ hadoop-at-twitter.html.
Weil, K. (2010. Retrieved Feburay 06, 2012). Hadoop at twitter. http://engineering.twitter.com/2010/04/ hadoop-at-twitter.html.
Wensel, C. (2010. Retrieved Feburay 08, 2012). Cascading. http://www.cascading.org/.
Wensel, C. (2010. Retrieved Feburay 08, 2012). Cascading. http://www.cascading.org/.
Yahoo! (2010. Retrieved Feburay 08, 2012). Oozie: Workflow engine for hadoop. http://yahoo.github.com/ oozie/.
Yahoo! (2010. Retrieved Feburay 08, 2012). Oozie: Workflow engine for hadoop. http://yahoo.github.com/ oozie/.
Zhang, K., Chen, K., and Xue, W. (2011). Kangaroo: Reliable execution of scientific applications with dag programming model. In Parallel Processing Workshops (ICPPW), 2011 40th International Conference on, pages 327-334. IEEE.
Zhang, K., Chen, K., and Xue, W. (2011). Kangaroo: Reliable execution of scientific applications with dag programming model. In Parallel Processing Workshops (ICPPW), 2011 40th International Conference on, pages 327-334. IEEE.

Download

Paper Citation

in Harvard Style

Zaliva V. and Orlov V. (2012). HAMAKE: A DATA FLOW APPROACH TO DATA PROCESSING IN HADOOP . In Proceedings of the 2nd International Conference on Cloud Computing and Services Science - Volume 1: CLOSER, ISBN 978-989-8565-05-1, pages 457-461. DOI: 10.5220/0003893804570461

in Harvard Style

Zaliva V. and Orlov V. (2012). HAMAKE: A DATA FLOW APPROACH TO DATA PROCESSING IN HADOOP . In Proceedings of the 2nd International Conference on Cloud Computing and Services Science - Volume 1: CLOSER, ISBN 978-989-8565-05-1, pages 457-461. DOI: 10.5220/0003893804570461

in Bibtex Style

@conference{closer12,
author={Vadim Zaliva and Vladimir Orlov},
title={HAMAKE: A DATA FLOW APPROACH TO DATA PROCESSING IN HADOOP},
booktitle={Proceedings of the 2nd International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,},
year={2012},
pages={457-461},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003893804570461},
isbn={978-989-8565-05-1},
}

in Bibtex Style

@conference{closer12,
author={Vadim Zaliva and Vladimir Orlov},
title={HAMAKE: A DATA FLOW APPROACH TO DATA PROCESSING IN HADOOP},
booktitle={Proceedings of the 2nd International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,},
year={2012},
pages={457-461},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003893804570461},
isbn={978-989-8565-05-1},
}

in EndNote Style

TY - CONF
JO - Proceedings of the 2nd International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,
TI - HAMAKE: A DATA FLOW APPROACH TO DATA PROCESSING IN HADOOP
SN - 978-989-8565-05-1
AU - Zaliva V.
AU - Orlov V.
PY - 2012
SP - 457
EP - 461
DO - 10.5220/0003893804570461

in EndNote Style

TY - CONF
JO - Proceedings of the 2nd International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,
TI - HAMAKE: A DATA FLOW APPROACH TO DATA PROCESSING IN HADOOP
SN - 978-989-8565-05-1
AU - Zaliva V.
AU - Orlov V.
PY - 2012
SP - 457
EP - 461
DO - 10.5220/0003893804570461