HAMAKE: A DATA FLOW APPROACH TO DATA PROCESSING IN HADOOP
Vadim Zaliva, Vladimir Orlov
2012
Abstract
Most non-trivial data processing scenarios using Hadoop typically involve launching more than one MapReduce job. Usually, such processing is data-driven with the data funneled through a sequence of jobs. The processing model could be expressed in terms of dataflow programming, represented as a directed graph with datasets as vertices. Using fuzzy timestamps as a way to detect which dataset needs to be updated, we can calculate a sequence in which Hadoop jobs should be launched to bring all datasets up to date. Incremental data processing and parallel job execution fit well into this approach. These ideas inspired the creation of the hamake utility. We attempted to emphasize data allowing the developer to formulate the problem as a data flow, in contrast to the workflow approach commonly used. Hamake language uses just two data flow operators: fold and foreach, providing a clear processing model similar to MapReduce, but on a dataset level.
References
- Bialecki, A., Cafarella, M., Cutting, D., and O'Malley, O. (2005, Retrieved Feburay 06, 2012). Hadoop: a framework for running applications on large clusters built of commodity hardware.
- Bialecki, A., Cafarella, M., Cutting, D., and O'Malley, O. (2005, Retrieved Feburay 06, 2012). Hadoop: a framework for running applications on large clusters built of commodity hardware.
- Dean, J. and Ghemawat, S. (2008). Map Reduce: Simplified data processing on large clusters. Communications of the ACM-Association for Computing MachineryCACM, 51(1):107-114.
- Dean, J. and Ghemawat, S. (2008). Map Reduce: Simplified data processing on large clusters. Communications of the ACM-Association for Computing MachineryCACM, 51(1):107-114.
- Gangadhar, M. (2010. Retrieved Feburay 06, 2012). Benchmarking and optimizing hadoop. http:// www.slideshare.net/ydn/hadoop-summit-2010- benchmarking-and-optimizing-hadoop.
- Gangadhar, M. (2010. Retrieved Feburay 06, 2012). Benchmarking and optimizing hadoop. http:// www.slideshare.net/ydn/hadoop-summit-2010- benchmarking-and-optimizing-hadoop.
- Kahn, A. (1962). Topological sorting of large networks. Communications of the ACM, 5(11):558-562.
- Kahn, A. (1962). Topological sorting of large networks. Communications of the ACM, 5(11):558-562.
- Linkedin (2010. Retrieved Feburay 08, 2012). Azkaban: Simple hadoop workflow. http:// sna-projects.com/ azkaban/.
- Linkedin (2010. Retrieved Feburay 08, 2012). Azkaban: Simple hadoop workflow. http:// sna-projects.com/ azkaban/.
- Manning, C., Raghavan, P., and Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press.
- Manning, C., Raghavan, P., and Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press.
- McCallum, A., Nigam, K., and Ungar, L. H. (2000). Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. KDD 7800.
- McCallum, A., Nigam, K., and Ungar, L. H. (2000). Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. KDD 7800.
- Olston, C., Chiou, G., Chitnis, L., Liu, F., Han, Y., Larsson, M., Neumann, A., Rao, V., Sankarasubramanian, V., Seth, S., et al. (2011). Nova: continuous pig/hadoop workflows. In Proceedings of the 2011 international conference on Management of data, pages 1081-1090. ACM.
- Olston, C., Chiou, G., Chitnis, L., Liu, F., Han, Y., Larsson, M., Neumann, A., Rao, V., Sankarasubramanian, V., Seth, S., et al. (2011). Nova: continuous pig/hadoop workflows. In Proceedings of the 2011 international conference on Management of data, pages 1081-1090. ACM.
- Orlov, V. and Bondar, A. (2011. Retrieved Feburay 06, 2012). Hamake syntax reference. http:// code.google.com/p/hamake/wiki/HamakeFileSyntax Reference.
- Orlov, V. and Bondar, A. (2011. Retrieved Feburay 06, 2012). Hamake syntax reference. http:// code.google.com/p/hamake/wiki/HamakeFileSyntax Reference.
- Weil, K. (2010. Retrieved Feburay 06, 2012). Hadoop at twitter. http://engineering.twitter.com/2010/04/ hadoop-at-twitter.html.
- Weil, K. (2010. Retrieved Feburay 06, 2012). Hadoop at twitter. http://engineering.twitter.com/2010/04/ hadoop-at-twitter.html.
- Wensel, C. (2010. Retrieved Feburay 08, 2012). Cascading. http://www.cascading.org/.
- Wensel, C. (2010. Retrieved Feburay 08, 2012). Cascading. http://www.cascading.org/.
- Yahoo! (2010. Retrieved Feburay 08, 2012). Oozie: Workflow engine for hadoop. http://yahoo.github.com/ oozie/.
- Yahoo! (2010. Retrieved Feburay 08, 2012). Oozie: Workflow engine for hadoop. http://yahoo.github.com/ oozie/.
- Zhang, K., Chen, K., and Xue, W. (2011). Kangaroo: Reliable execution of scientific applications with dag programming model. In Parallel Processing Workshops (ICPPW), 2011 40th International Conference on, pages 327-334. IEEE.
- Zhang, K., Chen, K., and Xue, W. (2011). Kangaroo: Reliable execution of scientific applications with dag programming model. In Parallel Processing Workshops (ICPPW), 2011 40th International Conference on, pages 327-334. IEEE.
Paper Citation
in Harvard Style
Zaliva V. and Orlov V. (2012). HAMAKE: A DATA FLOW APPROACH TO DATA PROCESSING IN HADOOP . In Proceedings of the 2nd International Conference on Cloud Computing and Services Science - Volume 1: CLOSER, ISBN 978-989-8565-05-1, pages 457-461. DOI: 10.5220/0003893804570461
in Harvard Style
Zaliva V. and Orlov V. (2012). HAMAKE: A DATA FLOW APPROACH TO DATA PROCESSING IN HADOOP . In Proceedings of the 2nd International Conference on Cloud Computing and Services Science - Volume 1: CLOSER, ISBN 978-989-8565-05-1, pages 457-461. DOI: 10.5220/0003893804570461
in Bibtex Style
@conference{closer12,
author={Vadim Zaliva and Vladimir Orlov},
title={HAMAKE: A DATA FLOW APPROACH TO DATA PROCESSING IN HADOOP},
booktitle={Proceedings of the 2nd International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,},
year={2012},
pages={457-461},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003893804570461},
isbn={978-989-8565-05-1},
}
in Bibtex Style
@conference{closer12,
author={Vadim Zaliva and Vladimir Orlov},
title={HAMAKE: A DATA FLOW APPROACH TO DATA PROCESSING IN HADOOP},
booktitle={Proceedings of the 2nd International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,},
year={2012},
pages={457-461},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003893804570461},
isbn={978-989-8565-05-1},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 2nd International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,
TI - HAMAKE: A DATA FLOW APPROACH TO DATA PROCESSING IN HADOOP
SN - 978-989-8565-05-1
AU - Zaliva V.
AU - Orlov V.
PY - 2012
SP - 457
EP - 461
DO - 10.5220/0003893804570461
in EndNote Style
TY - CONF
JO - Proceedings of the 2nd International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,
TI - HAMAKE: A DATA FLOW APPROACH TO DATA PROCESSING IN HADOOP
SN - 978-989-8565-05-1
AU - Zaliva V.
AU - Orlov V.
PY - 2012
SP - 457
EP - 461
DO - 10.5220/0003893804570461