If the list of stop words has been changed,
hamake will re-run only FilterStopWords, Calcu-
lateTF, FindSimilar, and OutputResults.
5 RELATED WORK
Several workflow engines exist for Hadoop, such as
Oozie (Yahoo!, 2012), Azkaban (Linkedin, 2012),
Cascading (Wensel, 2012), and Nova (Olston et al.,
2011). Although all of these products could be used to
solve similar problems, they differ significantly in de-
sign, philosophy, target user profile, and usage scenar-
ios limiting the usefulness of a simple, feature-wise
comparison.
The most significant difference between these en-
gines and hamake lies in the workflow vs. dataflow
approach. All of them use the former, explicitly spec-
ifying dependencies between jobs. Hamake, in con-
trast, uses dependencies between datasets to derive
workflow. Both approaches have their advantages,
but for some problems, the dataflow representation as
used by hamake is more natural.
Kangaroo (Zhang et al., 2011) is using similar
data-flow DAG processing model, but not integrated
with Hadoop.
6 FUTURE DIRECTIONS
One possible hamake improvement may be better
integration with Hadoop schedulers. For example,
if Capacity Scheduler or Fair Scheduler is used, it
would be useful for hamaketo take information about
scheduler pools or queues capacity into account in its
job scheduling algorithm.
More granular control over parallelism could be
achieved if the hamake internal dependency graph
for foreach DTR contained individual files rather than
just filesets. For example, consider a dataflow consist-
ing of three filesets A, B, C, and two foreach DTR’s:
D
1
, mapping A to B, and D
2
, mapping B to C. File-
level dependencieswould allow some jobs to run from
D
2
without waiting for all jobs in D
1
to complete.
Another potential area of future extension is the
hamake dependency mechanism. The current imple-
mentation uses a fairly simple timestamp compari-
son to check whether dependency is satisfied. This
could be generalized, allowing the user to specify cus-
tom dependency check predicates, implemented ei-
ther as plugins, scripts (in some embedded scripting
languages), or external programs. This would allow
for decisions based not only on file meta data, such as
the timestamp, but also on its contents.
Several hamake users have requested support
for iterative computations with a termination con-
dition. Possible use-cases include fixed-point com-
putations and clustering or iterative regression algo-
rithms. Presently, to embed this kind of algorithm into
the hamake dataflow, it requires the use of the gen-
erations feature combined with external automation,
which invokes hamake repeatedly until a certain exit
condition is satisfied. Hamake users could certainly
benefit from native support for this kind of dataflow.
REFERENCES
Bialecki, A., Cafarella, M., Cutting, D., and O’Malley,
O. (2005, Retrieved Feburay 06, 2012). Hadoop: a
framework for running applications on large clusters
built of commodity hardware.
Dean, J. and Ghemawat, S. (2008). Map Reduce: Simplified
data processing on large clusters. Communications
of the ACM-Association for Computing Machinery-
CACM, 51(1):107–114.
Gangadhar, M. (2010. Retrieved Feburay 06, 2012).
Benchmarking and optimizing hadoop. http://
www.slideshare.net/ydn/hadoop-summit-2010-
benchmarking-and-optimizing-hadoop.
Kahn, A. (1962). Topological sorting of large networks.
Communications of the ACM, 5(11):558–562.
Linkedin (2010. Retrieved Feburay 08, 2012). Azkaban:
Simple hadoop workflow. http:// sna-projects.com/
azkaban/.
Manning, C., Raghavan, P., and Sch¨utze, H. (2008). Intro-
duction to information retrieval. Cambridge Univer-
sity Press.
McCallum, A., Nigam, K., and Ungar, L. H. (2000). Effi-
cient Clustering of High-Dimensional Data Sets with
Application to Reference Matching. KDD ’00.
Olston, C., Chiou, G., Chitnis, L., Liu, F., Han, Y., Lars-
son, M., Neumann, A., Rao, V., Sankarasubrama-
nian, V., Seth, S., et al. (2011). Nova: continuous
pig/hadoop workflows. In Proceedings of the 2011 in-
ternational conference on Management of data, pages
1081–1090. ACM.
Orlov, V. and Bondar, A. (2011. Retrieved Feburay
06, 2012). Hamake syntax reference. http://
code.google.com/p/hamake/wiki/HamakeFileSyntax
Reference.
Weil, K. (2010. Retrieved Feburay 06, 2012). Hadoop
at twitter. http://engineering.twitter.com/2010/04/
hadoop-at-twitter.html.
Wensel, C. (2010. Retrieved Feburay 08, 2012). Cascading.
http://www.cascading.org/.
Yahoo! (2010. Retrieved Feburay 08, 2012). Oozie: Work-
flow engine for hadoop. http://yahoo.github.com/
oozie/.
Zhang, K., Chen, K., and Xue, W. (2011). Kangaroo:
Reliable execution of scientific applications with dag
programming model. In Parallel Processing Work-
shops (ICPPW), 2011 40th International Conference
on, pages 327–334. IEEE.
HAMAKE:ADATAFLOWAPPROACHTODATAPROCESSINGINHADOOP
461