phase, rather then implicitly hidden in the execution
model (by caching).
Finally, major database vendors currently include
Hadoop as a system into their software stack and
optimize the data transfer between the database and
Hadoop e.g., to call MapReduce tasks from SQL
queries. Greenplum and Oracle (Su and Swart, 2012)
are two commercial database products for analytical
query processing that support MapReduce natively in
their execution model. However, to our knowledge
they do not support extensions based on the process-
ing patterns described in this paper.
6 CONCLUSIONS
In this paper we derived four basic parallel processing
patterns found in advanced analytic applications—
e.g., algorithms from the field of Data Mining
and Machine Learning—and discussed them in the
context of the classic MapReduce programming
paradigm.
We have shown that the introduced programming
skeletons based on the Worker Farm yield expres-
siveness beyond the classic MapReduce paradigm.
They allow using all four discussed processing pat-
terns within a relational database. As a consequence,
advanced analytic applications can be executed di-
rectly on business data situated in the SAP HANA
database, exploiting the parallel processing power of
the database for first-order functions and custom code
operations.
In future we plan to investigate and evaluate op-
timizations which can be applied combining classical
database operations - such as aggregations and joins -
with parallelized custom code operations and the lim-
itations which arise with it.
REFERENCES
A. P. Dempster, N. M. Laird, D. B. R. (2008). Maxi-
mum Likelihood from Incomplete Data via the EM
Algorithm. Journal of the Royal Statistical Society,
39(1):1–38.
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silber-
schatz, A., and Rasin, A. (2009). HadoopDB: an
architectural hybrid of MapReduce and DBMS tech-
nologies for analytical workloads. Proc. VLDB En-
dow., 2(1):922–933.
Alexandrov, A., Battr
´
e, D., Ewen, S., Heimel, M., Hueske,
F., Kao, O., Markl, V., Nijkamp, E., and Warneke, D.
(2010). Massively Parallel Data Analysis with PACTs
on Nephele. PVLDB, 3(2):1625–1628.
Apache Mahout (2013). http://mahout.apache.org/.
Bu, Y., Howe, B., Balazinska, M., and Ernst, M. D. (2010).
HaLoop: Efficient Iterative Data Processing on Large
Clusters. PVLDB, 3(1):285–296.
Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G. R.,
Ng, A. Y., and Olukotun, K. (2006). Map-Reduce for
Machine Learning on Multicore. In NIPS, pages 281–
288.
Dean, J. and Ghemawat, S. (2004). MapReduce: Simplified
Data Processing on Large Clusters. In OSDI, pages
137–150.
Dittrich, J., Quian
´
e-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty,
V., and Schad, J. (2010). Hadoop++: making a yellow
elephant run like a cheetah (without it even noticing).
Proc. VLDB Endow., 3(1-2):515–529.
Gillick, D., Faria, A., and Denero, J. (2006). MapReduce:
Distributed Computing for Machine Learning.
Große, P., Lehner, W., Weichert, T., F
¨
arber, F., and Li, W.-
S. (2011). Bridging Two Worlds with RICE Integrat-
ing R into the SAP In-Memory Computing Engine.
PVLDB, 4(12):1307–1317.
Kaldewey, T., Shekita, E. J., and Tata, S. (2012). Clydes-
dale: structured data processing on MapReduce. In
Proc. Extending Database Technology, EDBT ’12,
pages 15–25, New York, NY, USA. ACM.
Poldner, M. and Kuchen, H. (2005). On implementing the
farm skeleton. In Proc. Workshop HLPP 2005.
R Development Core Team (2005). R: A Language and
Environment for Statistical Computing. R Foundation
for Statistical Computing, Vienna, Austria. ISBN 3-
900051-07-0.
Sikka, V., F
¨
arber, F., Lehner, W., Cha, S. K., Peh, T., and
Bornh
¨
ovd, C. (2012). Efficient transaction processing
in SAP HANA database: the end of a column store
myth. In Proc. SIGMOD, SIGMOD ’12, pages 731–
742, New York, NY, USA. ACM.
Su, X. and Swart, G. (2012). Oracle in-database Hadoop:
when MapReduce meets RDBMS. In Proc. SIGMOD,
SIGMOD ’12, pages 779–790, New York, NY, USA.
ACM.
The Canadian Hansard Corpus (2001). http://www.isi.edu/
natural-language/download/hansard.
Yang, H.-c., Dasdan, A., Hsiao, R.-L., and Parker, D. S.
(2007). Map-Reduce-Merge: simplified relational
data processing on large clusters. In Proc. SIGMOD,
SIGMOD ’07, pages 1029–1040, New York, NY,
USA. ACM.
AdvancedAnalyticswiththeSAPHANADatabase
71