When using the first analytical view, the
multiplicity
view, we achieve another 25% speed-
up. With the fully modelled query using both the
multiplicity
and
event count
views the execu-
tion runtime is reduced to below 3 seconds, and a total
speedup of 13x, showing the potential of HANA’s inter-
nal optimizer when using the modelling infrastructure
and analytical views discussed in Section 4.
The result cache enables the caching of smaller
intermediate results, for example the
event count
view, which does not change throughout the analysis.
Thus, it helps to reduce the execution runtime down to
about a third compared to step 5.
As a result, we achieve an overall performance
improvement of more than a factor 40 compared to the
hand-crafted C++ code using the ROOT framework. Of
course hand-tuned C++-code should outperform our
system, but as discussed the usage of frameworks like
ROOT inhibits common optimizations, e.g push-down
of expressions and filters.
Not only is this interface more easy to use, it now
also allows for truly interactive and incremental anal-
ysis of the simulation and experimental data. As fu-
ture work we intend to compare our approach with
the direct access to the RAW Root-files as presented
in (Karpathiotakis et al., 2014), e.g. using the Smart
Data Access framework available in SAP HANA.
6 RELATED WORK
The increase in data size and demand for scalable data
processing of several scientific domains has led to an
increased attention for scientific data processing in
the database community (Ailamaki et al., 2010). In
that context, some specialized database systems with
scientific focus have emerged and advanced in the
recent decade.
Array DBMS.
In many scientific domains, for ex-
ample astronomical image processing, data is usually
stored in structured representations, such as multidi-
mensional arrays. Since the relational data model of
conventional DBMS did not match the requirements
for scientific data processing, which is often based
on arrays to achieve data locality, systems like Ras-
DaMan (Baumann et al., 1998) or SciDB (Stonebraker
et al., 2009) emerged. However, particle physics data
does not have an array-like structure that requires ele-
ment locality, but are rather a huge set of many inde-
pendent measurements, hence, our analysis would not
benefit from using an array-based DBMS. Although
the authors of a related project come to a similar con-
clusion (Malon et al., 2011), there have been efforts
to utilize SciDB for the ATLAS Tag Database (Malon
et al., 2012).
Particle Physics and Database Systems.
The AT-
LAS Tag Database (Cranshaw et al., 2008; Malon
et al., 2011; Malon et al., 2012) stores event-level
metadata in a relational database system, so that sci-
entists can preselect events according to their analysis
criteria by relational means. Based on the preselection,
the analysis is then run conventionally using C++ pro-
grams on ROOT files, but only using the selected raw
data files rather than on the whole set. This approach
deviates significantly from our work, since in EPOS
on HANA the complete analysis is performed in the
DBMS, and selection criteria are a part of the main
analysis.
The idea of combining customized in-situ process-
ing on raw data files with the advantages of the colum-
nar data processing capabilities of a modern DBMS
has already been mentioned in (Malon et al., 2011). A
similar – “NoDB” – approach was implemented in the
work of Karpathiotakis et. al. (Karpathiotakis et al.,
2014), which addresses another workflow from high
energy physics (Higgs Boson Search). In their work,
the authors propose an adaptive strategy, which utilizes
Just-In-Time (JIT) access paths and column shreds for
query processing.
We agree with the authors that it would be infeasi-
ble to store as much as 140 PB of raw data generated
at CERN in a database system, in particular when re-
garding main-memory DBMS. However, the query of
the Higgs Boson search considered in (Karpathiotakis
et al., 2014) resembles a highly selective exhaustive
search on the vast amount of all raw data files. In
contrast, our EPOS on HANA application is rather
geared towards an incremental analysis of medium-
size data sets, in the range of up to several terabytes,
which are typical sizes for Monte-Carlo simulation
data. We argue that the column-oriented data layout
and efficient data processing capabilities of modern in-
memory DBMS enable to shift the actual computation
into the database engine. Together with the JavaScript-
based SAPUI5 framework that allows flexibility and
extensibility of the analysis application, our system
offers the infrastructure to host the complete analysis
workflow in the database, making commercial systems
like SAP HANA more interesting for the high energy
physics community in the future.
7 CONCLUSIONS
In this paper, we showed how an interesting applica-
tion from the challenging and data intensive domain of
DATA2015-4thInternationalConferenceonDataManagementTechnologiesandApplications
24