Database Appliance for providing a Database as a
Service able to match the predictable performance,
robustness and trustworthiness of on-premise
architectures. This project will deliver a cloud
database appliance featuring both the next-generation
hardware for in-memory computing and the
necessary software for processing high update
workloads on an operational database, performing
fast analytics and applying data mining / machine
learning techniques on a data stream or on a data lake,
the robustness being protected by redundancy and
failover mechanisms.
We have tested the ActivePivot analytical
database on the hardware built during this ongoing
project. The platform we tested is a Bull Sequana
S800 with 8 processors Intel(R) Xeon(R) Platinum
8158 CPU @ 3.00GHz, (which represents 96 cores),
each constituting a NUMA node with 512 GB of
RAM (4 TB in total). For our experiment, we have
used the well-known TPC-H benchmark’s dbgen tool
(TPC, 2019) to generate data at two different scales:
we used a factor of 100, then 500, to generate
respectively 107 GB then 542 GB of data. Those data
quantities are roughly representative of the two
smaller use case sizes cited in Table 1, for which the
platform would have enough memory to co-host the
operational database and the analytical database. The
data (as generated in CSV format) was stored on SSD
disk on the machine. We evaluated the total startup
time, from launching the command line that starts the
analytical database until the analytical database is
able to answer queries on the full dataset. This
includes the file reading, parsing to transform the
CSV into the actual data types, adding the data to the
datastore, and publishing to the analytical cube.
On the 107 GB dataload, the total startup time was
6 minutes and 31s, which represents an average
throughput of 2,209,997 records published per second
on that initial load. On the factor 542 GB dataload,
the total startup time was 29 minutes and 9 seconds,
with an average throughput of 2,006,060 records
published per second. The total startup time might
seem a bit long considering everything is local to the
machine with no need to download the data through
network. However, the loading performance is
hindered by the TPC-H setting in which all data
resides in a single CSV file: even if the parsing and
publishing of records can happen in parallel, the
actual file reading still has to be done sequentially.
We are currently implementing a connector
between the LeanXcale database, the operational
database that was selected for the CloudDBAppliance
project (CloudDBAppliance, 2018), and the
ActivePivot analytical database. We expect
throughputs to be much higher, since the data will be
pushed directly from the in-memory database, so:
The data will be read from RAM, not SSD disk;
the data will already be in a binary format, no need
for parsing the CSV into actual data types;
the data will already be broken into individual
records, removing the constraint of having to read
one single CSV file sequentially.
We can envision several scenarii for the best usage of
the platform’s capacities. In one scenario, the
operational database is always on, but the rest of the
machine resources are shared between different kinds
of applications: the ActivePivot analytical database
can be started on-demand to compute complex
analytical queries, and be shut down after usage to
leave resources to other applications such as machine
learning Spark jobs performed on the datalake. In that
scenario, the main difference from using a
datawarehouse built from nightly batches is that the
data will be fresher, since the application can be
loaded on-demand from the latest version available in
the operational database. In another scenario, the
ActivePivot is always on, receiving data updates
continuously from the operational database. This way
analytical queries can be performed at any time
without any loading phase and reflect real-time
changes in the operational data. In both scenarii, the
whole machine is always up because the operational
database has to be always available. The main
advantage of being in the Cloud rather than using an
on-premise machine resides in the ability to easily
switch from one configuration to another when the
needs evolve, instead of having to buy a new server.
The LeanXcale operational database indeed comes
with data migration capacities that allow to switch the
hosting machine without interruption of service, and
restarting an ActivePivot on the new machine has
been shown to be a less than half an hour affair.
4 CONCLUSION
While it’s not feasible to try out every use case, we
believe we demonstrated through our tests that one of
the most demanding business cases can be addressed
fully without the need for a recurring nightly batch. If
financial risk analytics can be performed without
scheduled batches through a combination of cloud
infrastructure and in-memory computing, we are
convinced that there aren’t many real-world business
scenarios that cannot be addressed in the same way.
Furthermore, while it serves its purpose for
decades, nightly batch processing now appears not