![](bg3.png)
theoretical foundation. Section 4 presents the three
experiments performed, along with a detailed analy-
sis of the star and flat schema design’s performance
for each selected solution. Section 5 presents a com-
parative analysis of the overall performance between
the three DW solutions. Finally, conclusion and fu-
ture work are presented in Section 6.
2 RELATED WORK
We started our literature review by searching for stud-
ies related to Data Warehousing that carried out at
least one experimental performance analysis, prefer-
ably using a benchmark specific for OLAP, and stud-
ies that analyzed different data modeling techniques
and data schemas using one or multiple relational
DBMS, NoSQL or NewSQL solutions.
We found several studies that analyze the per-
formance of a DW using relational DBMSs. The
study by (Almeida et al., 2015) performs an exper-
iment to compare two relational DBMS to process
OLAP queries using the Star Schema Benchmark
(SSB): MySQL InnoDB and Microsoft SQL Server
2012 (MSSQL), in a variety of dataset sizes: 1GB,
3GB, 6GB, 12GB and 24GB, representing a fact ta-
ble with 6M, 18M, 36M, 71M and 144M rows re-
spectively. The author concludes that MSSQL out-
performs MySQL in general, especially for the larger
datasets, while MySQL has acceptable performance
only for workloads up to 6GB, mainly as a result
of the columnstore storage in MSSQL in compari-
son to MySQL InnoDB, which uses a rowstore stor-
age. In (S Ili
´
c, 2022), the author provides a thor-
ough investigation of rowstores versus columnstores
using a variety of different RDBMS: HP Vertica, Mi-
crosoft SQL Server, Oracle, MariaDB ColumnStore
and MySQL, concluding that all queries were ex-
ecuted in an acceptable time of under 10 seconds,
except for the databases using rowstore (MySQL),
meaning columnstore-based DBMSs are highly rec-
ommended for OLAP, but rowstores are not.
We also found a few studies that analyze the per-
formance of a DW using newer technologies, such
as NewSQL, and compare it with relational DBMSs.
The work of (Murazzo et al., 2019) analyzes the per-
formance of Google Cloud Spanner, a NewSQL so-
lution, and compares it with MySQL, both running
on the Google Cloud Platform (GCP). The author
uses the year 2018 from a public historical database
of the City of Buenos Aires as a workload, contain-
ing 895,000 lines, and saves the average time to ex-
ecute three different SQL queries. The results ob-
tained showed that Cloud Spanner achieved a lower
execution time than MySQL in all scenarios and tests.
However, the author points out that for query 3 (the
most complex query, as it has a higher number of
joins), Spanner’s performance is very close to that of
MySQL, leading to a preliminary conclusion that the
greater the complexity, or the higher the number of
joins, the greater the tendency of the two DBMSs to
converge towards a greater and equivalent execution
time.
In (Oliveira, 2017), the authors conduct ex-
periments on MemSQL (currently Singlestore) and
VoltDB using the TPC-H benchmark in SF = 1 (6
million rows, 1GB of size) and on a single machine
with 40GB of RAM. The study shows that MemSQL
achieves better performance in both load time and
average query time, saving 92% of the average time
spent compared to VoltDB. According to the authors,
the reason for MemSQL’s better performance when
executing queries is due to its in-memory cache stor-
age, offering time savings for reexecutions. To an-
alyze a distributed DW, the setup of the experiment,
however, is not ideal, as only one node is used in
the cluster, and the data volume using SF = 1 is too
small a volume to simulate a realistic DW environ-
ment (data always fit in memory).
In a study by (Costa et al., 2017), the authors in-
vestigate the role of data modeling in the process-
ing times of big data DWs implemented using two
SQL-on-Hadoop systems: Hive, an open-source, dis-
tributed NoSQL DBMS by Apache, and Engine-X
(real name omitted due to licensing limitations), an
enterprise NoSQL system. The authors specifically
benchmark multidimensional star schemas and fully
denormalized tables, investigating how data denor-
malization and partitioning affects the performance
of Hive on Tez using the SSB with different Scale
Factors: SF = 10, SF = 30, SF = 100 and SF = 300
in a Hadoop cluster with 5 nodes, each with 32GB
of RAM. The authors conclude that the usage of the
star schema for DWs on Hive may not be the most
efficient design pattern: despite saving a significantly
larger amount of data (SS was 3x smaller than the FT),
Hive still favors the denormalized schema with faster
query execution, less memory requirements, and less
intensive CPU usage. This conclusion arises directly
from evaluating the effort required by the solution
to handle data redundancy in comparison to the ef-
fort needed to manage join operations in a distributed
cluster with 5 nodes. The conclusion, however, can-
not be extended to other classes of DWs, since the au-
thors only experiment with SQL-on-hadoop systems.
In a more recent study, in (Eduardo Pina, 2023)
the authors used OSSpal methodology to analyze
the performance of three NewSQL databases, Cock-
A Performance Analysis for Efficient Schema Design in Cloud-Based Distributed Data Warehouses
41