2 RELATED WORKS
Data is growing in volume at an ever-increasing rate.
In 2013, it was reported that 2.5 quintillion bytes of
data, or 2500 petabytes of data, were created each day
(Wu et al., 2013). Ten years later, the Internet of
Things, mobile devices, social media, sensors and a
whole host of other technologies mean that the
amount of data produced is growing rapidly, even
when confined to a single field of expertise.
Volume of data is already a problem that many
companies and organisations must consider, amongst
others. While more data, particularly good quality,
relevant data, is a good thing, it introduces additional
challenges, such as processing and storage. Research
and experimentation are constantly being instigated
to address these challenges.
As previously stated, this paper will investigate
structured (CSV) and semi-structured (JSON,
Parquet, ORC and Avro) data formats. Apart from
being structured or semi-structured, there is also a
division between row-oriented (CSV, JSON and
Avro) data structures and column-oriented (Parquet
and ORC) data structures. This is also a
consideration, in that the data a query is extracting,
and where it is placed within, for example, a table, is
important. If a sale or transaction is considered, in a
row-oriented data format all the data about the sale,
for example, the item, the cost, the date of sale, will
all be stored together, and it will then store all the data
for the next sale together. With a column-oriented
data format, all the items sold will be stored together
and the costs will be stored together. The data
regarding each individual sale is still linked, but it is
optimised to work with columns and subsets of
columns, whereas row-oriented formats are optimised
to work with and filter on individual rows, or entities
(Dwivedi et al., 2012).
The compression of different data formats results
in the size of the data on disk varying between the
formats. Various data formats have been compared on
this topic. JSON and CSV are not compressed
whereas Parquet, ORC and Avro all make use of
compression.
Related work shows that JSON is the largest of the
five data formats on disk, followed by CSV (Belov et
al., 2021).
Of the three compressed formats, Avro files
usually take the most disk space (Belov et al., 2021a,
2021b; Naidu, 2022; Plase et al., 2017), however
there are anomalies where Parquet data is the larger
of the two (Abdullah & Ahmed, 2020). ORC
consistently uses the least disk space (Belov et al.,
2021a, 2021b; Pergolesi, 2019; Plase et al., 2017;
Rodrigues et al., 2017). In one case, Parquet had a
smaller footprint than ORC, but the difference was
minimal (Naidu, 2022).
Related works, regarding which data formats are
optimised for the functions being used in the later
evaluation will be analysed. What follows is a brief
summary of what these works discovered, based on
data formats and technologies used.
Several studies showed that using a Spark-on-
Hadoop environment, Parquet and ORC were the
most efficient data formats (Belov et al., 2021a,
2021b), with Parquet out-performing ORC (Abdullah
& Ahmed, 2020; Gupta et al., 2018; Pergolesi, 2019)
and JSON being consistently the most inefficient data
format (Belov et al., 2021a, 2021b).
Meanwhile, when using Hive as the underlying
platform, ORC performed better than other data
formats (Gupta et al., 2018; Naidu, 2022; Pergolesi,
2019). When ORC was not in consideration, it was
shown that Parquet was a more efficient data format
on Hive than Avro (Plase et al., 2016, 2017).
Avro generally proved to be an inefficient data
format except when used with Impala (Gupta et al.,
2018) and this data format and architecture
combination was more performant than other
combinations.
Finally, one study analysed Parquet and ORC
using Amazon Athena and discovered that for lookup
intensive queries, ORC was more efficient whereas
for aggregating queries, Parquet was more efficient
(Tran, 2019).
In comparison, this paper tests a range of
individual functions using Amazon Athena and
Amazon EMR across CSV, JSON, Parquet, ORC and
Avro in addition to comparing overall performance
using both time and monetary cost as evaluation
metrics.
3 EXPERIMENT PREPARATION
The COVID Vaccination data from Our World in
Data was chosen as the data on which to run
experiments over. It was downloaded in JSON form,
then modified to be usable.
Our World in Data, who have provided the data,
make it their mission to provide “research and data to
make progress against the world’s largest problems”.
The COVID Vaccinations dataset (OWID, 2024)
contains daily updates from each country regarding
vaccinations, including total number of people
vaccinated, daily vaccinations and booster
vaccinations. There are fields for Country and ISO
Code, and a nested data structure that can contain up