Assessing the Lakehouse: Analysis, Requirements and Definition

Jan Schneider

, Christoph Gröger

, Arnold Lutsch

, Holger Schwarz

and Bernhard Mitschang

Institute of Parallel and Distributed Systems, University of Stuttgart, Universitätsstraße 38, 70569 Stuttgart, Germany

Robert Bosch GmbH, Borsigstraße 4, 70469 Stuttgart, Germany

{christoph.gröger, arnold.lutsch}@de.bosch.com

Keywords: Lakehouse, Data Warehouse, Data Lake, Data Management, Data Analytics.

Abstract: The digital transformation opens new opportunities for enterprises to optimize their business processes by

applying data-driven analysis techniques. For storing and organizing the required huge amounts of data, dif-

ferent types of data platforms have been employed in the past, with data warehouses and data lakes being the

most prominent ones. Since they possess rather contrary characteristics and address different types of analyt-

ics, companies typically utilize both of them, leading to complex architectures with replicated data and slow

analytical processes. To counter these issues, vendors have recently been making efforts to break the bound-

aries and to combine features of both worlds into integrated data platforms. Such systems are commonly

called lakehouses and promise to simplify enterprise analytics architectures by serving all kinds of analytical

workloads from a single platform. However, it remains unclear how lakehouses can be characterized, since

existing definitions focus almost arbitrarily on individual architectural or functional aspects and are often

driven by marketing. In this paper, we assess prevalent definitions for lakehouses and finally propose a new

definition, from which several technical requirements for lakehouses are derived. We apply these require-

ments to several popular data management tools, such as Delta Lake, Snowflake and Dremio in order to

evaluate whether they enable the construction of lakehouses.

1 INTRODUCTION

In recent years, enterprises of almost all sectors have

become subject to fundamental paradigm shifts:

Large-scale projects, such as in the scope of

Industry 4.0 (Lasi et al., 2014), are driving the digital

transformation and pursue to interleave traditional

business models with digital technologies. Supported

by the recent advances and the increasing maturity of

AI (Davenport and Ronanki, 2018), data-driven anal-

ysis techniques can now be utilized to evaluate exist-

ing business processes, products and services by de-

riving insights and knowledge from collected data.

This development opens new opportunities for com-

panies to evaluate and optimize their business prac-

tices and hence gain long-term competitive ad-

vantages. For example, in manufacturing, data col-

lected along the value chain can be used to optimize

product lifecycles, taking all stages from the product

development until the retirement into account. To

keep up with the advances in this field and to benefit

from them, enterprises need to a) collect related data,

b) store and organize the resulting huge amounts of

data in a structured manner and c) exploit the data by

applying data-driven analysis techniques. In this con-

text, data platforms take a crucial role: They allow to

store data and associated metadata from all kinds of

sources and hence form the technical foundation for

data collection, data processing and analytics applica-

tions. While the field of data platforms has been dom-

inated by data warehouses (Inmon W. H., 2005) and

data lakes (Giebler et al., 2019) in the past, a suppos-

edly new type has recently attracted attention: So-

called lakehouses claim to combine the desirable

characteristics of data warehouses and data lakes, al-

lowing to serve all kinds of analytical workloads from

a single platform (Armbrust et al., 2021). This devel-

opment promises huge improvements regarding oper-

ational costs and the quality of analysis results, since

conventional enterprise data architectures are cur-

rently rather complex and require a) the utilization of

several types of data platforms in parallel to serve all

kinds of workloads, b) the storage of multiple copies

of the same data on different platforms, as well as c)

the implementation of error-prone and often slow data

pipelines for synchronizing the data between the plat-

forms, leading to stale or inconsistent data.

Schneider, J., Gröger, C., Lutsch, A., Schwarz, H. and Mitschang, B.

Assessing the Lakehouse: Analysis, Requirements and Deﬁnition.

DOI: 10.5220/0011840500003467

In Proceedings of the 25th International Conference on Enterprise Information Systems (ICEIS 2023) - Volume 1, pages 44-56

ISBN: 978-989-758-648-4; ISSN: 2184-4992

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

The prospect of addressing these problems re-

sulted in high expectations: According to the Gartner

Hype Cycle for Data Management (Feinberg et al.,

2022), the lakehouse vision is about to meet the peak

of expectations and will reach maturity in two to five

years. Consequently, many vendors of data manage-

ment tools try to take advantage of this trend and ex-

pand their products for common features of data

warehouses and data lakes. Since precise definitions

and distinguishing criteria are missing, it remains un-

clear how lakehouses can be characterized, which re-

quirements they must necessarily fulfil and which

data management tools enable the construction of

lakehouses. The broad usage of “lakehouse” as a mar-

keting term further blurs the boundaries.

In the paper at hand, we address these issues by

reviewing prevalent literature and definitions for

lakehouses and with the following key contributions:

• We propose a new definition that overcomes

the identified issues of existing definitions,

• based on this definition, we derive eight tech-

nical requirements for lakehouses, and

• we evaluate popular data management tools,

such as Delta Lake, Snowflake and Dremio by

applying the derived requirements to assess

whether these tools already enable the con-

struction of full-fledged lakehouses.

The remainder of this paper first provides background

information regarding the role of data warehouses and

data lakes in enterprise analytics architectures. Sec-

tion 3 then reviews available literature, leading to the

proposal of our definition and the derivation of tech-

nical requirements in Section 4. These are subse-

quently applied to six popular data management tools

in Section 5. Finally, conclusions regarding the inves-

tigated types of data management tools are drawn.

2 BACKGROUND

This section provides an overview on data ware-

houses and data lakes and discusses how they can be

combined in enterprise analytics architectures.

2.1 Data Warehouses and Data Lakes

Data platforms form the technical foundation for data

collection, data processing and analytics applications

(Gröger, 2022). Table 1 summarizes key properties of

common data warehouses and data lakes, the two

most prominent kinds of analytical data platforms.

https://parquet.apache.org

Emerged from relational databases as a more con-

venient solution for large-scale data analysis, data

warehouses represent the more established type. They

typically allow multidimensional data modelling and

querying, guarantee ACID properties (Härder and

Reuter, 1983) and provide advanced management ca-

pabilities, such as for data governance, time travel

and zero-copy cloning (Armbrust et al., 2021). Mod-

ern data warehouses transfer these concepts to public

clouds and thus provide high scalability and reduced

operational costs. Due to their static, use-case specific

data models, data warehouses are primarily used to

answer questions that are known in advance, such as

in reporting and online analytical processing (OLAP)

workloads (Chaudhuri and Dayal, 1997), and barely

for any kinds of advanced analytics (Bose, 2009).

These limitations gave birth to the idea of data

lakes, which leverage highly scalable and low-cost

storage systems, such as the Hadoop Distributed File

System (HDFS) or cloud services, to store all kinds

of data in their raw formats as self-contained files or

objects. For this purpose, open file formats like

Apache Parquet

are commonly utilized, which ena-

ble direct data access for applications through the in-

terfaces of the underlying storage layer. Due to these

characteristics, data lakes provide more flexibility for

analyses than data warehouses, but at the cost of low

robustness and a lack of management features. Fur-

thermore, the business value of the stored data can

only be exploited when extensive management of

metadata is performed (Eichler et al., 2021).

Table 1: Comparison of data warehouses and data lakes.

Property Data Warehouse Data Lake

Workloads: Reporting, OLAP Advanced

analytics

Users: Business users, data

anal

sts

Data scientists

Data access: Query language,

ata export

Direct access

on storage

Data

independence:

Physical,

artly logical

Weak

Guarantees: ACID Wea

Schema: On-write On-

Data t

e: Mainl

structure

All t

Addressin

: Relational Via

etadata

Data

ranularit

Aggregated Raw and

ate

Data Stora

e: RDBMS Ob

ect stora

Flexibilit

:Low Hi

Mgt. features: Advance

Rudimentar

Analysis

uestions:

Known in advance Not known in

advance

Assessing the Lakehouse: Analysis, Requirements and Deﬁnition

In summary, it can be stated that both types of

data platforms show rather contrary properties and

hence target different fields of analytical applications.

2.2 Integration Patterns

While data warehouses and data lakes address differ-

ent analytical workloads, companies often need to

leverage both types in parallel (Gröger, 2021), e.g. for

generating business reports, feeding user recommen-

dation systems and for the training and application of

machine learning models. In industrial practice, we

currently see four common patterns for integrating the

capabilities of data warehouses and data lakes into en-

terprise analytics architectures, which are depicted in

Figure 1. In addition to these, also variants exist.

Figure 1: Integration patterns for combining data ware-

houses and data lakes in enterprise analytics architectures.

In ①, a data warehouse and a data lake are used in-

dependently. This pattern assumes that the data of

each data source is of interest for either reporting and

OLAP or advanced analytics, but not for both. Con-

sequently, each data record is only ingested into one

of both data platforms, depending on its relevance for

the different types of analytics. The pattern stands out

for its simplicity, but is strictly limited to scenarios

where the source data can be appropriately split into

disjoint subsets for both workloads, which is rarely

the case. Furthermore, it is inflexible, because the de-

cisions on how to split the source data require upfront

assumptions regarding the analysis questions.

Similarly, ② also employs an independent data

warehouse and data lake; however, the source data is

not split up anymore and instead ingested into both

data platforms where necessary. This way, relevant

data can be exploited for reporting, OLAP and ad-

vanced analytics in parallel, while other data can

solely reside on one of the platforms, depending on

the intended analyses. This approach is more flexible

than ①, but also requires the replication of data to

both platforms, which prevents the formation of a sin-

gle source of truth, provokes additional storage costs,

may cause inconsistencies between both copies of

data and requires several pipelines for data ingestion.

The 2-tier architecture in ③ appears to be the cur-

rently most commonly used integration pattern

(Armbrust et al., 2021). Here, all source data is first

ingested into the data lake and subsequently prepared

for analytical evaluation. A second data pipeline then

copies or moves data from the data lake to the data

warehouse, where it can be exploited in the scope of

reporting and OLAP workloads. Optionally, another

pipeline can offload data that is no longer required by

the data warehouse back to the data lake to improve

query performance and storage costs (Oreščanin and

Hlupić, 2021). However, this pattern possesses severe

drawbacks (Armbrust et al., 2021): The additional

data pipelines increase the overall complexity of the

architecture and the required data conversions render

them error-prone. In addition, they cause additional

delays, leading to stale data in the data warehouse.

Finally, ④ represents the vision of a single data

platform that combines desirable characteristics and

features of both worlds such that all types of analyti-

cal workloads can be served. This way, no data repli-

cation or additional pipelines for transferring the data

between platforms are needed. How such a solution

may look like and which requirements it must neces-

sarily meet is discussed in the following sections.

3 RELATED WORK

First, work related to the general concepts of lake-

houses, associated technologies, implementations and

practical applications is discussed. The second part of

this section then elaborates on existing definitions for

lakehouses and shows why they are insufficient.

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

3.1 Conceptual Work

The term "lakehouse" was presumably used for the

first time by Alonso (Alonso, 2016), where it de-

scribes “a solution for the analytical framework in

the middle point [...] between classical [data ware-

houses] and [data lakes]” that allows to combine the

schema-on-write and schema-on-read paradigms by

using flexible schemas. However, the work does not

discuss how other properties of data warehouses and

data lakes (cf. Table 1) could be combined as well and

hence positions itself considerably far from today's

notion of a lakehouse. About four years later, the

lakehouse idea took shape with the emergence of the

open source framework Delta Lake

, which intends to

allow the construction of integrated data platforms

that combine characteristics of modern data lakes

with comfortable management features of traditional

data warehouses (cf. ④ in Figure 1). The underlying

concepts and technologies are explained in the ac-

companying paper by Databricks (Armbrust et al.,

2020), which refers to Delta Lake as a novel kind of

“ACID table storage layer over cloud object stores”.

The term “lakehouse” gained further popularity with

their subsequent paper (Armbrust et al., 2021), which

discusses issues of typical enterprise analytics archi-

tectures and emphasizes the benefits of an integrated

lakehouse platform in comparison to the established

2-tier architecture (cf. ③ in Figure 1). The paper fo-

cuses on Delta Lake, but also refers to similar frame-

works, such as Apache Hudi

and Apache Iceberg

Most of the currently available literature on lake-

houses has adopted the descriptions and elementary

concepts presented in the two previously mentioned

papers, promoting them to cornerstones for research

related to the lakehouse vision. However, also other

perspectives exist: Oreščanin and Hlupić (Oreščanin

and Hlupić, 2021; Hlupić et al., 2022) use the term

“lakehouse” to describe an architecture similar to the

2-tier approach, in which data is transferred between

a data lake and a data warehouse and propose the in-

tegration of a virtualization layer that provides uni-

form data access to the users. According to Azeroual

et al. (Azeroual et al., 2022), lakehouse-like charac-

teristics can be achieved by combining data lakes

with practices of data wrangling. Others argue that

modern, cloud-based data warehouses already repre-

sent feature-rich lakehouses, since they have adopted

common features of data lakes, including the inges-

tion of streaming data, support for semi-structured

https://delta.io

https://hudi.apache.org

https://iceberg.apache.org

data and means for querying data on external cloud

storages (Hansen, 2021; Eckerson, 2020). In contrast,

Inmon et al. (Inmon et al., 2021) argue that lake-

houses are always built on top of existing data lakes.

Due to the different views, Raina and Krishna-

murthy (Raina and Krishnamurthy, 2022) conclude

that the term “lakehouse” should only be used to de-

scribe the general vision of combining both worlds,

rather than to categorize individual tools. However,

we believe that lakehouses add value over prevalent

enterprise analytics architectures and indeed possess

characteristics that distinguish them from traditional

data warehouse or data lake solutions (cf. Section 4).

For the construction of lakehouses, Behm at al.

(Behm et al., 2022) propose a vectorized query engine

for the Databricks ecosystem that provides increased

performance for SQL queries on tables in open file

formats. Fourny et al. (Fourny et al., 2021) developed

a language and library which allows to define tasks

for data preparation and the management of machine

learning models on lakehouses in a declarative man-

ner. Oreščanin and Hlupić (Oreščanin and Hlupić,

2021) suggest to leverage process control frameworks

for orchestrating data flows in lakehouses.

Current proposals for the implementation of lake-

houses are often based on Delta Lake and the Data-

bricks ecosystem, like the one used by Begoli et al.

(Begoli et al., 2021) for the management of biomedi-

cal research data, or on public cloud services

(L'Esteve, 2022; Shiyal, 2021). In contrast, Tovarňák

et al. (Tovarňák et al., 2021) utilize a more diverse

technology stack, including Apache Iceberg, Apache

Spark

, Trino

and other tools for telemetry analysis.

Due to its popularity and the broad variety of per-

spectives, technologies and implementations related

to the lakehouse vision, a precise characterization is

necessary. However, the existing definition attempts

are not sufficient, as pointed out in the following.

3.2 Prevalent Lakehouse Definitions

An obvious definition for lakehouses can be derived

from the portmanteau word "lakehouse" itself: It sug-

gests the fusion of key characteristics of data ware-

houses and data lakes, some of which are listed in Ta-

ble 1, into a common architecture (cf. Shiyal, 2021;

Raina and Krishnamurthy, 2022; Alonso, 2016;

Eckerson, 2020; Hansen, 2021)). However, it remains

unclear which properties must necessarily be present

https://spark.apache.org

https://trino.io

Assessing the Lakehouse: Analysis, Requirements and Deﬁnition

and/or whether this architecture needs to reflect a sin-

gle data platform or can also be implemented as sev-

eral tiers (cf. ③ in Figure 1).

The most often cited definition is given by Arm-

brust et al. (Armbrust et al., 2021), who define a lake-

house as “data management system based on lowcost

and directly-accessible storage that also provides

traditional analytical DBMS management and per-

formance features such as ACID transactions, data

versioning, auditing, indexing, caching, and query

optimization.” The first two characteristics, low-cost

storage and direct data access, refer to attributes of

data lakes and have a mandatory character in this def-

inition. In contrast, the second part only provides ex-

amples for common management and performance

features of data warehouses that are also desirable for

lakehouses, but does not actually demand any of

them. Depending on the interpretation of this defini-

tion, a) platforms based on the Delta Lake framework,

b) instances of Apache Hive

on top of the HDFS,

c) a cloud object storage combined with an external

SQL query engine and even d) modern data ware-

houses that support tables on external storages could

be considered lakehouses. However, all of these ap-

proaches show rather different qualities, e.g. with re-

spect to read and write access, provided guarantees

and stream processing capabilities. Furthermore, this

definition does not explain why the listed properties

were selected and how they can achieve benefits over

prevalent enterprise analytics architectures.

According to Gartner (Feinberg et al., 2022), a

lakehouse represents a “converged infrastructure en-

vironment that combines the semantic flexibility of a

data lake with the production optimization and deliv-

ery of a data warehouse” and “supports the full pro-

gression of data from raw, unrefined [to] optimized

data for consumption.” This definition reflects a busi-

ness strategy perspective rather than a technical one.

"Semantic flexibility" and "production optimization

and delivery" are rather abstract properties that do not

provide a sharp outline of the lakehouse paradigm and

are difficult to verify in practice. The second part of

the definition refers to the so-called Delta architec-

ture (Leano, 2020), which is claimed to represent an

alternative to the Lambda (Warren and Marz, 2015)

and Kappa (Kreps, 2014) architectures by unifying

batch and stream processing. While we consider this

an important implication of lakehouses (cf. Section

4.3.8), Gartner remains too abstract and merely de-

scribes a process that can already be implemented on

conventional data lakes by leveraging processing en-

gines and zone models (Giebler et al., 2020).

https://hive.apache.org

Hansen (Hansen, 2021) defines a lakehouse as

“architectural approach for managing all [types of

data] and supporting [all] data workloads (Data

Warehouse, BI, AI/ML, and Streaming)”, which em-

phasizes the intended usage of lakehouses rather than

detailed functional characteristics. However, the def-

inition is too broad to delineate a distinct concept, as

all of the integration patterns (cf. Figure 1) can be

considered “architectural approaches” that satisfy the

two prerequisites mentioned in this definition.

Similar to Hansen, our definition (cf. Section 4.1)

also focuses on the analytical workloads lakehouses

must be able to serve, but adds further constraints that

reflect the promised benefits in comparison to cur-

rently operated enterprise analytics architectures.

4 DEFINITION AND

REQUIREMENTS FOR

LAKEHOUSES

This section proposes a definition for lakehouses that

addresses the issues of prevalent definitions as dis-

cussed in Section 3. Next, several technical require-

ments are derived that allow to verify whether given

data platforms represent full-fledged lakehouses.

4.1 Defining the Lakehouse

As shown in Table 1, most of the characteristics of

typical data warehouses and data lakes are rather con-

trary, for example with respect to data access, data in-

dependence, and the storage type. For this reason, a

straight-forward merge of both concepts into one uni-

versal data platform that preserves all desirable prop-

erties is not possible. Instead, data warehouses and

data lakes typically need to give up on some of their

characteristics in order to be able to adopt features

from the respective other platform. For instance, a

data warehouse that should support direct data access

on the storage layer like data lakes, must give up its

data independence and instead utilize open file for-

mats. If it should support semi-structured data as well,

it must relax its relational design and consistency

guarantees. Similarly, a data lake that is supposed to

provide ACID properties has to limit its support for

direct data access, since read and write operations

must now be performed according to a specific proto-

col that ensures data integrity.

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

On the other side, the combination of both types

of data platforms can also lead to new, emergent char-

acteristics that neither typical data warehouses nor

data lakes possess. In summary, they represent the ad-

ditional value of lakehouses for enterprises in com-

parison to the other patterns sketched in Figure 1 and

may include the ability to satisfy all analytical work-

loads from a single data platform and reduced mainte-

nance efforts. However, drawbacks may emerge as

well, such as an increased risk of vendor lock-ins.

Figure 2 illustrates how the characteristics of lake-

houses can be composed. In this figure, defining lake-

houses means to decide which mandatory character-

istics the green set must include. While prevalent def-

initions do not select these in a structured manner (cf.

Section 3.2), we first shift the focus to a higher level

of abstraction and instead of individual characteristics

consider the different kinds of analytical workloads

that are expected to be executed on lakehouses. Sec-

ondly, since each type of workload is associated with

functional requirements, we use the workloads to de-

rive mandatory characteristics in a top-down manner.

Figure 2: Venn diagram illustrating how the characteristics

of lakehouses are composed.

Despite the different perspectives on the lake-

house paradigm, it appears to be common sense that

the fundamental motivation is to simplify existing en-

terprise analytics architectures and to reduce their

complexity. We believe that this vision can only be

achieved when a lakehouse consists of a single, inte-

grated platform that can run the typical workloads of

both data warehouses and data lakes, since architec-

tures with multiple data platforms a) prevent the for-

mation of a single source of truth, b) require addi-

tional data pipelines that need to be maintained, c) re-

quire additional data transformations that may cause

inconsistencies and d) tend generally to become com-

plex and error-prone, which undermines the promises

of the lakehouse paradigm. Consequently, we argue

that the 2-tier architecture (cf. ③ in Figure 1) does

not represent a lakehouse. Based on these considera-

tions, we propose the following definition:

Definition 1. A lakehouse is an integrated data plat-

form that leverages the same storage type and data

format for reporting and OLAP, data mining and ma-

chine learning, as well as streaming workloads.

Reporting and OLAP refer to the primary workload

of data warehouses, while the combination of data

mining and machine learning, as well as streaming

represent typical data lake workloads. All three work-

loads are discussed in detail in Section 4.2 and are

subsequently used to derive technical requirements.

With its additional constraints, the definition ensures

that lakehouses use the same type of storage (e.g. ob-

ject storages) and the same data format (e.g. Parquet)

to serve all of the listed workloads. As a consequence,

the data must not be replicated to different types of

storages or transformed into other formats for these

purposes. Section 4.3.1 explains this in more detail.

The definition deliberately does not make any

statement about non-functional properties of lake-

houses, since these mainly distinguish more suitable

from less suitable lakehouse systems for the respec-

tive application scenario, but have no major influence

on the underlying type of data platform.

4.2 Analytical Workload

Characteristics

This section characterizes the three analytical work-

loads that a lakehouse must be able to serve according

to our definition. Table 2 summarizes the results.

4.2.1 Reporting and OLAP

Reporting refers to the production, delivery and man-

agement of reports (Vaisman and Zimányi, 2022), i.e.

static or interactive overviews of business facts, such

as key performance indicators (KPIs) and corre-

sponding visualizations (Zheng, 2017). For the auto-

matic generation of reports, predefined queries are

typically employed and periodically executed against

the stored data (Vaisman and Zimányi, 2022).

This workload is supplemented by OLAP, which

intends to enable interactive analyses by providing

fast, intuitive, multi-user and scalable query capabili-

ties based on multidimensional data models (Pendse

Assessing the Lakehouse: Analysis, Requirements and Deﬁnition

and Creeth, 1995). Together, reporting and OLAP al-

low to extract descriptive statistics from the stored

data in order to support business decisions. For exam-

ple, the periodic calculation of the first pass yield

within a manufacturing company, broken down to the

level of machines and quarters, can guide decisions

regarding the acquisition and replacement of manu-

facturing machines. To enable efficient query pro-

cessing and report generation, the data must be avail-

able in structured and table-like form, which requires

the definition and enforcement of schemas to ensure

data integrity and quality. While low response times

are desirable for OLAP, batch-like processing is gen-

erally sufficient, since the performed analyses are

typically not time-critical and do not need to happen

immediately in response to external events.

4.2.2 Data Mining and Machine Learning

Both data mining and machine learning are broad

sub-disciplines of the field of advanced analytics

(Bose, 2009). Data mining is the process of discover-

ing patterns and other forms of knowledge in large

data sets (Han et al., 2022). Typical data mining tech-

niques include classification approaches, clustering

analysis, association analysis and regression analysis.

The goal of machine learning is to develop learning

algorithms that are capable of building models from

data, which can then be used to generate predictions

on new observations (Zhou, 2021). Although there is

an overlap between the techniques used in data min-

ing and machine learning, the focus of data mining

lies on finding new patterns and inferring knowledge,

while machine learning tries to generalize from pat-

terns in collected data in order to generate prediction

models for unseen data. For example, in the context

of manufacturing, data mining techniques can be ap-

plied to find usage patterns for products and derive

possible design optimizations from them, whereas

machine learning may allow to create models for the

predictive maintenance of machines. Since most of

the associated techniques and algorithms are too com-

plex to be expressed using query languages and due

to the volume of data that needs to be analysed, data

mining and machine learning usually require direct

read access to the data on the storage layer. This also

provides high flexibility for data mining, as analysis

questions are often not known in advance and only

arise after the discovery of first patterns in the col-

lected data. The data that is supposed to be analysed

can be of arbitrary types, including semi-structured

and even unstructured data (Gröger et al., 2014). This

workload has no strict timing requirements, because

data mining and the training of models are rather slow

processes that involve human experimentation and

produce results that are valid until a further iteration

comes up with an updated version of the model.

4.2.3 Streaming

In the context of analytical workloads, streaming sub-

sumes all analysis techniques for near-real-time re-

porting and stream analytics (Kejariwal et al., 2017).

The goals of near-real-time reporting are similar to

those of batch reporting (cf. Section 4.2.1), with the

difference that the reports are usually replaced by

dashboards whose business facts and visualizations

must be updated within minutes. Due to the time-con-

suming nature of most data-driven analysis tech-

niques, results cannot be continuously re-calculated

and instead must be incrementally updated as new

data arrives. Hence, near-real-time reporting requires

different approaches than batch reporting.

Stream analytics refers to techniques for the anal-

ysis of data that arrives in continuous data streams,

including algorithms for data filtering, pattern detec-

tion and clustering (Kejariwal et al., 2017). Data for

streaming workloads is typically either structured or

semi-structured, since most streaming tools cannot

handle unstructured data well. In the scope of a man-

ufacturing company, the application of machine

learning models to arriving data for predictive

maintenance and the updating of dashboards for shop

floor operators are typical examples of streaming

workloads. Traditional data warehouses are designed

for batch processing and operations on large data vol-

umes and hence not optimized for small incremental

data changes that occur with high frequency, which

renders them unsuitable for streaming. Instead,

streaming workloads have been mainly executed on

data lakes so far, e.g. by applying the Lambda or

Kappa architecture (Giebler et al., 2021).

Table 2: Comparison of the analytical lakehouse workloads.

Characteristics Reporing/

OLAP

DM/ML Streaming

Analytics types: Descript.,

diagnostic

Diagnostic,

predictive,

prescriptive

Descriptive,

diagnostic,

predictive

Users: Business

users, data

analysts

Data

scientists

Operators,

analysts

Data

access:

Via query

language

Direct

access on

storage

Direct acc. on

stream storage

Timing: Batch Batch Near-real-time

Data types: Structured All types Structured,

semi-struct.

User concurrency: High Low Low

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

4.3 Derived Technical Requirements

Based on the previously described analytical work-

loads, we identified eight technical requirements that

a lakehouse must satisfy in order to comply with our

lakehouse definition. Table 3 provides an overview

over them and indicates how strongly they were in-

fluenced by the different workloads.

4.3.1 R1: Same Storage Type and Data

Format

Following up on our lakehouse definition, this re-

quirement demands that all data and metadata is

solely stored on a single type of highly scalable stor-

age and that all data (excluding metadata) is stored

using the same data format. Examples of storage

types include object storages, such as provided by

Amazon S3

or Azure Blob Storage

, and highly scal-

able file systems, like the HDFS, while Parquet and

CSV represent common data formats. As the lake-

house paradigm promises to simplify enterprise ana-

lytic architectures and hence to overcome drawbacks

of the first three integration patterns (cf. Figure 1),

R1 does not allow to replicate data or metadata to dif-

ferent storage types (e.g. from object storage to

RDBMS) or to transform it to different data file for-

mats (e.g. from Parquet to CSV). However, the paral-

lel utilization of multiple storage systems of the same

type, e.g. several cloud object storages from different

providers, is not restricted, as well as the replication

of data to other storages of the same type for ensuring

availability. Furthermore, data may be replicated and

stored in different versions and with different sche-

mas on the same type of storage, which allows the im-

plementation of data processing pipelines. While all

data stored in the lakehouse must leverage the same

data format, metadata may be stored in different for-

mats, since it typically shows a lower volume and

serves different purposes. With the goals of avoiding

complex data integration tasks and keeping enterprise

analytics architectures simple and scalable, R1 is rel-

evant for all three types of analytical workloads.

4.3.2 R2: CRUD for all Types of Data

Data mining and machine learning applications are

not necessarily limited to structured data, but can also

operate on semi-structured or unstructured data (cf.

Section 4.2.2). For this reason, lakehouses must be

able to store all kinds of data, similar to data lakes.

While the possibility of writing data to storage (C of

CRUD) and retrieving the stored data (R) is crucial

for all types of analytical data platforms, it may also

be necessary to update (U) and delete (D) data, e.g.

due to changes of privacy policies or because the

stored data turns out to be erroneous and hence needs

to be repaired or removed. As a result, the ingestion,

retrieval, updating and deleting of all kinds of data

must be supported by the lakehouse at least on the

level of data collections (cf. R3). This means that the

lakehouse must at least allow to create, retrieve, up-

date and delete entire data collections. R8 later refines

this requirement for stream processing.

4.3.3 R3: Relational Data Collections

In data lakes, structured data is typically broken down

to multiple files and stored in open and column-ori-

ented file formats, such as Parquet. This enables more

efficient column-wise aggregations and direct data

access. However, just dumping the data as multiple

files and providing direct read access is not sufficient

for lakehouses, since especially the reporting and

Table 3: Overview about the identified lakehouse requirements and the workloads from which they were mainly derived.

Requirement Influencing workloads

# Name Reporting/OLAP DM/ML Streaming

R1 Same storage type and data format

R2 CRUD for all types of data

R3 Relational data collections

R4 Query language

R5 Consistency guarantees

R6 Isolation and atomicity

R7 Direct read access

R8 Unified batch and stream processing

strong influence medium influence no influence

https://aws.amazon.com/s3/

https://azure.microsoft.com/products/storage/blobs/

Assessing the Lakehouse: Analysis, Requirements and Deﬁnition

OLAP workload relies on the relational processing of

data, which requires a higher degree of structure that

associates the stored files with their context. Hence,

lakehouses must provide concepts that allow to com-

pose structured data to relations on the logical level,

such that multiple files in the storage system can

jointly represent a cohesive data collection with rela-

tional properties, such as a table-like structure (cf.

(Codd, 1990)). This can be achieved e.g. by storing

and managing technical metadata that contains infor-

mation about the available relations, their column

names and the data files holding their tuples. Also

streaming applications can benefit from relational

data collections, since they simplify the handling and

addressing of data sources and sinks.

4.3.4 R4: Query Language

To support typical OLAP tasks, a lakehouse must of-

fer at least a declarative, structured data query lan-

guage (DQL) that allows to query at least the stored

structured data in a relational manner. Such a lan-

guage is necessary, because OLAP queries often have

to be created in an experimental manner and with high

frequency. Although additional language elements,

such as those of a data management language (DML),

would be desirable as well, these are not mandatory,

since the associated operations could also be issued in

other ways, e.g. via an API. Besides OLAP, a DQL

can also be helpful for specifying the business facts

that are supposed to be included into reports.

4.3.5 R5: Consistency Guarantees

As discussed for R3, structured data is typically

stored as data files in column-oriented formats, such

as Parquet. Since a relational data collection can con-

sist of multiple data files, it is necessary for reporting

and OLAP, but partly also for data mining, to enforce

the consistency of the data within and across these

files. Otherwise, aggregations and filter operations on

the data are not possible in a meaningful and reliable

manner. Hence, a lakehouse must provide means to

enforce the consistency of data across data collections

with respect to its structure. This can be achieved e.g.

by employing schema validation and constraint

checking. However, it is up to the implementation of

the respective lakehouse to decide whether these

guarantees should be enforced when new data is in-

gested into a data collection or when it is queried.

4.3.6 R6: Isolation and Atomicity

In order to be able to run different types of workloads

and tasks in parallel, precautions must be taken to

prevent lost updates and other anomalies that can

arise during concurrent data accesses. This is espe-

cially relevant for the generation of reports and OLAP

analyses that may be executed in parallel to write and

update operations on the same data collections, but is

also a prerequisite for unified batch and stream pro-

cessing (cf. R8). Thus, a lakehouse must ensure ato-

micity and isolation (Härder and Reuter, 1983) at

least for the structured data and at least on the level of

data collections (cf. R3), such that incomplete or in-

termediate results cannot be accidentally read during

concurrently executed operations that affect the same

data collections. This can be implemented in various

ways, e.g. via serialization techniques.

4.3.7 R7: Direct Read Access

As described in Section 4.2.2, data mining and ma-

chine learning tasks typically require direct access to

the data on the storage layer, which is naturally pro-

vided by data lakes. Similarly, also lakehouses must

allow unmediated read access to all stored data and

metadata and leverage open, standardized file for-

mats, so that the data and metadata can be accessed

directly on the storage layer without needing to export

the data. Being able to directly access the metadata is

crucial, since the metadata describes the context of

the stored data, links data files to data collections and

may be required to ensure isolation (cf. R6). While

the possibility of being able to modify the data di-

rectly on the storage layer would be desirable as well,

this is not demanded, as it would likely conflict with

R5 and R6 that typically need to enforce specific pro-

tocols for write access.

4.3.8 R8: Unified Batch and Stream

Processing

In order to support streaming workloads, lakehouses

must be able to deal with continuous streams of data,

i.e. allow the ingestion of data from data streams into

data collections and provide stored or updated data

rapidly to stream consumers. However, since other

workloads may be executed on a lakehouse as well, it

may be desirable to combine stream processing with

batch-based processing steps. Since R6 already as-

sures isolation and atomicity, this is possible without

risking to run into concurrency anomalies and to read

intermediate results. For this reason, lakehouses pro-

vide new opportunities for breaking the boundaries

between batch and stream processing and interleaving

both techniques as needed, because data collections

can act as sinks and sources for both batch and stream

processing. However, stream processing typically re-

quires incremental changes to small batches or even

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

single records of data with high frequency and in

near-real-time (cf. Section 4.2.3). Hence, in sum-

mary, a lakehouse must a) support the near-real-time

execution of at least append, update and read opera-

tions on single records of structured data at a high

rate, i.e. several operations within a second, b) allow

to interleave batch processing and stream processing

tasks while ensuring data integrity in accordance with

R6 and c) be able to provide the updated contents of

data collections as data source to external batch and

stream processing tools and to ingest data from these

tools into data collections as data sink.

In our opinion, the support for external batch and

stream processing tools, such as Apache Spark for

batch processing and Spark Structured Streaming or

Apache Flink

for stream processing, is crucial, since

both batch and streaming tasks typically require so-

phisticated implementations with a broad range of

features and high flexibility that can barely be pro-

vided by individual platform-internal solutions.

5 TOOL EVALUATION

We applied the previously described requirements to

six popular data management tools in order to evalu-

ate whether they enable the construction of lake-

houses that comply with our definition. While the re-

quirements could also be met by combing several dif-

ferent tools into one data platform, we considered

each of them separately, as off-the-shelf solutions are

generally preferred over custom compositions and

hence more promising for industrial application.

However, as the frameworks Delta Lake, Apache

Hudi and Apache Iceberg do not represent self-con-

tained data platforms and are instead designed as en-

hancements for existing processing engines, we eval-

uated them in combination with Apache Spark as the

hosting infrastructure. The assessment is primarily

based on the available online documentations of the

tools in their latest version at the time of evaluation,

and in some cases supplemented by insights gained

via prototyping. Table 4 summarizes the results for

each tool. For those satisfying all eight requirements,

we conclude that they enable to build lakehouses (cf.

Table 4). Snowflake

was evaluated twice, one time

by using internal tables and one time with external ta-

bles, as they show different characteristics.

Based on our evaluation, we conclude that cur-

rently only Delta Lake, Apache Hudi and Apache Ice-

berg enable the construction of lakehouses. All three

frameworks operate on top of cloud object stores or

the HDFS and also use them to store metadata in the

JSON format. This metadata contains information

about the available tables, their structure and also in-

cludes a log that tracks additions and deletions of data

files in order to provide isolation and atomicity. In ad-

dition, these frameworks support SQL queries

through the hosting processing engine, allow schema

validation and offer comprehensive integration points

for common batch and stream processing tools, e.g.

Apache Spark, Apache Flink and Apache Kafka

Snowflake represents a modern type of data ware-

house and provides several features and characteris-

tics that go beyond those of traditional data ware-

houses, including cloud deployment, the separation

between compute and storage nodes and additional

management capabilities for semi-structured data

(Dageville et al., 2016). Nevertheless, Snowflake in-

ternally still relies on a proprietary, non-open format

for storing the data in order to accelerate query pro-

cessing, which prevents direct read access (cf. R7).

Snowflake supports batch processing pipelines via a

Table 4: Evaluation results for six popular data management tools. The numbered columns indicate which requirements are

satisfied by each tool, while the last column concludes whether they enable to build lakehouses.

Tool Version R1 R2 R3 R4 R5 R6 R7 R8 Lakehouses?

Delta Lake 2.1.0 ✓ ✓ ✓

✓

Apache Hudi 0.12.1 ✓ ✓ ✓

✓

Apache Iceberg 1.0.0 ✓ ✓ ✓

✓

Snowflake with internal tables 6.31.1 ✓ ✓ ✓

✓

✗

Snowflake with external tables 6.31.1

✗

✓

✗

Dremio 23.0.1

✗

✓ ✓

✓

✗

Trino 394

✗

✓ ✓

✓

✗

https://flink.apache.org

https://snowflake.com

https://kafka.apache.org

Assessing the Lakehouse: Analysis, Requirements and Deﬁnition

convolute of periodically executed tasks and change

data capture, allows Spark batch jobs to query and

write data and can also ingest streaming data from

Kafka and Spark Structured Streaming. However, the

streaming ingestion is limited to append operations

and streaming from Snowflake tables, i.e. using them

as sources for stream processing tools is not sup-

ported at the time of evaluation.

When using external tables instead of internal

ones, the data resides on a directly accessible, third-

party cloud storage in an open format and can still be

queried like an ordinary table within Snowflake.

However, the metadata remains in Snowflake's

metadata store (cf. R1), which is an instance of Foun-

dationDB

and does not provide unmediated access

(cf. R7). Furthermore, external tables are read-only

and Snowflake provides no means for inserting, up-

dating or deleting their data, which is a missing pre-

requisite for R2, R5, R6 and R8.

Dremio

is advertised as a lakehouse platform

that brings self-service analytics and data warehouse

functionality to data lakes. In contrast, Trino de-

scribes itself as a distributed SQL query engine which

allows to query large datasets that are possibly dis-

tributed over several data sources. Despite their dif-

ferent focus, Dremio and Trino show several similar-

ities from a high-level perspective: Both tools are

used on top of self-contained data platforms, such as

data lakes or even relational or NoSQL databases, and

are hence not responsible for ingesting, storing and

organizing data themselves. Instead, they are built

around an SQL-based query engine and allow to

query the data on these platforms. Although Dremio

and Trino provide a few rudimentary DML operations

for some of the supported storages, data manipulation

is generally supposed to take place on the data plat-

forms themselves. For this reason, Dremio and Trino

are neither capable of providing consistency guaran-

tees (cf. R5), nor atomicity and isolation (cf. R6) for

the underlying storage layers. Dremio stores its

metadata as key-value pairs within an instance of

RocksDB

, which is a different type of storage as em-

ployed for the actual data (cf. R1) and also impedes

direct metadata access (cf. R7). With Trino, the stor-

age location of the metadata depends on the storage

systems that are used as data sources and the connect-

ors that Trino provides for them. The Hive con-

nector

allows Trino to access data in open file for-

mats that resides on highly scalable storage systems,

such as instances of the HDFS or cloud object stores.

https://foundationdb.org

https://dremio.com

http://rocksdb.org

However, this connector also requires a Hive Metas-

tore

for metadata management, which represents a

different type of storage (cf. R1) and does not provide

unmediated access (cf. R7). Since Dremio and Trino

possess only little control over the data that resides on

the data platforms, they also cannot provide support

for unified batch and stream processing in accordance

with R8. Hence, we do not consider them as tools that

enable the construction of lakehouses.

6 CONCLUSIONS

In this paper, we first elaborated on the motivation for

the recently emerging lakehouse paradigm and as-

sessed different perspectives and definitions that are

available in literature. As we found the existing defi-

nitions insufficient, we proposed a new definition

based on the promises and key benefits of lakehouses

in comparison to prevalent enterprise analytics archi-

tectures. This definition allowed us to derive eight

technical requirements, which can be used to verify

whether given data platforms represent full-fledged

lakehouses. We subsequently applied these require-

ments to six popular data management tools and in-

vestigated to which degree they support the construc-

tion of lakehouses that comply with our definition.

As a result of this evaluation, we found that of the

reviewed tools, only Delta Lake, Apache Hudi and

Apache Iceberg were able to satisfy all of our require-

ments. These tools represent feature-rich frameworks

that operate on top of highly scalable and directly ac-

cessible storages and leverage additional metadata to

enhance them for lightweight data warehousing capa-

bilities. Hence, the resulting data platforms can be

considered advanced data lakes that follow the pattern

“Integrated Architecture” as shown in Figure 1 and

allow to serve all kinds of analytical workloads.

In contrast, the other assessed tools, including

Snowflake and Dremio, provide only individual en-

hancements for data lakes, such as SQL query capa-

bilities. Thus, they need to be complemented by other

frameworks in order to become able to meet all re-

quirements and allow the construction of lakehouses.

In future work, we plan to expand our evaluation

to further data management tools, to investigate the

suitability and maturity of lakehouse concepts for in-

dustrial applications and to assess the implications of

the so-called Delta architecture.

https://trino.io/docs/current/connector/hive.html

https://hive.apache.org

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

REFERENCES

Alonso, P. J. (2016, October). SETA, a suite-independent

agile analytical framework. Master thesis, Polytechnic

Univ. of Catalonia, BarcelonaTech.

Armbrust, M., Das, T., Sun, L., & others. (2020). Delta

lake: high-performance ACID table storage over cloud

object stores. Proceedings of the VLDB Endowment,

13, 3411–3424.

Armbrust, M., Ghodsi, A., Xin, R., & others. (2021).

Lakehouse: a new generation of open platforms that

unify data warehousing and advanced analytics.

Proceedings of CIDR.

Azeroual, O., Schöpfel, J., Ivanovic, D., & others. (2022).

Combining Data Lake and Data Wrangling for

Ensuring Data Quality in CRIS. CRIS2022: 15th

International Conference on Current Research

Information Systems.

Begoli, E., Goethert, I., & Knight, K. (2021). A Lakehouse

Architecture for the Management and Analysis of

Heterogeneous Data for Biomedical Research and

Mega-biobanks. 2021 IEEE International Conference

on Big Data, (pp. 4643–4651).

Behm, A., Palkar, S., Agarwal, U., & others. (2022).

Photon: A Fast Query Engine for Lakehouse Systems.

Proceedings of the 2022 Internat. Conf. on

Management of Data, (pp. 2326–2339).

Bose, R. (2009, March). Advanced analytics: opportunities

and challenges. Industrial Management & Data

Systems, 109, 155–172.

Chaudhuri, S., & Dayal, U. (1997, March). An Overview of

Data Warehousing and OLAP Technology. SIGMOD

Rec., 26, 65–74.

Codd, E. F. (1990). The relational model for database

management: version 2. Addison-Wesley Longman

Publishing Co., Inc.

Codd, E. F., Codd, S. B., & Salley, C. T. (1993). Providing

OLAP (on-line analytical processing) to user-analysts.

An IT Mandate. White Paper. Arbor Software

Corporation, 4.

Dageville, B., Cruanes, T., Zukowski, M., Antonov, V.,

Avanes, A., Bock, J., Unterbrunner, P. (2016, June).

The Snowflake Elastic Data Warehouse. Proceedings

of the 2016 International Conference on Management

of Data. ACM.

Davenport, T. H., & Ronanki, R. (2018). Artificial

intelligence for the real world. Harvard business

review, 96, 108–116.

Eckerson, W. (2020, June 8). All Hail, the Data Lakehouse!

(If Built on a Modern Data Warehouse). Retrieved

December 8, 2022, from https://www.eckerson.com/

articles/all-hail-the-data-lakehouse-if-built-on-a-moder

n-data-warehouse

Eichler, R., Giebler, C., Gröger, C., & others. (2021).

Modeling metadata in data lakes—A generic model.

Data & Knowledge Engineering, 136, 101931.

Feinberg, D., Russom, P., & Showell, N. (2022, June).

Hype Cycle for Data Management. Gartner Inc.

Fourny, G., Dao, D., Cikis, C. B., & others. (2021).

RumbleML: program the lakehouse with JSONiq.

arXiv.

Giebler, C., Gröger, C., Hoos, E., & others. (2019).

Leveraging the data lake: Current state and challenges.

Internat. Conf. on Big Data Analytics and Knowledge

Discovery, (pp. 179–188).

Giebler, C., Gröger, C., Hoos, E., & others. (2020). A Zone

Reference Model for Enterprise-Grade Data Lake

Management.

IEEE 24th Internat. Enterprise

Distributed Object Computing Conf., (pp. 57-66).

Giebler, C., Gröger, C., Hoos, E., & others. (2021). The

Data Lake Architecture Framework. BTW 2021.

Gröger, C. (2021). There is no AI without data.

Communications of the ACM, 64, 98–108.

Gröger, C. (2022). Industrial analytics–An overview. it-

Information Technology.

Gröger, C., Schwarz, H., & Mitschang, B. (2014). The

Manufacturing Knowledge Repository. Proceedings of

the 16th International Conference on Enterprise

Information Systems, (pp. 39-51).

Han, J., Pei, J., & Tong, H. (2022). Data mining: concepts

and techniques. Morgan kaufmann.

Hansen, J. (2021, April 1). Selling the Data Lakehouse.

Retrieved December 8, 2022, from

https://medium.com/snowflake/a9f25f67c906

Härder, T., & Reuter, A. (1983). Principles of transaction-

oriented database recovery. ACM computing surveys

(CSUR), 15, 287–317.

Hlupić, T., Oreščanin, D., Ružak, D., & others. (2022). An

Overview of Current Data Lake Architecture Models.

2022 45th Jubilee International Convention on

Information, Communication and Electronic

Technology, (pp. 1082–1087).

Inmon, B., Levins, M., & Srivastava, R. (2021, October).

Building the Data Lakehouse. TECHNICS PUBN LLC.

Inmon, W. H. (2005). Building the data warehouse. John

wiley & sons.

Kejariwal, A., Kulkarni, S., & Ramasamy, K. (2017). Real

Time Analytics: Algorithms and Systems. arXiv.

Kreps, J. (2014, July). Questioning the Lambda

Architecture. Retrieved December 8, 2022, from

https://www.oreilly.com/radar/questioning-the-

lambda-architecture/

Lasi, H., Fettke, P., Kemper, H.-G., Feld, T., & Hoffmann,

M. (2014, June). Industry 4.0. Business & Information

Systems Engineering, 6, 239–242.

Leano, H. (2020, November). Delta vs. Lambda: Why

Simplicity Trumps Complexity for Data Pipelines.

Retrieved December 8, 2022, from

https://www.databricks.com/blog/2020/11/20/delta-vs-

lambda-why-simplicity-trumps-complexity-for-data-

pipelines.html

L'Esteve, R. (2022, July). The Azure Data Lakehouse

Toolkit. Apress.

Oreščanin, D., & Hlupić, T. (2021). Data Lakehouse - a

Novel Step in Analytics Architecture. 44th Internat.

Conv. on Information, Communication and Electronic

Technology, (pp. 1242-1246).

Assessing the Lakehouse: Analysis, Requirements and Deﬁnition

Pendse, N., & Creeth, R. (1995). The OLAP report.

Business Intelligence.

Raina, V., & Krishnamurthy, S. (2022). Building an

Effective Data Science Practice. apress, Springer.

Shiyal, B. (2021, June). Beginning Azure Synapse

Analytics. Apress.

Tovarňák, D., Raček, M., & Velan, P. (2021). Cloud Native

Data Platform for Network Telemetry and Analytics.

17th Internat. Conference on Network and Service

Management, (pp. 394-396).

Vaisman, A., & Zimányi, E. (2022). Data Warehouse

Concepts. In Data Warehouse Systems: Design and

Implementation (pp. 45-74). Berlin, Heidelberg:

Springer Berlin Heidelberg.

Warren, J., & Marz, N. (2015). Big Data: Principles and

best practices of scalable realtime data systems. Simon

and Schuster.

Zheng, J. G. (2017, November). Data Visualization in

Business Intelligence. In Global Business Intelligence

(pp. 67–81). Routledge.

Zhou, Z.-H. (2021). Machine learning. Springer Nature.

ICEIS 2023 - 25th International Conference on Enterprise Information Systems