Table 1: Comparison of Data Warehouses vs Data Lakes.
- Data Warehouse Data Lake
Data Structured, processed Raw: Structured / semi-structured / unstructured
Processing Schema-on-write Schema-on-read
Storage Requires large storage (architecturally complex) Requires larger data storages (architecturally less
complex). It can be cheaper despite the large amount of
data stored.
Agile-aware Fixed structure Tailored
Purpose of Data Fixed (BI, reporting) Not Yet Determined (Machine Learning, Data analytics)
Target audience Business Professionals Data Scientist
auto-fill and edit check options as e.g. speech to text,
OCR (optical character recognition) that computerize
customers’ data may introduce systematic and ran-
dom inaccuracies that differ from clerks to clerks, a
software tool to software tool. These errors are diffi-
cult to quantify and forestall.
To ensure that there will be a centralized loca-
tion that would serve as the single source of truth, the
data with different types and structures from various
sources should be collected and loaded into a Data
Lake. The most recent technologies can yield oppor-
tunities for the application of data analytics and mod-
els of Data Science. The results of running data ana-
lytic algorithms could be actionable knowledge in the
clinical research environment.
Generally, it is assumed that the data coming from
source systems are in good quality however, the mar-
ket and administrative forces have not enforced a sat-
isfyingly high level on standards of data quality. The
typical life history of data can be seen in Figure 1.
On the left part of the diagram, the various major
source systems of financial data can be found. The
data are represented as customers’ electronic personal
records, geo-codes for geographical information, and
other loosely coupled data related to management,
and business administration.
Our paper showcases architecture for a moderate
size insurance enterprise environment so that the Vs
(volume, velocity, variety, veracity, variability, value)
of the Big Data are as follows: A population of a
million customers may generate electronic customer
records in the order of terabytes yearly. Primarily, the
variability of structured, semi-structured, and unstruc-
tured data increases the complexity thereby the diffi-
culty of ensuring the single point of truth within the
data collection.
Data Lake as a Big Data analytics system allows
the continuous collection of structured and unstruc-
tured data of an organization in the form of data rep-
resentation without changing the original data, with-
out data cleaning and transformation of data models
- i.e. without loss of information, so the informa-
tion can be analyzed with the greatest degree of free-
dom. Data Lake solutions focus primarily on inte-
gration and efficient data storage processes, besides
providing advanced data management, data analyt-
ics, machine learning, and self-service Business In-
telligence (BI) and data visualization services as well.
A Data Lake offers services for Business Users us-
ing BI tools but the typical target audience is the
data scientists. Data Lake is effective for an organi-
zation where a significant part of the organizational
data is structured (and interpreted, stored in several,
not yet reconciled sources), complemented by a large
amount of unstructured data. The most important is
that goal of the data processing is to utilize corpo-
rate data assets, exploring further new contexts, typi-
cally non-repetitive research questions. Our proposed
Data Lake architecture consists of a Data Warehouse
as well, to support or the information requirements
of the organization (see Table 1). Within a hybrid
Data Lake that contains a robust Data Warehouse as
well, the life cycle of data commences with transfor-
mation, cleansing and integration. The objective to
build up a Data Lake is to separate the daily opera-
tion, transactional data from the non-production data
collection. Historically, the Data Warehouse technol-
ogy has been employed for that purpose. During the
data staging phase, the data are cleansed and filtered
for the target data structure within the Data Ware-
house, i.e. the fact table and dimensions. The phase
of data staging includes data migration, data integra-
tion, translation of codes used for data representation,
transformation between data base management sys-
tems. The Data Warehouse served as a basis for data
analysis traditionally. The ETL (Extract, Transform,
Load) procedure is applied for feeding data into the
Data Warehouse. During that step the general data
cleansing and transformation happens, e.g. amputa-
tion of trailing and leading white spaces, superfluous
zeros, standardization of identifiers/identifying num-
bers, inflicting constraints on data fields, converting
English units into metric units. While the before men-
tioned data-transformation is carried out relationships
among entities may be dropped or harmed. Similar
way, the data integration from multiple source sys-
tems, can lead to errors that are transmitted into the
Data Warehouse
To overcome the data quality limitations of Data
Warehouse, the idea of the Data Lake is conceptu-
alized. Touted idea of Data Lake is that it deposits
data in their original form, i.e. the Transientand/or
Raw Data (see Figure 2 and Figure 3, (LaPlante
and Sharma, 2014)) contains the data after an in-
ICE-B 2020 - 17th International Conference on e-Business
130