(A. Beheshti and Tabebordbar, 2018).
In (M. Wibowo and Shamsuddin, 2017), the au-
thors propose a machine learning technique to opti-
mize data management processes in a DL by com-
bining data silos. This solution intended to improve
data quality is divided into two phases. The first phase
bridges the gap between the data sources, i.e. the data
silos, and the DL that will manage the data source. In
this phase, data discovery will describe the data, gov-
ernance will capture the data using evolving metadata,
and data mining will new data models to combine it
with other ML processes. The second phase, is for
verifying the result. It uses several tools related to
Reporting, BI, Visualization...
In the same context, (A. Farrugia and Thomp-
son, 2016) proposes a DL management (DLM) by
extracting the metadata from the database using So-
cial Network Analysis. (Z. Shang and Feng, 2016)
proposes the iFuse (data fusion platform) that bases
on Bayesian Graphical Model to manage and query
a DL. (I. D. Nogueira and Ea, 2018) uses a group
of modeling to handle schema evolution in a DL
and proposes a data vault. (Sawadogo and Darmont,
2019) presents a methodological approach to manage
and build a metadata system for textual document in
a DL. Also, (L. Chen and Zhuang, 2015) propose
a data model for unstructured data and the RAISE
method to process it using a SQL-like query lan-
guage. Hai and colleagues present an intelligent sys-
tem under the name Constance. This system (R. Hai
and Quix, 2016) have been proposed as a solution to
non-integrated data management system with hetero-
geneous schema, and to avoid the problem of “data
swamp”. Constance was built to discover, extract and
summarize the structural metadata of data sources and
to annotate data and metadata with semantic informa-
tion to avoid ambiguities. Another system proposed
in (M. Farid and Chu, 2016) introduces the CLAMS
system that discovers the raw data and metadata in-
tegrity constraint using the RDF model. To validate
the result, this system requires human intervention.
From these previous work, we state that most re-
searchers use specific data type as while addressing
the data heterogeneity problem. In another words,
contributors start defining the two main axes of a DL,
which are the data extraction and data management
phase. they start choosing, beforehand, the targeted
data type. Since we acknowledged the variety as-
pect of nowdays data -there are various of structured,
semi-structured and unstructured data-, the proposed
DL architectures and models are limited to the only
type of data they explicitly target. also, what is miss-
ing in the existing work is the projection of the ap-
proaches in a real case (some of them have a large
project which, based on a real case as (R. Hai and
Quix, 2016)).
To summarize, most research focuses on the man-
agement system and exploration of DL using popular
knowledge and tools such as machine learning, data
quality, social networks focus on textual data. This
only concerns a part of the variety of data types.
3 DATA LAKE MANAGEMENT
As stated previously, a Data Lake is a sustainable so-
lution for companies which want to take advantage of
publicly available data. However, DL solutions hard
to implement, manage and operate especially if tar-
geted data sources are heterogeneous. It is therefore
necessary to have an architecture that can adapt to any
type of data structure or format and ensure the stor-
age, ingestion and preparation policy.
According to the literature, the company needs a
service-oriented architecture but it isn’t easy to trans-
form the entire information system into a single DL.
To deal with this problem we propose the creation of
interfaces for each source of data. In addition, we
believe that we need to create a virtual DL with two
layers (physical layer and logical layer) in order to
conserve resources adequately.
The question that arises now is how to link the
two layers? To do so, we propose a DL architec-
ture, covering the business perimeter, then we focus
on managing a DL by grouping metadata, managing
the schema, managing database access and indicating
how to extract metadata from any possible source.
3.1 Architecture
The figure 1 bellow shows the architecture of our DL.
It is divided into two layers:
• Physical Layer: This layer, makes the physical
and real link between the DL and the external
(API, web page, etc.) and internal (databases, flat
file, etc.) sources of an organization. In other
words, it is responsible for establishing the con-
nection of each source through a dedicated in-
terface that take into account the nature of each
source.
• Logical Layer: this layer illustrates the core of
our architecture. It Contains several functionali-
ties. For example, as soon as the connection is es-
tablished with a source, we retrieve the metadata
of this source and store it in our database. After
that, the integration of all metadatas is performed
by storing them in the same database. From this
Data Lake and Digital Enterprise
425