There are several approaches and techniques in
the literature to manage semi-structured and
structured data ((Bhroithe et al. 2020), (Alloghani et
al. 2019), (Aftab et al. 2020), (Ouaret et al. 2019)).
However, it only focuses on two formats (structured
and semi-structured) but does not examine
unstructured data. In addition to that, most of the
approaches that deal with unstructured data focus
only on textual data (Yafooz and Fahad 2018).
The goal of this paper is formed as follows:
Firstly, we take a look at the state of art and present a
comprehensive vision of DL concepts. Secondly, we
introduce a new method for structuring unstructured
data. Especially in the health field, because in this
field, we often find data in different formats. Thirdly,
the construction of our ontology which represents our
Moroccan data lake.
The remainder of this paper is structured like this:
in Section 2, we parse the related literature. In Section
3, we present the formalization and data lake
architectures adopted by our approach. Then we offer
a procedure to partly structuring unstructured data
sources. In Section 4, we describe our data lake
ontology-based model to enrich the representation of
unstructured data sources. In Section 5, we give an
example case of covid-19. In Section 6, we present
the evaluation technique and describe a critical
discussion of our approach. Finally, in Section 7, we
conclude our paper.
2 RELATED LITERATURE
2.1 State of the Art
Data lake relatively is a recent concept, introduced by
James Dixon as an alternative to data marts; storing
data into silos (Alserafi et al. 2016), to prevent them
from being transformed into a data swamp must be
accompanied by metadata (Sawadogo et al. 2019).
The data lake model demands that any raw data be
combined with a set of metadata. This represents a
crucial competitive differentiator for any data lake
architecture. Following (Farrugia et al. 2016), they
proposed an approach to managing data lakes based
on the extraction of metadata from an open-source
data warehouse system named hive. To achieve their
target, it applies Social Network Analysis techniques.
In the literature, various metadata classifications
have been introduced. Thereby, various metadata
models are used to design metadata classification.
Among these models we find RDF. The power of this
model is of course its semantic richness. However, its
weakness is its complexity. Indeed, cannot maintain
fast processing and analysis of the heterogeneous
data.
A metadata model proposed by Oram is well-
suited for data lakes (Oram 2016). There is also the
model approved by Zaloni (Ben Sharma 2018),
considered as one of the business managers in the data
lake domain. Yet, Zaloni adopts a trinomial
classification of metadata, namely operational,
technical, and business metadata.
2.2 Data Lake Definition
A data lake is represented as an extensive system or
repository that stores heterogeneous raw data; the
diversification of concepts poses a significant issue.
There is robust compliance in the literature on the
definition of data lakes. Still, all existing definitions
share the same vision about the definition of data
lakes, respecting the idea that a data lake is a central
repository of raw data stored in a natural format. For
example, (Hai et al. 2016) defines data lakes as “a
megadata repository that stores data in its native
format and provides on-demand ingestion
functionality using metadata description”.
(Terrizzano et al. 2015) uses a definition of a data lake
provided by (Madera and Laurent 2016) and asserts
that “a data lake is a central repository containing
enormous amounts of raw data described by
metadata”. Thus, we ascertain that there is a strong
agreement concerning the definition of data lakes. In
the situation of big data analytics, user needs are not
established during the primary draft. A data lake is an
answer that came with the appearance of big data,
ingests raw data from different sources, and stocks
source data in a natural format. Enables data to be
processed conforming to diverse specifications.
Indeed, empowers access to ready data for various
needs, and supervises data to ensure data governance.
3 FORMALIZATION
In this section, we will describe our network model to
manage the data lake, which will be used in our paper.
Our network model shows data lake as being a set
of data sources, like this:
DL = {DS
1
, DS
2
, … , DS
n
/ DS = Data Source} (1)
It is important to note that each data source DS
k
is
arranged with a set of metadata commented by M
k
.
We denote by M
DL
the set of metadata repositories for
heterogeneous data sources stored in the lake.