Metadata Management for Textual Documents in Data Lakes
Pegdwendé Sawadogo, Tokio Kibata, Jérôme Darmont
2019
Abstract
Data lakes have emerged as an alternative to data warehouses for the storage, exploration and analysis of big data. In a data lake, data are stored in a raw state and bear no explicit schema. Thence, an efficient metadata system is essential to avoid the data lake turning to a so-called data swamp. Existing works about managing data lake metadata mostly focus on structured and semi-structured data, with little research on unstructured data. Thus, we propose in this paper a methodological approach to build and manage a metadata system that is specific to textual documents in data lakes. First, we make an inventory of usual and meaningful metadata to extract. Then, we apply some specific techniques from the text mining and information retrieval domains to extract, store and reuse these metadata within the COREL research project, in order to validate our proposals.
DownloadPaper Citation
in Harvard Style
Sawadogo P., Kibata T. and Darmont J. (2019). Metadata Management for Textual Documents in Data Lakes.In Proceedings of the 21st International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-372-8, pages 72-83. DOI: 10.5220/0007706300720083
in Bibtex Style
@conference{iceis19,
author={Pegdwendé Sawadogo and Tokio Kibata and Jérôme Darmont},
title={Metadata Management for Textual Documents in Data Lakes},
booktitle={Proceedings of the 21st International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2019},
pages={72-83},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0007706300720083},
isbn={978-989-758-372-8},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 21st International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - Metadata Management for Textual Documents in Data Lakes
SN - 978-989-758-372-8
AU - Sawadogo P.
AU - Kibata T.
AU - Darmont J.
PY - 2019
SP - 72
EP - 83
DO - 10.5220/0007706300720083