BaseX GmbH (2018). BaseX – The XML Framework.
http://basex.org/.
Beheshti, A., Benatallah, B., Nouri, R., Chhieng, V. M.,
Xiong, H., and Zhao, X. (2017). CoreDB: a Data Lake
Service. In 2017 ACM on Conference on Informa-
tion and Knowledge Management (CIKM 2017), Sin-
gapore, Singapore, ACM, pages 2451–2454.
Dixon, J. (2010). Pentaho, Hadoop, and Data Lakes.
https://jamesdixon.wordpress.com/2010/10/14/
pentaho-hadoop-and-data-lakes/.
Dublin Core Metadata Initiative (2018). Dublin Core.
http://dublincore.org/.
Elastic (2018). Elasticsearch. https://www.elastic.co.
Fang, H. (2015). Managing Data Lakes in Big Data Era:
What’s a data lake and why has it became popular in
data management ecosystem. In 5th Annual IEEE In-
ternational Conference on Cyber Technology in Au-
tomation, Control and Intelligent Systems (CYBER
2015), Shenyang, China, IEEE, pages 820–824.
Farid, M., Roatis, A., Ilyas, I. F., Hoffmann, H.-F., and
Chu, X. (2016). CLAMS: Bringing Quality to Data
Lakes. In 2016 International Conference on Manage-
ment of Data (SIGMOD 2016), San Francisco, CA,
USA, ACM, pages 2089–2092.
Farrugia, A., Claxton, R., and Thompson, S. (2016). To-
wards Social Network Analytics for Understanding
and Managing Enterprise Data Lakes. In Advances
in Social Networks Analysis and Mining (ASONAM
2016), San Francisco, CA, USA, IEEE, pages 1213–
1220.
Fauduet, L. and Peyrard, S. (2010). A data-first preservation
strategy: Data management in spar. In 7th Interna-
tional Conference on Preservation of Digital Objects
(SPAR 2010), Vienna, Autria, pages 1–8.
Hai, R., Geisler, S., and Quix, C. (2017). Constance: An
Intelligent Data Lake System. In 2016 International
Conference on Management of Data (SIGMOD 2016)
,San Francisco, CA, USA, ACM Digital Library, pages
2097–2100.
Halevy, A., Korn, F., Noy, N. F., Olston, C., Polyzotis,
N., Roy, S., and Whang, S. E. (2016). Managing
Google’s data lake: an overview of the GOODS sys-
tem. In 2016 International Conference on Manage-
ment of Data (SIGMOD 2016), San Francisco, CA,
USA, ACM, pages 795–806.
Hultgren, H. (2016). Data Vault modeling guide: Intro-
ductory Guide to Data Vault Modeling. Genessee
Academy, USA.
Ibrahimov, O., Sethi, I., and Dimitrova, N. (2002). The Per-
formance Analysis of a Chi-square Similarity Mea-
sure for Topic Related Clustering of Noisy Tran-
scripts. In 16th International Conference on Pattern
Recognition, Quebec City, Quebec, Canada, pages
285–288.
Inmon, B. (2016). Data Lake Architecture: Designing the
Data Lake and avoiding the garbage dump. Technics
Publications.
Kilgarriff, A. (2001). Comparing Corpora. International
Journal of Corpus Linguistics, 6(1):97–133.
Klettke, M., Awolin, H., St
¨
url, U., M
¨
uller, D., and
Scherzinger, S. (2017). Uncovering the Evolution His-
tory of Data Lakes. In 2017 IEEE International Con-
ference on Big Data (BIGDATA 2017), Boston, MA,
USA, pages 2462–2471.
Laskowski, N. (2016). Data lake governance: A big data do
or die. https://searchcio.techtarget.com/feature/Data-
lake-governance-A-big-data-do-or-die.
Linstedt, D. (2011). Super Charge your Data Warehouse:
Invaluable Data Modeling Rules to Implement Your
Data Vault. CreateSpace Independent Publishing.
Maccioni, A. and Torlone, R. (2017). Crossing the finish
line faster when paddling the data lake with KAYAK.
VLDB Endowment, 10(12):1853–1856.
Madera, C. and Laurent, A. (2016). The next information
architecture evolution: the data lake wave. In 8th
International Conference on Management of Digital
EcoSystems (MEDES 2016), Biarritz, France, pages
174–180.
Miloslavskaya, N. and Tolstoy, A. (2016). Big Data, Fast
Data and Data Lake Concepts. In 7th Annual Interna-
tional Conference on Biologically Inspired Cognitive
Architectures (BICA 2016), NY, USA, volume 88 of
Procedia Computer Science, pages 1–6.
Neo4J Inc. (2018). The Neo4j Graph Platform.
https://neo4j.com.
Nogueira, I., Romdhane, M., and Darmont, J. (2018). Mod-
eling Data Lake Metadata with a Data Vault. In 22nd
International Database Engineering and Applications
Symposium (IDEAS 2018), Villa San Giovanni, Italia,
pages 253–261, New York. ACM.
Pons, P. and Latapy, M. (2006). Computing Communities
in Large Networks Using Random Walks. Journal of
Graph Algorithms and Applications, 10(2):191–218.
Quix, C., Hai, R., and Vatov, I. (2016). Metadata Extrac-
tion and Management in Data Lakes With GEMMS.
Complex Systems Informatics and Modeling Quar-
terly, (9):289–293.
Stein, B. and Morrison, A. (2014). The enterprise data lake:
Better integration and deeper analytics. PWC Tech-
nology Forecast, (1):1–9.
Suriarachchi, I. and Plale, B. (2016). Crossing Analytics
Systems: A Case for Integrated Provenance in Data
Lakes. In 12th IEEE International Conference on e-
Science (e-Science 2016), Baltimore, MD, USA, Octo-
ber 23-27, 2016, pages 349–354.
Terrizzano, I., Schwarz, P., Roth, M., and Colino, J. E.
(2015). Data Wrangling: The Challenging Journey
from the Wild to the Lake. In 7th Biennial Conference
on Innovative Data Systems Research (CIDR 2015),
Asilomar, CA, USA, pages 1–9.
The Apache Software Foundation (2018). Apache Tika – a
content analysis toolkit. https://tika.apache.org/.
The Library of Congress (2017).
METS: An Overview and Tutorial.
http://www.loc.gov/standards/mets/METSOverview.
v2.html.
Metadata Management for Textual Documents in Data Lakes
83