Authors:
Dominik Kerzel
1
;
Sheeba Samuel
1
;
2
and
Birgitta König-Ries
1
;
2
Affiliations:
1
Heinz Nixdorf Chair for Distributed Information Systems, Friedrich Schiller University Jena, Germany
;
2
Michael Stifel Center Jena, Friedrich Schiller University Jena, Germany
Keyword(s):
Machine Learning, Information Extraction, Provenance, Jupyter Notebook, Reproducibility.
Abstract:
Machine learning (ML) pipelines are constructed to automate every step of ML tasks, transforming raw data into engineered features, which are then used for training models. Even though ML pipelines provide benefits in terms of flexibility, extensibility, and scalability, there are many challenges when it comes to their reproducibility and data dependencies. Therefore, it is crucial to track and manage metadata and provenance of ML pipelines, including code, model, and data. The provenance information can be used by data scientists in developing and deploying ML models. It improves understanding complex ML pipelines and facilitates analyzing, debugging, and reproducing ML experiments. In this paper, we discuss ML use cases, challenges, and design goals of an ML provenance management tool to automatically expose the metadata. We introduce MLProvLab, a JupyterLab extension, to automatically identify the relationships between data and models in ML scripts. The tool is designed to help da
ta scientists and ML practitioners track, capture, compare, and visualize the provenance of machine learning notebooks.
(More)