MLProvCodeGen - Machine Learning Provenance Code Generator

Install

pip install MLProvCodeGen

Our goal in this research was to find out, whether provenance data can be used to support the end-to-end reproducibility of machine learning experiments.

In short, provenance data is data that contains information about a specific datapoint; how, when, and by whom it was conceived, and by which processes (functions, methods) it was generated.

The functionalities of MLProvCodeGen can be split into 2 parts:

MLProvCodeGen's original purpose was to automatically generate code for training machine learning (ML) models, providing users multiple different options for machine learning tasks, datasets, model parameters, training parameters and evaluation metrics. We then extended MLProvCodeGen to generate code according to real-world provenance data models, to automatically capture provenance data from the generated experiments, and to take provenance data files that were captured with MLProvCodeGen as input to generate one-to-one reproductions of the original experiments. MLProvCodeGen can also generate relational graphs of the captured provenance data, allowing for visual representation of the implemented experiments.

The specific use-cases for this project are twofold:

Image Classification

We can generate code to train a ML model on image input files to classify handwritten digits (MNIST),

clothing articles (FashionMNIST), and a mix of vehicles and animals (CIFAR10).

Multiclass Classification

We can generate code to train a ML model on tabular data (.csv) to classify different species of iris flowers

and to also test different models using 'toy datasets' which are fake datasets specifically designed to mimic patterns that could occur in real-world data such as spirals.

How to use MLProvCodeGen

Please open MLProvCodeGen by using the Binder Button at the top of this page. This opens a virtual installation.

The JupyterLab interface should look like this:

Please proceed by pressing the 'MLProvCodeGen' button located in the 'other' section to open the extension.

Here is an example interface:

And generated notebooks look like this:

Troubleshoot

If you are seeing the frontend extension, but it is not working, check that the server extension is enabled:

jupyter server extension list

If the server extension is installed and enabled, but you are not seeing the frontend extension, check the frontend extension is installed:

jupyter labextension list

Contributing

Development install

Note: You will need NodeJS to build the extension package.

The jlpm command is JupyterLab's pinned version of yarn that is installed with JupyterLab. You may use yarn or npm in lieu of jlpm below.

# Clone the repo to your local environment
# Change directory to the MLProvCodeGen directory
# Install package in development mode
pip install -e .
# Link your development version of the extension with JupyterLab
jupyter labextension develop . --overwrite
# Rebuild extension Typescript source after making changes
jlpm run build

You can watch the source directory and run JupyterLab at the same time in different terminals to watch for changes in the extension's source and automatically rebuild the extension.

# Watch the source directory in one terminal, automatically rebuilding when needed
jlpm run watch
# Run JupyterLab in another terminal
jupyter lab

With the watch command running, every saved change will immediately be built locally and available in your running JupyterLab. Refresh JupyterLab to load the change in your browser (you may need to wait several seconds for the extension to be rebuilt).

By default, the jlpm run build command generates the source maps for this extension to make it easier to debug using the browser dev tools. To also generate source maps for the JupyterLab core extensions, you can run the following command:

jupyter lab build --minimize=False

Adding new ML experiments

The following steps must be taken to add a new ML experiment to this extension:

Have an existing Python script for your machine learning experiment.
Paste the code into a Jupyter notebook and split it into cells following the execution order of your experiment.
Create a Jinja template for each cell and wrap if-statements around the Python code depending on which variables are important. Refer to existing modules for what the provenance data of your experiment might look like.
Load the templates in a Python procedure that also creates a new notebook element and write their rendered outputs to the notebook.
Expect every local variable for the procedure to be extracted from a dictionary input.
Add HTML input elements to the user interface based on your provenance variables.
Combine the variable values into a JavaScript/TypeScript dictionary.
Create a new server request for your module and pass the dictionary through it as “stringified” JSON data.
Once the frontend, backend, and server connection work, your module has been added successfully.

Note that while these steps might seem complicated, most of them only require copy-pasting already existing code. The only new part for most users is templating through Jinja. However, Jinja has good documentation, and its syntax is very simple, requiring only if-loops.

Uninstall

pip uninstall MLProvCodeGen

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.github/workflows		.github/workflows
EvaluationResults		EvaluationResults
GeneratedNotebooks		GeneratedNotebooks
GeneratedProvenanceData		GeneratedProvenanceData
binder		binder
data/MNIST		data/MNIST
datasets		datasets
extension		extension
jinjaTemplates		jinjaTemplates
jupyter-config		jupyter-config
schema		schema
src		src
style		style
.eslintignore		.eslintignore
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc		.prettierrc
.stylelintrc		.stylelintrc
CHANGELOG.md		CHANGELOG.md
Final_BTW2023_Student_Program.pdf		Final_BTW2023_Student_Program.pdf
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
RELEASE.md		RELEASE.md
install.json		install.json
package-lock.json		package-lock.json
package.json		package.json
pyproject.toml		pyproject.toml
setup.py		setup.py
tsconfig.json		tsconfig.json
yarn-error.log		yarn-error.log
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLProvCodeGen - Machine Learning Provenance Code Generator

Install

How to use MLProvCodeGen

Troubleshoot

Contributing

Development install

Adding new ML experiments

Uninstall

About

Releases

Packages

Contributors 2

Languages

License

fusion-jena/MLProvCodeGen

Folders and files

Latest commit

History

Repository files navigation

MLProvCodeGen - Machine Learning Provenance Code Generator

Install

How to use MLProvCodeGen

Troubleshoot

Contributing

Development install

Adding new ML experiments

Uninstall

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages