Skip to content

Provenance-based Custom Code Generation of Machine Learning Experiments in Jupyter Notebooks

License

Notifications You must be signed in to change notification settings

fusion-jena/MLProvCodeGen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MLProvCodeGen - Machine Learning Provenance Code Generator

Github Actions StatusBinder

Install

pip install MLProvCodeGen

Our goal in this research was to find out, whether provenance data can be used to support the end-to-end reproducibility of machine learning experiments.

In short, provenance data is data that contains information about a specific datapoint; how, when, and by whom it was conceived, and by which processes (functions, methods) it was generated.

provenance data example

The functionalities of MLProvCodeGen can be split into 2 parts:

MLProvCodeGen's original purpose was to automatically generate code for training machine learning (ML) models, providing users multiple different options for machine learning tasks, datasets, model parameters, training parameters and evaluation metrics.  We then extended MLProvCodeGen to generate code according to real-world provenance data models, to automatically capture provenance data from the generated experiments, and to take provenance data files that were captured with MLProvCodeGen as input to generate one-to-one reproductions of the original experiments. MLProvCodeGen can also generate relational graphs of the captured provenance data, allowing for visual representation of the implemented experiments.

The specific use-cases for this project are twofold: 

  1. Image Classification
  • We can generate code to train a ML model on image input files to classify handwritten digits (MNIST),

clothing articles (FashionMNIST), and a mix of vehicles and animals (CIFAR10).

MNIST example

  1. Multiclass Classification
  • We can generate code to train a ML model on tabular data (.csv) to classify different species of iris flowers

iris example

and to also test different models using 'toy datasets' which are fake datasets specifically designed to mimic patterns that could occur in real-world data such as spirals.

Spiral example

How to use MLProvCodeGen

Please open MLProvCodeGen by using the Binder Button at the top of this page. This opens a virtual installation.

The JupyterLab interface should look like this:

jupyterlab startup

Please proceed by pressing the 'MLProvCodeGen' button located in the 'other' section to open the extension.

MLProvCodeGen startup

Here is an example interface:

MLProvCodeGen_MCC_inputs

And generated notebooks look like this:

execute notebook button red

Troubleshoot

If you are seeing the frontend extension, but it is not working, check that the server extension is enabled:

jupyter server extension list

If the server extension is installed and enabled, but you are not seeing the frontend extension, check the frontend extension is installed:

jupyter labextension list

Contributing

Development install

Note: You will need NodeJS to build the extension package.

The jlpm command is JupyterLab's pinned version of yarn that is installed with JupyterLab. You may use yarn or npm in lieu of jlpm below.

# Clone the repo to your local environment
# Change directory to the MLProvCodeGen directory
# Install package in development mode
pip install -e .
# Link your development version of the extension with JupyterLab
jupyter labextension develop . --overwrite
# Rebuild extension Typescript source after making changes
jlpm run build

You can watch the source directory and run JupyterLab at the same time in different terminals to watch for changes in the extension's source and automatically rebuild the extension.

# Watch the source directory in one terminal, automatically rebuilding when needed
jlpm run watch
# Run JupyterLab in another terminal
jupyter lab

With the watch command running, every saved change will immediately be built locally and available in your running JupyterLab. Refresh JupyterLab to load the change in your browser (you may need to wait several seconds for the extension to be rebuilt).

By default, the jlpm run build command generates the source maps for this extension to make it easier to debug using the browser dev tools. To also generate source maps for the JupyterLab core extensions, you can run the following command:

jupyter lab build --minimize=False

Adding new ML experiments

The following steps must be taken to add a new ML experiment to this extension:

  1. Have an existing Python script for your machine learning experiment.
  2. Paste the code into a Jupyter notebook and split it into cells following the execution order of your experiment.
  3. Create a Jinja template for each cell and wrap if-statements around the Python code depending on which variables are important. Refer to existing modules for what the provenance data of your experiment might look like.
  4. Load the templates in a Python procedure that also creates a new notebook element and write their rendered outputs to the notebook.
  5. Expect every local variable for the procedure to be extracted from a dictionary input.
  6. Add HTML input elements to the user interface based on your provenance variables.
  7. Combine the variable values into a JavaScript/TypeScript dictionary.
  8. Create a new server request for your module and pass the dictionary through it as “stringified” JSON data.
  9. Once the frontend, backend, and server connection work, your module has been added successfully.

Note that while these steps might seem complicated, most of them only require copy-pasting already existing code. The only new part for most users is templating through Jinja. However, Jinja has good documentation, and its syntax is very simple, requiring only if-loops.

Uninstall

pip uninstall MLProvCodeGen

About

Provenance-based Custom Code Generation of Machine Learning Experiments in Jupyter Notebooks

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published