mlvtools pipeline features

Goal: show how mlvtools, DVC and MLflow tracking work on a trivial case.

The pipeline converts an unreadable input into a readable text. It is composed with specific steps:

(basic) only input and output parameters (#1, #4)
with extra parameters (#2)
with a specific DVC command (#3)

using MLflow (#5)

                                                    * dummy_pipeline_feed.txt                       
                                                    *
                                  +-------------------------------------+                                      
                                  | #1 mlvtools_step1_sanitize_data.dvc |                                      
                                  +-------------------------------------+                                      
                                                    *                                                       
                                                    *  sanitized_data.txt                                                     
                                                    *                                                       
                                    +----------------------------------+                                       
                                    | #2 mlvtools_step2_split_data.dvc |                                       
                                    +----------------------------------+                                       
                                *******                            *******                                  
         binary_data.txt   *****                                          *****     octal_data.txt                          
                       ****                                                    ****                         
   +----------------------------------------+                                +---------------------------------------+  
   | #3 mlvtools_step3_convert_binaries.dvc |                                | # 4 mlvtools_step4_convert_octals.dvc |  
   +----------------------------------------+                                +---------------------------------------+  
                                    *******                            *******                                  
            data_conv_from_octal.txt       *****                  *****       data_conv_from_octal.txt                                  
                                                ****          ****                                              
                                        +---------------------------------+                                        
                                        | #5 mlvtools_step5_sort_data.dvc |                                        
                                        +---------------------------------+  
                                                         *
                                                         *  result.txt

Following features will be demonstrated:

reproducing a pipeline using the cache
reproducing a pipeline without the cache
modifying input and run the whole pipeline
modifying input and run a sub-pipeline
modifying the code
modifying an hyperparameter

Setup

Create a dummy branch
```
 git checkout -b dummy-pipeline
```

Create a dummy structure to store data, Jupyter Notebook, generated scripts and commands.

 mkdir -p ./dummy/pipeline/notebooks
 mkdir -p ./dummy/pipeline/steps
 mkdir -p ./dummy/dvc
 mkdir -p ./dummy/data

Create pipeline input data
```
 make dummy-input
```
Create dummy conf
```
 make dummy-conf
```

Using mlvtools, it can be repetitive to repeat output paths parameters for each ipynb_to_python and gen_dvc command. It is possible to provide a configuration to declare project structure and let mlvtools generates output path. (For more information see documentation)

Initialize DVC
```
 dvc init
```

DVC works on top of git repositories. Run DVC initialization in a git repository directory to create DVC meta files.

Version setup under git

 git add ./dummy *.dvc
 git commit -m 'Dummy pipeline setup'

Step 1: Sanitize Data

This step sanitizes pipeline input data. It is a simple step with only input and output parameters. It is based on step1_sanitize_data.ipynb notebook.


Step Input:	`./dummy/data/dummy_pipeline_feed.txt`

Step Output:	`./dummy/data/sanitized_data.txt`

Generated files:	`./dummy/pipeline/steps/mlvtools_step1_sanitize_data.py`
	`./dummy/dvc/mlvtools_step1_sanitize_data_dvc`

Get the Pipeline Step Resources

Add the pipeline input file under DVC versioning.

 dvc add ./dummy/data/dummy_pipeline_feed.txt

Copy the step1_sanitize_data.ipynb from the resources directory to the poc project:
```
 cp ./resources/dummy/step1_sanitize_data.ipynb ./dummy/pipeline/notebooks/
```
Open the notebook, read info about parameters and try it.

Generate Python Script and DVC Pipeline Step

Convert the Jupyter Notebook into a parameterized Python script using ipynb_to_python.
```
ipynb_to_python -w . -n ./dummy/pipeline/notebooks/step1_sanitize_data.ipynb
```
Ensure ./dummy/pipeline/steps/mlvtools_step1_sanitize_data.py is well created.
Create a DVC command to run the Python script using DVC.
```
gen_dvc -w . -i ./dummy/pipeline/steps/mlvtools_step1_sanitize_data.py
```
Ensure ./dummy/dvc/mlvtools_step1_sanitize_data_dvc is created.

Run the Pipeline Step

Run the DVC command.

./dummy/dvc/mlvtools_step1_sanitize_data_dvc

Ensure the output file is well created.
```
See ./dummy/data/sanitized_data.txt
```
Ensure the DVC meta file is well created.
```
 See ./mlvtools_step1_sanitize_data.dvc
```
This file represents this pipeline step. It is used for the pipeline reproducibility.

Track Files Under Git Versioning

DVC meta files, notebook and generated scripts should be tracked under git.

git add *.dvc ./dummy/
git commit -m 'Dummy pipeline step1'

For the next steps we apply the same process, but commands are aggregated in the same code section to make it more practical.

Step 2: Split Data

This step splits data into binary and octal values. It is a simple step which takes one parameter which is neither an input nor an output. The extra parameter is the number of bits in a binary value.

This extra parameter is provided to the pipeline step using :dvc-extra in the notebook Docstring.

It is based on step2_split_data.ipynb notebook.


Step Input:	`./dummy/data/sanitized_data.txt`

Step Output:	`./dummy/data/octal_data.txt`
	`./dummy/data/binary_data.txt`

Generated files:	`./dummy/pipeline/steps/mlvtools_step2_split_data.py`
	`./dummy/dvc/mlvtools_step2_split_data_dvc`

# Copy Jupyter Notebook
cp ./resources/dummy/step2_split_data.ipynb ./dummy/pipeline/notebooks/

# Generate Python script and DVC pipeline step
ipynb_to_python -w . -n ./dummy/pipeline/notebooks/step2_split_data.ipynb
gen_dvc -w . -i ./dummy/pipeline/steps/mlvtools_step2_split_data.py

# Run the pipeline step
./dummy/dvc/mlvtools_step2_split_data_dvc

# Version files
git add *.dvc ./dummy/
git commit -m 'Dummy pipeline step2'

Step 3: Convert Binary Values

This step converts binary values into the corresponding Ascii code. It uses (even if it is not necessary in this case) dvc-cmd in Docstring to be able to write the whole DVC command. The feature is available for complex cases.

It is based on step3_convert_binaries.ipynb notebook.


Step Input:	`./dummy/data/binary_data.txt`

Step Output:	`./dummy/data/data_conv_from_bin.txt`

Generated files:	`./dummy/pipeline/steps/mlvtools_step3_convert_binaries.py`
	`./dummy/dvc/mlvtools_step3_convert_binaries_dvc`

# Copy Jupyter Notebook
cp ./resources/dummy/step3_convert_binaries.ipynb ./dummy/pipeline/notebooks/

# Generate Python script and DVC pipeline step
ipynb_to_python -w . -n ./dummy/pipeline/notebooks/step3_convert_binaries.ipynb
gen_dvc -w . -i ./dummy/pipeline/steps/mlvtools_step3_convert_binaries.py

# Run the pipeline step
./dummy/dvc/mlvtools_step3_convert_binaries_dvc

# Version files
git add *.dvc ./dummy/
git commit -m 'Dummy pipeline step3'

Step 4: Convert Octal Values

This step converts octal values into the corresponding Ascii code. This step is analog to the step 3 but input data are octal values not binary values. It is writen as simple it is possible using basics dvc-in and dvc-out in Docstring. It is based on step4_convert_octals.ipynb notebook.


Step Input:	`./dummy/data/octal_data.txt`

Step Output:	`./dummy/data/data_conv_from_octal.txt`

Generated files:	`./dummy/pipeline/steps/mlvtools_step4_convert_octals.py`
	`./dummy/dvc/mlvtools_step4_convert_octals_dvc`

# Copy Jupyter Notebook
cp ./resources/dummy/step4_convert_octals.ipynb ./dummy/pipeline/notebooks/

# Generate Python script and DVC pipeline step
ipynb_to_python -w . -n ./dummy/pipeline/notebooks/step4_convert_octals.ipynb
gen_dvc -w . -i ./dummy/pipeline/steps/mlvtools_step4_convert_octals.py

# Run the pipeline step
./dummy/dvc/mlvtools_step4_convert_octals_dvc

# Version files
git add *.dvc ./dummy/
git commit -m 'Dummy pipeline step4'

Step 5: Merge and Sort Data

This step merge Ascii characters produced by step 3 and 4 and then it sorts them. It also shows how to use MLflow through an "artificial" example.

It is based on step5_sort_data.ipynb notebook.


Step Input:	`./dummy/data/data_conv_from_octal.txt`
	`./dummy/data/data_conv_from_bin.txt`

Step Output:	`./dummy/data/result.txt`

Generated files:	`./dummy/pipeline/steps/mlvtools_step5_sort_data.py`
	`./dummy/dvc/mlvtools_step5_sort_data_dvc`

# Copy Jupyter Notebook
cp ./resources/dummy/step5_sort_data.ipynb ./dummy/pipeline/notebooks/

# Generate Python script and DVC pipeline step
ipynb_to_python -w . -n ./dummy/pipeline/notebooks/step5_sort_data.ipynb
gen_dvc -w . -i ./dummy/pipeline/steps/mlvtools_step5_sort_data.py

# Run the pipeline step
./dummy/dvc/mlvtools_step5_sort_data_dvc

# Version files
git add *.dvc ./dummy/
git commit -m 'Dummy pipeline step5'

This step produces MLflow tracking metrics. To visualize them run mlflow ui --backend-store-uri ./dummy/data/mlflow/ then go to http://localhost:5000.

Check the Result

The pipeline input is ./dummy/data/dummy_pipeline_feed.txt and the pipeline result is ./dummy/data/result.txt.

You can check the result:

cat ./dummy/data/result.txt

It is also possible to visualize the pipeline structure:

dvc pipeline show mlvtools_step5_sort_data.dvc

OR

dvc pipeline show mlvtools_step5_sort_data.dvc --ascii

Reproduce the Pipeline

Overview

  #1 --> #2 --> #3 --> #5
           \
            --> #4 --/

Use Cache

If we try to reproduce the pipeline without any change, nothing is computed again because DVC use a cache mechanism.

dvc repro ./mlvtools_step5_sort_data.dvc  -v

    Debug: updater is not old enough to check for updates
    Debug: Dvc file 'mlvtools_step1_sanitize_data.dvc' didn't change
    Debug: Dvc file 'mlvtools_step2_split_data.dvc' didn't change
    Debug: Dvc file 'mlvtools_step3_convert_binaries.dvc' didn't change
    Debug: Dvc file 'mlvtools_step4_convert_octals.dvc' didn't change
    Debug: Dvc file 'mlvtools_step5_sort_data.dvc' didn't change
    Pipeline is up to date. Nothing to reproduce.

No step is re-run because nothing has changed.

Without Cache

It is possible to reproduce the pipeline without cache mechanism.

dvc repro ./mlvtools_step5_sort_data.dvc -v -f

Modify Pipeline Input

If the pipeline input is modified, DVC detects there is a change and it reproduces impacted steps.

1.Reproduce the Whole Pipeline

a. Modify input.

make dummy-input2

b. Reproduce the whole pipeline.

dvc repro [Last pipeline step]
dvc repro ./mlvtools_step5_sort_data.dvc

To reproduce the whole pipeline the last pipeline step is provided to dvc repro command line. We can see all steps are run.

c. Check files

git status -s

 M dummy/data/dummy_pipeline_feed.txt.dvc
 M mlvtools_step1_sanitize_data.dvc
 M mlvtools_step2_split_data.dvc
 M mlvtools_step3_convert_binaries.dvc
 M mlvtools_step4_convert_octals.dvc
 M mlvtools_step5_sort_data.dvc

We can see all DVC meta files has been modified. And the result file contains a new text.

cat ./dummy/data/result.txt

   It is easy to reproduce a whole pipeline !

d. Commit the new pipeline version.

git commit -m "New pipeline version"  *.dvc **/*.dvc

e. Remove changes.

It is possible to use all git commands which impact commits (checkout, revert, reset, ...). The only thing to never forget is to run dvc checkout after that.

git revert HEAD
dvc checkout

cat ./dummy/data/result.txt

    You reached the end of this dummy pipeline, now we can try to reproduce it!

2.Reproduce a Sub-Pipeline

a. Modify input.

make dummy-input3

b. Reproduce a sub-pipeline.

dvc repro [Intermediate pipeline step]
dvc repro ./mlvtools_step3_convert_binaries.dvc

It is possible to run the dvc repro command on any pipeline step. This can be useful to compute intermediate results. You can see only steps 1, 2 and 3 are run.

c. Complete the pipeline.

dvc repro ./mlvtools_step5_sort_data.dvc

Only steps 4 and 5 are run because of cache.

d. Check result then commit.

cat ./dummy/data/result.txt
    It is also easy to reproduce a a sub-pipeline \o/ !!!

git commit -m "New pipeline version 2"  *.dvc **/*.dvc

Modify Pipeline Code

In this example we will modify the code of Jupyter Notebook. It is an algorithm change.

Algorithm change.

Open ./dummy/pipeline/notebooks/step4_convert_octals.ipynb

Create a new cell before the file result write. Add the following code, then save.

 str_to_add = " It is an algo change !"

 max = int(sorted(characters, key=lambda v: int(v.split('=')[0]))[-1].split('=')[0]) + 1

 for char in str_to_add:
     characters.append(f'{max}={char}')
     max += 1

This modification appends a text to the text to convert.

Re-generate scripts and commands

 # Generate Python script and DVC pipeline step
 ipynb_to_python -w . -n ./dummy/pipeline/notebooks/step4_convert_octals.ipynb -f
 gen_dvc -w . -i ./dummy/pipeline/steps/mlvtools_step4_convert_octals.py -f

Check changes

 git status -s

  M dummy/pipeline/notebooks/step4_convert_octals.ipynb
  M dummy/pipeline/steps/mlvtools_step4_convert_octals.py

A new Python script has been generated.

The DVC command remains the same because no parameter change has been done.

Reproduce pipeline

 dvc repro ./mlvtools_step5_sort_data.dvc

     Pipeline is up to date. Nothing to reproduce.

Nothing happen, it is because DVC tracks inputs and outputs, not the code.

Solution:

Run the modified step individually then once everything is ok complete the pipeline.

     ./dummy/dvc/mlvtools_step4_convert_octals_dvc
     dvc repro ./mlvtools_step5_sort_data.dvc

Check the result

cat ./dummy/data/result.txt

   It is also easy to reproduce a a sub-pipeline \o/ !!! It is an algo change !

Modify Pipeline Hyperparameter

In this example we will change the value of the parameter which describes the number of bits in a binary value. It is a dummy example to show of to reproduce a pipeline after an hyperparameter change.

Start from initial case:

git revert HEAD
dvc checkout

Hyperparameter change

Parameters are stored in DVC commands. Open ./dummy/dvc/mlvtools_step2_split_data_dvc and replace :
```
    --size-bin-data 8

 WITH

    --size-bin-data 7
```
This change can also be done in the Jupyter Notebook but it must be follow by a script and command re-generation.
Check changes

git status -s
```
M dummy/dvc/mlvtools_step2_split_data_dvc
```
The DVC command is modified.

The Python script is unchanged as there is no algorithm change.

Reproduce pipeline

 dvc repro ./mlvtools_step5_sort_data.dvc

     Pipeline is up to date. Nothing to reproduce.

Nothing happen, it is because DVC tracks inputs and outputs, not the code.

Solution:

 Run the first modified step individually then once everything is ok complete the pipeline.

         ./dummy/dvc/mlvtools_step2_split_data_dvc
         dvc repro ./mlvtools_step5_sort_data.dvc

Check the result
```
cat ./dummy/data/result.txt
```

You reached the end of this dummy tutorial. A tutorial with more realistic cases is available here.

Or go back to README

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pipeline_features.md

pipeline_features.md

mlvtools pipeline features

Setup

Step 1: Sanitize Data

Get the Pipeline Step Resources

Generate Python Script and DVC Pipeline Step

Run the Pipeline Step

Track Files Under Git Versioning

Step 2: Split Data

Step 3: Convert Binary Values

Step 4: Convert Octal Values

Step 5: Merge and Sort Data

Check the Result

Reproduce the Pipeline

Overview

Use Cache

Without Cache

Modify Pipeline Input

1.Reproduce the Whole Pipeline

2.Reproduce a Sub-Pipeline

Modify Pipeline Code

Modify Pipeline Hyperparameter

Files

pipeline_features.md

Latest commit

History

pipeline_features.md

File metadata and controls

mlvtools pipeline features

Setup

Step 1: Sanitize Data

Get the Pipeline Step Resources

Generate Python Script and DVC Pipeline Step

Run the Pipeline Step

Track Files Under Git Versioning

Step 2: Split Data

Step 3: Convert Binary Values

Step 4: Convert Octal Values

Step 5: Merge and Sort Data

Check the Result

Reproduce the Pipeline

Overview

Use Cache

Without Cache

Modify Pipeline Input

1.Reproduce the Whole Pipeline

2.Reproduce a Sub-Pipeline

Modify Pipeline Code

Modify Pipeline Hyperparameter