Use Case 2: Create a New Version of a Pipeline

The aim of this use case is to show how to create a new pipeline version based on an existing one. We will reuse the pipeline created in Use Case 1: Build and Reproduce a Pipeline and change the classifier type.

The initial classifier is a neural network learned with FastText (see file resources/03_Classify_text.ipynb) and we want to try to improve it by using trigrams instead of unigrams (see file resources/03_bis_Classify_text.ipynb).

To achieve that, we will use git branch principle.

Requirements:

setup the environment (tutorial setup)
build the pipeline from Use Case 1: Build and Reproduce a Pipeline

Note: it is possible to quickly build the pipeline from Use Case 1 running make setup if setup is not done then make pipeline1. Be careful, the pipeline files and DVC meta files will not be committed.

1. Create a New Branch

On the current branch we have built the first pipeline PipelineUseCase1.dvc. We want to create a new git branch to modify a step of this pipeline (~= create a new version).

git checkout -b tutorial_use_case_2

2. Replace the Classifier

The new classifier Jupyter Notebook is 03_bis_Classify_text.ipynb.

Input and output files must remain the same, only the 'algorithm' part change


Step Input:	`./poc/data/data_train_tokenized.csv`

Step Output:	`./poc/data/fasttext_model.bin`
	`./poc/data/fasttext_model.vec`

Generated files:	`./poc/pipeline/steps/mlvtools_03_classify_text.py`
	`./poc/commands/dvc/mlvtools_03_classify_text_dvc`

Replace the actual classifier Jupyter Notebook with the new version

 cp ./resources/03_bis_Classify_text.ipynb ./poc/pipeline/notebooks/03_Classify_text.ipynb

Edit the notebook with right path ./poc/pipeline/notebooks/03_Classify_text.ipynb (see input/outputs above)

The Docstring must be :

"""
 :param str input_csv_file: Path to input file
 :param str out_model_path: Path to model files
 :param float learning_rate: Learning rate
 :param int epochs: Number of epochs

 :dvc-in input_csv_file: ./poc/data/data_train_tokenized.csv
 :dvc-out out_model_path: ./poc/data/fasttext_model.bin
 :dvc-out: ./poc/data/fasttext_model.vec
 :dvc-extra: --learning-rate 0.7 --epochs 4
"""

Commit notebook modification

 git commit -m 'Tutorial: use case 2 step 1 - Modify notebook'  ./poc/pipeline/notebooks/03_Classify_text.ipynb

Re-generate Python script

 ipynb_to_python -w . -n ./poc/pipeline/notebooks/03_Classify_text.ipynb -f

Re-generate command

 gen_dvc -w . -i ./poc/pipeline/steps/mlvtools_03_classify_text.py -f

Run the DVC step

 ./poc/commands/dvc/mlvtools_03_classify_text_dvc

DVC asks if you want to overwrite the corresponding meta file. The answer is yes.

Complete the pipeline run

See pipeline status

 dvc status

 > mlvtools_04_evaluate_model.dvc
   	deps
   		changed:  poc/data/fasttext_model.bin
 > mlvtools_04_evaluate_test_model.dvc
   	deps
   		changed:  poc/data/fasttext_model.bin

We see metric files, which are generated by evaluation steps, need to be re-generated because input files has changed.

Reproduce the pipeline using cache

 dvc repro ./PipelineUseCase1.dvc -v

 Debug: updater is not old enough to check for updates
 Debug: Dvc file 'poc/data/20news-bydate_py3.pkz.dvc' didn't change
 Debug: Dvc file 'mlvtools_01_extract_dataset.dvc' didn't change
 Debug: Dvc file 'mlvtools_02_tokenize_text.dvc' didn't change
 Debug: Dvc file 'mlvtools_03_classify_text.dvc' didn't change
 Debug: Dvc file 'mlvtools_04_evaluate_model.dvc' changed
 Debug: Removing 'poc/data/metrics.txt'
 Reproducing 'mlvtools_04_evaluate_model.dvc'
 Running command:
 	poc/pipeline/steps/mlvtools_04_evaluate_model.py --data-file ./poc/data/data_train_tokenized.csv --model-file ./poc/data/fasttext_model.bin --result-file ./poc/data/metrics.txt
 Saving 'poc/data/metrics.txt' to cache '.dvc/cache'.
 Debug: Cache type 'reflink' is not supported: EOPNOTSUPP
 Created 'hardlink': .dvc/cache/2a/2818ec7cbf536a5f2057bb47e9f8f2 -> poc/data/metrics.txt
 Debug: 'mlvtools_04_evaluate_model.dvc' was reproduced
 Saving information to 'mlvtools_04_evaluate_model.dvc'.
 Debug: Dvc file 'mlvtools_02_test_tokenize_text.dvc' didn't change
 Debug: Dvc file 'mlvtools_04_evaluate_test_model.dvc' changed
 Debug: Removing 'poc/data/metrics_test.txt'
 Reproducing 'mlvtools_04_evaluate_test_model.dvc'
 Running command:
 	poc/pipeline/steps/mlvtools_04_evaluate_model.py --data-file ./poc/data/data_test_tokenized.csv --model-file ./poc/data/fasttext_model.bin --result-file ./poc/data/metrics_test.txt
 Saving 'poc/data/metrics_test.txt' to cache '.dvc/cache'.
 Created 'hardlink': .dvc/cache/a0/0ba1e87b0fb7970f9b8d6b6eafcc6c -> poc/data/metrics_test.txt
 Debug: 'mlvtools_04_evaluate_test_model.dvc' was reproduced
 Saving information to 'mlvtools_04_evaluate_test_model.dvc'.
 Debug: Dvc file 'PipelineUseCase1.dvc' changed
 Reproducing 'PipelineUseCase1.dvc'
 Running command:
 	cat ./poc/data/metrics.txt ./poc/data/metrics_test.txt
 accuracy 0.36555075593952485
 accuracy 0.5101653564651667
 Debug: 'PipelineUseCase1.dvc' was reproduced
 Saving information to 'PipelineUseCase1.dvc'.

Evaluation steps (train and test) are re-run using new model.

Version the new pipeline

 git add *.dvc ./poc
 git commit -m 'Tutorial use case 2 step 1: classify text'

Check Results

In the execution trace above we can see the new accuracy is:

 Running command:
         cat ./poc/data/metrics.txt ./poc/data/metrics_test.txt
     ...

Go back to tutorial branch
```
  git checkout -
  dvc checkout
```

After the checkout you can check results are those from use case 1.

You reached the end of this tutorial, see Use Case 3: Build a Pipeline from an Existing Pipeline

Or go back to README

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use_case2.md

use_case2.md

Use Case 2: Create a New Version of a Pipeline

1. Create a New Branch

2. Replace the Classifier

Files

use_case2.md

Latest commit

History

use_case2.md

File metadata and controls

Use Case 2: Create a New Version of a Pipeline

1. Create a New Branch

2. Replace the Classifier