Skip to content

Latest commit

 

History

History
164 lines (111 loc) · 6.26 KB

use_case2.md

File metadata and controls

164 lines (111 loc) · 6.26 KB

Use Case 2: Create a New Version of a Pipeline

The aim of this use case is to show how to create a new pipeline version based on an existing one. We will reuse the pipeline created in Use Case 1: Build and Reproduce a Pipeline and change the classifier type.

The initial classifier is a neural network learned with FastText (see file resources/03_Classify_text.ipynb) and we want to try to improve it by using trigrams instead of unigrams (see file resources/03_bis_Classify_text.ipynb).

To achieve that, we will use git branch principle.

Requirements:

Note: it is possible to quickly build the pipeline from Use Case 1 running make setup if setup is not done then make pipeline1. Be careful, the pipeline files and DVC meta files will not be committed.

1. Create a New Branch

On the current branch we have built the first pipeline PipelineUseCase1.dvc. We want to create a new git branch to modify a step of this pipeline (~= create a new version).

git checkout -b tutorial_use_case_2

2. Replace the Classifier

The new classifier Jupyter Notebook is 03_bis_Classify_text.ipynb.

Input and output files must remain the same, only the 'algorithm' part change

Step Input: ./poc/data/data_train_tokenized.csv
Step Output: ./poc/data/fasttext_model.bin
./poc/data/fasttext_model.vec
Generated files: ./poc/pipeline/steps/mlvtools_03_classify_text.py
./poc/commands/dvc/mlvtools_03_classify_text_dvc
  1. Replace the actual classifier Jupyter Notebook with the new version

     cp ./resources/03_bis_Classify_text.ipynb ./poc/pipeline/notebooks/03_Classify_text.ipynb
    
  2. Edit the notebook with right path ./poc/pipeline/notebooks/03_Classify_text.ipynb (see input/outputs above)

    The Docstring must be :

    """
     :param str input_csv_file: Path to input file
     :param str out_model_path: Path to model files
     :param float learning_rate: Learning rate
     :param int epochs: Number of epochs
    
     :dvc-in input_csv_file: ./poc/data/data_train_tokenized.csv
     :dvc-out out_model_path: ./poc/data/fasttext_model.bin
     :dvc-out: ./poc/data/fasttext_model.vec
     :dvc-extra: --learning-rate 0.7 --epochs 4
    """
    
  3. Commit notebook modification

     git commit -m 'Tutorial: use case 2 step 1 - Modify notebook'  ./poc/pipeline/notebooks/03_Classify_text.ipynb
    
  4. Re-generate Python script

     ipynb_to_python -w . -n ./poc/pipeline/notebooks/03_Classify_text.ipynb -f
    
  5. Re-generate command

     gen_dvc -w . -i ./poc/pipeline/steps/mlvtools_03_classify_text.py -f
    
  6. Run the DVC step

     ./poc/commands/dvc/mlvtools_03_classify_text_dvc
    

DVC asks if you want to overwrite the corresponding meta file. The answer is yes.

  1. Complete the pipeline run

    See pipeline status

     dvc status
    
     > mlvtools_04_evaluate_model.dvc
       	deps
       		changed:  poc/data/fasttext_model.bin
     > mlvtools_04_evaluate_test_model.dvc
       	deps
       		changed:  poc/data/fasttext_model.bin
    

    We see metric files, which are generated by evaluation steps, need to be re-generated because input files has changed.

    Reproduce the pipeline using cache

     dvc repro ./PipelineUseCase1.dvc -v
    
     Debug: updater is not old enough to check for updates
     Debug: Dvc file 'poc/data/20news-bydate_py3.pkz.dvc' didn't change
     Debug: Dvc file 'mlvtools_01_extract_dataset.dvc' didn't change
     Debug: Dvc file 'mlvtools_02_tokenize_text.dvc' didn't change
     Debug: Dvc file 'mlvtools_03_classify_text.dvc' didn't change
     Debug: Dvc file 'mlvtools_04_evaluate_model.dvc' changed
     Debug: Removing 'poc/data/metrics.txt'
     Reproducing 'mlvtools_04_evaluate_model.dvc'
     Running command:
     	poc/pipeline/steps/mlvtools_04_evaluate_model.py --data-file ./poc/data/data_train_tokenized.csv --model-file ./poc/data/fasttext_model.bin --result-file ./poc/data/metrics.txt
     Saving 'poc/data/metrics.txt' to cache '.dvc/cache'.
     Debug: Cache type 'reflink' is not supported: EOPNOTSUPP
     Created 'hardlink': .dvc/cache/2a/2818ec7cbf536a5f2057bb47e9f8f2 -> poc/data/metrics.txt
     Debug: 'mlvtools_04_evaluate_model.dvc' was reproduced
     Saving information to 'mlvtools_04_evaluate_model.dvc'.
     Debug: Dvc file 'mlvtools_02_test_tokenize_text.dvc' didn't change
     Debug: Dvc file 'mlvtools_04_evaluate_test_model.dvc' changed
     Debug: Removing 'poc/data/metrics_test.txt'
     Reproducing 'mlvtools_04_evaluate_test_model.dvc'
     Running command:
     	poc/pipeline/steps/mlvtools_04_evaluate_model.py --data-file ./poc/data/data_test_tokenized.csv --model-file ./poc/data/fasttext_model.bin --result-file ./poc/data/metrics_test.txt
     Saving 'poc/data/metrics_test.txt' to cache '.dvc/cache'.
     Created 'hardlink': .dvc/cache/a0/0ba1e87b0fb7970f9b8d6b6eafcc6c -> poc/data/metrics_test.txt
     Debug: 'mlvtools_04_evaluate_test_model.dvc' was reproduced
     Saving information to 'mlvtools_04_evaluate_test_model.dvc'.
     Debug: Dvc file 'PipelineUseCase1.dvc' changed
     Reproducing 'PipelineUseCase1.dvc'
     Running command:
     	cat ./poc/data/metrics.txt ./poc/data/metrics_test.txt
     accuracy 0.36555075593952485
     accuracy 0.5101653564651667
     Debug: 'PipelineUseCase1.dvc' was reproduced
     Saving information to 'PipelineUseCase1.dvc'.
    

Evaluation steps (train and test) are re-run using new model.

  1. Version the new pipeline

     git add *.dvc ./poc
     git commit -m 'Tutorial use case 2 step 1: classify text'
    
  2. Check Results

    In the execution trace above we can see the new accuracy is:

     Running command:
             cat ./poc/data/metrics.txt ./poc/data/metrics_test.txt
         ...
    
  3. Go back to tutorial branch

      git checkout -
      dvc checkout
    

After the checkout you can check results are those from use case 1.

You reached the end of this tutorial, see Use Case 3: Build a Pipeline from an Existing Pipeline

Or go back to README