The aim of this use case is to show how to create a new pipeline version based on an existing one. We will reuse the pipeline created in Use Case 1: Build and Reproduce a Pipeline and change the classifier type.
The initial classifier is a neural network learned with FastText (see file resources/03_Classify_text.ipynb
) and we want to try to improve it by using trigrams instead of unigrams (see file resources/03_bis_Classify_text.ipynb
).
To achieve that, we will use git branch principle.
Requirements:
- setup the environment (tutorial setup)
- build the pipeline from Use Case 1: Build and Reproduce a Pipeline
Note: it is possible to quickly build the pipeline from Use Case 1 running
make setup
if setup is not done thenmake pipeline1
. Be careful, the pipeline files and DVC meta files will not be committed.
On the current branch we have built the first pipeline PipelineUseCase1.dvc. We want to create a new git branch to modify a step of this pipeline (~= create a new version).
git checkout -b tutorial_use_case_2
The new classifier Jupyter Notebook is 03_bis_Classify_text.ipynb
.
Input and output files must remain the same, only the 'algorithm' part change
Step Input: | ./poc/data/data_train_tokenized.csv |
Step Output: | ./poc/data/fasttext_model.bin |
./poc/data/fasttext_model.vec |
|
Generated files: | ./poc/pipeline/steps/mlvtools_03_classify_text.py |
./poc/commands/dvc/mlvtools_03_classify_text_dvc |
-
Replace the actual classifier Jupyter Notebook with the new version
cp ./resources/03_bis_Classify_text.ipynb ./poc/pipeline/notebooks/03_Classify_text.ipynb
-
Edit the notebook with right path
./poc/pipeline/notebooks/03_Classify_text.ipynb
(see input/outputs above)The Docstring must be :
""" :param str input_csv_file: Path to input file :param str out_model_path: Path to model files :param float learning_rate: Learning rate :param int epochs: Number of epochs :dvc-in input_csv_file: ./poc/data/data_train_tokenized.csv :dvc-out out_model_path: ./poc/data/fasttext_model.bin :dvc-out: ./poc/data/fasttext_model.vec :dvc-extra: --learning-rate 0.7 --epochs 4 """
-
Commit notebook modification
git commit -m 'Tutorial: use case 2 step 1 - Modify notebook' ./poc/pipeline/notebooks/03_Classify_text.ipynb
-
Re-generate Python script
ipynb_to_python -w . -n ./poc/pipeline/notebooks/03_Classify_text.ipynb -f
-
Re-generate command
gen_dvc -w . -i ./poc/pipeline/steps/mlvtools_03_classify_text.py -f
-
Run the DVC step
./poc/commands/dvc/mlvtools_03_classify_text_dvc
DVC asks if you want to overwrite the corresponding meta file. The answer is yes.
-
Complete the pipeline run
See pipeline status
dvc status > mlvtools_04_evaluate_model.dvc deps changed: poc/data/fasttext_model.bin > mlvtools_04_evaluate_test_model.dvc deps changed: poc/data/fasttext_model.bin
We see metric files, which are generated by evaluation steps, need to be re-generated because input files has changed.
Reproduce the pipeline using cache
dvc repro ./PipelineUseCase1.dvc -v Debug: updater is not old enough to check for updates Debug: Dvc file 'poc/data/20news-bydate_py3.pkz.dvc' didn't change Debug: Dvc file 'mlvtools_01_extract_dataset.dvc' didn't change Debug: Dvc file 'mlvtools_02_tokenize_text.dvc' didn't change Debug: Dvc file 'mlvtools_03_classify_text.dvc' didn't change Debug: Dvc file 'mlvtools_04_evaluate_model.dvc' changed Debug: Removing 'poc/data/metrics.txt' Reproducing 'mlvtools_04_evaluate_model.dvc' Running command: poc/pipeline/steps/mlvtools_04_evaluate_model.py --data-file ./poc/data/data_train_tokenized.csv --model-file ./poc/data/fasttext_model.bin --result-file ./poc/data/metrics.txt Saving 'poc/data/metrics.txt' to cache '.dvc/cache'. Debug: Cache type 'reflink' is not supported: EOPNOTSUPP Created 'hardlink': .dvc/cache/2a/2818ec7cbf536a5f2057bb47e9f8f2 -> poc/data/metrics.txt Debug: 'mlvtools_04_evaluate_model.dvc' was reproduced Saving information to 'mlvtools_04_evaluate_model.dvc'. Debug: Dvc file 'mlvtools_02_test_tokenize_text.dvc' didn't change Debug: Dvc file 'mlvtools_04_evaluate_test_model.dvc' changed Debug: Removing 'poc/data/metrics_test.txt' Reproducing 'mlvtools_04_evaluate_test_model.dvc' Running command: poc/pipeline/steps/mlvtools_04_evaluate_model.py --data-file ./poc/data/data_test_tokenized.csv --model-file ./poc/data/fasttext_model.bin --result-file ./poc/data/metrics_test.txt Saving 'poc/data/metrics_test.txt' to cache '.dvc/cache'. Created 'hardlink': .dvc/cache/a0/0ba1e87b0fb7970f9b8d6b6eafcc6c -> poc/data/metrics_test.txt Debug: 'mlvtools_04_evaluate_test_model.dvc' was reproduced Saving information to 'mlvtools_04_evaluate_test_model.dvc'. Debug: Dvc file 'PipelineUseCase1.dvc' changed Reproducing 'PipelineUseCase1.dvc' Running command: cat ./poc/data/metrics.txt ./poc/data/metrics_test.txt accuracy 0.36555075593952485 accuracy 0.5101653564651667 Debug: 'PipelineUseCase1.dvc' was reproduced Saving information to 'PipelineUseCase1.dvc'.
Evaluation steps (train and test) are re-run using new model.
-
Version the new pipeline
git add *.dvc ./poc git commit -m 'Tutorial use case 2 step 1: classify text'
-
Check Results
In the execution trace above we can see the new accuracy is:
Running command: cat ./poc/data/metrics.txt ./poc/data/metrics_test.txt ...
-
Go back to tutorial branch
git checkout - dvc checkout
After the checkout you can check results are those from use case 1.
You reached the end of this tutorial, see Use Case 3: Build a Pipeline from an Existing Pipeline