The aim of this use case is to show how to create a new pipeline version based on an existing one. We will reuse the pipeline created in Use Case 1: Build and Reproduce a Pipeline and change the classifier type.
The initial classifier is a neural network learned with FastText (see file resources/03_Classify_text.ipynb) and we want to try to improve it by using trigrams instead of unigrams (see file resources/03_bis_Classify_text.ipynb).
To achieve that, we will use git branch principle.
Requirements:
- setup the environment (tutorial setup)
- build the pipeline from Use Case 1: Build and Reproduce a Pipeline
Note: it is possible to quickly build the pipeline from Use Case 1 running
make setupif setup is not done thenmake pipeline1. Be careful, the pipeline files and DVC meta files will not be committed.
On the current branch we have built the first pipeline PipelineUseCase1.dvc. We want to create a new git branch to modify a step of this pipeline (~= create a new version).
git checkout -b tutorial_use_case_2
The new classifier Jupyter Notebook is 03_bis_Classify_text.ipynb.
Input and output files must remain the same, only the 'algorithm' part change
| Step Input: | ./poc/data/data_train_tokenized.csv |
| Step Output: | ./poc/data/fasttext_model.bin |
./poc/data/fasttext_model.vec |
|
| Generated files: | ./poc/pipeline/steps/mlvtools_03_classify_text.py |
./poc/commands/dvc/mlvtools_03_classify_text_dvc |
-
Replace the actual classifier Jupyter Notebook with the new version
cp ./resources/03_bis_Classify_text.ipynb ./poc/pipeline/notebooks/03_Classify_text.ipynb -
Edit the notebook with right path
./poc/pipeline/notebooks/03_Classify_text.ipynb(see input/outputs above)The Docstring must be :
""" :param str input_csv_file: Path to input file :param str out_model_path: Path to model files :param float learning_rate: Learning rate :param int epochs: Number of epochs :dvc-in input_csv_file: ./poc/data/data_train_tokenized.csv :dvc-out out_model_path: ./poc/data/fasttext_model.bin :dvc-out: ./poc/data/fasttext_model.vec :dvc-extra: --learning-rate 0.7 --epochs 4 """ -
Commit notebook modification
git commit -m 'Tutorial: use case 2 step 1 - Modify notebook' ./poc/pipeline/notebooks/03_Classify_text.ipynb -
Re-generate Python script
ipynb_to_python -w . -n ./poc/pipeline/notebooks/03_Classify_text.ipynb -f -
Re-generate command
gen_dvc -w . -i ./poc/pipeline/steps/mlvtools_03_classify_text.py -f -
Run the DVC step
./poc/commands/dvc/mlvtools_03_classify_text_dvc
DVC asks if you want to overwrite the corresponding meta file. The answer is yes.
-
Complete the pipeline run
See pipeline status
dvc status > mlvtools_04_evaluate_model.dvc deps changed: poc/data/fasttext_model.bin > mlvtools_04_evaluate_test_model.dvc deps changed: poc/data/fasttext_model.binWe see metric files, which are generated by evaluation steps, need to be re-generated because input files has changed.
Reproduce the pipeline using cache
dvc repro ./PipelineUseCase1.dvc -v Debug: updater is not old enough to check for updates Debug: Dvc file 'poc/data/20news-bydate_py3.pkz.dvc' didn't change Debug: Dvc file 'mlvtools_01_extract_dataset.dvc' didn't change Debug: Dvc file 'mlvtools_02_tokenize_text.dvc' didn't change Debug: Dvc file 'mlvtools_03_classify_text.dvc' didn't change Debug: Dvc file 'mlvtools_04_evaluate_model.dvc' changed Debug: Removing 'poc/data/metrics.txt' Reproducing 'mlvtools_04_evaluate_model.dvc' Running command: poc/pipeline/steps/mlvtools_04_evaluate_model.py --data-file ./poc/data/data_train_tokenized.csv --model-file ./poc/data/fasttext_model.bin --result-file ./poc/data/metrics.txt Saving 'poc/data/metrics.txt' to cache '.dvc/cache'. Debug: Cache type 'reflink' is not supported: EOPNOTSUPP Created 'hardlink': .dvc/cache/2a/2818ec7cbf536a5f2057bb47e9f8f2 -> poc/data/metrics.txt Debug: 'mlvtools_04_evaluate_model.dvc' was reproduced Saving information to 'mlvtools_04_evaluate_model.dvc'. Debug: Dvc file 'mlvtools_02_test_tokenize_text.dvc' didn't change Debug: Dvc file 'mlvtools_04_evaluate_test_model.dvc' changed Debug: Removing 'poc/data/metrics_test.txt' Reproducing 'mlvtools_04_evaluate_test_model.dvc' Running command: poc/pipeline/steps/mlvtools_04_evaluate_model.py --data-file ./poc/data/data_test_tokenized.csv --model-file ./poc/data/fasttext_model.bin --result-file ./poc/data/metrics_test.txt Saving 'poc/data/metrics_test.txt' to cache '.dvc/cache'. Created 'hardlink': .dvc/cache/a0/0ba1e87b0fb7970f9b8d6b6eafcc6c -> poc/data/metrics_test.txt Debug: 'mlvtools_04_evaluate_test_model.dvc' was reproduced Saving information to 'mlvtools_04_evaluate_test_model.dvc'. Debug: Dvc file 'PipelineUseCase1.dvc' changed Reproducing 'PipelineUseCase1.dvc' Running command: cat ./poc/data/metrics.txt ./poc/data/metrics_test.txt accuracy 0.36555075593952485 accuracy 0.5101653564651667 Debug: 'PipelineUseCase1.dvc' was reproduced Saving information to 'PipelineUseCase1.dvc'.
Evaluation steps (train and test) are re-run using new model.
-
Version the new pipeline
git add *.dvc ./poc git commit -m 'Tutorial use case 2 step 1: classify text' -
Check Results
In the execution trace above we can see the new accuracy is:
Running command: cat ./poc/data/metrics.txt ./poc/data/metrics_test.txt ... -
Go back to tutorial branch
git checkout - dvc checkout
After the checkout you can check results are those from use case 1.
You reached the end of this tutorial, see Use Case 3: Build a Pipeline from an Existing Pipeline