Implemented testing and updated DVC pipeline with metrics (#7)

remla23-team08 · Jun 5, 2023 · f39ff60 · f39ff60
1 parent 259c34c
commit f39ff60
Show file tree

Hide file tree

Showing 19 changed files with 231 additions and 172 deletions.
diff --git a/.dvc/config b/.dvc/config
@@ -1,2 +1,4 @@
+[core]
+    remote = gdrive_remote
 ['remote "gdrive_remote"']
     url = gdrive://1NY8yEl6N1ZhE-q9jnEt6G6cqIHyEiCKc
diff --git a/README.md b/README.md
@@ -1,27 +1,50 @@
 # model-training
-Contains the ML training pipeline used for the main project of course CS4295: Release Engineering for Machine Learning Applications. This pipeline is of an ML model that evaluates restaurant reviews.
+Contains the ML training pipeline used for the main project of course CS4295: Release Engineering for Machine Learning Applications. This pipeline is of an ML model that evaluates restaurant reviews. The repository structure is based off the Cookiecutter template.
 
 ## Dependencies
-All required packages can be found in dep/requirements.txt. To install the required packages, run the following command:
+All required packages can be found in `requirements.txt`. To install the required packages, run the following command:
 
 ```bash
 pip install -r dep/requirements.txt
 ```
 
-## Dataset
-Project was created using the dataset provided by course instructors on [SURFdrive](https://surfdrive.surf.nl/files/index.php/s/207BTysNQFuVZPE?path=%2Fmaterial)
+## Usage
+In order to run the pipeline, ensure that you have `dvc` installed and run the following command:
 
-To use this repository, please download the archive at the link above, unzip the archive and place the `restaurant-sentiment/` folder in the main `model-training` folder.
+```bash
+dvc exp run
+```
 
-## Usage
-To manually generate a new ML model, execute `main.py`. 
+This will automatically download the dataset from an external source, pre-process the dataset, train the model and save the evaluation results in `reports/model_evaluation.json`. Tests will also automatically be ran. Linting via Pylint and DSLinter is also automatically run as part of the pipeline.
 
-### Preprocessing
-Any preprocessing steps can be found in `preprocessing.py`. These are executed automatically with the execution of `main.py`.
+To view a graphical representation of the pipeline, run the following command:
+``` bash
+dvc dag
+```
+### Remote
+A Google drive folder has been configured to be used as remote storage.
+
+### Testing
+In order to test the ML pipeline, several tests are performed which can be found in `tests/`. These are ran automatically as part of the pipeline. They can be manually ran using the following command:
 
+```bash
+pytest
+```
+### Metrics
+The accuracy metric is stored in `reports/model_evaluation.json`. In order to see the experiment history, run the following command:
+
+```bash
+dvc exp show
+```
+Two experiments are listed, comparing the use of a 20% and 10% test split size. 
+### Dataset
+Project was created using the dataset provided by course instructors on [SURFdrive](https://surfdrive.surf.nl/files/index.php/s/207BTysNQFuVZPE?path=%2Fmaterial).
+
+### Preprocessing
+Any preprocessing steps can be found in `preprocessing.py`. These are executed automatically with the execution of the pipeline. Processed data (corpus) is stored in `data/processed/`.
 
 ### Storing the trained model
-The trained model is stored in the `res/` folder.
+The trained model is stored in `data/models/`.
 
 
 ## Pylint & DSLinter
@@ -33,4 +56,4 @@ If you would like to manually verify the code quality, please run the following
 pylint src
 ```
 
-DSLinter is configured and will automatically run. This should return a perfect score of 10.00. A report summarising the findings can be found in the `data/reports` folder. 
+DSLinter is configured and will automatically run. This should return a perfect score of 10.00. A report summarising the findings can be found in `data/reports/`. 
diff --git a/data/external/.data b/data/external/.data
diff --git a/data/models/.gitignore b/data/models/.gitignore
@@ -1,2 +1,4 @@
-c1_BoW_Sentiment_Model.pkl
-c2_Classifier_Sentiment_Model
+*
+!/**/
+!*.*
+*.pkl
diff --git a/dvc.lock b/dvc.lock
@@ -1,50 +1,75 @@
 schema: '2.0'
 stages:
   preprocessing:
-    cmd: python src/preprocessing.py
+    cmd: python src/pipeline/preprocessing.py
     deps:
     - path: data/external/a1_RestaurantReviews_HistoricDump.tsv
       md5: 102f1f4193e0bdebdd6cce7f13e0a839
       size: 54686
-    - path: src/preprocessing.py
-      md5: b45d76ab50b20ccabfb50d591ee7ef02
-      size: 2034
+    - path: src/pipeline/preprocessing.py
+      md5: 2939fdfbdbb8254ed5ee7e228a46d3a5
+      size: 2038
     outs:
     - path: data/processed/corpus.joblib
       md5: 243212bb05cce5e3fdc72bfd2826d329
       size: 31612
   load_data:
-    cmd: python src/load_data.py
+    cmd: python src/pipeline/load_data.py
     deps:
-    - path: src/load_data.py
-      md5: e579b1f5296f89c5f22d8ac4af92e1c0
-      size: 913
+    - path: src/pipeline/load_data.py
+      md5: 4261731cb0748f8fa8805c370a2bafce
+      size: 919
     outs:
     - path: data/external/a1_RestaurantReviews_HistoricDump.tsv
       md5: 102f1f4193e0bdebdd6cce7f13e0a839
       size: 54686
+    - path: data/external/a2_RestaurantReviews_FreshDump.tsv
+      md5: 097c8b95f6b255e5a6a06b29d61fef8e
+      size: 6504
   training:
-    cmd: python src/training.py
+    cmd: python src/pipeline/training.py
     deps:
     - path: data/external/a1_RestaurantReviews_HistoricDump.tsv
       md5: 102f1f4193e0bdebdd6cce7f13e0a839
       size: 54686
     - path: data/processed/corpus.joblib
       md5: 243212bb05cce5e3fdc72bfd2826d329
       size: 31612
-    - path: src/evaluation.py
-      md5: 96c08113733680243cbc537a93cc128d
-      size: 396
-    - path: src/training.py
-      md5: 81ddde09ae93959e83afb4bae0ddd90a
-      size: 2073
+    - path: src/pipeline/preprocessing.py
+      md5: 2939fdfbdbb8254ed5ee7e228a46d3a5
+      size: 2038
+    - path: src/pipeline/training.py
+      md5: 71f7ea4f607346e17a2264aa12da221d
+      size: 1522
     outs:
     - path: data/models/c1_BoW_Sentiment_Model.pkl
-      md5: 47e4584e52d616cbb5af92f988648e27
+      md5: 7b5775b55574c74cf828b4577e73f26d
       size: 39823
     - path: data/models/c2_Classifier_Sentiment_Model
-      md5: e6e6744062a1d370a585d15df7f45934
+      md5: 527a8f24c9766cd8ec50d943997acb76
       size: 46127
-    - path: reports/model_evaluation.txt
-      md5: 35b131f5c189995225c586a8ae7025d9
-      size: 67
+  linting:
+    cmd: pylint src
+    deps:
+    - path: .pylintrc
+      md5: 93822e4a1f2eed84947a1ff37ec8e7ca
+      size: 18348
+  evaluation:
+    cmd: python src/pipeline/evaluation.py --output reports/model_evaluation.json
+    deps:
+    - path: data/models/c1_BoW_Sentiment_Model.pkl
+      md5: 7b5775b55574c74cf828b4577e73f26d
+      size: 39823
+    - path: data/models/c2_Classifier_Sentiment_Model
+      md5: 527a8f24c9766cd8ec50d943997acb76
+      size: 46127
+    - path: src/pipeline/evaluation.py
+      md5: 5b6b5bd1e1be639b55db7c430701fe23
+      size: 2046
+    - path: src/pipeline/preprocessing.py
+      md5: 2939fdfbdbb8254ed5ee7e228a46d3a5
+      size: 2038
+    outs:
+    - path: reports/model_evaluation.json
+      md5: ced9e7cf4502282c409734a3d577f195
+      size: 75
diff --git a/dvc.yaml b/dvc.yaml
@@ -1,25 +1,44 @@
 stages:
+  linting:
+    cmd: pylint src
+    deps: 
+      - .pylintrc
   load_data:
-    cmd: python src/load_data.py
+    cmd: python src/pipeline/load_data.py
     deps:
-    - src/load_data.py
+    - src/pipeline/load_data.py
     outs:
     - data/external/a1_RestaurantReviews_HistoricDump.tsv
+    - data/external/a2_RestaurantReviews_FreshDump.tsv
   preprocessing:
-    cmd: python src/preprocessing.py
+    cmd: python src/pipeline/preprocessing.py
     deps:
-    - src/preprocessing.py
+    - src/pipeline/preprocessing.py
     - data/external/a1_RestaurantReviews_HistoricDump.tsv
     outs:
     - data/processed/corpus.joblib
   training:
-    cmd: python src/training.py
+    cmd: python src/pipeline/training.py
     deps:
-    - src/training.py
-    - src/evaluation.py
+    - src/pipeline/training.py
+    - src/pipeline/preprocessing.py
     - data/external/a1_RestaurantReviews_HistoricDump.tsv
     - data/processed/corpus.joblib
     outs:
     - data/models/c1_BoW_Sentiment_Model.pkl
     - data/models/c2_Classifier_Sentiment_Model
-    - reports/model_evaluation.txt
+  evaluation:
+    cmd: python src/pipeline/evaluation.py --output reports/model_evaluation.json
+    deps:
+    - src/pipeline/evaluation.py
+    - src/pipeline/preprocessing.py
+    - data/models/c1_BoW_Sentiment_Model.pkl
+    - data/models/c2_Classifier_Sentiment_Model
+    metrics:
+    - reports/model_evaluation.json
+  # testing:
+  #   cmd: pytest
+  #   deps: 
+  #   - src/pipeline/evaluation.py
+  #   - data/models/c1_BoW_Sentiment_Model.pkl
+  #   - data/models/c2_Classifier_Sentiment_Model
diff --git a/reports/.gitignore b/reports/.gitignore
@@ -1 +1 @@
-/model_evaluation.txt
+/model_evaluation.json
diff --git a/reports/pylint_report.txt b/reports/pylint_report.txt
@@ -2,21 +2,21 @@
 
 Report
 ======
-109 statements analysed.
+100 statements analysed.
 
 Statistics by type
 ------------------
 
 +---------+-------+-----------+-----------+------------+---------+
 |type     |number |old number |difference |%documented |%badname |
 +=========+=======+===========+===========+============+=========+
-|module   |7      |7          |=          |100.00      |0.00     |
+|module   |6      |6          |=          |100.00      |0.00     |
 +---------+-------+-----------+-----------+------------+---------+
 |class    |1      |1          |=          |100.00      |0.00     |
 +---------+-------+-----------+-----------+------------+---------+
 |method   |3      |3          |=          |100.00      |0.00     |
 +---------+-------+-----------+-----------+------------+---------+
-|function |3      |3          |=          |100.00      |0.00     |
+|function |1      |1          |=          |100.00      |0.00     |
 +---------+-------+-----------+-----------+------------+---------+
 
 
@@ -25,18 +25,21 @@ External dependencies
 ---------------------
 ::
 
-    joblib (src.main,src.preprocessing,src.training)
-    nltk (src.preprocessing)
-      \-corpus (src.preprocessing)
+    joblib (src.pipeline.evaluation,src.pipeline.preprocessing,src.pipeline.training)
+    nltk (src.pipeline.preprocessing)
+      \-corpus (src.pipeline.preprocessing)
       \-stem 
-        \-porter (src.preprocessing)
-    pandas (src.preprocessing,src.training)
+        \-porter (src.pipeline.preprocessing)
+    pandas (src.pipeline.evaluation,src.pipeline.preprocessing,src.pipeline.training)
     sklearn 
       \-feature_extraction 
-      | \-text (src.training)
-      \-metrics (src.evaluation)
-      \-model_selection (src.classification,src.training)
-      \-naive_bayes (src.classification,src.training)
+      | \-text (src.pipeline.training)
+      \-metrics (src.pipeline.evaluation)
+      \-model_selection (src.pipeline.evaluation,src.pipeline.training)
+      \-naive_bayes (src.pipeline.training)
+    src 
+      \-pipeline 
+        \-preprocessing (src.pipeline.evaluation)
 
 
 
@@ -46,13 +49,13 @@ Raw metrics
 +----------+-------+------+---------+-----------+
 |type      |number |%     |previous |difference |
 +==========+=======+======+=========+===========+
-|code      |122    |49.19 |122      |=          |
+|code      |110    |53.66 |110      |=          |
 +----------+-------+------+---------+-----------+
-|docstring |32     |12.90 |32       |=          |
+|docstring |22     |10.73 |22       |=          |
 +----------+-------+------+---------+-----------+
-|comment   |37     |14.92 |37       |=          |
+|comment   |26     |12.68 |26       |=          |
 +----------+-------+------+---------+-----------+
-|empty     |57     |22.98 |57       |=          |
+|empty     |47     |22.93 |47       |=          |
 +----------+-------+------+---------+-----------+
 
 
@@ -76,7 +79,7 @@ Messages by category
 +-----------+-------+---------+-----------+
 |type       |number |previous |difference |
 +===========+=======+=========+===========+
-|convention |0      |1        |1          |
+|convention |0      |0        |0          |
 +-----------+-------+---------+-----------+
 |refactor   |0      |0        |0          |
 +-----------+-------+---------+-----------+
@@ -97,6 +100,6 @@ Messages
 
 
 
--------------------------------------------------------------------
-Your code has been rated at 10.00/10 (previous run: 9.91/10, +0.09)
+--------------------------------------------------------------------
+Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)
 
diff --git a/requirements.txt b/requirements.txt
@@ -7,3 +7,4 @@ dvc_gdrive==2.19.2
 pylint==2.12.2
 mllint==0.12.2
 dslinter==2.0.9
+pytest==7.3.1
diff --git a/src/__init__.py b/src/__init__.py
@@ -1,3 +1,3 @@
-"""Top-level package for model-training."""
+"""src init"""
 
 __author__ = """Team 08"""
diff --git a/src/classification.py b/src/classification.py
diff --git a/src/evaluation.py b/src/evaluation.py