Skip to content

Commit

Permalink
Implemented testing and updated DVC pipeline with metrics (#7)
Browse files Browse the repository at this point in the history
  • Loading branch information
JvanderSaag authored Jun 5, 2023
1 parent 259c34c commit f39ff60
Show file tree
Hide file tree
Showing 19 changed files with 231 additions and 172 deletions.
2 changes: 2 additions & 0 deletions .dvc/config
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
[core]
remote = gdrive_remote
['remote "gdrive_remote"']
url = gdrive://1NY8yEl6N1ZhE-q9jnEt6G6cqIHyEiCKc
45 changes: 34 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,50 @@
# model-training
Contains the ML training pipeline used for the main project of course CS4295: Release Engineering for Machine Learning Applications. This pipeline is of an ML model that evaluates restaurant reviews.
Contains the ML training pipeline used for the main project of course CS4295: Release Engineering for Machine Learning Applications. This pipeline is of an ML model that evaluates restaurant reviews. The repository structure is based off the Cookiecutter template.

## Dependencies
All required packages can be found in dep/requirements.txt. To install the required packages, run the following command:
All required packages can be found in `requirements.txt`. To install the required packages, run the following command:

```bash
pip install -r dep/requirements.txt
```

## Dataset
Project was created using the dataset provided by course instructors on [SURFdrive](https://surfdrive.surf.nl/files/index.php/s/207BTysNQFuVZPE?path=%2Fmaterial)
## Usage
In order to run the pipeline, ensure that you have `dvc` installed and run the following command:

To use this repository, please download the archive at the link above, unzip the archive and place the `restaurant-sentiment/` folder in the main `model-training` folder.
```bash
dvc exp run
```

## Usage
To manually generate a new ML model, execute `main.py`.
This will automatically download the dataset from an external source, pre-process the dataset, train the model and save the evaluation results in `reports/model_evaluation.json`. Tests will also automatically be ran. Linting via Pylint and DSLinter is also automatically run as part of the pipeline.

### Preprocessing
Any preprocessing steps can be found in `preprocessing.py`. These are executed automatically with the execution of `main.py`.
To view a graphical representation of the pipeline, run the following command:
``` bash
dvc dag
```
### Remote
A Google drive folder has been configured to be used as remote storage.

### Testing
In order to test the ML pipeline, several tests are performed which can be found in `tests/`. These are ran automatically as part of the pipeline. They can be manually ran using the following command:

```bash
pytest
```
### Metrics
The accuracy metric is stored in `reports/model_evaluation.json`. In order to see the experiment history, run the following command:

```bash
dvc exp show
```
Two experiments are listed, comparing the use of a 20% and 10% test split size.
### Dataset
Project was created using the dataset provided by course instructors on [SURFdrive](https://surfdrive.surf.nl/files/index.php/s/207BTysNQFuVZPE?path=%2Fmaterial).

### Preprocessing
Any preprocessing steps can be found in `preprocessing.py`. These are executed automatically with the execution of the pipeline. Processed data (corpus) is stored in `data/processed/`.

### Storing the trained model
The trained model is stored in the `res/` folder.
The trained model is stored in `data/models/`.


## Pylint & DSLinter
Expand All @@ -33,4 +56,4 @@ If you would like to manually verify the code quality, please run the following
pylint src
```

DSLinter is configured and will automatically run. This should return a perfect score of 10.00. A report summarising the findings can be found in the `data/reports` folder.
DSLinter is configured and will automatically run. This should return a perfect score of 10.00. A report summarising the findings can be found in `data/reports/`.
1 change: 0 additions & 1 deletion data/external/.data

This file was deleted.

6 changes: 4 additions & 2 deletions data/models/.gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
c1_BoW_Sentiment_Model.pkl
c2_Classifier_Sentiment_Model
*
!/**/
!*.*
*.pkl
65 changes: 45 additions & 20 deletions dvc.lock
Original file line number Diff line number Diff line change
@@ -1,50 +1,75 @@
schema: '2.0'
stages:
preprocessing:
cmd: python src/preprocessing.py
cmd: python src/pipeline/preprocessing.py
deps:
- path: data/external/a1_RestaurantReviews_HistoricDump.tsv
md5: 102f1f4193e0bdebdd6cce7f13e0a839
size: 54686
- path: src/preprocessing.py
md5: b45d76ab50b20ccabfb50d591ee7ef02
size: 2034
- path: src/pipeline/preprocessing.py
md5: 2939fdfbdbb8254ed5ee7e228a46d3a5
size: 2038
outs:
- path: data/processed/corpus.joblib
md5: 243212bb05cce5e3fdc72bfd2826d329
size: 31612
load_data:
cmd: python src/load_data.py
cmd: python src/pipeline/load_data.py
deps:
- path: src/load_data.py
md5: e579b1f5296f89c5f22d8ac4af92e1c0
size: 913
- path: src/pipeline/load_data.py
md5: 4261731cb0748f8fa8805c370a2bafce
size: 919
outs:
- path: data/external/a1_RestaurantReviews_HistoricDump.tsv
md5: 102f1f4193e0bdebdd6cce7f13e0a839
size: 54686
- path: data/external/a2_RestaurantReviews_FreshDump.tsv
md5: 097c8b95f6b255e5a6a06b29d61fef8e
size: 6504
training:
cmd: python src/training.py
cmd: python src/pipeline/training.py
deps:
- path: data/external/a1_RestaurantReviews_HistoricDump.tsv
md5: 102f1f4193e0bdebdd6cce7f13e0a839
size: 54686
- path: data/processed/corpus.joblib
md5: 243212bb05cce5e3fdc72bfd2826d329
size: 31612
- path: src/evaluation.py
md5: 96c08113733680243cbc537a93cc128d
size: 396
- path: src/training.py
md5: 81ddde09ae93959e83afb4bae0ddd90a
size: 2073
- path: src/pipeline/preprocessing.py
md5: 2939fdfbdbb8254ed5ee7e228a46d3a5
size: 2038
- path: src/pipeline/training.py
md5: 71f7ea4f607346e17a2264aa12da221d
size: 1522
outs:
- path: data/models/c1_BoW_Sentiment_Model.pkl
md5: 47e4584e52d616cbb5af92f988648e27
md5: 7b5775b55574c74cf828b4577e73f26d
size: 39823
- path: data/models/c2_Classifier_Sentiment_Model
md5: e6e6744062a1d370a585d15df7f45934
md5: 527a8f24c9766cd8ec50d943997acb76
size: 46127
- path: reports/model_evaluation.txt
md5: 35b131f5c189995225c586a8ae7025d9
size: 67
linting:
cmd: pylint src
deps:
- path: .pylintrc
md5: 93822e4a1f2eed84947a1ff37ec8e7ca
size: 18348
evaluation:
cmd: python src/pipeline/evaluation.py --output reports/model_evaluation.json
deps:
- path: data/models/c1_BoW_Sentiment_Model.pkl
md5: 7b5775b55574c74cf828b4577e73f26d
size: 39823
- path: data/models/c2_Classifier_Sentiment_Model
md5: 527a8f24c9766cd8ec50d943997acb76
size: 46127
- path: src/pipeline/evaluation.py
md5: 5b6b5bd1e1be639b55db7c430701fe23
size: 2046
- path: src/pipeline/preprocessing.py
md5: 2939fdfbdbb8254ed5ee7e228a46d3a5
size: 2038
outs:
- path: reports/model_evaluation.json
md5: ced9e7cf4502282c409734a3d577f195
size: 75
35 changes: 27 additions & 8 deletions dvc.yaml
Original file line number Diff line number Diff line change
@@ -1,25 +1,44 @@
stages:
linting:
cmd: pylint src
deps:
- .pylintrc
load_data:
cmd: python src/load_data.py
cmd: python src/pipeline/load_data.py
deps:
- src/load_data.py
- src/pipeline/load_data.py
outs:
- data/external/a1_RestaurantReviews_HistoricDump.tsv
- data/external/a2_RestaurantReviews_FreshDump.tsv
preprocessing:
cmd: python src/preprocessing.py
cmd: python src/pipeline/preprocessing.py
deps:
- src/preprocessing.py
- src/pipeline/preprocessing.py
- data/external/a1_RestaurantReviews_HistoricDump.tsv
outs:
- data/processed/corpus.joblib
training:
cmd: python src/training.py
cmd: python src/pipeline/training.py
deps:
- src/training.py
- src/evaluation.py
- src/pipeline/training.py
- src/pipeline/preprocessing.py
- data/external/a1_RestaurantReviews_HistoricDump.tsv
- data/processed/corpus.joblib
outs:
- data/models/c1_BoW_Sentiment_Model.pkl
- data/models/c2_Classifier_Sentiment_Model
- reports/model_evaluation.txt
evaluation:
cmd: python src/pipeline/evaluation.py --output reports/model_evaluation.json
deps:
- src/pipeline/evaluation.py
- src/pipeline/preprocessing.py
- data/models/c1_BoW_Sentiment_Model.pkl
- data/models/c2_Classifier_Sentiment_Model
metrics:
- reports/model_evaluation.json
# testing:
# cmd: pytest
# deps:
# - src/pipeline/evaluation.py
# - data/models/c1_BoW_Sentiment_Model.pkl
# - data/models/c2_Classifier_Sentiment_Model
2 changes: 1 addition & 1 deletion reports/.gitignore
Original file line number Diff line number Diff line change
@@ -1 +1 @@
/model_evaluation.txt
/model_evaluation.json
41 changes: 22 additions & 19 deletions reports/pylint_report.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,21 @@

Report
======
109 statements analysed.
100 statements analysed.

Statistics by type
------------------

+---------+-------+-----------+-----------+------------+---------+
|type |number |old number |difference |%documented |%badname |
+=========+=======+===========+===========+============+=========+
|module |7 |7 |= |100.00 |0.00 |
|module |6 |6 |= |100.00 |0.00 |
+---------+-------+-----------+-----------+------------+---------+
|class |1 |1 |= |100.00 |0.00 |
+---------+-------+-----------+-----------+------------+---------+
|method |3 |3 |= |100.00 |0.00 |
+---------+-------+-----------+-----------+------------+---------+
|function |3 |3 |= |100.00 |0.00 |
|function |1 |1 |= |100.00 |0.00 |
+---------+-------+-----------+-----------+------------+---------+


Expand All @@ -25,18 +25,21 @@ External dependencies
---------------------
::

joblib (src.main,src.preprocessing,src.training)
nltk (src.preprocessing)
\-corpus (src.preprocessing)
joblib (src.pipeline.evaluation,src.pipeline.preprocessing,src.pipeline.training)
nltk (src.pipeline.preprocessing)
\-corpus (src.pipeline.preprocessing)
\-stem
\-porter (src.preprocessing)
pandas (src.preprocessing,src.training)
\-porter (src.pipeline.preprocessing)
pandas (src.pipeline.evaluation,src.pipeline.preprocessing,src.pipeline.training)
sklearn
\-feature_extraction
| \-text (src.training)
\-metrics (src.evaluation)
\-model_selection (src.classification,src.training)
\-naive_bayes (src.classification,src.training)
| \-text (src.pipeline.training)
\-metrics (src.pipeline.evaluation)
\-model_selection (src.pipeline.evaluation,src.pipeline.training)
\-naive_bayes (src.pipeline.training)
src
\-pipeline
\-preprocessing (src.pipeline.evaluation)



Expand All @@ -46,13 +49,13 @@ Raw metrics
+----------+-------+------+---------+-----------+
|type |number |% |previous |difference |
+==========+=======+======+=========+===========+
|code |122 |49.19 |122 |= |
|code |110 |53.66 |110 |= |
+----------+-------+------+---------+-----------+
|docstring |32 |12.90 |32 |= |
|docstring |22 |10.73 |22 |= |
+----------+-------+------+---------+-----------+
|comment |37 |14.92 |37 |= |
|comment |26 |12.68 |26 |= |
+----------+-------+------+---------+-----------+
|empty |57 |22.98 |57 |= |
|empty |47 |22.93 |47 |= |
+----------+-------+------+---------+-----------+


Expand All @@ -76,7 +79,7 @@ Messages by category
+-----------+-------+---------+-----------+
|type |number |previous |difference |
+===========+=======+=========+===========+
|convention |0 |1 |1 |
|convention |0 |0 |0 |
+-----------+-------+---------+-----------+
|refactor |0 |0 |0 |
+-----------+-------+---------+-----------+
Expand All @@ -97,6 +100,6 @@ Messages



-------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 9.91/10, +0.09)
--------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)

1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@ dvc_gdrive==2.19.2
pylint==2.12.2
mllint==0.12.2
dslinter==2.0.9
pytest==7.3.1
2 changes: 1 addition & 1 deletion src/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
"""Top-level package for model-training."""
"""src init"""

__author__ = """Team 08"""
30 changes: 0 additions & 30 deletions src/classification.py

This file was deleted.

17 changes: 0 additions & 17 deletions src/evaluation.py

This file was deleted.

Loading

0 comments on commit f39ff60

Please sign in to comment.