You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Contains the ML training pipeline used for the main project of course CS4295: Release Engineering for Machine Learning Applications. This pipeline is of an ML model that evaluates restaurant reviews. The repository structure is based off the Cookiecutter template.
3
3
4
-
## Dependencies
5
-
All required packages can be found in `requirements.txt`. To install the required packages, run the following command:
4
+
## **Dependencies**
5
+
6
+
This project is using Poetry instead of Pip to manage dependencies. Poetry is a Python dependency management tool that simplifies the process of managing dependencies and packaging. Additionally, Poetry is also used to manage the virtual environment from which the project is run, thus not requiring the user to manually create a virtual environment.
7
+
8
+
### **Installation (Poetry)**
9
+
10
+
To install Poetry, please follow the instructions on the [Poetry website](https://python-poetry.org/docs/#installation) and follow the corresponding steps for your operating system.
11
+
12
+
### **Installing dependencies**
13
+
14
+
To install the project dependencies, please run the following command:
6
15
7
16
```bash
8
-
pip install -r dep/requirements.txt
17
+
poetry install
9
18
```
10
19
11
-
## Usage
20
+
This will install all dependencies listed in `pyproject.toml` and create a virtual environment for the project. As such, instead of using `pip` to install a specific dependency and then run that dependency in a virtual environment, Poetry will handle this for you.
21
+
22
+
### **Adding a new dependency**
23
+
24
+
To add a new dependency, please run the following command:
25
+
26
+
```bash
27
+
poetry add <dependency-name>
28
+
```
29
+
30
+
This will add the dependency to `pyproject.toml` and install it in the virtual environment.
31
+
However, if you would like to install a dependency for development purposes, please run the following command:
32
+
33
+
```bash
34
+
poetry add --dev <dependency-name>
35
+
```
36
+
37
+
In any case, dependency changes will also show up in the `poetry.lock` file. This file is used to ensure that all developers are using the same versions of the dependencies. Consequently, it is good practice and actually recommended that this file is committed to version control.
38
+
39
+
## **Usage**
40
+
12
41
In order to run the pipeline, ensure that you have `dvc` installed and run the following command:
13
42
14
43
```bash
@@ -21,39 +50,47 @@ To view a graphical representation of the pipeline, run the following command:
21
50
```bash
22
51
dvc dag
23
52
```
24
-
### Remote
53
+
### **Remote**
54
+
25
55
A Google drive folder has been configured to be used as remote storage.
26
56
27
-
### Testing
57
+
### **Testing**
58
+
28
59
In order to test the ML pipeline, several tests are performed which can be found in `tests/`. These are ran automatically as part of the pipeline. They can be manually ran using the following command:
29
60
30
61
```bash
31
-
pytest
62
+
poetry run pytest
32
63
```
33
-
### Metrics
64
+
65
+
### **Metrics**
66
+
34
67
The accuracy metric is stored in `reports/model_evaluation.json`. In order to see the experiment history, run the following command:
35
68
36
69
```bash
37
70
dvc exp show
38
71
```
39
-
Two experiments are listed, comparing the use of a 20% and 10% test split size.
40
-
### Dataset
72
+
Two experiments are listed, comparing the use of a 20% and 10% test split size.
73
+
74
+
### **Dataset**
75
+
41
76
Project was created using the dataset provided by course instructors on [SURFdrive](https://surfdrive.surf.nl/files/index.php/s/207BTysNQFuVZPE?path=%2Fmaterial).
42
77
43
-
### Preprocessing
78
+
### **Preprocessing**
79
+
44
80
Any preprocessing steps can be found in `preprocessing.py`. These are executed automatically with the execution of the pipeline. Processed data (corpus) is stored in `data/processed/`.
45
81
46
-
### Storing the trained model
82
+
### **Storing the trained model**
83
+
47
84
The trained model is stored in `data/models/`.
48
85
86
+
## **Pylint & DSLinter**
49
87
50
-
## Pylint & DSLinter
51
88
Pylint and DSLinter have been used and configured to ensure the code quality. All configuration options can be found in `.pylintrc`. This configuration file is based on [this example from the DSLinter documentation](https://github.com/SERG-Delft/dslinter/blob/main/docs/pylint-configuration-examples/pylintrc-for-ml-projects/.pylintrc). Besides this, there are a few custom changes, such as adding the variable names `X_train`, `X_test` etc. to the list of accepted variable names by Pylint, as these variable names are commonly used in ML applications. The `init_hook` variable in `.pylintrc` is also set to the path of this directory, in order to ensure that all imports within the code do not result in a warning from Pylint.
52
89
53
90
If you would like to manually verify the code quality, please run the following command:
54
91
55
92
```bash
56
-
pylint src
93
+
poetry run pylint src
57
94
```
58
95
59
96
DSLinter is configured and will automatically run. This should return a perfect score of 10.00. A report summarising the findings can be found in `data/reports/`.
0 commit comments