Skip to content

Commit 5555f07

Browse files
authored
Feature/docs and poetry (#21)
* feat: modify readme and restrict python version. * feat: modify readme and restrict python version. * feat: modify readme and restrict python version. * feat: modify readme and restrict python version.
1 parent 7dc5be7 commit 5555f07

File tree

3 files changed

+244
-194
lines changed

3 files changed

+244
-194
lines changed

README.md

+55-9
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,58 @@
11
# model-training
22
Contains the ML training pipeline used for the main project of course CS4295: Release Engineering for Machine Learning Applications. This pipeline is of an ML model that evaluates restaurant reviews. The repository structure is based off the Cookiecutter template.
33

4-
## **Pre-requisites**
5-
6-
* Python = `3.8.*`
4+
## :books: **Tabel of Contents**
5+
6+
- [model-training](#model-training)
7+
- [:books: **Tabel of Contents**](#books-tabel-of-contents)
8+
- [:scroll: **Pre-requisites**](#scroll-pre-requisites)
9+
- [:gear: **Poetry Setup**](#gear-poetry-setup)
10+
- [**Installation (Poetry)**](#installation-poetry)
11+
- [**Python Version**](#python-version)
12+
- [**Installing dependencies**](#installing-dependencies)
13+
- [**Adding a new dependency**](#adding-a-new-dependency)
14+
- [**The `pyproject.toml` Configuration**](#the-pyprojecttoml-configuration)
15+
- [:rocket: **Pipeline Usage**](#rocket-pipeline-usage)
16+
- [**Remote**](#remote)
17+
- [**Testing**](#testing)
18+
- [**Metrics**](#metrics)
19+
- [**Dataset**](#dataset)
20+
- [**Preprocessing**](#preprocessing)
21+
- [**Storing the trained model**](#storing-the-trained-model)
22+
- [:clipboard: **Linting**](#clipboard-linting)
23+
- [:art: **Formatting (isort \& black)**](#art-formatting-isort--black)
24+
- [**isort**](#isort)
25+
- [**black**](#black)
26+
27+
## :scroll: **Pre-requisites**
28+
29+
* Python >= `3.8.*` or <= `3.10.*`
30+
* Installation varies per python version and OS. Please refer to the [Python website](https://www.python.org/downloads/) for more details.
731
* Poetry
32+
* Refer to the [Installation (Poetry)](#installation-poetry) section for more details
33+
* DVC
34+
* See installation instructions [here](https://dvc.org/doc/install)
835

936
This project is using Poetry instead of Pip to manage dependencies. Poetry is a Python dependency management tool that simplifies the process of managing dependencies and packaging. Additionally, Poetry is also used to manage the virtual environment from which the project is run, thus not requiring the user to manually create a virtual environment. As such, make sure you have poetry installed before proceeding with the next sections.
1037

11-
> If you are not familiar with Poetry, you can find additional details about the setup by referring to the [Poetry Setup](#petry-setup) section.
38+
> **Note:** If you are not familiar with Poetry, you can find additional details about the setup by referring to the [Poetry Setup](#poetry-setup) section. If you have experience with it, you can skip this section by going directly to the [Pipeline Usage](#pipeline-usage) section.
1239
13-
## **Poetry Setup**
40+
## :gear: **Poetry Setup**
1441

1542
### **Installation (Poetry)**
1643

1744
To install Poetry, please follow the instructions on the [Poetry website](https://python-poetry.org/docs/#installation) and follow the corresponding steps for your operating system.
1845

46+
### **Python Version**
47+
48+
Poetry for this project is configured to using any python version in the range of `3.8.*` to `3.10.*`. If you are using a different version of Python, you need to install a correct version and configure your poetry environment to use it. For example, to use `python3.8` you can run the following command:
49+
50+
```bash
51+
poetry env use python3.8
52+
```
53+
54+
> **Note**: The actual python value should be the path to the python executable (if not already on the `$PATH`). If already on the system, while on Linux-based systems, you can directly use `poetry env use $(which python3.8)` to use the correct path. If you installed the python binary in a different location, you must use the correct path to the executable.
55+
1956
### **Installing dependencies**
2057

2158
To install the project dependencies, please run the following command:
@@ -52,20 +89,29 @@ The `pyproject.toml` file is used to configure the project by managing dependenc
5289
* What sources `bandit` should analyze
5390
* etc.
5491

55-
## **Pipeline Usage**
92+
## :rocket: **Pipeline Usage**
5693

5794
In order to run the pipeline, ensure that you have `dvc` installed and run the following command:
5895

5996
```bash
6097
poetry run dvc exp run
6198
```
6299

63-
This will automatically download the dataset from an external source, pre-process the dataset, train the model and save the evaluation results in `reports/model_evaluation.json`. Tests will also automatically be ran. Linting via Pylint and DSLinter is also automatically run as part of the pipeline.
100+
Alternatively, you can also run the following command:
101+
102+
```bash
103+
poetry run dvc repro
104+
```
105+
106+
Both of these commands will automatically download the dataset from an external source, pre-process the dataset, train the model and save the evaluation results in `reports/model_evaluation.json`. Tests will also automatically be ran. Linting via Pylint and DSLinter is also automatically run as part of the pipeline.
107+
108+
> **Note**: The aforementioned commands will produce reports in the `reports/` folder. Some of these reports relate to the testing phase, namely the `tests-report.xml` and `coverage-report.xml`, whereas the rest relate to `mllint` and `pylint` scores.
64109
65110
To view a graphical representation of the pipeline, run the following command:
66111
``` bash
67112
poetry run dvc dag
68113
```
114+
69115
### **Remote**
70116

71117
A Google drive folder has been configured to be used as remote storage.
@@ -101,7 +147,7 @@ Any preprocessing steps can be found in `preprocessing.py`. These are executed a
101147

102148
The trained model is stored in `data/models/`.
103149

104-
## **Linting**
150+
## :clipboard: **Linting**
105151
We are using the mllint tool to check for common mistakes in ML projects (formatting, tests, general good practice rules). The report that was used in the latest run of the pipeline can be found within `reports/mllint_report.md`.
106152

107153
> Note: The mllint tool combines multiple linters and uses rules for testing, configuration and other topics that are specific to ML projects. You can find the official source code for the tool [here](https://github.com/bvobart/mllint).
@@ -117,7 +163,7 @@ poetry run mllint
117163
This will run mllint, which includes several linters. DSLinter is configured and will automatically run. This should return a perfect score of 10.00. A report summarising the findings can be found in `reports/mllint_report.md`.
118164

119165

120-
## **Formatting (isort & black)**
166+
## :art: **Formatting (isort & black)**
121167

122168
The project uses `isort` and `black` to format the code. `isort` is used to sort the imports in the code, while `black` is used to format the code itself. Both of these tools are configured in `pyproject.toml`.
123169

0 commit comments

Comments
 (0)