Skip to content

Commit

Permalink
Fixed import issues, and got mllint to return near-perfect score (#20)
Browse files Browse the repository at this point in the history
* Fixed module import

* Fixed import problems and linting issues

* Update gitignore to ignore all reports

* Added placeholder test to be expanded later

* Stop tracking mllint report

* Updated dvc pipeline with proper configurations

* Updated readme with new dvc and linting instructions
  • Loading branch information
JvanderSaag authored Jun 21, 2023
1 parent 44585a0 commit cfe068b
Show file tree
Hide file tree
Showing 22 changed files with 290 additions and 573 deletions.
23 changes: 23 additions & 0 deletions .mllint.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
rules:
disabled: []
custom: []
git:
maxFileSize: 10000000
code-quality:
linters:
- pylint
- black
- isort
- bandit
testing:
report: "reports/tests-report.xml"
targets:
minimum: 1
ratio:
tests: 1
other: 4
coverage:
report: "reports/coverage-report.xml"
targets:
line: 80

2 changes: 1 addition & 1 deletion .pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,7 @@ evaluation=10.0 - ((float(5 * error + warning + refactor + convention) / stateme
# Set the output format. Available formats are text, parseable, colorized, json
# and msvs (visual studio). You can also give a reporter class, e.g.
# mypackage.mymodule.MyReporterClass.
output-format=text:reports/pylint_report.txt,colorized
#output-format=text:reports/pylint_report.txt,colorized

# Tells whether to display a full report or only the messages.
reports=y
Expand Down
114 changes: 57 additions & 57 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,27 +3,68 @@ Contains the ML training pipeline used for the main project of course CS4295: Re

## **Pre-requisites**

* Python >= `3.8`
* Python = `3.8.*`
* Poetry
* DVC

This project is using Poetry instead of Pip to manage dependencies. Poetry is a Python dependency management tool that simplifies the process of managing dependencies and packaging. Additionally, Poetry is also used to manage the virtual environment from which the project is run, thus not requiring the user to manually create a virtual environment. As such, make sure you have poetry installed before proceeding with the next sections.

> If you are not familiar with Poetry, you can find additional details about the setup by referring to the [Poetry Setup](#poetry-setup) section.
> If you are not familiar with Poetry, you can find additional details about the setup by referring to the [Poetry Setup](#petry-setup) section.
## **Usage**
## **Poetry Setup**

### **Installation (Poetry)**

To install Poetry, please follow the instructions on the [Poetry website](https://python-poetry.org/docs/#installation) and follow the corresponding steps for your operating system.

### **Installing dependencies**

To install the project dependencies, please run the following command:

```bash
poetry install
```

This will install all dependencies listed in `pyproject.toml` and create a virtual environment for the project. As such, instead of using `pip` to install a specific dependency and then run that dependency in a virtual environment, Poetry will handle this for you.

### **Adding a new dependency**

To add a new dependency, please run the following command:

```bash
poetry add <dependency-name>
```

This will add the dependency to `pyproject.toml` and install it in the virtual environment.
However, if you would like to install a dependency for development purposes, please run the following command:

```bash
poetry add --dev <dependency-name>
```

In any case, dependency changes will also show up in the `poetry.lock` file. This file is used to ensure that all developers are using the same versions of the dependencies. Consequently, it is good practice and actually recommended that this file is committed to version control.

### **The `pyproject.toml` Configuration**

The `pyproject.toml` file is used to configure the project by managing dependencies and configuring poetry itself. It is also used to configure additional behaviours for linting and testing - essentially acting as a configuration file for the dependencies used in the project. For example, the `pyproject.toml` file in this project is used to configure the following:
* The Python version
* The project name
* What profile `isort` should use
* What sources `bandit` should analyze
* etc.

## **Pipeline Usage**

In order to run the pipeline, ensure that you have `dvc` installed and run the following command:

```bash
dvc exp run
poetry run dvc exp run
```

This will automatically download the dataset from an external source, pre-process the dataset, train the model and save the evaluation results in `reports/model_evaluation.json`. Tests will also automatically be ran. Linting via Pylint and DSLinter is also automatically run as part of the pipeline.

To view a graphical representation of the pipeline, run the following command:
``` bash
dvc dag
poetry run dvc dag
```
### **Remote**

Expand All @@ -37,12 +78,14 @@ In order to test the ML pipeline, several tests are performed which can be found
poetry run pytest
```

The coverage report and test report are both found in the `reports/` folder.

### **Metrics**

The accuracy metric is stored in `reports/model_evaluation.json`. In order to see the experiment history, run the following command:

```bash
dvc exp show
poetry run dvc exp show
```
Two experiments are listed, comparing the use of a 20% and 10% test split size.

Expand All @@ -58,59 +101,21 @@ Any preprocessing steps can be found in `preprocessing.py`. These are executed a

The trained model is stored in `data/models/`.

## **Poetry Setup**

### **Installation (Poetry)**

To install Poetry, please follow the instructions on the [Poetry website](https://python-poetry.org/docs/#installation) and follow the corresponding steps for your operating system.

### **Installing dependencies**

To install the project dependencies, please run the following command:

```bash
poetry install
```
## **Linting**
We are using the mllint tool to check for common mistakes in ML projects (formatting, tests, general good practice rules). The report that was used in the latest run of the pipeline can be found within `reports/mllint_report.md`.

This will install all dependencies listed in `pyproject.toml` and create a virtual environment for the project. As such, instead of using `pip` to install a specific dependency and then run that dependency in a virtual environment, Poetry will handle this for you.
> Note: The mllint tool combines multiple linters and uses rules for testing, configuration and other topics that are specific to ML projects. You can find the official source code for the tool [here](https://github.com/bvobart/mllint).
### **Adding a new dependency**
Pylint and DSLinter have been configured to ensure the code quality, and are run as part of mllint. All configuration options can be found in `.pylintrc`. This configuration file is based on [this example from the DSLinter documentation](https://github.com/SERG-Delft/dslinter/blob/main/docs/pylint-configuration-examples/pylintrc-for-ml-projects/.pylintrc). Besides this, there are a few custom changes, such as adding the variable names `X_train`, `X_test` etc. to the list of accepted variable names by Pylint, as these variable names are commonly used in ML applications. The `init_hook` variable in `.pylintrc` is also set to the path of this directory, in order to ensure that all imports within the code do not result in a warning from Pylint.

To add a new dependency, please run the following command:
isort and black are used for the formatting. If you would like to manually verify the code quality, please run the following command:

```bash
poetry add <dependency-name>
poetry run mllint
```

This will add the dependency to `pyproject.toml` and install it in the virtual environment.
However, if you would like to install a dependency for development purposes, please run the following command:
This will run mllint, which includes several linters. DSLinter is configured and will automatically run. This should return a perfect score of 10.00. A report summarising the findings can be found in `reports/mllint_report.md`.

```bash
poetry add --dev <dependency-name>
```

In any case, dependency changes will also show up in the `poetry.lock` file. This file is used to ensure that all developers are using the same versions of the dependencies. Consequently, it is good practice and actually recommended that this file is committed to version control.

### **The `pyproject.toml` Configuration**

The `pyproject.toml` file is used to configure the project by managing dependencies and configuring poetry itself. It is also used to configure additional behaviours for linting and testing - essentially acting as a configuration file for the dependencies used in the project. For example, the `pyproject.toml` file in this project is used to configure the following:
* The Python version
* The project name
* What profile `isort` should use
* What sources `bandit` should analyze
* etc.

## **Pylint & DSLinter**

Pylint and DSLinter have been used and configured to ensure the code quality. All configuration options can be found in `.pylintrc`. This configuration file is based on [this example from the DSLinter documentation](https://github.com/SERG-Delft/dslinter/blob/main/docs/pylint-configuration-examples/pylintrc-for-ml-projects/.pylintrc). Besides this, there are a few custom changes, such as adding the variable names `X_train`, `X_test` etc. to the list of accepted variable names by Pylint, as these variable names are commonly used in ML applications. The `init_hook` variable in `.pylintrc` is also set to the path of this directory, in order to ensure that all imports within the code do not result in a warning from Pylint.

If you would like to manually verify the code quality, please run the following command:

```bash
poetry run pylint src
```

DSLinter is configured and will automatically run. This should return a perfect score of 10.00. A report summarising the findings can be found in `reports/pylint_report.txt`.

## **Formatting (isort & black)**

Expand Down Expand Up @@ -148,8 +153,3 @@ poetry run black --check .

> Again, there are many more configuration options, therefore consider looking at the [black readthedocs page](https://black.readthedocs.io/en/stable/) if you are interested in more information.
## **mllint setup**

We are using the mllint tool to check for common mistakes in ML projects (formatting, tests, general good practice rules). The report that was used in the latest run of the pipeline can be found within `reports/mllint_report.md`.

> Note: The mllint tool combines multiple linters and uses rules for testing, configuration and other topics that are specific to ML projects. You can find the official source code for the tool [here](https://github.com/bvobart/mllint).
102 changes: 0 additions & 102 deletions data/reports/report.txt

This file was deleted.

Loading

0 comments on commit cfe068b

Please sign in to comment.