MLflow Sandbox

Train a ML model in Scikit-learn for sentiment classification, while keeping track of the performance of the different models via MLflow. The optimal model is found by exploring the model search space through GridSearch (Bayesian optimization is on the TODO list..).

The search space has two dimensions:

Different vectorizer settings: ngram sizes
Different classifiers and classifier settings: Naive Bayes, Random Forest and Support Vector Machines.

After evaluating all the models their performance, the best model is selected. This model then trained on the complete corpus.

Dependencies

This project has the following dependencies:

Dependencies
- Large Movie Review Dataset from Stanford (already included in the repo)
- Git LFS
- Python >= 3.7
- Docker
- Optional: Poetry

Setup

Install Git LFS on your machine

$ git lfs install --system --skip-repo

Clone the repo
Create a virtual environment with at least Python 3.7 via the tool of your choice (conda, venv, etc.)
Install the Python dependencies

Using poetry:

$ poetry install

Not using poetry:

$ pip install -r requirements.txt

Create the directories database and artifact in the data directory

$ cd data
$ mkdir database
$ mkdir artifacts

Train models

1. Run MLflow server via the code shown below. This Makefile command starts up the Postgres database and the MLflow server. The MLflow server is accessible via localhost:5000.

$ make mlflow-server

With the current configuration the statistics are stored in the Postgres database, whereas the artifacts are stored on your disk. For production I would recommend using a SQL instance in the Cloud for the statistics and blob storage for the artifacts.

2. Train the different ML models using Scikit-learn. After the run is finished the parameters and metrics (performance) of each models is visible in the corresponding experiment in the MLFlow dashboard

$ python train_hp_optimizer.py

Train the best model on the complete dataset and evaluate performance on the test dataset

$ python train_best_model.py

4. The best model is stored in the directory trained_model in the subdirectory with the corresponding experiment name. The model.pkl is your trained ML model that can be utilized to make predictions!

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
data		data
mlflow_sandbox		mlflow_sandbox
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Makefile		Makefile
README.rst		README.rst
docker-compose.yaml		docker-compose.yaml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLflow Sandbox

Dependencies

Setup

Train models

About

Releases

Packages

Languages

Marvzinc/mlflow-sandbox

Folders and files

Latest commit

History

Repository files navigation

MLflow Sandbox

Dependencies

Setup

Train models

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages