This guide will help you set up and run a machine learning pipeline that includes feature engineering, model training, and deployment using Hopsworks and OpenAI.
- 📋 Prerequisites
- 🎯 Getting Started
- ⚡️ Running the H&M Personalized Recommender
- 🤖 Running the ML Pipelines in GitHub Actions
- 🌐 Live Demo
- ☁️ Deploying the Streamlit App
You'll need the following tools installed locally:
Tool | Version | Purpose | Installation Link |
---|---|---|---|
Python | 3.11 | Programming language runtime | Download |
uv | ≥ 0.4.30 | Python package installer and virtual environment manager | Download |
GNU Make | ≥ 3.81 | Build automation tool | Download |
Git | ≥2.44.0 | Version control | Download |
The project requires access to these cloud services:
Service | Purpose | Cost | Required Credentials | Setup Guide |
---|---|---|---|---|
Hopsworks | AI Lakehouse for feature store, model registry, and serving | Free tier available | HOPSWORKS_API_KEY |
Create API Key |
GitHub Actions | Compute & Automation | Free for public repos | - | - |
OpenAI API | LLM API for recommender system | Pay-per-use | OPENAI_API_KEY |
Quick Start Guide |
Start by cloning the repository and navigating to the project directory:
git clone https://github.com/decodingml/personalized-recommender-course.git
cd personalized-recommender-course
Next, we have to prepare your Python environment and its adjacent dependencies.
Set up the project environment by running the following:
make install
Test that you have Python 3.11.8 installed in your new uv
environment:
uv run python --version
# Output: Python 3.11.8
This command will:
- Create a virtual environment using
uv
- Activate the virtual environment
- Install all dependencies from
pyproject.toml
Note
Normally, uv
will pick the right Python version mentioned in .python-version
and install it automatically if it is not on your system. If you are having any issues, explicitly install the right Python version by running make install-python
Before running any components:
- Create your environment file:
cp .env.example .env
- Open
.env
and configure the required credentials following the inline comments and the recommendations from the Cloud Services section.
For instructions on exploring the Notebooks, check out the 📚 Course section from the main README.
You can run the entire pipeline at once or execute individual components.
Execute all the ML pipelines in a sequence:
make all
It will take ~1.5 hours to run, depending on your machine.
This runs the following steps:
- Feature engineering
- Retrieval model training
- Ranking model training
- Candidate embeddings creation
- Inference pipeline deployment
- Materialization job scheduling
View results in Hopsworks Serverless: Data Science → Deployments
Start the Streamlit UI:
make start-ui
Accessible at http://localhost:8501/
Each component can be run separately:
- Feature Engineering
make feature-engineering
It will take ~1 hour to run, depending on your machine.
View results in Hopsworks Serverless: Feature Store → Feature Groups
- Retrieval Model Training
make train-retrieval
View results in Hopsworks Serverless: Data Science → Model Registry
- Ranking Model Training
make train-ranking
View results in Hopsworks Serverless: Data Science → Model Registry
- Embeddings Creation
make create-embeddings
View results in Hopsworks Serverless: Feature Store → Feature Groups
- Deployment Creation
make create-deployments
View results in Hopsworks Serverless: Data Science → Deployments
Start the Streamlit UI:
make start-ui
Accessible at http://localhost:8501/
Important
The demo is in 0-cost mode, which means that when there is no traffic, the deployment scales to 0 instances. The first time you interact with it, give it 1-2 minutes to warm up to 1+ instances. Afterward, everything will become smoother.
- Materialization Job Scheduling
make schedule-materialization-jobs
View results in Hopsworks Serverless: Compute → Ingestions
- Deployment Creation with LLM Ranking (Optional)
Optional step to replace the standard deployments (created in Step 5) with the ones powered by LLMs:
make create-deployments-llm-ranking
NOTE: If the script fails, go to Hopsworks Serverless: Data Science → Deployments, forcefully stop all the deployments and run again.
Warning
The LLM Ranking deployment overrides the deployment from 5. Deployment Creation
Start the Streamlit UI that interfaces the LLM deployment:
make start-ui-llm-ranking
Accessible at http://localhost:8501/
Warning
The Streamlit UI command is compatible only with its corresponding deployment. For example, running the deployment from 5. Deployment Creation and make start-ui-llm-ranking
won't work.
Remove all created resources from Hopsworks Serverless:
make clean-hopsworks-resources
- Ensure UV is properly installed and configured before running any commands
- All notebooks are executed using IPython through the UV virtual environment
- Components should be run in the specified order when executing individually
This project supports running ML pipelines automatically through GitHub Actions, providing an alternative to local or Colab execution.
Note
This is handy when getting network errors, such as timeouts, on your local machine. GitHub Actions has an enterprise-level network that will run your ML pipelines smoothly.
The ML pipelines can be triggered in three ways:
- Manual trigger through GitHub UI
- Scheduled execution (configurable)
- On push to main branch (configurable)
Create your own copy of the repository to access GitHub Actions:
# Use GitHub's UI to fork the repository
https://github.com/original-repo/name → Your-Username/name
Set up required environment variables as GitHub Actions secrets:
Option A: Using GitHub UI
- Navigate to: Repository → Settings → Secrets and variables → Actions
- Click "New repository secret"
- Add required secrets:
HOPSWORKS_API_KEY
OPENAI_API_KEY
📚 Set up GitHub Actions Secrets Guide
Option B: Using GitHub CLI
If you have GitHub CLI
installed, instead of settings the GitHub Actions secrets manually, you can set them by running the following:
gh secret set HOPSWORKS_API_KEY
gh secret set OPENAI_API_KEY
- Go to Actions → ML Pipelines
- Click "Run workflow"
- Select branch (default: main)
- Click "Run workflow"
After triggering the pipeline, you will see it running, signaled by a yellow circle. Click on it to see the progress.
After it is finished, it should look like this:
Another option is to run the ML pipelines automatically on a schedule or when new commits are pushed to the main branch.
Edit .github/workflows/ml_pipelines.yaml
to enable automatic triggers:
name: ML Pipelines
on:
# schedule: # Uncomment to run the pipelines every 2 hours. All the pipelines take ~1.5 hours to run.
# - cron: '0 */2 * * *'
# push: # Uncomment to run pipelines on every new commit to main
# branches:
# - main
workflow_dispatch: # Allows manual triggering from GitHub UI
-
Pipeline Progress
- View real-time execution in Actions tab
- Each step shows detailed logs and status
-
Output Verification
- Access results in Hopsworks Serverless
- Check Feature Groups, Feature Views, Model Registry, and Deployments
- Full pipeline execution takes approximately 1.5 hours
- Ensure sufficient GitHub Actions minutes available
- Monitor usage when enabling automated triggers
Try out our deployed H&M real-time personalized recommender to see what you'll learn to build by the end of this course: 💻 Live H&M Recommender Streamlit Demo
Important
The demo is in 0-cost mode, which means that when there is no traffic, the deployment scales to 0 instances. The first time you interact with it, give it 1-2 minutes to warm up to 1+ instances. Afterward, everything will become smoother.
Deploying a Streamlit App to their cloud is free and straightforward after the GitHub repository is set in right place:
uv.lock
- installing Python dependenciespackages.txt
- installing system dependenciesstreamlit_app.py
- entrypoint to the Streamlit application
Fork the repository if you haven't already:
# Use GitHub's UI to fork the repository
https://github.com/original-repo/name → Your-Username/name
- Create a free account on Streamlit Cloud
- Navigate to New App Deployment
- Configure deployment settings:
Setting | Configuration | Description |
---|---|---|
App Type | Select "Deploy a public app from GitHub" | |
Main Settings | Configure your repository | |
Advanced Settings | Set Python 3.11 and HOPSWORKS_API_KEY |
- Ensure all required files are present in your repository
- Python version must be set to 3.11
HOPSWORKS_API_KEY
must be configured in environment variables- Repository must be public for free tier deployment