Skip to content

rodrick-mpofu/predict-bpm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Predicting the Beats-per-Minute of Songs

Kaggle Playground Series - Season 5, Episode 9

🎡 Project Overview

This repository contains my solution for the Kaggle Playground Series competition focused on predicting the beats-per-minute (BPM) of songs using machine learning techniques. The goal is to build a regression model that accurately predicts a song's tempo from various audio features.

🎯 Competition Details

  • Competition Type: Regression (Continuous Target Variable)
  • Evaluation Metric: Root Mean Squared Error (RMSE)
  • Timeline: September 1-30, 2025
  • Dataset: Synthetically generated from real-world music data

πŸ“Š Dataset Information

The dataset contains various audio features that can be used to predict a song's BPM. This is a synthetic dataset created from real-world music data, designed to provide realistic patterns while ensuring test labels remain private.

Target Variable: BeatsPerMinute - Continuous values representing song tempo

πŸ† Competition Goals

  • Predict continuous BPM values for songs in the test set
  • Minimize Root Mean Squared Error between predictions and actual BPM
  • Practice regression techniques and feature engineering
  • Explore audio feature relationships with song tempo

πŸ“ Repository Structure

β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ train.csv           # Training dataset with BPM targets
β”‚   β”œβ”€β”€ test.csv            # Test dataset (without target)
β”‚   └── sample_submission.csv # Competition submission format
β”œβ”€β”€ notebooks/
β”‚   └── 01_eda.ipynb        # Exploratory Data Analysis βœ…
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py         # Package initialization
β”‚   β”œβ”€β”€ data_preprocessing.py # Data cleaning and preprocessing βœ…
β”‚   └── utils.py            # Utility functions and helpers βœ…
β”œβ”€β”€ submissions/
β”‚   └── ensemble_submission.csv # Generated predictions
β”œβ”€β”€ .gitignore              # Git ignore rules for ML projects βœ…
β”œβ”€β”€ requirements.txt        # Python dependencies βœ…
└── README.md              # Project documentation

πŸ› οΈ Setup and Installation

  1. Clone this repository:
git clone https://github.com/[username]/predict-bpm.git
cd predict-bpm
  1. Install required packages:
pip install -r requirements.txt
  1. Download the competition data from Kaggle and place in the data/ folder.

πŸ§ͺ Methodology & Progress

Data Exploration βœ…

  • Analyze distribution of BPM values (completed in EDA notebook)
  • Explore feature correlations and relationships
  • Identify missing values and outliers
  • Visualize audio feature relationships with target
  • Statistical analysis of feature distributions

Feature Engineering πŸ”„

  • Data preprocessing pipeline implemented
  • Missing value handling strategies
  • Feature scaling and normalization
  • Create polynomial features
  • Generate interaction terms
  • Domain-specific audio feature engineering

Modeling Approach πŸ“‹

  • Baseline linear regression
  • Random Forest Regressor
  • Gradient Boosting (XGBoost, LightGBM, CatBoost)
  • Neural Networks (if beneficial)
  • Ensemble methods for final predictions

Model Validation πŸ“‹

  • K-fold cross-validation setup
  • Feature importance analysis
  • Hyperparameter tuning with Optuna
  • Model interpretability with SHAP

πŸ“ˆ Current Results

Model CV Score (RMSE) Public LB Score
Baseline - -
Random Forest - -
XGBoost - -
Ensemble - -

πŸ”§ Technical Implementation Details

Current Architecture

  • Data Pipeline: Modular preprocessing with data_preprocessing.py
  • Utility Functions: Reusable components in utils.py for model evaluation and visualization
  • Notebook-Based EDA: Comprehensive exploratory analysis in Jupyter notebooks

Feature Engineering Strategy

  • Missing Value Handling: Statistical imputation based on feature distributions
  • Scaling Strategy: StandardScaler for continuous features, preserving audio feature relationships
  • Feature Selection: Correlation analysis and domain knowledge for audio features
  • Future Plans: Polynomial features, interaction terms, and domain-specific transformations

Model Development Approach

  • Baseline Strategy: Start with linear regression for interpretability
  • Tree-Based Models: Random Forest and Gradient Boosting (XGBoost, LightGBM, CatBoost)
  • Ensemble Strategy: Weighted averaging and stacking approaches
  • Validation: Stratified K-fold cross-validation to ensure robust performance estimates

Technical Stack

  • Core ML: scikit-learn ecosystem with gradient boosting libraries
  • Hyperparameter Optimization: Optuna for efficient parameter search
  • Model Interpretation: SHAP values for feature importance and model explainability
  • Visualization: matplotlib/seaborn for EDA, plotly for interactive plots

πŸ“‹ Submission Format

Predictions should be submitted in CSV format:

ID,BeatsPerMinute
524164,119.5
524165,127.42
524166,111.11

πŸš€ How to Run

  1. Exploratory Data Analysis:

    jupyter notebook notebooks/01_eda.ipynb
  2. Training Models:

    python src/models.py
  3. Generate Predictions:

    python src/predict.py

πŸ“š Dependencies

Core Data Science Stack

  • pandas (β‰₯1.5.0) - Data manipulation and analysis
  • numpy (β‰₯1.21.0) - Numerical computing
  • scikit-learn (β‰₯1.1.0) - Machine learning algorithms and preprocessing
  • scipy (β‰₯1.8.0) - Statistical functions
  • statsmodels (β‰₯0.13.0) - Advanced statistical analysis

Machine Learning Libraries

  • xgboost (β‰₯1.6.0) - Gradient boosting framework
  • lightgbm (β‰₯3.3.0) - Fast gradient boosting
  • catboost (β‰₯1.1.0) - Categorical feature handling
  • optuna (β‰₯3.0.0) - Hyperparameter optimization

Visualization & Analysis

  • matplotlib (β‰₯3.5.0) - Static plotting
  • seaborn (β‰₯0.11.0) - Statistical visualization
  • plotly (β‰₯5.10.0) - Interactive visualizations
  • shap (β‰₯0.41.0) - Model interpretation

Development Tools

  • jupyter (β‰₯1.0.0) - Interactive notebooks
  • tqdm (β‰₯4.64.0) - Progress bars
  • joblib (β‰₯1.1.0) - Model persistence
  • feature-engine (β‰₯1.5.0) - Feature engineering utilities

See requirements.txt for complete dependency list with version specifications.

🀝 Competition Context

This competition is part of the Kaggle Tabular Playground Series, designed to provide:

  • Lightweight challenges for skill development
  • Synthetic datasets based on real-world data
  • Opportunities for rapid experimentation
  • Community learning and collaboration

πŸ“ Notes

  • The dataset is synthetically generated but maintains realistic patterns from actual music data
  • Focus on regression techniques and feature engineering
  • RMSE optimization is key to achieving good performance
  • Cross-validation is crucial for reliable model evaluation

πŸŽ–οΈ Competition Prizes

  • 1st-3rd Place: Choice of Kaggle merchandise
  • Focus on learning and community participation

πŸ“„ Citation

Walter Reade and Elizabeth Park. Predicting the Beats-per-Minute of Songs. https://kaggle.com/competitions/playground-series-s5e9, 2025. Kaggle.


πŸ“ž Contact

Feel free to reach out for discussions about approaches, feature engineering ideas, or collaboration opportunities!

Happy Modeling! πŸŽ΅πŸ“Š

About

Predicting the Beats-per-Minute of Songs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published