This repository contains my solution for the Kaggle Playground Series competition focused on predicting the beats-per-minute (BPM) of songs using machine learning techniques. The goal is to build a regression model that accurately predicts a song's tempo from various audio features.
- Competition Type: Regression (Continuous Target Variable)
- Evaluation Metric: Root Mean Squared Error (RMSE)
- Timeline: September 1-30, 2025
- Dataset: Synthetically generated from real-world music data
The dataset contains various audio features that can be used to predict a song's BPM. This is a synthetic dataset created from real-world music data, designed to provide realistic patterns while ensuring test labels remain private.
Target Variable: BeatsPerMinute - Continuous values representing song tempo
- Predict continuous BPM values for songs in the test set
- Minimize Root Mean Squared Error between predictions and actual BPM
- Practice regression techniques and feature engineering
- Explore audio feature relationships with song tempo
βββ data/
β βββ train.csv # Training dataset with BPM targets
β βββ test.csv # Test dataset (without target)
β βββ sample_submission.csv # Competition submission format
βββ notebooks/
β βββ 01_eda.ipynb # Exploratory Data Analysis β
βββ src/
β βββ __init__.py # Package initialization
β βββ data_preprocessing.py # Data cleaning and preprocessing β
β βββ utils.py # Utility functions and helpers β
βββ submissions/
β βββ ensemble_submission.csv # Generated predictions
βββ .gitignore # Git ignore rules for ML projects β
βββ requirements.txt # Python dependencies β
βββ README.md # Project documentation
- Clone this repository:
git clone https://github.com/[username]/predict-bpm.git
cd predict-bpm- Install required packages:
pip install -r requirements.txt- Download the competition data from Kaggle and place in the
data/folder.
- Analyze distribution of BPM values (completed in EDA notebook)
- Explore feature correlations and relationships
- Identify missing values and outliers
- Visualize audio feature relationships with target
- Statistical analysis of feature distributions
- Data preprocessing pipeline implemented
- Missing value handling strategies
- Feature scaling and normalization
- Create polynomial features
- Generate interaction terms
- Domain-specific audio feature engineering
- Baseline linear regression
- Random Forest Regressor
- Gradient Boosting (XGBoost, LightGBM, CatBoost)
- Neural Networks (if beneficial)
- Ensemble methods for final predictions
- K-fold cross-validation setup
- Feature importance analysis
- Hyperparameter tuning with Optuna
- Model interpretability with SHAP
| Model | CV Score (RMSE) | Public LB Score |
|---|---|---|
| Baseline | - | - |
| Random Forest | - | - |
| XGBoost | - | - |
| Ensemble | - | - |
- Data Pipeline: Modular preprocessing with
data_preprocessing.py - Utility Functions: Reusable components in
utils.pyfor model evaluation and visualization - Notebook-Based EDA: Comprehensive exploratory analysis in Jupyter notebooks
- Missing Value Handling: Statistical imputation based on feature distributions
- Scaling Strategy: StandardScaler for continuous features, preserving audio feature relationships
- Feature Selection: Correlation analysis and domain knowledge for audio features
- Future Plans: Polynomial features, interaction terms, and domain-specific transformations
- Baseline Strategy: Start with linear regression for interpretability
- Tree-Based Models: Random Forest and Gradient Boosting (XGBoost, LightGBM, CatBoost)
- Ensemble Strategy: Weighted averaging and stacking approaches
- Validation: Stratified K-fold cross-validation to ensure robust performance estimates
- Core ML: scikit-learn ecosystem with gradient boosting libraries
- Hyperparameter Optimization: Optuna for efficient parameter search
- Model Interpretation: SHAP values for feature importance and model explainability
- Visualization: matplotlib/seaborn for EDA, plotly for interactive plots
Predictions should be submitted in CSV format:
ID,BeatsPerMinute
524164,119.5
524165,127.42
524166,111.11
-
Exploratory Data Analysis:
jupyter notebook notebooks/01_eda.ipynb
-
Training Models:
python src/models.py
-
Generate Predictions:
python src/predict.py
- pandas (β₯1.5.0) - Data manipulation and analysis
- numpy (β₯1.21.0) - Numerical computing
- scikit-learn (β₯1.1.0) - Machine learning algorithms and preprocessing
- scipy (β₯1.8.0) - Statistical functions
- statsmodels (β₯0.13.0) - Advanced statistical analysis
- xgboost (β₯1.6.0) - Gradient boosting framework
- lightgbm (β₯3.3.0) - Fast gradient boosting
- catboost (β₯1.1.0) - Categorical feature handling
- optuna (β₯3.0.0) - Hyperparameter optimization
- matplotlib (β₯3.5.0) - Static plotting
- seaborn (β₯0.11.0) - Statistical visualization
- plotly (β₯5.10.0) - Interactive visualizations
- shap (β₯0.41.0) - Model interpretation
- jupyter (β₯1.0.0) - Interactive notebooks
- tqdm (β₯4.64.0) - Progress bars
- joblib (β₯1.1.0) - Model persistence
- feature-engine (β₯1.5.0) - Feature engineering utilities
See requirements.txt for complete dependency list with version specifications.
This competition is part of the Kaggle Tabular Playground Series, designed to provide:
- Lightweight challenges for skill development
- Synthetic datasets based on real-world data
- Opportunities for rapid experimentation
- Community learning and collaboration
- The dataset is synthetically generated but maintains realistic patterns from actual music data
- Focus on regression techniques and feature engineering
- RMSE optimization is key to achieving good performance
- Cross-validation is crucial for reliable model evaluation
- 1st-3rd Place: Choice of Kaggle merchandise
- Focus on learning and community participation
Walter Reade and Elizabeth Park. Predicting the Beats-per-Minute of Songs. https://kaggle.com/competitions/playground-series-s5e9, 2025. Kaggle.
Feel free to reach out for discussions about approaches, feature engineering ideas, or collaboration opportunities!
Happy Modeling! π΅π