This benchmark implements optimization algorithm on a pre-defined grid of hyper-parameters for every library. It measures time to perform boosting iteration and metric's value on test set using library primitives. XGBoost benchmark code served as a starting point for this benchmark.
xgboost==0.80
matplotlib==2.2.2
tqdm==4.26.0
scipy==1.1.0
lightgbm==2.2.1
pandas==0.23.4
catboost==0.10.3
numpy==1.15.2
scikit_learn==0.20.0
How to run all experiments in one line and get table with results. It will run each library on each dataset (from list: abalone, airline, epsilon, higgs, letters, msrank, msrank-classification, synthetic, synthetic-5k-features) with max_depth=6 three times (with learning rate 0.03, 0.05 and 0.15) for 1000 iterations, dump logs to directory 'logs' write all results to file result.json and create table with time per iteration and total learning time (in common-table.txt).
python benchmark.py --use-gpu
For CPU benchmark run without --use-gpu flag.
First of all you need to define a hyperparameters grid in json file, (you may find grid example in the root directory named example_params_grid.json)
Here every line define a hyperparameter name and a list of values to iterate in experiments.
{
"max_depth": [6, 10, 12],
"learning_rate": [0.02, 0.05, 0.08],
"reg_lambda": [1, 10],
"subsample": [0.5, 1.0]
}
This is example of how you can run an experiments on GPU for all three libraries for dataset airline using grid ''example_params_grid.json'' with 5000 iterations.
python run.py --learners cat lgb xgb --experiment airline --params-grid example_params_grid.json --iterations 5000 --use-gpu
Logs by default will be written to directory 'logs', its compressed version (containing only timestamps and quality values on iteration) will be written to file 'result.json'.
Higgs, Epsilon, MSRank (MSLR-WEB10K), CoverType, Airline, Synthetic, Abalone, Letters.
Names for script run.py:
abalone,airline,cover-type,epsilon,higgs,letters,msrank,msrank-classification,syntetic,synthetic-5k-features
It is supported to draw four types of plots:
- Learning curves for top N runs between all experiments (parameter name -- best).
- Box plot of time per iteration for each library (time-per-iter).
- Draw time to achieve percent from best quality (quality-vs-time).
- Learning curves for concrete experiments (custom).
Third method calculates the best achieved quality between all experiments. Then fix a grid of relative percents of this value. Then for each level of quality the method filter all experiments from grid search that reach it and compute median (circle point), minimum and maximum time when algorithm achieved that score.
Draw learning curves for 5 best experiments on dataset MSRank (multiclass mode):
python plot.py --type best --top 5 --from-iter 500 -i result.json -o msrank_mc_plots
Draw time curves with figure size 10x8 for experiments on Higgs dataset starting from 85% of best achieved quality to 100% in 30 points:
python plot.py --type quality-vs-time -f 10 8 -o higgs_plots -i result.json --low-percent 0.85 --num-bins 30
You can pass option --only-min in this mode, to draw only minimum time to achive percent of quality.
Draw time per iteration box plots for every library:
python plot.py --type time-per-iter -f 10 8 -o higgs_plots -i result.json
If you want to draw experiments of concrete method pass option --only method_name. Choices for method_name are cat, xgb, lgb, cpu, gpu and combinations of those.
- Used library versions: CatBoost -- 0.10.3, XGBoost -- 0.80, LightGBM -- 2.2.1.
- Hardware for CPU experiments: Intel Xeon E312xx (Sandy Bridge) (32 cores).
- Hardware for GPU experiments: NVIDIA GTX1080TI + Intel Core i7-6800K (12 cores).
We took the most popular datasets between benchmarks in other libraries and kaggle competitions. Datasets present different task types (regression, classification, ranking), and have different sparsity number of features and samples. This set of datasets cover the majority of computation bottlenecks in gbdt algorithm, for example with large number of samples computation of loss function gradients may become bottleneck, with large number of features the phase of searching of the tree structure may become computational bottleneck. To demonstrate performance of each library in such experiments we created two synthetic datasets -- the first with 10 million objects and 100 features, the second with 100K objects and 5K features.
In such different tasks the optimal hyper parameters, for example maximal tree depth or learning rate, will be different. Thus we choose different hyper parameter grids for different datasets. The selection of the grid for optimization consisted of the following steps:
- Run each library with different maximum depth (or number of leaves in case of LightGBM) and learning rate.
- Select best tries for each library and run with different sample rate and regularization parameters.
- Create grid based on all observations. It must contain all best tries for every algorithm.
Learning curves, distribution of time per iteration and other plots you may find in plots folder.
Table with dataset characteristics. Sparsity values was calculated using the formula
sparsity = (100 - 100 * float(not_null_count) / (pool.shape[0] * pool.shape[1]))
, where not_null_count
is number of NaN, None or zero (abs(x) < 1e-6) values in dataset.
Dataset characteristics | #samples | #features | sparsity |
---|---|---|---|
Abalone | 4.177 | 8 | 3.923 |
Letters | 20.000 | 16 | 2.621 |
Epsilon | 500.000 | 2000 | 3.399 |
Higgs | 11.000.000 | 28 | 7.912 |
MSRank | 1.200.192 | 137 | 37.951 |
Airlines | 115.069.017 | 13 | 7.821 |
Synthetic (10M samples, 100 features) | 10.000.000 | 100 | 0.089 |
Synthetic (100K samples, 5K features) | 100.000 | 5000 | 0.079 |
Parameters grid | iterations | max_depth | learning_rate | subsample | reg_lambda |
---|---|---|---|---|---|
Abalone | 2000 | [6, 8, 10] | [.03, .07, .15] | [.5, .75, 1] | [1] |
Letters | 2000 | [6, 8, 10] | [.03, .07, .15] | [.5, .75, 1] | [1] |
Epsilon | 5000 | [4,6,8] | [.09, .15, .25] | [.5, 1] | [1, 10, 100] |
Higgs | 5000 | [6, 8, 10] | [.01, .03, .07, .15, .3] | [.25, .5, .75, 1] | [2, 4, 8, 16] |
MSRank-RMSE | 8000 | [6, 10, 12] | [.02, .05, .08] | [.5, 1] | [1, 10] |
MSRank-MultiClass | 5000 | [6, 8, 10] | [.03, .07, .15] | [.5, .75, 1] | [1] |
Airlines | 5000 | [6, 8, 10] | [.05] | default | default |
Synthetic (10M samples, 100 features) | 5000 | [6, 8, 10] | [.05, .15] | default | default |
Synthetic (100K samples, 5K features) | 1000 | [6, 8] | [.05] | default | default |
Metric abbreviations: RMSE -- root mean squared error. Error = # incorrect classified samples / # samples.
Dataset | Metric | CatBoost | XGBoost | LightGBM |
---|---|---|---|---|
Abalone | RMSE | 2.154 | 2.154 | 2.143 |
Letters | Error | 0.020 | 0.026 | 0.023 |
Epsilon | Error | 0.109 | 0.111 | 0.111 |
Higgs | Error | 0.227 | 0.224 | 0.222 |
MSRank-RMSE | RMSE | 0.737 | 0.738 | 0.739 |
MSRank-MultiClass | Error | 0.424 | 0.425 | 0.424 |
Airlines | Error | 0.188 | 0.190 | 0.176 |
Synthetic (10M samples, 100 features) | RMSE | 8.363 | 8.287 | 9.354 |
Synthetic (100K samples, 5K features) | RMSE | 6.814 | 10.39 | 11.060 |
Table with speed measurements. Here we present only experiments with max_depth = 6 (other parameters were set to default).
median time-per-iter, | catboost | xgboost | lightgbm |
---|---|---|---|
abalone | 0.005 +/- 0.004 | 0.01 +/- 0.001 | 0.002 |
airline | 7.08 +/- 2.684 | 6.293 +/- 0.378 | 5.002 +/- 0.375 |
epsilon | 0.271 +/- 0.015 | 9.008 +/- 0.456 | 1.344 +/- 0.185 |
higgs | 0.723 +/- 0.032 | 0.75 +/- 0.192 | 0.691 +/- 0.254 |
letters | 0.265 +/- 0.011 | 0.125 +/- 0.025 | 0.074 +/- 0.02 |
msrank | 0.092 +/- 0.004 | 0.256 +/- 0.09 | 0.131 +/- 0.027 |
msrank-classification | 0.565 +/- 0.02 | 1.701 +/- 0.198 | 0.95 +/- 0.284 |
synthetic | 0.541 +/- 0.079 | 1.199 +/- 0.23 | 1.097 +/- 0.242 |
synthetic-5k-features | 0.166 +/- 0.011 | 12.481 +/- 3.716 | 1.1 +/- 0.213 |
total, sec | catboost | xgboost | lightgbm |
---|---|---|---|
abalone | 6.57 | 13.70 | 6.43 |
airline | 974.107 | 611.934 | 526.182 |
epsilon | 266.77 | 8935.36 | 1323.40 |
higgs | 733.27 | 804.18 | 788.64 |
letters | 259.75 | 176.82 | 112.74 |
msrank | 89.72 | 341.99 | 185.26 |
msrank-classification | 55.349 | 174.726 | 92.670 |
synthetic | 593.87 | 1327.40 | 1154.13 |
synthetic-5k-features | 166.57 | 14951.47 | 1135.94 |
median time-per-iter | catboost | xgboost | lightgbm |
---|---|---|---|
abalone | 0.006 ± 0.001 | 0.005 ± 0.001 | 0.004 ± 0.016 |
airline | 0.486 ± 0.016 | 0.932 ± 0.094 | 3.641 ± 0.612 |
epsilon | 0.045 ± 0.005 | 2.686 ± 0.283 | 0.659 ± 0.107 |
higgs | 0.056 ± 0.001 | 0.111 ± 0.019 | 0.166 ± 0.06 |
letters | 0.009 | 0.034 ± 0.008 | 0.349 ± 0.099 |
msrank | 0.011 ± 0.001 | 0.039 ± 0.006 | 0.053 ± 0.013 |
msrank-classification | 0.022 ± 0.001 | 0.366 ± 0.021 | 0.191 ± 2.961 |
synthetic | 0.035 ± 0.003 | 0.209 ± 0.022 | 0.112 ± 0.03 |
synthetic-5k-features | 0.027 ± 0.002 | 1.357 ± 0.149 | 1.51 ± 0.148 |
total, sec | catboost | xgboost | lightgbm |
---|---|---|---|
abalone | 5.763 | 5.764 | 23.109 |
airline | 485.825 | 928.99 | 3716.344 |
epsilon | 44.971 | 2661.741 | 686.4 |
higgs | 55.601 | 110.485 | 166.228 |
letters | 9.271 | 36.79 | 115.383 |
msrank | 11.057 | 40.003 | 60.688 |
msrank-classification | 22.156 | 372.541 | 204.634 |
synthetic | 36.658 | 210.305 | 123.255 |
synthetic-5k-features | 28.014 | 1353.212 | 1557.471 |
This repository contains modified versions of files released by third parties under other licenses. See the corresponding notifications in these files for the details.