- We removed calc_feature_importance parameter from Python and R. Now feature importance calculation is almost free, so we always calculate feature importances. Previously you could disable it if it was slowing down your training.
- We removed Doc type for feature importances. Use Shap instead.
- We moved thread_count parameter in Python get_feature_importance method to the end.
In this release we added several very powerfull ranking objectives:
- PairLogitPairwise
- YetiRankPairwise
- QueryCrossEntropy (GPU only)
Other ranking improvements:
- We have made improvements to our existing ranking objectives QuerySoftMax and PairLogit.
- We have added group weights support.
- Improvement for datasets with weights
- Now we automatically calculate a good learning rate for you in the start of training, you don't have to specify it. After the training has finished, you can look on the training curve on evaluation dataset and make ajustments to the selected learning rate, but it will already be a good value.
- Several speedups for GPU training.
- 1.5x speedup for applying the model.
- Speed up multi classificaton training.
- 2x speedup for AUC calculation in eval_metrics.
- Several speedups for eval_metrics for other metrics.
- 100x speed up for Shap values calculation.
- Speedup for feature importance calculation. It used to be a bottleneck for GPU training previously, now it's not.
- We added possibility to not calculate metric on train dataset using
MetricName:hints=skip_train~false
(it might speed up your training if metric calculation is a bottle neck, for example, if you calculate many metrics or if you calculate metrics on GPU). - We added possibility to calculate metrics only periodically, not on all iterations. Use metric_period for that. (previously it only disabled verbose output on each iteration).
- Now we disable by default calculation of expensive metrics on train dataset. We don't calculate AUC and PFound metrics on train dataset by default. You can also disable calculation of other metrics on train dataset using
MetricName:hints=skip_train~true
. If you want to calculate AUC or PFound on train dataset you can useMetricName:hints=skip_train~false
. - Now if you want to calculate metrics using eval_metrics or during training you can use metric_period to skip some iterations. It will speed up eval_metrics and it might speed up training, especially GPU training.
Note that the most expensive metric calculation is AUC calculation, for this metric and large datasets it makes sense to use metric_period.
If you only want to see less verbose output, and still want to see metric values on every iteration written in file, you can use
verbose=n
parameter - Parallelization of calculation of most of the metrics during training
- It is possible now to calculate and visualise custom_metric during training on GPU.
Now you can use our Jupyter visualization, CatBoost viewer or TensorBoard the same way you used it for CPU training. It might be a bottleneck, so if it slows down your training use
metric_period=something
andMetricName:hints=skip_train~false
- We switched to CUDA 9.1. Starting from this release CUDA 8.0 will not be supported
- Support for external borders on GPU for cmdline
- We added support of feature combinations to our Shap values implementation.
- Added Shap values for MultiClass and added an example of it's usage to our Shap tutorial.
- Added pretified parameter to get_feature_importance(). With
pretified=True
the function will return list of features with names sorted in descending order by their importance. - Improved interfaces for eval-feature functionality
- Shap values support in R-package
- It is possible now to save any metainformation to the model.
- Empty values support
- Better support of sklearn
- feature_names_ for CatBoost class
- Added silent parameter
- Better stdout
- Better diagnostic for invalid inputs
- Better documentation
- Added a flag to allow constant labels
We added many new metrics that can be used for visualization, overfitting detection, selecting of best iteration of training or for cross-validation:
- BierScore
- HingeLoss
- HammingLoss
- ZeroOneLoss
- MSLE
- MAE
- BalancedAccuracy
- BalancedErrorRate
- Kappa
- Wkappa
- QueryCrossEntropy
- NDCG
- Saving model as C++ code
- Saving model with categorical features as Python code
Added make files for binary with CUDA and for Python package
We created a new repo with tutorials, now you don't have to clone the whole catboost repo to run Jupyter notebook with a tutorial.
We have also a set of bugfixes and we are gratefull to everyone who has filled a bugreport, helping us making the library better.
This release contains contributions from CatBoost team. We want to especially mention @pukhlyakova who implemented lots of useful metrics.
- New model method
get_cat_feature_indices()
in Python wrapper. - Minor fixes and stability improvements.
- We fixed bug in CatBoost. Pool initialization from
numpy.array
andpandas.dataframe
with string values that can cause slight inconsistence while using trained model from older versions. Around 1% of cat feature hashes were treated incorrectly. If you expirience quality drop after update you should consider retraining your model.
- Algorithm for finding most influential training samples for a given object from the 'Finding Influential Training Samples for Gradient Boosted Decision Trees' paper is implemented. This mode for every object from input pool calculates scores for every object from train pool. A positive score means that the given train object has made a negative contribution to the given test object prediction. And vice versa for negative scores. The higher score modulo - the higher contribution.
See
get_object_importance
model method in Python package andostr
mode in cli-version. Tutorial for Python is available here. More details and examples will be published in documentation soon. - We have implemented new way of exploring feature importance - Shap values from paper. This allows to understand which features are most influent for a given object. You can also get more insite about your model, see details in a tutorial.
- Save model as code functionality published. For now you could save model as Python code with categorical features and as C++ code w/o categorical features.
- Fix
_catboost
reinitialization issues #268 and #269. - Python module
catboost.util
extended withcreate_cd
. It creates column description file. - Now it's possible to load titanic and amazon (Kaggle Amazon Employee Access Challenge) datasets from Python code. Use
catboost.datasets
. - GPU parameter
use_cpu_ram_for_cat_features
renamed togpu_cat_features_storage
with posible valuesCpuPinnedMemory
andGpuRam
. Default isGpuRam
.
This release contains contributions from CatBoost team.
As usual we are grateful to all who filed issues or helped resolve them, asked and answered questions.
- GPU: New
DocParallel
mode for tasks without categorical features and or with categorical features and—max-ctr-complextiy 1
. Provides best performance for pools with big number of documents. - GPU: Distributed training on several GPU host via MPI. See instruction how to build binary here.
- GPU: Up to 30% learning speed-up for Maxwell and later GPUs with binarization level > 32
- Hotfixes for GPU version of python wrapper.
- Python wrapper: added methods to download datasets titanic and amazon, to make it easier to try the library (
catboost.datasets
). - Python wrapper: added method to write column desctiption file (
catboost.utils.create_cd
). - Made improvements to visualization.
- Support non-numeric values in
GroupId
column. - Tutorials section updated.
- Fixed problems with eval_metrics (issue #285)
- Other fixes
- Changed parameter order in
train()
function to be consistant with other GBDT libraries. use_best_model
is set to True by default ifeval_set
labels are present.
- New ranking mode
YetiRank
optimizesNDGC
andPFound
. - New visualisation for
eval_metrics
andcv
in Jupyter notebook. - Improved per document feature importance.
- Supported
verbose
=int
: ifverbose
> 1,metric_period
is set to this value. - Supported type(
eval_set
) = list in python. Currently supporting only singleeval_set
. - Binary classification leaf estimation defaults are changed for weighted datasets so that training converges for any weights.
- Add
model_size_reg
parameter to control model size. Fixctr_leaf_count_limit
parameter, also to control model size. - Beta version of distributed CPU training with only float features support.
- Add
subgroupId
to Python/R-packages. - Add groupwise metrics support in
eval_metrics
.
This release contains contributions from CatBoost team.
We are grateful to all who filed issues or helped resolve them, asked and answered questions.
boosting_type
parameter valueDynamic
is renamed toOrdered
.- Data visualisation functionality in Jupyter Notebook requires ipywidgets 7.x+ now.
query_id
parameter renamed togroup_id
in Python and R wrappers.- cv returns pandas.DataFrame by default if Pandas installed. See new parameter
as_pandas
.
- CatBoost build with make file. Now it’s possible to build command-line CPU version of CatBoost under Linux with make file.
- In column description column name
Target
is changed toLabel
. It will still work with previous name, but it is recommended to use the new one. eval-metrics
mode added into cmdline version. Metrics can be calculated for a given dataset using a previously trained model.- New classification metric
CtrFactor
is added. - Load CatBoost model from memory. You can load your CatBoost model from file or initialize it from buffer in memory.
- Now you can run
fit
function using file with dataset:fit(train_path, eval_set=eval_path, column_description=cd_file)
. This will reduce memory consumption by up to two times. - 12% speedup for training.
- JSON output data format is changed.
- Python whl binaries with CUDA 9.1 support for Linux OS published into the release assets.
- Added
bootstrap_type
parameter toCatBoostClassifier
andRegressor
(issue #263).
This release contains contributions from newbfg and CatBoost team.
We are grateful to all who filed issues or helped resolve them, asked and answered questions.
- BETA version of distributed mulit-host GPU via MPI training
- Added possibility to import coreml model with oblivious trees. Makes possible to migrate pre-flatbuffers model (with float features only) to current format (issue #235)
- Added QuerySoftMax loss function
- Fixed GPU models bug on pools with both categorical and float features (issue #241)
- Use all available cores by default
- Fixed not querywise loss for pool with
QueryId
- Default float features binarization method set to
GreedyLogSum
- Hotfix for critical bug in Python and R wrappers (issue #238)
- Added stratified data split in CV
- Fix
is_classification
check and CV for Logloss
- Fixed critical bugs in formula evaluation code (issue #236)
- Added scale_pos_weight parameter
- 25% speedup of the model applier
- 43% speedup for training on large datasets.
- 15% speedup for
QueryRMSE
and calculation of querywise metrics. - Large speedups when using binary categorical features.
- Significant (x200 on 5k trees and 50k lines dataset) speedup for plot and stage predict calculations in cmdline.
- Compilation time speedup.
- Industry fastest applier implementation.
- Introducing new parameter
boosting-type
to switch between standard boosting scheme and dynamic boosting, described in paper "Dynamic boosting". - Adding new bootstrap types
bootstrap_type
,subsample
. UsingBernoulli
bootstrap type withsubsample < 1
might increase the training speed. - Better logging for cross-validation, added parameter
logging_level
andmetric_period
(should be set in training parameters) to cv. - Added a separate
train
function that receives the parameters and returns a trained model. - Ranking mode
QueryRMSE
now supports default settings for dynamic boosting. - R-package pre-build binaries are included into release.
- We added many synonyms to our parameter names, now it is more convenient to try CatBoost if you are used to some other library.
- Fix for CPU
QueryRMSE
with weights. - Adding several missing parameters into wrappers.
- Fix for data split in querywise modes.
- Better logging.
- From this release we'll provide pre-build R-binaries
- More parallelisation.
- Memory usage improvements.
- And some other bug fixes.
This release contains contributions from CatBoost team.
We are grateful to all who filed issues or helped resolve them, asked and answered questions.
- We've made single document formula applier 4 times faster!
model.shrink
function added in Python and R wrappers.- Added new training parameter
metric_period
that controls output frequency. - Added new ranking metric
QueryAverage
. - This version contains an easy way to implement new user metrics in C++. How-to example is provided.
- Stability improvements and bug fixes
As usual we are grateful to all who filed issues, asked and answered questions.
Cmdline:
- Training parameter
gradient-iterations
renamed toleaf-estimation-iterations
. border
option removed. If you want to specify border for binary classification mode you need to specify it in the following way:loss-function Logloss:Border=0.5
- CTR parameters are changed:
- Removed
priors
,per-feature-priors
,ctr-binarization
; - Added
simple-ctr
,combintations-ctr
,per-feature-ctr
; More details will be published in our documentation.
- Removed
Python:
- Training parameter
gradient_iterations
renamed toleaf_estimation_iterations
. border
option removed. If you want to specify border for binary classification mode you need to specify it in the following way:loss_function='Logloss:Border=0.5'
- CTR parameters are changed:
- Removed
priors
,per_feature_priors
,ctr_binarization
; - Added
simple_ctr
,combintations_ctr
,per_feature_ctr
; More details will be published in our documentation.
- Removed
- In Python we added a new method
eval_metrics
: now it's possible for a given model to calculate specified metric values for each iteration on specified dataset. - One command-line binary for CPU and GPU: in CatBoost you can switch between CPU and GPU training by changing single parameter value
task-type CPU
orGPU
(task_type 'CPU', 'GPU' in python bindings). Windows build still contains two binaries. - We have speed up the training up to 30% for datasets with a lot of objects.
- Up to 10% speed-up of GPU implementation on Pascal cards
- Stability improvements and bug fixes
As usual we are grateful to all who filed issues, asked and answered questions.
FlatBuffers model format: new CatBoost versions wouldn’t break model compatibility anymore.
- Training speedups: we have speed up the training by 33%.
- Two new ranking modes are available:
PairLogit
- pairwise comparison of objects from the input dataset. Algorithm maximises probability correctly reorder all dataset pairs.QueryRMSE
- mix of regression and ranking. It’s trying to make best ranking for each dataset query by input labels.
- We have fixed a bug that caused quality degradation when using weights < 1.
Verbose
flag is now deprecated, please uselogging_level
instead. You could set the following levels:Silent
,Verbose
,Info
,Debug
.- And some other bugs.
This release contains contributions from: avidale, newbfg, KochetovNicolai and CatBoost team.
We are grateful to all who filed issues or helped resolve them, asked and answered questions.
GPU CUDA support is available. CatBoost supports multi-GPU training. Our GPU implementation is 2 times faster then LightGBM and more then 20 times faster then XGBoost one. Check out the news with benchmarks on our site.
Stability improvements and bug fixes
This release contains contributions from: daskol and CatBoost team.
We are grateful to all who filed issues or helped resolve them, asked and answered questions.
- R library interface significantly changed
- New model format: CatBoost v0.2 model binary not compatible with previous versions
- Cross-validation parameters changes: we changed overfitting detector parameters of CV in python so that it is same as those in training.
- CTR types: MeanValue => BinarizedTargetMeanValue
- Training speedups: we have speed up the training by 20-30%.
- Accuracy improvement with categoricals: we have changed computation of statistics for categorical features, which leads to better quality.
- New type of overfitting detector:
Iter
. This type of detector was requested by our users. So now you can also stop training by a simple criterion: if after a fixed number of iterations there is no improvement of your evaluation function. - TensorBoard support: this is another way of looking on the graphs of different error functions both during training and after training has finished. To look at the metrics you need to provide
train_dir
when training your model and then run"tensorboard --logdir={train_dir}"
- Jupyter notebook improvements: for our Python library users that experiment with Jupyter notebooks, we have improved our visualisation tool. Now it is possible to save image of the graph. We also have changed scrolling behaviour so that it is more convenient to scroll the notebook.
- NaN features support: we also have added simple but effective way of dealing with NaN features. If you have some NaNs in the train set, they will be changed to a value that is less than the minimum value or greater than the maximum value in the dataset (this is configurable), so that it is guaranteed that they are in their own bin, and a split would separates NaN values from all other values. By default, no NaNs are allowed, so you need to use option
nan_mode
for that. When applying a model, NaNs will be treated in the same way for the features where NaN values were seen in train. It is not allowed to have NaN values in test if no NaNs in train for this feature were provided. - Snapshotting: we have added snapshotting to our Python and R libraries. So if you think that something can happen with your training, for example machine can reboot, you can use
snapshot_file
parameter - this way after you restart your training it will start from the last completed iteration. - R library tutorial: we have added tutorial
- Logging customization: we have added
allow_writing_files
parameter. By default some files with logging and diagnostics are written on disc, but you can turn it off using by setting this flag to False. - Multiclass mode improvements: we have added a new objective for multiclass mode -
MultiClassOneVsAll
. We also addedclass_names
param - now you don't have to renumber your classes to be able to use multiclass. And we have added two new metrics for multiclass:TotalF1
andMCC
metrics. You can use the metrics to look how its values are changing during training or to use overfitting detection or cutting the model by best value of a given metric. - Any delimeters support: in addition to datasets in
tsv
format, CatBoost now supports files with any delimeters
Stability improvements and bug fixes
This release contains contributions from: grayskripko, hadjipantelis and CatBoost team.
We are grateful to all who filed issues or helped resolve them, asked and answered questions.