The sklearn wrapper is not really compatible with the sklearn ecosystem #2966

ekerazha · 2020-04-02T10:04:18Z

I originally wrote this comment in #2946 but it was not the best place (thank you @rth for your reply).

In my opinion there's a big issue with the current scikit-learn wrapper.

In general, most libraries in the scikit-learn ecosystem (or sklearn itself) expect a fit() method where you only pass X and y (and maybe sample_weight).

In the current wrapper we also have other params such as early_stopping_round. I think we should move as much parameters as possible from the fit method to the estimator constructor.

For example, catboost https://catboost.ai/docs/concepts/python-reference_catboost.html allows to set most parameters through the constructor (you can set early_stopping_round both when you create the estimator object or in the fit method).

For example, if I want to create a StackedClassifier https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html it's not clear how to pass the additional LightGBM fit() parameters to the StackedClassifier wrapper.

In the past I created a custom LightGBM compatibility layer where I passed parameters to the constructor (inside a fit_params dictionary) that were used when calling LightGBM's fit() method.

I think we should definitely move as much parameters as possible out of the fit() method to improve compatibility with the sklearn ecosystem.

StrikerRUS · 2020-04-02T15:57:43Z

Yes, the current scikit-learn API is not standardized in a way how to pass additional arguments to fit which cannot be passed into constructor, e.g. eval_set. Probably, they will refactor their interface when histogram gradient boosting leaves beta phase. Refer to #2628 (comment). We should definitely get back to this issue after any corresponding updates at scikit-learn side.

StrikerRUS · 2020-04-02T16:11:22Z

Moving early_stopping_round alone into init right now makes no sense. Because if you cannot pass additional parameters to fit you also cannot pass eval_set which means early stopping cannot be used at all. The most fit params has no analogous in sklearn ecosystem for now. So they can be treated as an additional functionality which is not fully supported in different scikit-learn tools.

ekerazha · 2020-04-02T16:40:21Z

Moving early_stopping_round alone into init right now makes no sense. Because if you cannot pass additional parameters to fit you also cannot pass eval_set which means early stopping cannot be used at all. The most fit params has no analogous in sklearn ecosystem for now. So they can be treated as an additional functionality which is not fully supported in different scikit-learn tools.

A possible approach, which is the approach used by Apache Spark (and also by the LightGBM estimators in MMLSpark) is to have an additional indicator "column" which is 1 for eval rows and 0 for training rows. So you could pass to the constructor the name of that column (or its position if it's not a pandas DataFrame) and that column is internally used to split the eval set. It's not a common solution in Python libraries, but it's the solution adopted by Spark when they faced this issue.

ekerazha · 2020-04-02T17:03:49Z

Meanwhile, categorical_feature could be a good candidate to be set from the constructor instead of fit().

StrikerRUS · 2020-04-02T17:41:20Z

A possible approach, which is the approach used by Apache Spark (and also by the LightGBM estimators in MMLSpark) is to have an additional indicator "column" which is 1 for eval rows and 0 for training rows.

That approach cannot help in case when training and evaluation datasets come from different sources.

Meanwhile, categorical_feature could be a good candidate to be set from the constructor instead of fit().

Yeah, feature_name and categorical_feature can be removed from fit, because you can pass them in kwargs.

ekerazha · 2020-04-02T18:14:08Z

That approach cannot help in case when training and evaluation datasets come from different sources.

Why not? You merge them into a single feature matrix/dataframe, and you use the indicator column to distinguish the training set rows from the eval set rows.

StrikerRUS · 2020-04-02T20:21:14Z

Some types cannot be merged easily or merged at all, e.g. path to saved Dataset in binary file.
Moreover, scikit-learn explicitly states that the whole X and y into fit are used for training, so it's not allowed to do a trick you've described.

The fit() method takes the training data as arguments,
Note that the model is fitted using X and y, but the object holds no reference to X and y.
https://scikit-learn.org/stable/developers/develop.html#fitting

ekerazha · 2020-04-02T21:13:19Z

I don't see any particular problem in concatenating numpy arrays or pandas dataframes such as the ones supported by the sklearn wrapper.

About the second point, obviously it's not a perfect solution and it could have some issues (e.g. if you feed a numpy array to a sklearn pipeline where a previous transform changes the number of features, it could be non-obvious to locate the position of the indicator column in the new feature matrix). Still, it would allow for things that are currently impossible with the current sklearn wrapper, which is also not really compatible with the sklearn ecosystem, for the reasons I already explained.

guolinke · 2020-04-04T04:07:25Z

@ekerazha
I agree that most parameters could be moved to the constructor.
for the eval_set, although the additional column is a good hack, my most concern part is the performance. As most data in NumPy is stored row-wise, append an additional column will need to re-allocate a new memory and copy. This will double the (peak) memory cost, and slow down the training.

And use eval_set (and other row-related parameters, like sample_weight) in fit is very straight-forward, and many users are used to this fashion, for the popularity of XGBoost.

NicolasHug · 2020-04-09T12:26:52Z

Yeah, feature_name and categorical_feature can be removed from fit, because you can pass them in kwargs.

Nothing really defined yet, but we're actually trying to go in the reverse direction. I.e., we have just introduced monotonic constraints (and hopefully soon will be introducing categorical feature support): these are __init__ parameters for now, but ideally we'd want them to be fit parameters, or even just meta-data associated with the input data (X and y). Basically, we're trying to move any parameter that is data-specific into fit, or at least out of __init__. Though again, nothing definite for now.

StrikerRUS · 2020-04-09T13:14:38Z

@NicolasHug Thank you very much for finding time to come here and comment!
I think your comment can be treated as "+1" to my earlier words about that we (LightGBM) should take a break for now and don't start any refactoring processes of scikit-learn wrapper. Instead, we must wait for HistGradientBoosting becomes mature and then unify APIs with it and follow newly introduced rules in development guide. Ideally, I believe there can be created a roadmap issue with pings to maintainers of other gradient boosting libraries that you think make impact in data science society to help them get the latest updates in API and take a part into discussion. I remember there was a similar suggestion some time ago: scikit-learn/scikit-learn#15392 (comment).

Linking #2628 (comment) here.

StrikerRUS · 2020-04-27T12:54:03Z

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

StrikerRUS mentioned this issue Apr 27, 2020

LGBMRanker query group setting when using gridsearchcv #3018

Closed

guolinke mentioned this issue Apr 27, 2020

Feature Requests & Voting Hub #2302

Open

StrikerRUS closed this as completed Apr 27, 2020

StrikerRUS added feature request help wanted labels Apr 27, 2020

StrikerRUS mentioned this issue May 7, 2020

check_estimator fails for XGBClassifier dmlc/xgboost#5641

Closed

This was referenced Aug 21, 2020

LightGBM does not comply with sklearn's check_is_fitted #3014

Closed

[python][sklearn] be compatible with check_is_fitted sklearn function #3329

Merged

StrikerRUS mentioned this issue Jan 12, 2021

[Feature Request] Auto early stopping in Sklearn API #3313

Closed

c60evaporator mentioned this issue May 3, 2022

Cross validation with early stopping, dynamic eval_set c60evaporator/tune-easy#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The sklearn wrapper is not really compatible with the sklearn ecosystem #2966

The sklearn wrapper is not really compatible with the sklearn ecosystem #2966

ekerazha commented Apr 2, 2020

StrikerRUS commented Apr 2, 2020

StrikerRUS commented Apr 2, 2020

ekerazha commented Apr 2, 2020

ekerazha commented Apr 2, 2020

StrikerRUS commented Apr 2, 2020

ekerazha commented Apr 2, 2020 •

edited

Loading

StrikerRUS commented Apr 2, 2020

ekerazha commented Apr 2, 2020

guolinke commented Apr 4, 2020

NicolasHug commented Apr 9, 2020

StrikerRUS commented Apr 9, 2020

StrikerRUS commented Apr 27, 2020

The sklearn wrapper is not really compatible with the sklearn ecosystem #2966

The sklearn wrapper is not really compatible with the sklearn ecosystem #2966

Comments

ekerazha commented Apr 2, 2020

StrikerRUS commented Apr 2, 2020

StrikerRUS commented Apr 2, 2020

ekerazha commented Apr 2, 2020

ekerazha commented Apr 2, 2020

StrikerRUS commented Apr 2, 2020

ekerazha commented Apr 2, 2020 • edited Loading

StrikerRUS commented Apr 2, 2020

ekerazha commented Apr 2, 2020

guolinke commented Apr 4, 2020

NicolasHug commented Apr 9, 2020

StrikerRUS commented Apr 9, 2020

StrikerRUS commented Apr 27, 2020

ekerazha commented Apr 2, 2020 •

edited

Loading