Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The sklearn wrapper is not really compatible with the sklearn ecosystem #2966

Closed
ekerazha opened this issue Apr 2, 2020 · 12 comments
Closed

Comments

@ekerazha
Copy link

ekerazha commented Apr 2, 2020

I originally wrote this comment in #2946 but it was not the best place (thank you @rth for your reply).

In my opinion there's a big issue with the current scikit-learn wrapper.

In general, most libraries in the scikit-learn ecosystem (or sklearn itself) expect a fit() method where you only pass X and y (and maybe sample_weight).

In the current wrapper we also have other params such as early_stopping_round. I think we should move as much parameters as possible from the fit method to the estimator constructor.

For example, catboost https://catboost.ai/docs/concepts/python-reference_catboost.html allows to set most parameters through the constructor (you can set early_stopping_round both when you create the estimator object or in the fit method).

For example, if I want to create a StackedClassifier https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html it's not clear how to pass the additional LightGBM fit() parameters to the StackedClassifier wrapper.

In the past I created a custom LightGBM compatibility layer where I passed parameters to the constructor (inside a fit_params dictionary) that were used when calling LightGBM's fit() method.

I think we should definitely move as much parameters as possible out of the fit() method to improve compatibility with the sklearn ecosystem.

@StrikerRUS
Copy link
Collaborator

Yes, the current scikit-learn API is not standardized in a way how to pass additional arguments to fit which cannot be passed into constructor, e.g. eval_set. Probably, they will refactor their interface when histogram gradient boosting leaves beta phase. Refer to #2628 (comment). We should definitely get back to this issue after any corresponding updates at scikit-learn side.

@StrikerRUS
Copy link
Collaborator

Moving early_stopping_round alone into init right now makes no sense. Because if you cannot pass additional parameters to fit you also cannot pass eval_set which means early stopping cannot be used at all. The most fit params has no analogous in sklearn ecosystem for now. So they can be treated as an additional functionality which is not fully supported in different scikit-learn tools.

@ekerazha
Copy link
Author

ekerazha commented Apr 2, 2020

Moving early_stopping_round alone into init right now makes no sense. Because if you cannot pass additional parameters to fit you also cannot pass eval_set which means early stopping cannot be used at all. The most fit params has no analogous in sklearn ecosystem for now. So they can be treated as an additional functionality which is not fully supported in different scikit-learn tools.

A possible approach, which is the approach used by Apache Spark (and also by the LightGBM estimators in MMLSpark) is to have an additional indicator "column" which is 1 for eval rows and 0 for training rows. So you could pass to the constructor the name of that column (or its position if it's not a pandas DataFrame) and that column is internally used to split the eval set. It's not a common solution in Python libraries, but it's the solution adopted by Spark when they faced this issue.

@ekerazha
Copy link
Author

ekerazha commented Apr 2, 2020

Meanwhile, categorical_feature could be a good candidate to be set from the constructor instead of fit().

@StrikerRUS
Copy link
Collaborator

A possible approach, which is the approach used by Apache Spark (and also by the LightGBM estimators in MMLSpark) is to have an additional indicator "column" which is 1 for eval rows and 0 for training rows.

That approach cannot help in case when training and evaluation datasets come from different sources.

Meanwhile, categorical_feature could be a good candidate to be set from the constructor instead of fit().

Yeah, feature_name and categorical_feature can be removed from fit, because you can pass them in kwargs.

@ekerazha
Copy link
Author

ekerazha commented Apr 2, 2020

That approach cannot help in case when training and evaluation datasets come from different sources.

Why not? You merge them into a single feature matrix/dataframe, and you use the indicator column to distinguish the training set rows from the eval set rows.

@StrikerRUS
Copy link
Collaborator

Some types cannot be merged easily or merged at all, e.g. path to saved Dataset in binary file.
Moreover, scikit-learn explicitly states that the whole X and y into fit are used for training, so it's not allowed to do a trick you've described.

The fit() method takes the training data as arguments,
Note that the model is fitted using X and y, but the object holds no reference to X and y.
https://scikit-learn.org/stable/developers/develop.html#fitting

@ekerazha
Copy link
Author

ekerazha commented Apr 2, 2020

I don't see any particular problem in concatenating numpy arrays or pandas dataframes such as the ones supported by the sklearn wrapper.

About the second point, obviously it's not a perfect solution and it could have some issues (e.g. if you feed a numpy array to a sklearn pipeline where a previous transform changes the number of features, it could be non-obvious to locate the position of the indicator column in the new feature matrix). Still, it would allow for things that are currently impossible with the current sklearn wrapper, which is also not really compatible with the sklearn ecosystem, for the reasons I already explained.

@guolinke
Copy link
Collaborator

guolinke commented Apr 4, 2020

@ekerazha
I agree that most parameters could be moved to the constructor.
for the eval_set, although the additional column is a good hack, my most concern part is the performance. As most data in NumPy is stored row-wise, append an additional column will need to re-allocate a new memory and copy. This will double the (peak) memory cost, and slow down the training.

And use eval_set (and other row-related parameters, like sample_weight) in fit is very straight-forward, and many users are used to this fashion, for the popularity of XGBoost.

@NicolasHug
Copy link

Yeah, feature_name and categorical_feature can be removed from fit, because you can pass them in kwargs.

Nothing really defined yet, but we're actually trying to go in the reverse direction. I.e., we have just introduced monotonic constraints (and hopefully soon will be introducing categorical feature support): these are __init__ parameters for now, but ideally we'd want them to be fit parameters, or even just meta-data associated with the input data (X and y). Basically, we're trying to move any parameter that is data-specific into fit, or at least out of __init__. Though again, nothing definite for now.

@StrikerRUS
Copy link
Collaborator

@NicolasHug Thank you very much for finding time to come here and comment!
I think your comment can be treated as "+1" to my earlier words about that we (LightGBM) should take a break for now and don't start any refactoring processes of scikit-learn wrapper. Instead, we must wait for HistGradientBoosting becomes mature and then unify APIs with it and follow newly introduced rules in development guide. Ideally, I believe there can be created a roadmap issue with pings to maintainers of other gradient boosting libraries that you think make impact in data science society to help them get the latest updates in API and take a part into discussion. I remember there was a similar suggestion some time ago: scikit-learn/scikit-learn#15392 (comment).

Linking #2628 (comment) here.

@StrikerRUS
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants