-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The sklearn wrapper is not really compatible with the sklearn ecosystem #2966
Comments
Yes, the current scikit-learn API is not standardized in a way how to pass additional arguments to |
Moving |
A possible approach, which is the approach used by Apache Spark (and also by the LightGBM estimators in MMLSpark) is to have an additional indicator "column" which is 1 for eval rows and 0 for training rows. So you could pass to the constructor the name of that column (or its position if it's not a pandas DataFrame) and that column is internally used to split the eval set. It's not a common solution in Python libraries, but it's the solution adopted by Spark when they faced this issue. |
Meanwhile, |
That approach cannot help in case when training and evaluation datasets come from different sources.
Yeah, |
Why not? You merge them into a single feature matrix/dataframe, and you use the indicator column to distinguish the training set rows from the eval set rows. |
Some types cannot be merged easily or merged at all, e.g. path to saved Dataset in binary file.
|
I don't see any particular problem in concatenating numpy arrays or pandas dataframes such as the ones supported by the sklearn wrapper. About the second point, obviously it's not a perfect solution and it could have some issues (e.g. if you feed a numpy array to a sklearn pipeline where a previous transform changes the number of features, it could be non-obvious to locate the position of the indicator column in the new feature matrix). Still, it would allow for things that are currently impossible with the current sklearn wrapper, which is also not really compatible with the sklearn ecosystem, for the reasons I already explained. |
@ekerazha And use eval_set (and other row-related parameters, like |
Nothing really defined yet, but we're actually trying to go in the reverse direction. I.e., we have just introduced monotonic constraints (and hopefully soon will be introducing categorical feature support): these are |
@NicolasHug Thank you very much for finding time to come here and comment! Linking #2628 (comment) here. |
Closed in favor of being in #2302. We decided to keep all feature requests in one place. Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature. |
I originally wrote this comment in #2946 but it was not the best place (thank you @rth for your reply).
In my opinion there's a big issue with the current scikit-learn wrapper.
In general, most libraries in the scikit-learn ecosystem (or sklearn itself) expect a fit() method where you only pass X and y (and maybe sample_weight).
In the current wrapper we also have other params such as early_stopping_round. I think we should move as much parameters as possible from the fit method to the estimator constructor.
For example, catboost https://catboost.ai/docs/concepts/python-reference_catboost.html allows to set most parameters through the constructor (you can set early_stopping_round both when you create the estimator object or in the fit method).
For example, if I want to create a StackedClassifier https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html it's not clear how to pass the additional LightGBM fit() parameters to the StackedClassifier wrapper.
In the past I created a custom LightGBM compatibility layer where I passed parameters to the constructor (inside a fit_params dictionary) that were used when calling LightGBM's fit() method.
I think we should definitely move as much parameters as possible out of the fit() method to improve compatibility with the sklearn ecosystem.
The text was updated successfully, but these errors were encountered: