-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] ignore scikit-learn 'check_sample_weight_equivalence' estimator check (fixes #6678) #6679
Conversation
…' estimator check (fixes #6678)
@@ -41,6 +41,7 @@ bokeh=3.1.* | |||
fsspec=2024.5.* | |||
msgpack-python=1.0.* | |||
pluggy=1.5.* | |||
pyparsing=3.1.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While working on this, I discovered #6680.
Have reported that issue upstream: conda-forge/pyparsing-feedstock#48
For LightGBM's purposes, testing on an old version of Python, let's just pin it as we do other libraries in these environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think just skipping this compliance check is the right thing for
lightgbm
right now.
Agree with you!
Thanks for the fast fix!
Fixes #6678
Fixes #6680
As of scikit-learn/scikit-learn#29818 (which will be in
scikit-learn
1.6),scikit-learn
contains an estimator check that enforces the following behavior:This new check is breaking CI here... LightGBM does not work that way.
weight=0.0
samples' feature values still contribute to feature distributions and therefore bin boundaries inDataset
histograms (How observations with sample_weight of zero influence the fit of LGBMRegressor #5553)1
sample for the perspective of count-based hyper-parameters likemin_data_in_leaf
([R-package] Weighted Training - Different results (cross entropy) when using .weight column Vs inputting the expanded data (repeated rows = .weight times) #5626 (comment))This PR proposes skipping that estimator check, just as
scikit-learn
is currently doing forHistGradientBoosting*
estimators (code link).Notes for Reviewers
We could modify LightGBM's
scikit-learn
estimators to match this expected behavior by excluding rows withweight=0
and creating copies of rows withint(weight)>=2
before passing it through toDataset
here:LightGBM/python-package/lightgbm/sklearn.py
Lines 938 to 947 in bbeecc0
I don't support doing that... I think it'd add significant complexity (do count-based hyperparameters need to be modified? how does this affect Dask estimators?) for questionable benefit.
I think just skipping this compliance check is the right thing for
lightgbm
right now.