[python-package] ignore scikit-learn 'check_sample_weight_equivalence' estimator check (fixes #6678) #6679

jameslamb · 2024-10-14T02:14:45Z

As of scikit-learn/scikit-learn#29818 (which will be in scikit-learn 1.6), scikit-learn contains an estimator check that enforces the following behavior:

check that setting sample_weight to zero / integer is equivalent to removing / repeating corresponding samples

This new check is breaking CI here... LightGBM does not work that way.

weight=0.0 samples' feature values still contribute to feature distributions and therefore bin boundaries in Dataset histograms (How observations with sample_weight of zero influence the fit of LGBMRegressor #5553)
every sample counts as exactly 1 sample for the perspective of count-based hyper-parameters like min_data_in_leaf ([R-package] Weighted Training - Different results (cross entropy) when using .weight column Vs inputting the expanded data (repeated rows = .weight times) #5626 (comment))

This PR proposes skipping that estimator check, just as scikit-learn is currently doing for HistGradientBoosting* estimators (code link).

Notes for Reviewers

We could modify LightGBM's scikit-learn estimators to match this expected behavior by excluding rows with weight=0 and creating copies of rows with int(weight)>=2 before passing it through to Dataset here:

LightGBM/python-package/lightgbm/sklearn.py

Lines 938 to 947 in bbeecc0

    
           train_set = Dataset( 
        
               data=_X, 
        
               label=_y, 
        
               weight=sample_weight, 
        
               group=group, 
        
               init_score=init_score, 
        
               categorical_feature=categorical_feature, 
        
               feature_name=feature_name, 
        
               params=params, 
        
           )

I don't support doing that... I think it'd add significant complexity (do count-based hyperparameters need to be modified? how does this affect Dask estimators?) for questionable benefit.

I think just skipping this compliance check is the right thing for lightgbm right now.

…' estimator check (fixes #6678)

jameslamb · 2024-10-14T02:48:28Z

.ci/conda-envs/ci-core-py38.txt

@@ -41,6 +41,7 @@ bokeh=3.1.*
 fsspec=2024.5.*
 msgpack-python=1.0.*
 pluggy=1.5.*
+pyparsing=3.1.4


While working on this, I discovered #6680.

Have reported that issue upstream: conda-forge/pyparsing-feedstock#48

For LightGBM's purposes, testing on an old version of Python, let's just pin it as we do other libraries in these environments.

borchero

Thanks!

StrikerRUS

I think just skipping this compliance check is the right thing for lightgbm right now.

Agree with you!

Thanks for the fast fix!

[python-package] ignore scikit-learn 'check_sample_weight_equivalence…

6958df6

…' estimator check (fixes #6678)

jameslamb added blocking maintenance labels Oct 14, 2024

pin pyparsing in Python 3.8 environment

b139f36

jameslamb commented Oct 14, 2024

View reviewed changes

jameslamb changed the title ~~WIP: [python-package] ignore scikit-learn 'check_sample_weight_equivalence' estimator check (fixes #6678)~~ [python-package] ignore scikit-learn 'check_sample_weight_equivalence' estimator check (fixes #6678) Oct 14, 2024

jameslamb marked this pull request as ready for review October 14, 2024 03:34

jameslamb requested review from guolinke, shiyu1994, jmoralez, borchero and StrikerRUS as code owners October 14, 2024 03:34

jameslamb added the awaiting review label Oct 14, 2024

jameslamb mentioned this pull request Oct 14, 2024

[ci] enable OpenMP support in cpp tests #6676

Merged

borchero approved these changes Oct 15, 2024

View reviewed changes

StrikerRUS approved these changes Oct 15, 2024

View reviewed changes

StrikerRUS removed awaiting review blocking labels Oct 15, 2024

StrikerRUS merged commit 2b8a2bb into master Oct 15, 2024
48 checks passed

StrikerRUS deleted the more-scikit-learn branch October 15, 2024 11:17

jameslamb mentioned this pull request Oct 16, 2024

[python] tests fail on scikit-learn 1.6 nightlies dmlc/xgboost#10896

Closed

jameslamb mentioned this pull request Nov 26, 2024

Adapt to scikit-learn 1.6 estimator tag changes dmlc/xgboost#11021

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] ignore scikit-learn 'check_sample_weight_equivalence' estimator check (fixes #6678) #6679

[python-package] ignore scikit-learn 'check_sample_weight_equivalence' estimator check (fixes #6678) #6679

jameslamb commented Oct 14, 2024 •

edited

Loading

jameslamb Oct 14, 2024

borchero left a comment

StrikerRUS left a comment

	train_set = Dataset(
	data=_X,
	label=_y,
	weight=sample_weight,
	group=group,
	init_score=init_score,
	categorical_feature=categorical_feature,
	feature_name=feature_name,
	params=params,
	)

[python-package] ignore scikit-learn 'check_sample_weight_equivalence' estimator check (fixes #6678) #6679

[python-package] ignore scikit-learn 'check_sample_weight_equivalence' estimator check (fixes #6678) #6679

Conversation

jameslamb commented Oct 14, 2024 • edited Loading

Notes for Reviewers

jameslamb Oct 14, 2024

Choose a reason for hiding this comment

borchero left a comment

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

jameslamb commented Oct 14, 2024 •

edited

Loading