[dask] Support all LightGBM boosting types #3896

StrikerRUS · 2021-02-03T00:26:55Z

Summary

Right now the Dask interface in https://github.com/microsoft/LightGBM/blob/706f2af7badc26f6ec68729469ec6ec79a66d802/python-package/lightgbm/dask.py is tested only with gbdt (default) boosting type.
https://lightgbm.readthedocs.io/en/latest/Parameters.html#boosting

I believe that parametrization of core tests (test_classifier, test_regressor, test_ranker) with different boosting types (rf, dart, goss) will improve tests and make us confident in Dask module quality.
Please pay attention to that for some boosting types some parameters cannot be left with their default values. For example, rf performs the following checks

LightGBM/src/boosting/rf.hpp

Lines 35 to 36 in fcfd413

    
           CHECK(config->bagging_freq > 0 && config->bagging_fraction < 1.0f && config->bagging_fraction > 0.0f); 
        
           CHECK(config->feature_fraction <= 1.0f && config->feature_fraction > 0.0f);

How tests can be parametrized:

LightGBM/tests/python_package_test/test_dask.py

Line 31 in ac706e1

data_output = ['array', 'scipy_csr_matrix', 'dataframe']

LightGBM/tests/python_package_test/test_dask.py

Line 166 in ac706e1

@pytest.mark.parametrize('output', data_output)

The text was updated successfully, but these errors were encountered:

StrikerRUS · 2021-02-03T00:28:03Z

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

jmoralez · 2021-03-25T03:19:00Z

Hi, I'd like to work on this. I'm just starting out and saw all the tests fail for the other boosting_types, so I digged a bit and I'm ashamed to say I hadn't really seen the testing data we generate and just saw that for regression X is a (100, 100) array which gets split into two partitions, so each worker trains on a (50, 100) collection, which I think isn't very good. I tried generating 5 features with 2 informatives and 1,000 samples, so each worker gets (500, 5) and now all the tests pass. @jameslamb what are your thoughts regarding this? I believe 1,000 samples are ok and I don't think we need that many features, I think the tests should check that the distributed training achieves the same result as local, but not make it overly difficult by providing few samples and many features.

jameslamb · 2021-03-25T04:00:34Z

a data size like (1000, 5) or something is totally fine for tests

jmoralez · 2021-03-26T02:06:46Z

Hi James. test_classifier sometimes fails for dart because it doesn't use the categorical variables. Would it be in the scope of this PR to try to make a categorical variable more informative? i.e. modify the target a bit depending on the value of the category

jmoralez · 2021-03-26T02:11:12Z

Actually (if it's useful) I could try to make a PR to make categorical variables more informative for all objectives, I was able to lower this threshold

LightGBM/tests/python_package_test/test_dask.py

Line 268 in 65349b4

assert_eq(p1_proba, p2_proba, atol=0.3)

to 0.01 by using more samples, but had to increase it to 0.03 for the dataframe-with-categoricals output

jameslamb · 2021-03-26T02:12:38Z

You can try if you want but I found it was very difficult to do reliably (for regression at least), which is why I did this: https://github.com/microsoft/LightGBM/blob/master/tests/python_package_test/test_dask.py#L175. If you have ideas you're welcome to submit a PR.

jmoralez · 2021-03-27T03:03:15Z

Hi James. The local rf seems to be underperforming, I can't get test_classifier[rf-binary-classification-dataframe-with-categorical] nor test_classifier[rf-multiclass-classification-dataframe-with-categorical] to pass because the distributed version outperforms the local one haha. Seems odd though, do you have experience using it?

jameslamb · 2021-03-27T05:02:28Z

Oh interesting. Sorry, I don't have any ideas why random forest mode would perform better in distributed training than local training for the datasets we use in tests. It's ok with me to add a condition in tests like if boosting_type == 'rf' and data_output == 'dataframe-with-categorical' or whatever, that uses a less strict condition. Adding these tests will add some new coverage of code paths that aren't currently tested, so it's fine to have some conditions like that to get the tests merged and later work on resolving them.

jmoralez · 2021-03-27T05:09:17Z

I tried some trivial examples and saw that it wasn't performing as I expected so I opened #4118, I think it may be a bug.

jameslamb · 2021-03-27T05:10:34Z

Thanks for that! To be clear, I don't think the bug (if there is one) has to be solved before this issue is.

jmoralez · 2021-03-27T05:13:30Z

Ok, I'll try making those tests pass then, they're the only ones remaining but they're extremely hard haha, I'd get absolute differences of 0.5 in the probabilities

jameslamb · 2021-03-27T05:19:18Z

Yeah that's totally fine. Since you've documented that behavior in a separate issue, it'll just mean the rf tests aren't really a test of correctness but just like a "smoke test" that at least no errors are raised when using that boosting type with the Dask package.

Closing #4118 in the future could include removing any special cases for rf in the Dask tests.

StrikerRUS added feature request dask labels Feb 3, 2021

StrikerRUS mentioned this issue Feb 3, 2021

Feature Requests & Voting Hub #2302

Open

StrikerRUS closed this as completed Feb 3, 2021

jameslamb mentioned this issue Feb 9, 2021

[dask] Support asynchronous workflows #3929

Closed

jameslamb mentioned this issue Mar 23, 2021

[tests][dask] Add voting_parallel algorithm in tests (fixes #3834) #4088

Merged

jameslamb reopened this Mar 25, 2021

StrikerRUS closed this as completed in f879018 Mar 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dask] Support all LightGBM boosting types #3896

[dask] Support all LightGBM boosting types #3896

StrikerRUS commented Feb 3, 2021

StrikerRUS commented Feb 3, 2021

jmoralez commented Mar 25, 2021

jameslamb commented Mar 25, 2021

jmoralez commented Mar 26, 2021

jmoralez commented Mar 26, 2021

jameslamb commented Mar 26, 2021

jmoralez commented Mar 27, 2021

jameslamb commented Mar 27, 2021

jmoralez commented Mar 27, 2021

jameslamb commented Mar 27, 2021

jmoralez commented Mar 27, 2021

jameslamb commented Mar 27, 2021

[dask] Support all LightGBM boosting types #3896

[dask] Support all LightGBM boosting types #3896

Comments

StrikerRUS commented Feb 3, 2021

Summary

StrikerRUS commented Feb 3, 2021

jmoralez commented Mar 25, 2021

jameslamb commented Mar 25, 2021

jmoralez commented Mar 26, 2021

jmoralez commented Mar 26, 2021

jameslamb commented Mar 26, 2021

jmoralez commented Mar 27, 2021

jameslamb commented Mar 27, 2021

jmoralez commented Mar 27, 2021

jameslamb commented Mar 27, 2021

jmoralez commented Mar 27, 2021

jameslamb commented Mar 27, 2021