Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] Support all LightGBM boosting types #3896

Closed
StrikerRUS opened this issue Feb 3, 2021 · 12 comments
Closed

[dask] Support all LightGBM boosting types #3896

StrikerRUS opened this issue Feb 3, 2021 · 12 comments

Comments

@StrikerRUS
Copy link
Collaborator

Summary

Right now the Dask interface in https://github.com/microsoft/LightGBM/blob/706f2af7badc26f6ec68729469ec6ec79a66d802/python-package/lightgbm/dask.py is tested only with gbdt (default) boosting type.
https://lightgbm.readthedocs.io/en/latest/Parameters.html#boosting

I believe that parametrization of core tests (test_classifier, test_regressor, test_ranker) with different boosting types (rf, dart, goss) will improve tests and make us confident in Dask module quality.
Please pay attention to that for some boosting types some parameters cannot be left with their default values. For example, rf performs the following checks

CHECK(config->bagging_freq > 0 && config->bagging_fraction < 1.0f && config->bagging_fraction > 0.0f);
CHECK(config->feature_fraction <= 1.0f && config->feature_fraction > 0.0f);

How tests can be parametrized:

data_output = ['array', 'scipy_csr_matrix', 'dataframe']

@pytest.mark.parametrize('output', data_output)

@StrikerRUS
Copy link
Collaborator Author

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

@jmoralez
Copy link
Collaborator

Hi, I'd like to work on this. I'm just starting out and saw all the tests fail for the other boosting_types, so I digged a bit and I'm ashamed to say I hadn't really seen the testing data we generate and just saw that for regression X is a (100, 100) array which gets split into two partitions, so each worker trains on a (50, 100) collection, which I think isn't very good. I tried generating 5 features with 2 informatives and 1,000 samples, so each worker gets (500, 5) and now all the tests pass. @jameslamb what are your thoughts regarding this? I believe 1,000 samples are ok and I don't think we need that many features, I think the tests should check that the distributed training achieves the same result as local, but not make it overly difficult by providing few samples and many features.

@jameslamb
Copy link
Collaborator

a data size like (1000, 5) or something is totally fine for tests

@jameslamb jameslamb reopened this Mar 25, 2021
@jmoralez
Copy link
Collaborator

Hi James. test_classifier sometimes fails for dart because it doesn't use the categorical variables. Would it be in the scope of this PR to try to make a categorical variable more informative? i.e. modify the target a bit depending on the value of the category

@jmoralez
Copy link
Collaborator

Actually (if it's useful) I could try to make a PR to make categorical variables more informative for all objectives, I was able to lower this threshold

assert_eq(p1_proba, p2_proba, atol=0.3)

to 0.01 by using more samples, but had to increase it to 0.03 for the dataframe-with-categoricals output

@jameslamb
Copy link
Collaborator

You can try if you want but I found it was very difficult to do reliably (for regression at least), which is why I did this: https://github.com/microsoft/LightGBM/blob/master/tests/python_package_test/test_dask.py#L175. If you have ideas you're welcome to submit a PR.

@jmoralez
Copy link
Collaborator

Hi James. The local rf seems to be underperforming, I can't get test_classifier[rf-binary-classification-dataframe-with-categorical] nor test_classifier[rf-multiclass-classification-dataframe-with-categorical] to pass because the distributed version outperforms the local one haha. Seems odd though, do you have experience using it?

@jameslamb
Copy link
Collaborator

Oh interesting. Sorry, I don't have any ideas why random forest mode would perform better in distributed training than local training for the datasets we use in tests. It's ok with me to add a condition in tests like if boosting_type == 'rf' and data_output == 'dataframe-with-categorical' or whatever, that uses a less strict condition. Adding these tests will add some new coverage of code paths that aren't currently tested, so it's fine to have some conditions like that to get the tests merged and later work on resolving them.

@jmoralez
Copy link
Collaborator

I tried some trivial examples and saw that it wasn't performing as I expected so I opened #4118, I think it may be a bug.

@jameslamb
Copy link
Collaborator

Thanks for that! To be clear, I don't think the bug (if there is one) has to be solved before this issue is.

@jmoralez
Copy link
Collaborator

Ok, I'll try making those tests pass then, they're the only ones remaining but they're extremely hard haha, I'd get absolute differences of 0.5 in the probabilities

@jameslamb
Copy link
Collaborator

Yeah that's totally fine. Since you've documented that behavior in a separate issue, it'll just mean the rf tests aren't really a test of correctness but just like a "smoke test" that at least no errors are raised when using that boosting type with the Dask package.

Closing #4118 in the future could include removing any special cases for rf in the Dask tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants