-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] Support all LightGBM boosting types #3896
Comments
Closed in favor of being in #2302. We decided to keep all feature requests in one place. Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature. |
Hi, I'd like to work on this. I'm just starting out and saw all the tests fail for the other boosting_types, so I digged a bit and I'm ashamed to say I hadn't really seen the testing data we generate and just saw that for regression X is a (100, 100) array which gets split into two partitions, so each worker trains on a (50, 100) collection, which I think isn't very good. I tried generating 5 features with 2 informatives and 1,000 samples, so each worker gets (500, 5) and now all the tests pass. @jameslamb what are your thoughts regarding this? I believe 1,000 samples are ok and I don't think we need that many features, I think the tests should check that the distributed training achieves the same result as local, but not make it overly difficult by providing few samples and many features. |
a data size like (1000, 5) or something is totally fine for tests |
Hi James. |
Actually (if it's useful) I could try to make a PR to make categorical variables more informative for all objectives, I was able to lower this threshold
to 0.01 by using more samples, but had to increase it to 0.03 for the dataframe-with-categoricals output
|
You can try if you want but I found it was very difficult to do reliably (for regression at least), which is why I did this: https://github.com/microsoft/LightGBM/blob/master/tests/python_package_test/test_dask.py#L175. If you have ideas you're welcome to submit a PR. |
Hi James. The local |
Oh interesting. Sorry, I don't have any ideas why random forest mode would perform better in distributed training than local training for the datasets we use in tests. It's ok with me to add a condition in tests like |
I tried some trivial examples and saw that it wasn't performing as I expected so I opened #4118, I think it may be a bug. |
Thanks for that! To be clear, I don't think the bug (if there is one) has to be solved before this issue is. |
Ok, I'll try making those tests pass then, they're the only ones remaining but they're extremely hard haha, I'd get absolute differences of 0.5 in the probabilities |
Yeah that's totally fine. Since you've documented that behavior in a separate issue, it'll just mean the Closing #4118 in the future could include removing any special cases for |
Summary
Right now the Dask interface in https://github.com/microsoft/LightGBM/blob/706f2af7badc26f6ec68729469ec6ec79a66d802/python-package/lightgbm/dask.py is tested only with
gbdt
(default) boosting type.https://lightgbm.readthedocs.io/en/latest/Parameters.html#boosting
I believe that parametrization of core tests (
test_classifier
,test_regressor
,test_ranker
) with different boosting types (rf
,dart
,goss
) will improve tests and make us confident in Dask module quality.Please pay attention to that for some boosting types some parameters cannot be left with their default values. For example,
rf
performs the following checksLightGBM/src/boosting/rf.hpp
Lines 35 to 36 in fcfd413
How tests can be parametrized:
LightGBM/tests/python_package_test/test_dask.py
Line 31 in ac706e1
LightGBM/tests/python_package_test/test_dask.py
Line 166 in ac706e1
The text was updated successfully, but these errors were encountered: