-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] test training when a worker has no data #3897
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much! I have some ideas to simplify this, without needing to introduce a new function to check that the data is only on one worker.
- Assert that
client.nworkers > 1
, just to be sure the client fixture is for a multi-worker cluster. - Just repartition the data to 1 partition (
.repartition(npartitions=1)
for data frame,.rechunk((X.shape[0], X.shape[1]))
for array). assert thatdX.npartitions == 1
. If you have only 1 partition, you know for sure only one worker has data!
other requests
Please make sure this test is run for ranking, classification, and regression. You can borrow from
LightGBM/tests/python_package_test/test_dask.py
Lines 463 to 487 in 876bfe5
@pytest.mark.parametrize('task', ['classification', 'regression', 'ranking']) | |
def test_training_works_if_client_not_provided_or_set_after_construction(task, listen_port, client): | |
if task == 'ranking': | |
_, _, _, _, dX, dy, _, dg = _create_ranking_data( | |
output='array', | |
group=None | |
) | |
model_factory = lgb.DaskLGBMRanker | |
else: | |
_, _, _, dX, dy, _ = _create_data( | |
objective=task, | |
output='array', | |
) | |
dg = None | |
if task == 'classification': | |
model_factory = lgb.DaskLGBMClassifier | |
elif task == 'regression': | |
model_factory = lgb.DaskLGBMRegressor | |
params = { | |
"time_out": 5, | |
"local_listen_port": listen_port, | |
"n_estimators": 1, | |
"num_leaves": 2 | |
} |
For training on a single worker, the result should be identical to non-Dask training. So please also train a regular lightgbm.sklearn.LGBM[Classifier/Regressor/Ranker]
and then check that assert_eq(dask_model.predict(dX).compute(), local_model.predict(X))
. You can see the other tests in this file for references on how to do that, or @
me if you have any questions.
Hi James. I've included the different tasks and outputs. I just saw the PR you made for the categoricals and saw the new output type, I'll wait til that gets merged to include it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much! I like the changes you've made. We're starting to have enough of these tests where tasks
is parameterized, makes sense to concentrate it at the top of tetst_dask.py
like you did.
I left a few minor suggestions. Other than those, I agree with your proposal to wait until #3908 is merged.
Hi James. I've included your comments and the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This look great! Thanks very much for the help
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
This includes a test to check that the training process succeeds even when a worker has no data, e.g. verify that #3882 is solved.
I wanted to persist the collections to a specific worker with something like:
However I got:
TypeError: unhashable type: 'Array'
.So in the test I just create a collection with one chunk, persist it and check that it is only in one worker. Any feedback to make this test more robust is welcome.