Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] [python] Store co-local data parts as dicts instead of lists #3853

Merged
merged 2 commits into from
Jan 25, 2021

Conversation

ffineis
Copy link
Contributor

@ffineis ffineis commented Jan 25, 2021

Addresses #3795. The _train function in dask.py places delayed "parts," i.e. data, label, sample_weight, and group data into lists (which get distributed to workers) to enforce co-locality of data. Each worker is distributed a list_of_parts, a list of lists. Once the worker receives its lists of parts, it infers the data, label, and then optionally sample_weight and group from each list_of_parts by assuming that data is in position 0, label is in position 1, and if the model is not a DaskLGBMRanker, the second position is reserved for sample_weight.

This is an issue of readability + scalability - implying the presence of sample_weight and group based on the length of each sublist of list_of_parts in conjunction with the estimator type is... hard to understand and doesn't lend itself well to storing additional parts like eval_sets. Instead, let's just store individual sets of parts as dicts instead of lists. Now each worker will receive its portion of the overall dataset as a list of dicts instead of a list of lists.

@ffineis ffineis requested a review from jameslamb as a code owner January 25, 2021 16:34
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thank you! I just have one very small nitpicky style thing.

python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
@jameslamb jameslamb self-requested a review January 25, 2021 20:57
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much cleaner and I think it will help reduce the risk of mistakes in the future. Thanks so much!

@jameslamb jameslamb requested a review from StrikerRUS January 25, 2021 21:00
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thanks! Code is much cleaner now.

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants