-
Notifications
You must be signed in to change notification settings - Fork 7k
[PR 1/6] Collective in Ray #12637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PR 1/6] Collective in Ray #12637
Conversation
richardliaw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very exciting! Sorry for the slow response; I'll be sure to review within 24 hours next time.
python/ray/util/collective/collective_group/base_collective_group.py
Outdated
Show resolved
Hide resolved
python/ray/util/collective/collective_group/base_collective_group.py
Outdated
Show resolved
Hide resolved
| opts = types.AllReduceOptions | ||
| opts.reduceOp = op |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't this set the global variable?
can we instead create an instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an exposed user API: it does not write; It only reads from the global variable _group_mgr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't types.AllReduceOptions refer to a global setting?
anyways, i think this is a nit :)
| """ | ||
| Initialize the NCCL unique ID for this store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - in future, this first line should fit on first line of docstring:
"""Initialize the NCCL unique ID for this store.
Why are these changes needed?
This is the first PR for the project Collective-in-Ray.
To make each PR more manageable and friendly to reviewers, we break the entire project code into 6 incremental PRs:
See a list below:
1. (This one) Basic infrastructure; an in-actor collective interface
ray.util.collective.init_collective_group(*args, **kwargs); support for two collectivesallreduceandbarrier; some testing infrastructure, etc.2. Driver-program interface, which includes: (1) the second interface:
actor.options(collective_options, ...).remote()and the third interfacedeclare_collective_group(actors, collective_options, ...). See here and there.3. Support for other collectives:
allgather,broadcast, etc.4. Communicator caching, and support for num_gpus > 2 per actor/task.
5. CUDA stream management.
6. docs, examples, etc.
This is the first one (1/6).
MPI support is currently excluded from this series of PRs and will be developed later in another sub-project.
The testing pipeline needs to align with the current Ray CI and release tests. @richardliaw
Related issue number
RFC #12174
Checks
scripts/format.shto lint the changes in this PR.