Skip to content

Conversation

@zhisbug
Copy link
Contributor

@zhisbug zhisbug commented Dec 22, 2020

This is the 4th PR for the project Collective-in-Ray.

See a list below:

  1. PR [PR 1/6] Collective in Ray  #12637 ( merged) Basic infrastructure; an in-actor collective interface ray.util.collective.init_collective_group(*args, **kwargs); support for two collectives allreduce and barrier; some testing infrastructure, etc.
  2. (PR [Collective][PR 2/6] Driver program declarative interfaces #12874 merged) Driver-program interface, which includes: (1) the second interface: actor.options(collective_options, ...).remote() and the third interface declare_collective_group(actors, collective_options, ...).
  3. (PR [Collective][PR 3/6] Other collectives #12864 merged) Support for other collectives: allgather, broadcast, reduce, reducescatter; refactor the tests into distributed tests and single-node cluster tests.
    3.5 (PR [Collective][PR 3.5/6] Send/Recv calls and some initial code for communicator caching #12935 merged) send/recv P2P communication apis.
  4. (this one) Communicator caching, and support for num_gpus > 2 per actor/task.
  5. CUDA stream management.
  6. docs, examples, etc.

Related issue number

See this rfc-collective-in-ray

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@zhisbug zhisbug changed the title [Collective][PR 4/6][WIP] NCCL Communicator caching and preliminary stream management [WIP][Collective][PR 4/6] NCCL Communicator caching and preliminary stream management Dec 22, 2020
@zhisbug zhisbug changed the title [WIP][Collective][PR 4/6] NCCL Communicator caching and preliminary stream management [Collective][PR 4/6] NCCL Communicator caching and preliminary stream management Jan 23, 2021
@zhisbug
Copy link
Contributor Author

zhisbug commented Jan 23, 2021

@richardliaw This is a big one but good to go! Now it supports any multi-GPU collective calls on actors with >1 gpus!

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to review nccl_collective_group. Will review tomorrow!

@zhisbug
Copy link
Contributor Author

zhisbug commented Jan 24, 2021

Still need to review nccl_collective_group. Will review tomorrow!

Addressed!

@richardliaw richardliaw merged commit 7a78f4e into ray-project:master Jan 26, 2021
fishbone pushed a commit to fishbone/ray that referenced this pull request Feb 16, 2021
fishbone added a commit to fishbone/ray that referenced this pull request Feb 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants