Skip to content

Conversation

@zhisbug
Copy link
Contributor

@zhisbug zhisbug commented Dec 5, 2020

Why are these changes needed?

This is the first PR for the project Collective-in-Ray.

To make each PR more manageable and friendly to reviewers, we break the entire project code into 6 incremental PRs:
See a list below:

1. (This one) Basic infrastructure; an in-actor collective interface ray.util.collective.init_collective_group(*args, **kwargs); support for two collectives allreduce and barrier; some testing infrastructure, etc.
2. Driver-program interface, which includes: (1) the second interface: actor.options(collective_options, ...).remote() and the third interface declare_collective_group(actors, collective_options, ...). See here and there.
3. Support for other collectives: allgather, broadcast, etc.
4. Communicator caching, and support for num_gpus > 2 per actor/task.
5. CUDA stream management.
6. docs, examples, etc.

This is the first one (1/6).

MPI support is currently excluded from this series of PRs and will be developed later in another sub-project.

The testing pipeline needs to align with the current Ray CI and release tests. @richardliaw

Related issue number

RFC #12174

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@zhisbug zhisbug changed the title [WIP][PR 1/6] Collective in Ray [PR 1/6] Collective in Ray Dec 5, 2020
@richardliaw richardliaw self-assigned this Dec 5, 2020
Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very exciting! Sorry for the slow response; I'll be sure to review within 24 hours next time.

Comment on lines +211 to +212
opts = types.AllReduceOptions
opts.reduceOp = op
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this set the global variable?

can we instead create an instance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an exposed user API: it does not write; It only reads from the global variable _group_mgr.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't types.AllReduceOptions refer to a global setting?

anyways, i think this is a nit :)

Comment on lines +25 to +26
"""
Initialize the NCCL unique ID for this store.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - in future, this first line should fit on first line of docstring:

        """Initialize the NCCL unique ID for this store.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants