-
Notifications
You must be signed in to change notification settings - Fork 7k
[RFC] Collective in Ray #12174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Collective in Ray #12174
Conversation
|
Comments by Stephanie (@stephanie-wang): Thanks, this is interesting. The API to me seems very MPI-like, which i suppose is good for faster prototyping as it's closer to the implementation code, but i hope that we can come up with something that requires less boilerplate, is more compatible with ObjectRefs, and which is less error-prone. so far, the ray model of communication is through tasks and ObjectRefs only - obviously you can communicate between workers out-of-band through TCP, etc, but it is not recommended. so i worry about introducing an API that adds a second model of communication that the user has to think about i wrote this quick API proposal following the examples in the doc where the collectives are declared at task/actor creation time instead of during execution, and the communication is specified through ObjectRefs. the main difference between this API and the standard ObjectRef API is that the user is limited in what they can do with the ObjectRefs. they can only use them with other objects/processes in the same collective group. https://gist.github.com/stephanie-wang/ae9a82b3f200989ba37749d3268c0907 note that i'm not saying anything about how this should be implemented, but just some initial thoughts on that. the main difficulty is that the system is now responsible for the collective group setup, object storage and communication
|
|
Posting earlier conversations on slack, between myself and Stephanie: Hao:
Stephanie:
Hao:
|
|
Posting earlier conversations on slack, between myself and Richard (@richardliaw ): Hao:
Richard:
Hao:
Richard:
Hao:
|
|
@zhisbug your comment about the 2 stages is now cut off |
|
Thanks @zhisbug for sharing the proposal! The examples in the RFC are super helpful. In order to evaluate the APIs a bit more, it'd be nice to see them in the context of concrete use cases. For example, it'd be helpful to see how would they be used by SpaCy or other libraries. Your proposal seems slightly more general than @stephanie-wang's (since you can have an allreduce with both
Are there other collectives that make sense to support beyond allreduce? Is point to point GPU communication out of scope for this doc? |
|
Hi @robertnishihara: really appreciate the feedback; see responses inlined below.
For Spacy pipelines to benefit from this infra, we can implement a high-level parameter server strategy or AllReduce strategy based on Ray APIs and this set of collective APIs; since the communication (esp. on GPUs) will be taken care by CCL libraries, the performance can be improved significantly compared to what they have now; I'll defer the PS implementation using these APIs to a later RFC, but you can think in a way that these distributed ML strategies, such as a sharded PS, are just a series of CCL-based reduce + broadcast calls (but much more optimized and efficient than RPCs).
Consider RAG or Realm-like ML model where some part of the code was written in Cupy, numpy, or TensorFlow (e.g. an embedding or database look_up) while some parts are implemented in PyTorch (e.g. layers of transformers). When we model-parallelize this model across distributed GPUs we might send tensors from a tensorflow endpoint to a pytorch endpoint. While I understand this is just an imaginary case but it involved little effort to make this interoperability happen, without making the API confusing.
Good point. Yeah, gang scheduling might be needed to make sure no deadlocks in the proposed APIs. We'll look into it a bit.
Yes, we will bring in all available APIs in NCCL/MPI. Take NCCL as an example, this includes: reduce, allreduce, broadcast, allgather, reduce, reducescatter, and gpu2gpu send/recv (NCCL >= 2.7.4). See here. Regarding the proposed APIs and Stephanie's API suggestion: we might expose both, since the set of APIs this RFC proposed are slightly lower-level; hence we can implement Stephanie's APIs on top of it (i.e. collective group declaration happens at actor/task creation) and make them the recommended ones. |
|
I agree with @zhisbug that the APIs are compatible. The one that I suggested with
I don't think that's true. Both proposals would support such a use case (see the linked gist).
Actually, in both cases I think this should be done by the user. Ray could provide some integration (e.g., automatically create a placement group for a collective), but I don't think it makes sense since a user can already do it easily with the existing placement group API. In general, the reasoning behind the API that I proposed was to make it more declarative, which allows Ray to take on more functionality (by making the calls to the lower-level collective API internally). For example, part of the API that I proposed was to move the declaration of collective groups into task/actor invocation. This would let us provide better errors if gang scheduling fails, e.g., if an actor can't be scheduled. If the user is the one specifying when to initialize the communication group, then the system can't do anything except wait since it doesn't know which task/actor was supposed to be part of the communication group |
|
I'm ok with merging this, though maybe we can just have a separate repo in the org for RFC/community stuff. Thoughts @anabranch @zhe-thoughts @ericl @edoakes ? |
|
Closing this PR. I am going to move it to ray-project/RFC repo per our discussion last week. |
Why are these changes needed?
An initial proposal on adding NCCL/MPI-backed collective into Ray
Related issue number
Checks
scripts/format.shto lint the changes in this PR.