-
Notifications
You must be signed in to change notification settings - Fork 503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CollectivePermute support #8815
Comments
Hi, @rpsilva-aws, thanks, IIUC, for btw, what's the current failure did you met with |
Thanks @ManfeiBai. I have not encountered any failure yet - at least with Neuron's TRN1, but this is a generally concerning docstring/call-out when trying to productionize with this collective on XLA. We need to use this instead of P2P send/recv for other HW specific reasons.
This is what I understood as well from Jack's PR above, but it would be nice if we had a reference point that we can use to cross check with other collectives (particularly CollectivePermute). I'll wait on the XLA team's comment. |
@yaochengji - can you please share the latest technical updates on the support of this op outside of SPMD path? |
Any updates on whether there are still known XLA limitations with this op? cc: @yaochengji @ddunl |
Hi @rpsilva-aws , thanks for asking. Currently I have a PoC script https://github.com/pytorch/xla/blob/chengji/cm/test/torch_distributed/cm_perf.py to demostrate the CM support. And the current main blocker is that sometimes in real workload, |
Thanks for the context @yaochengji . Is this ( |
Yes, |
🐛 Bug
We currently discourage the use of CollectivePermute in #2384. There is no context behind why this is the case, including whether the motivation was hardware specific. It seems that we have enabled it for All-to-All (#2472) - do we have the same guidance/information from the XLA team?
I can not find any relevant reference to 'all_to_all_emitter'.
cc: @miladm @ManfeiBai @JackCaoG
The text was updated successfully, but these errors were encountered: