You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I am working on deploy distributed torch model with ray. I found that the performance of first distributed op (all_reduce in my case) changes after I set CUDA_VISIBLE_DEVICES. dist.all_reduce might cost 30s+.
good without CUDA_VISIBLE_DEVICES or with CUDA_VISIBLE_DEVICES=0,1 or with CUDA_VISIBLE_DEVICES=1,0
python ray/ray_dist.py
# start first all_reduce# Take time: 4.486 s# start second all_reduce# Take time: 0.002 s
bad with CUDA_VISIBLE_DEVICES=6,7
CUDA_VISIBLE_DEVICES=6,7 python ray/ray_dist.py
# start first all_reduce# Take time: 63.014 s# start second all_reduce# Take time: 0.002 s
good with docker run -it --gpus '"device=6,7"' ..
python ray/ray_dist.py
# start first all_reduce# Take time: 3.183 s# start second all_reduce# Take time: 0.001 s
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered:
grimoire
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Feb 19, 2025
Inside DistActor could you print out CUDA_VISIBLE_DEVICES to make sure each actor is assigned a correct cuda device. If it is then I think it's not a Ray issue since torch.distributed does all the communications outside of Ray.
jjyao
added
question
Just a question :)
P1
Issue that should be fixed within a few weeks
and removed
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Feb 20, 2025
grimoire
changed the title
[<Ray component: Core|RLlib|etc...>] slow torch.distributed with non-default CUDA_VISIBLE_DEVICES
slow torch.distributed with non-default CUDA_VISIBLE_DEVICES
Feb 20, 2025
CUDA_VISIBLE_DEVICES in the actor does match the devices.
I believe that torch.distributed does not have direct connect with ray. But since another mp implementation does not show the same behaviour, ray actor might have done something special?
What happened + What you expected to happen
Hi, I am working on deploy distributed torch model with ray. I found that the performance of first distributed op (all_reduce in my case) changes after I set CUDA_VISIBLE_DEVICES.
dist.all_reduce
might cost 30s+.Versions / Dependencies
Reproduction script
ray_dist.py
good without
CUDA_VISIBLE_DEVICES
or withCUDA_VISIBLE_DEVICES=0,1
or withCUDA_VISIBLE_DEVICES=1,0
bad with
CUDA_VISIBLE_DEVICES=6,7
good with
docker run -it --gpus '"device=6,7"' ..
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: