slow torch.distributed with non-default CUDA_VISIBLE_DEVICES #50723

grimoire · 2025-02-19T13:00:23Z

What happened + What you expected to happen

Hi, I am working on deploy distributed torch model with ray. I found that the performance of first distributed op (all_reduce in my case) changes after I set CUDA_VISIBLE_DEVICES. dist.all_reduce might cost 30s+.

Versions / Dependencies

ray: 2.42.1
python: 3.9.16 / 3.10.0
os: CentOS7
pytorch: 2.4.0+cuda12.1 / 2.5.1+cuda12.1

Reproduction script

ray_dist.py

import os
import time
import ray
import torch
import torch.distributed as dist
from contextlib import contextmanager


@ray.remote(num_gpus=1)
class DistActor:
    
    def __init__(self, rank, world_size):
        os.environ['MASTER_ADDR'] = '127.0.0.1'
        os.environ['MASTER_PORT'] = '29500'
        dist.init_process_group(backend='nccl', rank=rank, world_size=world_size)

    def all_reduce(self):
        a = torch.rand([1], device='cuda')
        dist.all_reduce(a)

@contextmanager
def timeit(msg):
    print(msg)
    start = time.time()
    yield
    end = time.time()
    duration = (end - start)
    print(f'Take time: {duration:.3f} s')

if __name__ == '__main__':
    ray.init()
    world_size = 2
    actors = [DistActor.remote(rank, world_size) for rank in range(world_size)]

    with timeit('start first all_reduce'):
        ray.get([actor.all_reduce.remote() for actor in actors])

    with timeit('start second all_reduce'):
        ray.get([actor.all_reduce.remote() for rank, actor in enumerate(actors)])

good without CUDA_VISIBLE_DEVICES or with CUDA_VISIBLE_DEVICES=0,1 or with CUDA_VISIBLE_DEVICES=1,0

python ray/ray_dist.py
# start first all_reduce
# Take time: 4.486 s
# start second all_reduce
# Take time: 0.002 s

bad with CUDA_VISIBLE_DEVICES=6,7

CUDA_VISIBLE_DEVICES=6,7 python ray/ray_dist.py
# start first all_reduce
# Take time: 63.014 s
# start second all_reduce
# Take time: 0.002 s

good with docker run -it --gpus '"device=6,7"' ..

python ray/ray_dist.py
# start first all_reduce
# Take time: 3.183 s
# start second all_reduce
# Take time: 0.001 s

Issue Severity

Medium: It is a significant difficulty but I can work around it.

The text was updated successfully, but these errors were encountered:

jjyao · 2025-02-20T00:31:02Z

Inside DistActor could you print out CUDA_VISIBLE_DEVICES to make sure each actor is assigned a correct cuda device. If it is then I think it's not a Ray issue since torch.distributed does all the communications outside of Ray.

grimoire · 2025-02-20T03:43:21Z

CUDA_VISIBLE_DEVICES in the actor does match the devices.
I believe that torch.distributed does not have direct connect with ray. But since another mp implementation does not show the same behaviour, ray actor might have done something special?

from torch import multiprocessing as mp
import torch
import torch.distributed as dist
import os
import time
from contextlib import contextmanager

def all_reduce():
    a = torch.rand([1], device='cuda')
    dist.all_reduce(a)

@contextmanager
def timeit(msg, enable=True):
    def _print(*args, **kwargs):
        if enable:
            print(*args, **kwargs)
    _print(msg)
    start = time.time()
    yield
    end = time.time()
    duration = (end - start)
    _print(f'Take time: {duration:.3f} s')

def proc(rank, world_size):
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    with torch.cuda.device(rank):
        dist.init_process_group(backend='nccl', rank=rank, world_size=world_size)

        with timeit('start first all_reduce', rank==0):
            all_reduce()

        with timeit('start second all_reduce', rank==0):
            all_reduce()


if __name__ == '__main__':
    world_size = 2
    procs = [mp.Process(target=proc, args=(rank, world_size)) for rank in range(world_size)]
    [proc.start() for proc in procs]
    [proc.join() for proc in procs]

grimoire added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 19, 2025

jcotant1 added the core Issues that should be addressed in Ray Core label Feb 19, 2025

jjyao added question Just a question :) P1 Issue that should be fixed within a few weeks and removed bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 20, 2025

grimoire changed the title ~~[<Ray component: Core|RLlib|etc...>] slow torch.distributed with non-default CUDA_VISIBLE_DEVICES~~ slow torch.distributed with non-default CUDA_VISIBLE_DEVICES Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slow torch.distributed with non-default CUDA_VISIBLE_DEVICES #50723

slow torch.distributed with non-default CUDA_VISIBLE_DEVICES #50723

grimoire commented Feb 19, 2025

jjyao commented Feb 20, 2025

grimoire commented Feb 20, 2025 •

edited

Loading

slow torch.distributed with non-default CUDA_VISIBLE_DEVICES #50723

slow torch.distributed with non-default CUDA_VISIBLE_DEVICES #50723

Comments

grimoire commented Feb 19, 2025

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

jjyao commented Feb 20, 2025

grimoire commented Feb 20, 2025 • edited Loading

grimoire commented Feb 20, 2025 •

edited

Loading