Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow torch.distributed with non-default CUDA_VISIBLE_DEVICES #50723

Open
grimoire opened this issue Feb 19, 2025 · 2 comments
Open

slow torch.distributed with non-default CUDA_VISIBLE_DEVICES #50723

grimoire opened this issue Feb 19, 2025 · 2 comments
Labels
core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks question Just a question :)

Comments

@grimoire
Copy link

What happened + What you expected to happen

Hi, I am working on deploy distributed torch model with ray. I found that the performance of first distributed op (all_reduce in my case) changes after I set CUDA_VISIBLE_DEVICES. dist.all_reduce might cost 30s+.

Versions / Dependencies

  • ray: 2.42.1
  • python: 3.9.16 / 3.10.0
  • os: CentOS7
  • pytorch: 2.4.0+cuda12.1 / 2.5.1+cuda12.1

Reproduction script

ray_dist.py

import os
import time
import ray
import torch
import torch.distributed as dist
from contextlib import contextmanager


@ray.remote(num_gpus=1)
class DistActor:
    
    def __init__(self, rank, world_size):
        os.environ['MASTER_ADDR'] = '127.0.0.1'
        os.environ['MASTER_PORT'] = '29500'
        dist.init_process_group(backend='nccl', rank=rank, world_size=world_size)

    def all_reduce(self):
        a = torch.rand([1], device='cuda')
        dist.all_reduce(a)

@contextmanager
def timeit(msg):
    print(msg)
    start = time.time()
    yield
    end = time.time()
    duration = (end - start)
    print(f'Take time: {duration:.3f} s')

if __name__ == '__main__':
    ray.init()
    world_size = 2
    actors = [DistActor.remote(rank, world_size) for rank in range(world_size)]

    with timeit('start first all_reduce'):
        ray.get([actor.all_reduce.remote() for actor in actors])

    with timeit('start second all_reduce'):
        ray.get([actor.all_reduce.remote() for rank, actor in enumerate(actors)])

good without CUDA_VISIBLE_DEVICES or with CUDA_VISIBLE_DEVICES=0,1 or with CUDA_VISIBLE_DEVICES=1,0

python ray/ray_dist.py
# start first all_reduce
# Take time: 4.486 s
# start second all_reduce
# Take time: 0.002 s

bad with CUDA_VISIBLE_DEVICES=6,7

CUDA_VISIBLE_DEVICES=6,7 python ray/ray_dist.py
# start first all_reduce
# Take time: 63.014 s
# start second all_reduce
# Take time: 0.002 s

good with docker run -it --gpus '"device=6,7"' ..

python ray/ray_dist.py
# start first all_reduce
# Take time: 3.183 s
# start second all_reduce
# Take time: 0.001 s

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@grimoire grimoire added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 19, 2025
@jcotant1 jcotant1 added the core Issues that should be addressed in Ray Core label Feb 19, 2025
@jjyao
Copy link
Collaborator

jjyao commented Feb 20, 2025

Inside DistActor could you print out CUDA_VISIBLE_DEVICES to make sure each actor is assigned a correct cuda device. If it is then I think it's not a Ray issue since torch.distributed does all the communications outside of Ray.

@jjyao jjyao added question Just a question :) P1 Issue that should be fixed within a few weeks and removed bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 20, 2025
@grimoire grimoire changed the title [<Ray component: Core|RLlib|etc...>] slow torch.distributed with non-default CUDA_VISIBLE_DEVICES slow torch.distributed with non-default CUDA_VISIBLE_DEVICES Feb 20, 2025
@grimoire
Copy link
Author

grimoire commented Feb 20, 2025

CUDA_VISIBLE_DEVICES in the actor does match the devices.
I believe that torch.distributed does not have direct connect with ray. But since another mp implementation does not show the same behaviour, ray actor might have done something special?

from torch import multiprocessing as mp
import torch
import torch.distributed as dist
import os
import time
from contextlib import contextmanager

def all_reduce():
    a = torch.rand([1], device='cuda')
    dist.all_reduce(a)

@contextmanager
def timeit(msg, enable=True):
    def _print(*args, **kwargs):
        if enable:
            print(*args, **kwargs)
    _print(msg)
    start = time.time()
    yield
    end = time.time()
    duration = (end - start)
    _print(f'Take time: {duration:.3f} s')

def proc(rank, world_size):
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    with torch.cuda.device(rank):
        dist.init_process_group(backend='nccl', rank=rank, world_size=world_size)

        with timeit('start first all_reduce', rank==0):
            all_reduce()

        with timeit('start second all_reduce', rank==0):
            all_reduce()


if __name__ == '__main__':
    world_size = 2
    procs = [mp.Process(target=proc, args=(rank, world_size)) for rank in range(world_size)]
    [proc.start() for proc in procs]
    [proc.join() for proc in procs]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks question Just a question :)
Projects
None yet
Development

No branches or pull requests

4 participants
@jjyao @grimoire @jcotant1 and others