Skip to content

error: undefined reference to '__stack_chk_fail' at build #9

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vkobel opened this issue May 2, 2018 · 7 comments
Closed

error: undefined reference to '__stack_chk_fail' at build #9

vkobel opened this issue May 2, 2018 · 7 comments

Comments

@vkobel
Copy link

vkobel commented May 2, 2018

When performing bazel build runsc under Arch the following error occurs: error: undefined reference to '__stack_chk_fail'.

One can fix this issue by adding "-fno-stack-protector " + in the cmd args in vdso/BUILD file (between -shared and -nostdlib).

EDIT: disabled stack protector instead of enabling it.

@amscanne
Copy link
Contributor

amscanne commented May 2, 2018

Thanks for the report!

What is the build environment? Since this doesn't happen for us, I'd like to make sure that it can be reproduced to fix it.

@vkobel
Copy link
Author

vkobel commented May 2, 2018

I'm using kernel x86_64 Linux 4.14.36-1-MANJARO and bazel 0.12.0

@simonvik
Copy link

simonvik commented May 3, 2018

Can recreate on arch linux and @vkobel patch solves it.

@dhaavi
Copy link

dhaavi commented May 3, 2018

I am experiencing the same error:

/tmp/ccCPNQ0H.o:vdso.cc:function __vdso_gettimeofday: error: undefined reference to '__stack_chk_fail'
/tmp/ccCPNQ0H.o:vdso.cc:function __vdso_time: error: undefined reference to '__stack_chk_fail'

System:
x86_64 Linux 4.16.3-1-ARCH

Bazel:

Build label: 0.12.0- (@non-git)
Build target: bazel-out/k8-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Wed Aug 13 03:13:23 +50251 (1523620552403)
Build timestamp: 1523620552403
Build timestamp as int: 1523620552403

I can supply additional information if needed.

I can also confirm that the patch by @vkobel solves the issue.

@vkobel
Copy link
Author

vkobel commented May 3, 2018

I made an AUR package for those interested: https://aur.archlinux.org/packages/gvisor-git/
Currently it's based on my forked repo, but I'll change it as soon as this google repo is fixed.

@prattmic
Copy link
Member

prattmic commented May 3, 2018

@vkobel could you verify that -fno-stack-protector also works? This VDSO shouldn't be using stack protectors at all. (As your build error shows, there is no where to go if there is a problem).

@vkobel
Copy link
Author

vkobel commented May 4, 2018

@prattmic yes, it does work with -fno-stack-protector. I've updated my fork and the arch package.
My original attempt was including this option, but I didn't assume VDSO to require no stack protectors.

chanwit pushed a commit to chanwit/gvisor that referenced this issue May 8, 2018
The VDSO has no hooks to handle stack protector failures.

Fixes google#9

PiperOrigin-RevId: 195460989
Change-Id: Idf1d55bfee1126e551d7274b7f484e03bf440427
tonistiigi pushed a commit to tonistiigi/gvisor that referenced this issue Jan 30, 2019
The VDSO has no hooks to handle stack protector failures.

Fixes google#9

PiperOrigin-RevId: 195460989
Change-Id: Idf1d55bfee1126e551d7274b7f484e03bf440427
Upstream-commit: 7bb10dc
copybara-service bot pushed a commit that referenced this issue Jul 3, 2024
Distributed training isn't working with PyTorch on certain A100 nodes.

Adds the missing ioctl `UVM_UNMAP_EXTERNAL` allowing for certain NCCL operations to succeed when using [`torch.distributed`](https://pytorch.org/docs/stable/distributed.html), fixing distributed training.

## Reproduction

This affects numerous A100 40GB and 80GB instances in our fleet. This reproduction requires 4 A100 GPUs, either 40GB or 80GB.

- **NVIDIA Driver Version**: 550.54.15
- **CUDA Version**: 12.4
- **NVIDIA device**: NVIDIA A100 80GB PCIe

### Steps

1. **Install gvisor**
```bash
URL="https://storage.googleapis.com/gvisor/releases/master/latest/${ARCH}"
wget -nc "${URL}/runsc" "${URL}/runsc.sha512"
chmod +x runsc
sudo cp runsc /usr/local/bin/runsc
sudo /usr/local/bin/runsc install
sudo systemctl reload docker
```

2. **Add GPU enabling gvisor options**

```json
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc": {
            "path": "/usr/local/bin/runsc",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc/", "-debug", "-strace"]

        }
    }
}
```
Reload configs with `sudo systemctl reload docker`.

3. **Run reproduction NCCL test**

This test creates one main process and N peer processes. Each peer process sends a torch `Tensor` to the main process using NCCL.

```Dockerfile
# Dockerfile
FROM python:3.9.15-slim-bullseye

RUN pip install torch numpy
COPY <<EOF repro.py
import argparse
import datetime
import os

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def setup(rank, world_size):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    dist.init_process_group("nccl", rank=rank, world_size=world_size, timeout=datetime.timedelta(seconds=600))
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def send_tensor(rank, world_size):
    try:
        setup(rank, world_size)

        # rank receiving all tensors
        target_rank = world_size - 1

        dist.barrier()

        tensor = torch.ones(5).cuda(rank)
        if rank < target_rank:
            print(f"[RANK {rank}] sending tensor: {tensor}")
            dist.send(tensor=tensor, dst=target_rank)
        elif rank == target_rank:
            for other_rank in range(target_rank):
                tensor = torch.zeros(5).cuda(target_rank)
                dist.recv(tensor=tensor, src=other_rank)
                print(f"[RANK {target_rank}] received tensor from rank={other_rank}: {tensor}")

            print("PASS: NCCL working.")

    except Exception as e:
        print(f"[RANK {rank}] error in send_tensor: {e}")
        raise
    finally:
        cleanup()

def main(world_size: int = 2):
    mp.spawn(send_tensor, args=(world_size,), nprocs=world_size, join=True)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Run torch-based NCCL tests")
    parser.add_argument("world_size", type=int, help="number of GPUs to run test on")
    args = parser.parse_args()

    if args.world_size < 2:
        raise RuntimeError(f"world_size needs to be larger than 1 {args.world_size}")

    main(args.world_size)
EOF

ENTRYPOINT ["python", "repro.py", "4"]
```
Build image with:

```
docker build -f Dockerfile .
```

Then run it with:
```
sudo docker run -it --shm-size=2.00gb --runtime=runsc --gpus='"device=GPU-742ea7fc-dd4f-612c-e860-499bf200a815,GPU-94a801d8-7713-acf6-337d-338b7cfdf19e,GPU-0d19cef2-10ce-e445-a0be-3d330e36c1fd,GPU-ac5046fb-020c-93e8-2784-f44aedbc5bbd"' 040a44863fb1
```

#### Failure (truncated)
```
...
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7edda14cf897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5b3a23e (0x7edd8d73a23e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7edd8d734c87 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7edd8d734f82 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7edd8d735fd1 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7edd54da9189 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7edd54db0610 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7edd54dcf978 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x5adc309 (0x7edd8d6dc309 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x5ae6f10 (0x7edd8d6e6f10 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x5ae6fa5 (0x7edd8d6e6fa5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x5124446 (0x7edd8cd24446 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x1acf4b8 (0x7edd896cf4b8 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x5aee004 (0x7edd8d6ee004 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x5af36b5 (0x7edd8d6f36b5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0xd2fe8e (0x7edda032fe8e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #19: <unknown function> + 0x47f074 (0x7edd9fa7f074 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #35: <unknown function> + 0x29d90 (0x7edda2029d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #36: __libc_start_main + 0x80 (0x7edda2029e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #37: <unknown function> + 0x108e (0x55f950b0c08e in /usr/local/bin/python)
. This may indicate a possible application crash on rank 0 or a network set up issue.
...
```

### Fix
gvisor debug logs show:

```
W0702 20:36:17.577055  445833 uvm.go:148] [  22:  84] nvproxy: unknown uvm ioctl 66 = 0x42
```
I've implemented that ioctl in this PR. This is the output after the fix.

```
[RANK 2] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:2')
[RANK 0] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:0')
[RANK 1] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:1')
[RANK 3] received tensor from rank=0: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=1: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=2: tensor([1., 1., 1., 1., 1.], device='cuda:3')
PASS: NCCL working.
```
FUTURE_COPYBARA_INTEGRATE_REVIEW=#10610 from luiscape:master ee88734
PiperOrigin-RevId: 649146570
copybara-service bot pushed a commit that referenced this issue Jul 3, 2024
Distributed training isn't working with PyTorch on certain A100 nodes.

Adds the missing ioctl `UVM_UNMAP_EXTERNAL` allowing for certain NCCL operations to succeed when using [`torch.distributed`](https://pytorch.org/docs/stable/distributed.html), fixing distributed training.

## Reproduction

This affects numerous A100 40GB and 80GB instances in our fleet. This reproduction requires 4 A100 GPUs, either 40GB or 80GB.

- **NVIDIA Driver Version**: 550.54.15
- **CUDA Version**: 12.4
- **NVIDIA device**: NVIDIA A100 80GB PCIe

### Steps

1. **Install gvisor**
```bash
URL="https://storage.googleapis.com/gvisor/releases/master/latest/${ARCH}"
wget -nc "${URL}/runsc" "${URL}/runsc.sha512"
chmod +x runsc
sudo cp runsc /usr/local/bin/runsc
sudo /usr/local/bin/runsc install
sudo systemctl reload docker
```

2. **Add GPU enabling gvisor options**

```json
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc": {
            "path": "/usr/local/bin/runsc",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc/", "-debug", "-strace"]

        }
    }
}
```
Reload configs with `sudo systemctl reload docker`.

3. **Run reproduction NCCL test**

This test creates one main process and N peer processes. Each peer process sends a torch `Tensor` to the main process using NCCL.

```Dockerfile
# Dockerfile
FROM python:3.9.15-slim-bullseye

RUN pip install torch numpy
COPY <<EOF repro.py
import argparse
import datetime
import os

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def setup(rank, world_size):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    dist.init_process_group("nccl", rank=rank, world_size=world_size, timeout=datetime.timedelta(seconds=600))
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def send_tensor(rank, world_size):
    try:
        setup(rank, world_size)

        # rank receiving all tensors
        target_rank = world_size - 1

        dist.barrier()

        tensor = torch.ones(5).cuda(rank)
        if rank < target_rank:
            print(f"[RANK {rank}] sending tensor: {tensor}")
            dist.send(tensor=tensor, dst=target_rank)
        elif rank == target_rank:
            for other_rank in range(target_rank):
                tensor = torch.zeros(5).cuda(target_rank)
                dist.recv(tensor=tensor, src=other_rank)
                print(f"[RANK {target_rank}] received tensor from rank={other_rank}: {tensor}")

            print("PASS: NCCL working.")

    except Exception as e:
        print(f"[RANK {rank}] error in send_tensor: {e}")
        raise
    finally:
        cleanup()

def main(world_size: int = 2):
    mp.spawn(send_tensor, args=(world_size,), nprocs=world_size, join=True)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Run torch-based NCCL tests")
    parser.add_argument("world_size", type=int, help="number of GPUs to run test on")
    args = parser.parse_args()

    if args.world_size < 2:
        raise RuntimeError(f"world_size needs to be larger than 1 {args.world_size}")

    main(args.world_size)
EOF

ENTRYPOINT ["python", "repro.py", "4"]
```
Build image with:

```
docker build -f Dockerfile .
```

Then run it with:
```
sudo docker run -it --shm-size=2.00gb --runtime=runsc --gpus='"device=GPU-742ea7fc-dd4f-612c-e860-499bf200a815,GPU-94a801d8-7713-acf6-337d-338b7cfdf19e,GPU-0d19cef2-10ce-e445-a0be-3d330e36c1fd,GPU-ac5046fb-020c-93e8-2784-f44aedbc5bbd"' 040a44863fb1
```

#### Failure (truncated)
```
...
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7edda14cf897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5b3a23e (0x7edd8d73a23e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7edd8d734c87 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7edd8d734f82 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7edd8d735fd1 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7edd54da9189 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7edd54db0610 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7edd54dcf978 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x5adc309 (0x7edd8d6dc309 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x5ae6f10 (0x7edd8d6e6f10 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x5ae6fa5 (0x7edd8d6e6fa5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x5124446 (0x7edd8cd24446 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x1acf4b8 (0x7edd896cf4b8 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x5aee004 (0x7edd8d6ee004 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x5af36b5 (0x7edd8d6f36b5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0xd2fe8e (0x7edda032fe8e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #19: <unknown function> + 0x47f074 (0x7edd9fa7f074 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #35: <unknown function> + 0x29d90 (0x7edda2029d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #36: __libc_start_main + 0x80 (0x7edda2029e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #37: <unknown function> + 0x108e (0x55f950b0c08e in /usr/local/bin/python)
. This may indicate a possible application crash on rank 0 or a network set up issue.
...
```

### Fix
gvisor debug logs show:

```
W0702 20:36:17.577055  445833 uvm.go:148] [  22:  84] nvproxy: unknown uvm ioctl 66 = 0x42
```
I've implemented that ioctl in this PR. This is the output after the fix.

```
[RANK 2] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:2')
[RANK 0] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:0')
[RANK 1] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:1')
[RANK 3] received tensor from rank=0: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=1: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=2: tensor([1., 1., 1., 1., 1.], device='cuda:3')
PASS: NCCL working.
```
FUTURE_COPYBARA_INTEGRATE_REVIEW=#10610 from luiscape:master ee88734
PiperOrigin-RevId: 649146570
copybara-service bot pushed a commit that referenced this issue Jul 3, 2024
Distributed training isn't working with PyTorch on certain A100 nodes.

Adds the missing ioctl `UVM_UNMAP_EXTERNAL` allowing for certain NCCL operations to succeed when using [`torch.distributed`](https://pytorch.org/docs/stable/distributed.html), fixing distributed training.

## Reproduction

This affects numerous A100 40GB and 80GB instances in our fleet. This reproduction requires 4 A100 GPUs, either 40GB or 80GB.

- **NVIDIA Driver Version**: 550.54.15
- **CUDA Version**: 12.4
- **NVIDIA device**: NVIDIA A100 80GB PCIe

### Steps

1. **Install gvisor**
```bash
URL="https://storage.googleapis.com/gvisor/releases/master/latest/${ARCH}"
wget -nc "${URL}/runsc" "${URL}/runsc.sha512"
chmod +x runsc
sudo cp runsc /usr/local/bin/runsc
sudo /usr/local/bin/runsc install
sudo systemctl reload docker
```

2. **Add GPU enabling gvisor options**

```json
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc": {
            "path": "/usr/local/bin/runsc",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc/", "-debug", "-strace"]

        }
    }
}
```
Reload configs with `sudo systemctl reload docker`.

3. **Run reproduction NCCL test**

This test creates one main process and N peer processes. Each peer process sends a torch `Tensor` to the main process using NCCL.

```Dockerfile
# Dockerfile
FROM python:3.9.15-slim-bullseye

RUN pip install torch numpy
COPY <<EOF repro.py
import argparse
import datetime
import os

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def setup(rank, world_size):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    dist.init_process_group("nccl", rank=rank, world_size=world_size, timeout=datetime.timedelta(seconds=600))
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def send_tensor(rank, world_size):
    try:
        setup(rank, world_size)

        # rank receiving all tensors
        target_rank = world_size - 1

        dist.barrier()

        tensor = torch.ones(5).cuda(rank)
        if rank < target_rank:
            print(f"[RANK {rank}] sending tensor: {tensor}")
            dist.send(tensor=tensor, dst=target_rank)
        elif rank == target_rank:
            for other_rank in range(target_rank):
                tensor = torch.zeros(5).cuda(target_rank)
                dist.recv(tensor=tensor, src=other_rank)
                print(f"[RANK {target_rank}] received tensor from rank={other_rank}: {tensor}")

            print("PASS: NCCL working.")

    except Exception as e:
        print(f"[RANK {rank}] error in send_tensor: {e}")
        raise
    finally:
        cleanup()

def main(world_size: int = 2):
    mp.spawn(send_tensor, args=(world_size,), nprocs=world_size, join=True)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Run torch-based NCCL tests")
    parser.add_argument("world_size", type=int, help="number of GPUs to run test on")
    args = parser.parse_args()

    if args.world_size < 2:
        raise RuntimeError(f"world_size needs to be larger than 1 {args.world_size}")

    main(args.world_size)
EOF

ENTRYPOINT ["python", "repro.py", "4"]
```
Build image with:

```
docker build -f Dockerfile .
```

Then run it with:
```
sudo docker run -it --shm-size=2.00gb --runtime=runsc --gpus='"device=GPU-742ea7fc-dd4f-612c-e860-499bf200a815,GPU-94a801d8-7713-acf6-337d-338b7cfdf19e,GPU-0d19cef2-10ce-e445-a0be-3d330e36c1fd,GPU-ac5046fb-020c-93e8-2784-f44aedbc5bbd"' 040a44863fb1
```

#### Failure (truncated)
```
...
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7edda14cf897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5b3a23e (0x7edd8d73a23e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7edd8d734c87 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7edd8d734f82 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7edd8d735fd1 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7edd54da9189 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7edd54db0610 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7edd54dcf978 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x5adc309 (0x7edd8d6dc309 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x5ae6f10 (0x7edd8d6e6f10 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x5ae6fa5 (0x7edd8d6e6fa5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x5124446 (0x7edd8cd24446 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x1acf4b8 (0x7edd896cf4b8 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x5aee004 (0x7edd8d6ee004 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x5af36b5 (0x7edd8d6f36b5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0xd2fe8e (0x7edda032fe8e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #19: <unknown function> + 0x47f074 (0x7edd9fa7f074 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #35: <unknown function> + 0x29d90 (0x7edda2029d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #36: __libc_start_main + 0x80 (0x7edda2029e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #37: <unknown function> + 0x108e (0x55f950b0c08e in /usr/local/bin/python)
. This may indicate a possible application crash on rank 0 or a network set up issue.
...
```

### Fix
gvisor debug logs show:

```
W0702 20:36:17.577055  445833 uvm.go:148] [  22:  84] nvproxy: unknown uvm ioctl 66 = 0x42
```
I've implemented that ioctl in this PR. This is the output after the fix.

```
[RANK 2] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:2')
[RANK 0] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:0')
[RANK 1] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:1')
[RANK 3] received tensor from rank=0: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=1: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=2: tensor([1., 1., 1., 1., 1.], device='cuda:3')
PASS: NCCL working.
```
FUTURE_COPYBARA_INTEGRATE_REVIEW=#10610 from luiscape:master ee88734
PiperOrigin-RevId: 649146570
copybara-service bot pushed a commit that referenced this issue Jul 8, 2024
Distributed training isn't working with PyTorch on certain A100 nodes.

Adds the missing ioctl `UVM_UNMAP_EXTERNAL` allowing for certain NCCL operations to succeed when using [`torch.distributed`](https://pytorch.org/docs/stable/distributed.html), fixing distributed training.

## Reproduction

This affects numerous A100 40GB and 80GB instances in our fleet. This reproduction requires 4 A100 GPUs, either 40GB or 80GB.

- **NVIDIA Driver Version**: 550.54.15
- **CUDA Version**: 12.4
- **NVIDIA device**: NVIDIA A100 80GB PCIe

### Steps

1. **Install gvisor**
```bash
URL="https://storage.googleapis.com/gvisor/releases/master/latest/${ARCH}"
wget -nc "${URL}/runsc" "${URL}/runsc.sha512"
chmod +x runsc
sudo cp runsc /usr/local/bin/runsc
sudo /usr/local/bin/runsc install
sudo systemctl reload docker
```

2. **Add GPU enabling gvisor options**

```json
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc": {
            "path": "/usr/local/bin/runsc",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc/", "-debug", "-strace"]

        }
    }
}
```
Reload configs with `sudo systemctl reload docker`.

3. **Run reproduction NCCL test**

This test creates one main process and N peer processes. Each peer process sends a torch `Tensor` to the main process using NCCL.

```Dockerfile
# Dockerfile
FROM python:3.9.15-slim-bullseye

RUN pip install torch numpy
COPY <<EOF repro.py
import argparse
import datetime
import os

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def setup(rank, world_size):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    dist.init_process_group("nccl", rank=rank, world_size=world_size, timeout=datetime.timedelta(seconds=600))
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def send_tensor(rank, world_size):
    try:
        setup(rank, world_size)

        # rank receiving all tensors
        target_rank = world_size - 1

        dist.barrier()

        tensor = torch.ones(5).cuda(rank)
        if rank < target_rank:
            print(f"[RANK {rank}] sending tensor: {tensor}")
            dist.send(tensor=tensor, dst=target_rank)
        elif rank == target_rank:
            for other_rank in range(target_rank):
                tensor = torch.zeros(5).cuda(target_rank)
                dist.recv(tensor=tensor, src=other_rank)
                print(f"[RANK {target_rank}] received tensor from rank={other_rank}: {tensor}")

            print("PASS: NCCL working.")

    except Exception as e:
        print(f"[RANK {rank}] error in send_tensor: {e}")
        raise
    finally:
        cleanup()

def main(world_size: int = 2):
    mp.spawn(send_tensor, args=(world_size,), nprocs=world_size, join=True)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Run torch-based NCCL tests")
    parser.add_argument("world_size", type=int, help="number of GPUs to run test on")
    args = parser.parse_args()

    if args.world_size < 2:
        raise RuntimeError(f"world_size needs to be larger than 1 {args.world_size}")

    main(args.world_size)
EOF

ENTRYPOINT ["python", "repro.py", "4"]
```
Build image with:

```
docker build -f Dockerfile .
```

Then run it with:
```
sudo docker run -it --shm-size=2.00gb --runtime=runsc --gpus='"device=GPU-742ea7fc-dd4f-612c-e860-499bf200a815,GPU-94a801d8-7713-acf6-337d-338b7cfdf19e,GPU-0d19cef2-10ce-e445-a0be-3d330e36c1fd,GPU-ac5046fb-020c-93e8-2784-f44aedbc5bbd"' 040a44863fb1
```

#### Failure (truncated)
```
...
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7edda14cf897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5b3a23e (0x7edd8d73a23e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7edd8d734c87 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7edd8d734f82 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7edd8d735fd1 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7edd54da9189 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7edd54db0610 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7edd54dcf978 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x5adc309 (0x7edd8d6dc309 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x5ae6f10 (0x7edd8d6e6f10 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x5ae6fa5 (0x7edd8d6e6fa5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x5124446 (0x7edd8cd24446 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x1acf4b8 (0x7edd896cf4b8 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x5aee004 (0x7edd8d6ee004 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x5af36b5 (0x7edd8d6f36b5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0xd2fe8e (0x7edda032fe8e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #19: <unknown function> + 0x47f074 (0x7edd9fa7f074 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #35: <unknown function> + 0x29d90 (0x7edda2029d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #36: __libc_start_main + 0x80 (0x7edda2029e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #37: <unknown function> + 0x108e (0x55f950b0c08e in /usr/local/bin/python)
. This may indicate a possible application crash on rank 0 or a network set up issue.
...
```

### Fix
gvisor debug logs show:

```
W0702 20:36:17.577055  445833 uvm.go:148] [  22:  84] nvproxy: unknown uvm ioctl 66 = 0x42
```
I've implemented that ioctl in this PR. This is the output after the fix.

```
[RANK 2] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:2')
[RANK 0] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:0')
[RANK 1] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:1')
[RANK 3] received tensor from rank=0: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=1: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=2: tensor([1., 1., 1., 1., 1.], device='cuda:3')
PASS: NCCL working.
```
FUTURE_COPYBARA_INTEGRATE_REVIEW=#10610 from luiscape:master ee88734
PiperOrigin-RevId: 649146570
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants