From 9bba814e490c4c0c8894fce4cf575dfc0fc0c6ca Mon Sep 17 00:00:00 2001 From: Ricardo Decal Date: Thu, 24 Jul 2025 23:34:51 -0400 Subject: [PATCH 01/13] [Docs] Factor out troubleshooting to its own guide; add section for Ray Observability Signed-off-by: Ricardo Decal --- docs/serving/distributed_serving.md | 63 --------------- docs/serving/distributed_troubleshooting.md | 89 +++++++++++++++++++++ 2 files changed, 89 insertions(+), 63 deletions(-) create mode 100644 docs/serving/distributed_troubleshooting.md diff --git a/docs/serving/distributed_serving.md b/docs/serving/distributed_serving.md index 4f111115f307..5ded91b9b648 100644 --- a/docs/serving/distributed_serving.md +++ b/docs/serving/distributed_serving.md @@ -118,68 +118,5 @@ vllm serve /path/to/the/model/in/the/container \ --tensor-parallel-size 16 ``` -## Troubleshooting distributed deployments - -To make tensor parallelism performant, ensure that communication between nodes is efficient, for example, by using high-speed network cards such as InfiniBand. To set up the cluster to use InfiniBand, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Contact your system administrator for more information about the required flags. One way to confirm if InfiniBand is working is to run `vllm` with the `NCCL_DEBUG=TRACE` environment variable set, for example `NCCL_DEBUG=TRACE vllm serve ...`, and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, NCCL uses a raw TCP socket, which is not efficient for cross-node tensor parallelism. If you find `[send] via NET/IB/GDRDMA` in the logs, NCCL uses InfiniBand with GPUDirect RDMA, which is efficient. - -## Enabling GPUDirect RDMA - -To enable GPUDirect RDMA with vLLM, configure the following settings: - -- `IPC_LOCK` security context: add the `IPC_LOCK` capability to the container's security context to lock memory pages and prevent swapping to disk. -- Shared memory with `/dev/shm`: mount `/dev/shm` in the pod spec to provide shared memory for interprocess communication (IPC). - -If you use Docker, set up the container as follows: - -```bash -docker run --gpus all \ - --ipc=host \ - --shm-size=16G \ - -v /dev/shm:/dev/shm \ - vllm/vllm-openai -``` - -If you use Kubernetes, set up the pod spec as follows: - -```yaml -... -spec: - containers: - - name: vllm - image: vllm/vllm-openai - securityContext: - capabilities: - add: ["IPC_LOCK"] - volumeMounts: - - mountPath: /dev/shm - name: dshm - resources: - limits: - nvidia.com/gpu: 8 - requests: - nvidia.com/gpu: 8 - volumes: - - name: dshm - emptyDir: - medium: Memory -... -``` - -Efficient tensor parallelism requires fast inter-node communication, preferably through high-speed network adapters such as InfiniBand. To enable InfiniBand, append flags such as `--privileged -e NCCL_IB_HCA=mlx5` to `run_cluster.sh`. For cluster-specific settings, consult your system administrator. - -To confirm InfiniBand operation, enable detailed NCCL logs: - -```bash -NCCL_DEBUG=TRACE vllm serve ... -``` - -Search the logs for the transport method. Entries containing `[send] via NET/Socket` indicate raw TCP sockets, which perform poorly for cross-node tensor parallelism. Entries containing `[send] via NET/IB/GDRDMA` indicate InfiniBand with GPUDirect RDMA, which provides high performance. - -!!! tip "Verify inter-node GPU communication" - After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script][troubleshooting-incorrect-hardware-driver]. If you need additional environment variables for communication configuration, append them to `run_cluster.sh`, for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see . - !!! tip "Pre-download Hugging Face models" If you use Hugging Face models, downloading the model before starting vLLM is recommended. Download the model on every node to the same path, or store the model on a distributed file system accessible by all nodes. Then pass the path to the model in place of the repository ID. Otherwise, supply a Hugging Face token by appending `-e HF_TOKEN=` to `run_cluster.sh`. - -!!! tip - The error message `Error: No available node types can fulfill resource request` can appear even when the cluster has enough GPUs. The issue often occurs when nodes have multiple IP addresses and vLLM can't select the correct one. Ensure that vLLM and Ray use the same IP address by setting `VLLM_HOST_IP` in `run_cluster.sh` (with a different value on each node). Use `ray status` and `ray list nodes` to verify the chosen IP address. For more information, see . diff --git a/docs/serving/distributed_troubleshooting.md b/docs/serving/distributed_troubleshooting.md new file mode 100644 index 000000000000..d37f3beb9d95 --- /dev/null +++ b/docs/serving/distributed_troubleshooting.md @@ -0,0 +1,89 @@ +# Troubleshooting distributed deployments + +## Ray observability + +Debugging a distributed system can be challenging due to the large scale and complexity. Ray provides a suite of tools to help monitor, debug, and optimize Ray applications and clusters: + +- Ray Dashboard – real-time cluster and application metrics, centralized logs, and granular observability into individual tasks, actors, nodes, and more. +- Ray Tracing (OpenTelemetry) – distributed execution traces for performance bottleneck analysis +- Ray Distributed Debugger – step-through execution of remote tasks with breakpoints, and port-mortem debugging of unhandled exceptions. +- Ray State CLI & State API – programmatic access to jobs, actors, tasks, objects, and node state +- Ray Logs CLI – programmatic access to Ray logs at the task, actor, cluster, node, job, or worker levels. +- Ray Metrics (Prometheus & Grafana) – scrapeable metrics that can be integrated with existing monitoring systems. +- Distributed Profiling – diagnose performance bottlenecks with integrated tools for analyzing CPU, memory, and GPU usage across a cluster. + +For more information about Ray observability, visit the [official Ray observability docs](https://docs.ray.io/en/latest/ray-observability/index.html). For more information about debugging Ray applications, visit the [Ray Debugging Guide](https://docs.ray.io/en/latest/ray-observability/user-guides/debug-apps/index.html). + +## KubeRay + +https://docs.ray.io/en/latest/serve/advanced-guides/multi-node-gpu-troubleshooting.html#serve-multi-node-gpu-troubleshooting + +## Optimizing network communication for tensor parallelism + +To make tensor parallelism performant, ensure that communication between nodes is efficient, for example, by using high-speed network cards such as InfiniBand. To set up the cluster to use InfiniBand, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Contact your system administrator for more information about the required flags. One way to confirm if InfiniBand is working is to run `vllm` with the `NCCL_DEBUG=TRACE` environment variable set, for example `NCCL_DEBUG=TRACE vllm serve ...`, and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, NCCL uses a raw TCP socket, which is not efficient for cross-node tensor parallelism. If you find `[send] via NET/IB/GDRDMA` in the logs, NCCL uses InfiniBand with GPUDirect RDMA, which is efficient. + +## Enabling GPUDirect RDMA + +To enable GPUDirect RDMA with vLLM, configure the following settings: + +- `IPC_LOCK` security context: add the `IPC_LOCK` capability to the container's security context to lock memory pages and prevent swapping to disk. +- Shared memory with `/dev/shm`: mount `/dev/shm` in the pod spec to provide shared memory for interprocess communication (IPC). +## Enabling GPUDirect RDMA + +To enable GPUDirect RDMA with vLLM, configure the following settings: + +- `IPC_LOCK` security context: add the `IPC_LOCK` capability to the container's security context to lock memory pages and prevent swapping to disk. +- Shared memory with `/dev/shm`: mount `/dev/shm` in the pod spec to provide shared memory for interprocess communication (IPC). + +If you use Docker, set up the container as follows: + +```bash +docker run --gpus all \ + --ipc=host \ + --shm-size=16G \ + -v /dev/shm:/dev/shm \ + vllm/vllm-openai +``` + +If you use Kubernetes, set up the pod spec as follows: + +```yaml +... +spec: + containers: + - name: vllm + image: vllm/vllm-openai + securityContext: + capabilities: + add: ["IPC_LOCK"] + volumeMounts: + - mountPath: /dev/shm + name: dshm + resources: + limits: + nvidia.com/gpu: 8 + requests: + nvidia.com/gpu: 8 + volumes: + - name: dshm + emptyDir: + medium: Memory +... +``` + +Efficient tensor parallelism requires fast inter-node communication, preferably through high-speed network adapters such as InfiniBand. To enable InfiniBand, append flags such as `--privileged -e NCCL_IB_HCA=mlx5` to `run_cluster.sh`. For cluster-specific settings, consult your system administrator. + +To confirm InfiniBand operation, enable detailed NCCL logs: + +```bash +NCCL_DEBUG=TRACE vllm serve ... +``` + +Search the logs for the transport method. Entries containing `[send] via NET/Socket` indicate raw TCP sockets, which perform poorly for cross-node tensor parallelism. Entries containing `[send] via NET/IB/GDRDMA` indicate InfiniBand with GPUDirect RDMA, which provides high performance. + +!!! tip "Verify inter-node GPU communication" + After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script][troubleshooting-incorrect-hardware-driver]. If you need additional environment variables for communication configuration, append them to `run_cluster.sh`, for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see . + + +!!! tip + The error message `Error: No available node types can fulfill resource request` can appear even when the cluster has enough GPUs. The issue often occurs when nodes have multiple IP addresses and vLLM can't select the correct one. Ensure that vLLM and Ray use the same IP address by setting `VLLM_HOST_IP` in `run_cluster.sh` (with a different value on each node). Use `ray status` and `ray list nodes` to verify the chosen IP address. For more information, see . From 8e168d79f4fb91651963402e0ffc73610c9f2af5 Mon Sep 17 00:00:00 2001 From: Ricardo Decal Date: Thu, 24 Jul 2025 23:38:53 -0400 Subject: [PATCH 02/13] cleaned up kubernetes section Signed-off-by: Ricardo Decal --- docs/serving/distributed_troubleshooting.md | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/docs/serving/distributed_troubleshooting.md b/docs/serving/distributed_troubleshooting.md index d37f3beb9d95..7239639b4649 100644 --- a/docs/serving/distributed_troubleshooting.md +++ b/docs/serving/distributed_troubleshooting.md @@ -12,11 +12,8 @@ Debugging a distributed system can be challenging due to the large scale and com - Ray Metrics (Prometheus & Grafana) – scrapeable metrics that can be integrated with existing monitoring systems. - Distributed Profiling – diagnose performance bottlenecks with integrated tools for analyzing CPU, memory, and GPU usage across a cluster. -For more information about Ray observability, visit the [official Ray observability docs](https://docs.ray.io/en/latest/ray-observability/index.html). For more information about debugging Ray applications, visit the [Ray Debugging Guide](https://docs.ray.io/en/latest/ray-observability/user-guides/debug-apps/index.html). - -## KubeRay - -https://docs.ray.io/en/latest/serve/advanced-guides/multi-node-gpu-troubleshooting.html#serve-multi-node-gpu-troubleshooting +For more information about Ray observability, visit the [official Ray observability docs](https://docs.ray.io/en/latest/ray-observability/index.html). For more information about debugging Ray applications, visit the [Ray Debugging Guide](https://docs.ray.io/en/latest/ray-observability/user-guides/debug-apps/index.html). For information about troubleshooting Kubernetes clusters, see the +[official KubeRay troubleshooting guide](https://docs.ray.io/en/latest/serve/advanced-guides/multi-node-gpu-troubleshooting.html). ## Optimizing network communication for tensor parallelism From 1b93f64e1a48cce33cf198f8179025a21352e91a Mon Sep 17 00:00:00 2001 From: Ricardo Decal Date: Thu, 24 Jul 2025 23:41:05 -0400 Subject: [PATCH 03/13] cells. interlinked. Signed-off-by: Ricardo Decal --- docs/serving/distributed_serving.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/serving/distributed_serving.md b/docs/serving/distributed_serving.md index 5ded91b9b648..c9cb05267096 100644 --- a/docs/serving/distributed_serving.md +++ b/docs/serving/distributed_serving.md @@ -120,3 +120,5 @@ vllm serve /path/to/the/model/in/the/container \ !!! tip "Pre-download Hugging Face models" If you use Hugging Face models, downloading the model before starting vLLM is recommended. Download the model on every node to the same path, or store the model on a distributed file system accessible by all nodes. Then pass the path to the model in place of the repository ID. Otherwise, supply a Hugging Face token by appending `-e HF_TOKEN=` to `run_cluster.sh`. + +For information about distributed debugging, see [Troubleshooting distributed deployments](distributed_troubleshooting.md). From b62396aa6167659472f717dcd30b73c06daa3678 Mon Sep 17 00:00:00 2001 From: Ricardo Decal Date: Fri, 25 Jul 2025 00:23:10 -0400 Subject: [PATCH 04/13] rm duplicated section Signed-off-by: Ricardo Decal --- docs/serving/distributed_troubleshooting.md | 6 ------ 1 file changed, 6 deletions(-) diff --git a/docs/serving/distributed_troubleshooting.md b/docs/serving/distributed_troubleshooting.md index 7239639b4649..c68903de89d1 100644 --- a/docs/serving/distributed_troubleshooting.md +++ b/docs/serving/distributed_troubleshooting.md @@ -23,12 +23,6 @@ To make tensor parallelism performant, ensure that communication between nodes i To enable GPUDirect RDMA with vLLM, configure the following settings: -- `IPC_LOCK` security context: add the `IPC_LOCK` capability to the container's security context to lock memory pages and prevent swapping to disk. -- Shared memory with `/dev/shm`: mount `/dev/shm` in the pod spec to provide shared memory for interprocess communication (IPC). -## Enabling GPUDirect RDMA - -To enable GPUDirect RDMA with vLLM, configure the following settings: - - `IPC_LOCK` security context: add the `IPC_LOCK` capability to the container's security context to lock memory pages and prevent swapping to disk. - Shared memory with `/dev/shm`: mount `/dev/shm` in the pod spec to provide shared memory for interprocess communication (IPC). From bad32497f599a10b7988898bebc4c7ba3b2f1d8e Mon Sep 17 00:00:00 2001 From: Ricardo Decal Date: Fri, 25 Jul 2025 00:49:39 -0400 Subject: [PATCH 05/13] combined redundant information about enabling gpudirect rdma. added explanation of what is gpudirect rdma Signed-off-by: Ricardo Decal --- docs/serving/distributed_troubleshooting.md | 25 +++++++++++++-------- 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/docs/serving/distributed_troubleshooting.md b/docs/serving/distributed_troubleshooting.md index c68903de89d1..c91ed8fd4e19 100644 --- a/docs/serving/distributed_troubleshooting.md +++ b/docs/serving/distributed_troubleshooting.md @@ -17,10 +17,19 @@ For more information about Ray observability, visit the [official Ray observabil ## Optimizing network communication for tensor parallelism -To make tensor parallelism performant, ensure that communication between nodes is efficient, for example, by using high-speed network cards such as InfiniBand. To set up the cluster to use InfiniBand, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Contact your system administrator for more information about the required flags. One way to confirm if InfiniBand is working is to run `vllm` with the `NCCL_DEBUG=TRACE` environment variable set, for example `NCCL_DEBUG=TRACE vllm serve ...`, and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, NCCL uses a raw TCP socket, which is not efficient for cross-node tensor parallelism. If you find `[send] via NET/IB/GDRDMA` in the logs, NCCL uses InfiniBand with GPUDirect RDMA, which is efficient. +Efficient tensor parallelism +requires fast inter-node +communication, preferably through +high-speed network adapters such +as InfiniBand. To set up the cluster to use InfiniBand, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the helper script. Contact your system administrator for more information about the required flags. + + + ## Enabling GPUDirect RDMA +GPUDirect RDMA (Remote Direct Memory Access) is an NVIDIA technology that allows network adapters to directly access GPU memory, bypassing the CPU and system memory. This direct access reduces latency and CPU overhead, which is beneficial for large data transfers between GPUs across nodes. + To enable GPUDirect RDMA with vLLM, configure the following settings: - `IPC_LOCK` security context: add the `IPC_LOCK` capability to the container's security context to lock memory pages and prevent swapping to disk. @@ -62,18 +71,16 @@ spec: ... ``` -Efficient tensor parallelism requires fast inter-node communication, preferably through high-speed network adapters such as InfiniBand. To enable InfiniBand, append flags such as `--privileged -e NCCL_IB_HCA=mlx5` to `run_cluster.sh`. For cluster-specific settings, consult your system administrator. +!!! tip "Confirm GPUDirect RDMA operation" + To confirm your InfiniBand card is using GPUDirect RDMA, run vLLM with detailed NCCL logs: `NCCL_DEBUG=TRACE vllm serve ...`. -To confirm InfiniBand operation, enable detailed NCCL logs: - -```bash -NCCL_DEBUG=TRACE vllm serve ... -``` + Then look for the NCCL version and the network used. -Search the logs for the transport method. Entries containing `[send] via NET/Socket` indicate raw TCP sockets, which perform poorly for cross-node tensor parallelism. Entries containing `[send] via NET/IB/GDRDMA` indicate InfiniBand with GPUDirect RDMA, which provides high performance. + - If you find `[send] via NET/IB/GDRDMA` in the logs, then NCCL is using InfiniBand with GPUDirect RDMA, which *is* efficient. + - If you find `[send] via NET/Socket` in the logs, NCCL used a raw TCP socket, which *is not* efficient for cross-node tensor parallelism. !!! tip "Verify inter-node GPU communication" - After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script][troubleshooting-incorrect-hardware-driver]. If you need additional environment variables for communication configuration, append them to `run_cluster.sh`, for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see . + After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script][troubleshooting-incorrect-hardware-driver]. If you need additional environment variables for communication configuration, append them to , for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see . !!! tip From 391a8ecc47841cafebaf382f935d287fee3f53e4 Mon Sep 17 00:00:00 2001 From: Ricardo Decal Date: Tue, 29 Jul 2025 11:32:32 -0700 Subject: [PATCH 06/13] lint Signed-off-by: Ricardo Decal --- docs/serving/distributed_troubleshooting.md | 13 ++++--------- 1 file changed, 4 insertions(+), 9 deletions(-) diff --git a/docs/serving/distributed_troubleshooting.md b/docs/serving/distributed_troubleshooting.md index c91ed8fd4e19..583d940e3bf0 100644 --- a/docs/serving/distributed_troubleshooting.md +++ b/docs/serving/distributed_troubleshooting.md @@ -17,14 +17,10 @@ For more information about Ray observability, visit the [official Ray observabil ## Optimizing network communication for tensor parallelism -Efficient tensor parallelism -requires fast inter-node -communication, preferably through -high-speed network adapters such -as InfiniBand. To set up the cluster to use InfiniBand, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the helper script. Contact your system administrator for more information about the required flags. - - - +Efficient tensor parallelism requires fast inter-node communication, preferably through high-speed network adapters such as InfiniBand. +To set up the cluster to use InfiniBand, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the + helper script. +Contact your system administrator for more information about the required flags. ## Enabling GPUDirect RDMA @@ -82,6 +78,5 @@ spec: !!! tip "Verify inter-node GPU communication" After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script][troubleshooting-incorrect-hardware-driver]. If you need additional environment variables for communication configuration, append them to , for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see . - !!! tip The error message `Error: No available node types can fulfill resource request` can appear even when the cluster has enough GPUs. The issue often occurs when nodes have multiple IP addresses and vLLM can't select the correct one. Ensure that vLLM and Ray use the same IP address by setting `VLLM_HOST_IP` in `run_cluster.sh` (with a different value on each node). Use `ray status` and `ray list nodes` to verify the chosen IP address. For more information, see . From bddc395963430261bbd4a314b5cf33dad2dec2fb Mon Sep 17 00:00:00 2001 From: Ricardo Decal Date: Wed, 30 Jul 2025 09:58:18 -0700 Subject: [PATCH 07/13] drop ray obs detail. how to use the toolkit in real scenarios for future work. Signed-off-by: Ricardo Decal --- docs/serving/distributed_troubleshooting.md | 12 +----------- 1 file changed, 1 insertion(+), 11 deletions(-) diff --git a/docs/serving/distributed_troubleshooting.md b/docs/serving/distributed_troubleshooting.md index 583d940e3bf0..968a6adb390d 100644 --- a/docs/serving/distributed_troubleshooting.md +++ b/docs/serving/distributed_troubleshooting.md @@ -2,17 +2,7 @@ ## Ray observability -Debugging a distributed system can be challenging due to the large scale and complexity. Ray provides a suite of tools to help monitor, debug, and optimize Ray applications and clusters: - -- Ray Dashboard – real-time cluster and application metrics, centralized logs, and granular observability into individual tasks, actors, nodes, and more. -- Ray Tracing (OpenTelemetry) – distributed execution traces for performance bottleneck analysis -- Ray Distributed Debugger – step-through execution of remote tasks with breakpoints, and port-mortem debugging of unhandled exceptions. -- Ray State CLI & State API – programmatic access to jobs, actors, tasks, objects, and node state -- Ray Logs CLI – programmatic access to Ray logs at the task, actor, cluster, node, job, or worker levels. -- Ray Metrics (Prometheus & Grafana) – scrapeable metrics that can be integrated with existing monitoring systems. -- Distributed Profiling – diagnose performance bottlenecks with integrated tools for analyzing CPU, memory, and GPU usage across a cluster. - -For more information about Ray observability, visit the [official Ray observability docs](https://docs.ray.io/en/latest/ray-observability/index.html). For more information about debugging Ray applications, visit the [Ray Debugging Guide](https://docs.ray.io/en/latest/ray-observability/user-guides/debug-apps/index.html). For information about troubleshooting Kubernetes clusters, see the +Debugging a distributed system can be challenging due to the large scale and complexity. Ray provides a suite of tools to help monitor, debug, and optimize Ray applications and clusters. For more information about Ray observability, visit the [official Ray observability docs](https://docs.ray.io/en/latest/ray-observability/index.html). For more information about debugging Ray applications, visit the [Ray Debugging Guide](https://docs.ray.io/en/latest/ray-observability/user-guides/debug-apps/index.html). For information about troubleshooting Kubernetes clusters, see the [official KubeRay troubleshooting guide](https://docs.ray.io/en/latest/serve/advanced-guides/multi-node-gpu-troubleshooting.html). ## Optimizing network communication for tensor parallelism From 5aff50bb881208a375d47e25b485c2aa785f98a7 Mon Sep 17 00:00:00 2001 From: Ricardo Decal Date: Wed, 30 Jul 2025 11:55:46 -0700 Subject: [PATCH 08/13] add header before link to distributed troubleshooting guide Signed-off-by: Ricardo Decal --- docs/serving/distributed_serving.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/serving/distributed_serving.md b/docs/serving/distributed_serving.md index c9cb05267096..c7e1432c62cb 100644 --- a/docs/serving/distributed_serving.md +++ b/docs/serving/distributed_serving.md @@ -121,4 +121,6 @@ vllm serve /path/to/the/model/in/the/container \ !!! tip "Pre-download Hugging Face models" If you use Hugging Face models, downloading the model before starting vLLM is recommended. Download the model on every node to the same path, or store the model on a distributed file system accessible by all nodes. Then pass the path to the model in place of the repository ID. Otherwise, supply a Hugging Face token by appending `-e HF_TOKEN=` to `run_cluster.sh`. +## Troubleshooting distributed deployments + For information about distributed debugging, see [Troubleshooting distributed deployments](distributed_troubleshooting.md). From f243c6d20a1d9ce43d7966147a902ff75882f532 Mon Sep 17 00:00:00 2001 From: Ricardo Decal Date: Thu, 31 Jul 2025 09:53:31 -0700 Subject: [PATCH 09/13] Move the network comms guide back to the distributed serving guide Signed-off-by: Ricardo Decal --- docs/serving/distributed_serving.md | 60 +++++++++++++++++++++ docs/serving/distributed_troubleshooting.md | 60 --------------------- 2 files changed, 60 insertions(+), 60 deletions(-) diff --git a/docs/serving/distributed_serving.md b/docs/serving/distributed_serving.md index c7e1432c62cb..25c6e96ee989 100644 --- a/docs/serving/distributed_serving.md +++ b/docs/serving/distributed_serving.md @@ -118,6 +118,66 @@ vllm serve /path/to/the/model/in/the/container \ --tensor-parallel-size 16 ``` +## Optimizing network communication for tensor parallelism + +Efficient tensor parallelism requires fast inter-node communication, preferably through high-speed network adapters such as InfiniBand. +To set up the cluster to use InfiniBand, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the + helper script. +Contact your system administrator for more information about the required flags. + +## Enabling GPUDirect RDMA + +GPUDirect RDMA (Remote Direct Memory Access) is an NVIDIA technology that allows network adapters to directly access GPU memory, bypassing the CPU and system memory. This direct access reduces latency and CPU overhead, which is beneficial for large data transfers between GPUs across nodes. + +To enable GPUDirect RDMA with vLLM, configure the following settings: + +- `IPC_LOCK` security context: add the `IPC_LOCK` capability to the container's security context to lock memory pages and prevent swapping to disk. +- Shared memory with `/dev/shm`: mount `/dev/shm` in the pod spec to provide shared memory for interprocess communication (IPC). + +If you use Docker, set up the container as follows: + +```bash +docker run --gpus all \ + --ipc=host \ + --shm-size=16G \ + -v /dev/shm:/dev/shm \ + vllm/vllm-openai +``` + +If you use Kubernetes, set up the pod spec as follows: + +```yaml +... +spec: + containers: + - name: vllm + image: vllm/vllm-openai + securityContext: + capabilities: + add: ["IPC_LOCK"] + volumeMounts: + - mountPath: /dev/shm + name: dshm + resources: + limits: + nvidia.com/gpu: 8 + requests: + nvidia.com/gpu: 8 + volumes: + - name: dshm + emptyDir: + medium: Memory +... +``` + +!!! tip "Confirm GPUDirect RDMA operation" + To confirm your InfiniBand card is using GPUDirect RDMA, run vLLM with detailed NCCL logs: `NCCL_DEBUG=TRACE vllm serve ...`. + + Then look for the NCCL version and the network used. + + - If you find `[send] via NET/IB/GDRDMA` in the logs, then NCCL is using InfiniBand with GPUDirect RDMA, which *is* efficient. + - If you find `[send] via NET/Socket` in the logs, NCCL used a raw TCP socket, which *is not* efficient for cross-node tensor parallelism. + !!! tip "Pre-download Hugging Face models" If you use Hugging Face models, downloading the model before starting vLLM is recommended. Download the model on every node to the same path, or store the model on a distributed file system accessible by all nodes. Then pass the path to the model in place of the repository ID. Otherwise, supply a Hugging Face token by appending `-e HF_TOKEN=` to `run_cluster.sh`. diff --git a/docs/serving/distributed_troubleshooting.md b/docs/serving/distributed_troubleshooting.md index 968a6adb390d..19aa65ead46f 100644 --- a/docs/serving/distributed_troubleshooting.md +++ b/docs/serving/distributed_troubleshooting.md @@ -5,66 +5,6 @@ Debugging a distributed system can be challenging due to the large scale and complexity. Ray provides a suite of tools to help monitor, debug, and optimize Ray applications and clusters. For more information about Ray observability, visit the [official Ray observability docs](https://docs.ray.io/en/latest/ray-observability/index.html). For more information about debugging Ray applications, visit the [Ray Debugging Guide](https://docs.ray.io/en/latest/ray-observability/user-guides/debug-apps/index.html). For information about troubleshooting Kubernetes clusters, see the [official KubeRay troubleshooting guide](https://docs.ray.io/en/latest/serve/advanced-guides/multi-node-gpu-troubleshooting.html). -## Optimizing network communication for tensor parallelism - -Efficient tensor parallelism requires fast inter-node communication, preferably through high-speed network adapters such as InfiniBand. -To set up the cluster to use InfiniBand, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the - helper script. -Contact your system administrator for more information about the required flags. - -## Enabling GPUDirect RDMA - -GPUDirect RDMA (Remote Direct Memory Access) is an NVIDIA technology that allows network adapters to directly access GPU memory, bypassing the CPU and system memory. This direct access reduces latency and CPU overhead, which is beneficial for large data transfers between GPUs across nodes. - -To enable GPUDirect RDMA with vLLM, configure the following settings: - -- `IPC_LOCK` security context: add the `IPC_LOCK` capability to the container's security context to lock memory pages and prevent swapping to disk. -- Shared memory with `/dev/shm`: mount `/dev/shm` in the pod spec to provide shared memory for interprocess communication (IPC). - -If you use Docker, set up the container as follows: - -```bash -docker run --gpus all \ - --ipc=host \ - --shm-size=16G \ - -v /dev/shm:/dev/shm \ - vllm/vllm-openai -``` - -If you use Kubernetes, set up the pod spec as follows: - -```yaml -... -spec: - containers: - - name: vllm - image: vllm/vllm-openai - securityContext: - capabilities: - add: ["IPC_LOCK"] - volumeMounts: - - mountPath: /dev/shm - name: dshm - resources: - limits: - nvidia.com/gpu: 8 - requests: - nvidia.com/gpu: 8 - volumes: - - name: dshm - emptyDir: - medium: Memory -... -``` - -!!! tip "Confirm GPUDirect RDMA operation" - To confirm your InfiniBand card is using GPUDirect RDMA, run vLLM with detailed NCCL logs: `NCCL_DEBUG=TRACE vllm serve ...`. - - Then look for the NCCL version and the network used. - - - If you find `[send] via NET/IB/GDRDMA` in the logs, then NCCL is using InfiniBand with GPUDirect RDMA, which *is* efficient. - - If you find `[send] via NET/Socket` in the logs, NCCL used a raw TCP socket, which *is not* efficient for cross-node tensor parallelism. - !!! tip "Verify inter-node GPU communication" After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script][troubleshooting-incorrect-hardware-driver]. If you need additional environment variables for communication configuration, append them to , for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see . From e7404c88be9913cba6f90be476043d73c23afe8a Mon Sep 17 00:00:00 2001 From: Ricardo Decal Date: Thu, 31 Jul 2025 09:59:07 -0700 Subject: [PATCH 10/13] Move ray observability section to bottom Signed-off-by: Ricardo Decal --- docs/serving/distributed_troubleshooting.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/serving/distributed_troubleshooting.md b/docs/serving/distributed_troubleshooting.md index 19aa65ead46f..d1d3f89ea6ee 100644 --- a/docs/serving/distributed_troubleshooting.md +++ b/docs/serving/distributed_troubleshooting.md @@ -1,12 +1,13 @@ # Troubleshooting distributed deployments -## Ray observability -Debugging a distributed system can be challenging due to the large scale and complexity. Ray provides a suite of tools to help monitor, debug, and optimize Ray applications and clusters. For more information about Ray observability, visit the [official Ray observability docs](https://docs.ray.io/en/latest/ray-observability/index.html). For more information about debugging Ray applications, visit the [Ray Debugging Guide](https://docs.ray.io/en/latest/ray-observability/user-guides/debug-apps/index.html). For information about troubleshooting Kubernetes clusters, see the -[official KubeRay troubleshooting guide](https://docs.ray.io/en/latest/serve/advanced-guides/multi-node-gpu-troubleshooting.html). !!! tip "Verify inter-node GPU communication" After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script][troubleshooting-incorrect-hardware-driver]. If you need additional environment variables for communication configuration, append them to , for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see . !!! tip The error message `Error: No available node types can fulfill resource request` can appear even when the cluster has enough GPUs. The issue often occurs when nodes have multiple IP addresses and vLLM can't select the correct one. Ensure that vLLM and Ray use the same IP address by setting `VLLM_HOST_IP` in `run_cluster.sh` (with a different value on each node). Use `ray status` and `ray list nodes` to verify the chosen IP address. For more information, see . +## Ray observability + +Debugging a distributed system can be challenging due to the large scale and complexity. Ray provides a suite of tools to help monitor, debug, and optimize Ray applications and clusters. For more information about Ray observability, visit the [official Ray observability docs](https://docs.ray.io/en/latest/ray-observability/index.html). For more information about debugging Ray applications, visit the [Ray Debugging Guide](https://docs.ray.io/en/latest/ray-observability/user-guides/debug-apps/index.html). For information about troubleshooting Kubernetes clusters, see the +[official KubeRay troubleshooting guide](https://docs.ray.io/en/latest/serve/advanced-guides/multi-node-gpu-troubleshooting.html). \ No newline at end of file From 17c511296d0272b6215161bca682834ef756d76e Mon Sep 17 00:00:00 2001 From: Ricardo Decal Date: Thu, 31 Jul 2025 09:59:49 -0700 Subject: [PATCH 11/13] Promote admonitions into sections, and create link to run_cluster.sh Signed-off-by: Ricardo Decal --- docs/serving/distributed_troubleshooting.md | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/docs/serving/distributed_troubleshooting.md b/docs/serving/distributed_troubleshooting.md index d1d3f89ea6ee..0508d57e3b68 100644 --- a/docs/serving/distributed_troubleshooting.md +++ b/docs/serving/distributed_troubleshooting.md @@ -1,13 +1,11 @@ # Troubleshooting distributed deployments - - -!!! tip "Verify inter-node GPU communication" +## Verify inter-node GPU communication After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script][troubleshooting-incorrect-hardware-driver]. If you need additional environment variables for communication configuration, append them to , for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see . -!!! tip - The error message `Error: No available node types can fulfill resource request` can appear even when the cluster has enough GPUs. The issue often occurs when nodes have multiple IP addresses and vLLM can't select the correct one. Ensure that vLLM and Ray use the same IP address by setting `VLLM_HOST_IP` in `run_cluster.sh` (with a different value on each node). Use `ray status` and `ray list nodes` to verify the chosen IP address. For more information, see . +## No available node types can fulfill resource request +The error message `Error: No available node types can fulfill resource request` can appear even when the cluster has enough GPUs. The issue often occurs when nodes have multiple IP addresses and vLLM can't select the correct one. Ensure that vLLM and Ray use the same IP address by setting `VLLM_HOST_IP` in (with a different value on each node). Use `ray status` and `ray list nodes` to verify the chosen IP address. For more information, see . ## Ray observability Debugging a distributed system can be challenging due to the large scale and complexity. Ray provides a suite of tools to help monitor, debug, and optimize Ray applications and clusters. For more information about Ray observability, visit the [official Ray observability docs](https://docs.ray.io/en/latest/ray-observability/index.html). For more information about debugging Ray applications, visit the [Ray Debugging Guide](https://docs.ray.io/en/latest/ray-observability/user-guides/debug-apps/index.html). For information about troubleshooting Kubernetes clusters, see the -[official KubeRay troubleshooting guide](https://docs.ray.io/en/latest/serve/advanced-guides/multi-node-gpu-troubleshooting.html). \ No newline at end of file +[official KubeRay troubleshooting guide](https://docs.ray.io/en/latest/serve/advanced-guides/multi-node-gpu-troubleshooting.html). From 06824165d5558704f019de02f57459270fe97e56 Mon Sep 17 00:00:00 2001 From: Ricardo Decal Date: Thu, 31 Jul 2025 10:44:37 -0700 Subject: [PATCH 12/13] markdownlint Signed-off-by: Ricardo Decal --- docs/serving/distributed_serving.md | 2 +- docs/serving/distributed_troubleshooting.md | 3 +++ 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/docs/serving/distributed_serving.md b/docs/serving/distributed_serving.md index 25c6e96ee989..06d5fbdbf4a6 100644 --- a/docs/serving/distributed_serving.md +++ b/docs/serving/distributed_serving.md @@ -99,7 +99,7 @@ From any node, enter a container and run `ray status` and `ray list nodes` to ve ### Running vLLM on a Ray cluster !!! tip - If Ray is running inside containers, run the commands in the remainder of this guide _inside the containers_, not on the host. To open a shell inside a container, connect to a node and use `docker exec -it /bin/bash`. + If Ray is running inside containers, run the commands in the remainder of this guide *inside the containers*, not on the host. To open a shell inside a container, connect to a node and use `docker exec -it /bin/bash`. Once a Ray cluster is running, use vLLM as you would in a single-node setting. All resources across the Ray cluster are visible to vLLM, so a single `vllm` command on a single node is sufficient. diff --git a/docs/serving/distributed_troubleshooting.md b/docs/serving/distributed_troubleshooting.md index 0508d57e3b68..f2d9b6105440 100644 --- a/docs/serving/distributed_troubleshooting.md +++ b/docs/serving/distributed_troubleshooting.md @@ -1,10 +1,13 @@ # Troubleshooting distributed deployments ## Verify inter-node GPU communication + After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script][troubleshooting-incorrect-hardware-driver]. If you need additional environment variables for communication configuration, append them to , for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see . ## No available node types can fulfill resource request + The error message `Error: No available node types can fulfill resource request` can appear even when the cluster has enough GPUs. The issue often occurs when nodes have multiple IP addresses and vLLM can't select the correct one. Ensure that vLLM and Ray use the same IP address by setting `VLLM_HOST_IP` in (with a different value on each node). Use `ray status` and `ray list nodes` to verify the chosen IP address. For more information, see . + ## Ray observability Debugging a distributed system can be challenging due to the large scale and complexity. Ray provides a suite of tools to help monitor, debug, and optimize Ray applications and clusters. For more information about Ray observability, visit the [official Ray observability docs](https://docs.ray.io/en/latest/ray-observability/index.html). For more information about debugging Ray applications, visit the [Ray Debugging Guide](https://docs.ray.io/en/latest/ray-observability/user-guides/debug-apps/index.html). For information about troubleshooting Kubernetes clusters, see the From 22a92228efd7e53717d53e11dca4b4ad85fae17a Mon Sep 17 00:00:00 2001 From: Ricardo Decal Date: Fri, 1 Aug 2025 11:11:11 -0700 Subject: [PATCH 13/13] unindent paragraph, and link to general troubleshooting guide Signed-off-by: Ricardo Decal --- docs/serving/distributed_troubleshooting.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/serving/distributed_troubleshooting.md b/docs/serving/distributed_troubleshooting.md index f2d9b6105440..bd45f010ed2a 100644 --- a/docs/serving/distributed_troubleshooting.md +++ b/docs/serving/distributed_troubleshooting.md @@ -1,8 +1,10 @@ # Troubleshooting distributed deployments +For general troubleshooting, see [Troubleshooting](../usage/troubleshooting.md). + ## Verify inter-node GPU communication - After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script][troubleshooting-incorrect-hardware-driver]. If you need additional environment variables for communication configuration, append them to , for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see . +After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script][troubleshooting-incorrect-hardware-driver]. If you need additional environment variables for communication configuration, append them to , for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see . ## No available node types can fulfill resource request