[BugFix] Prevent orphaned process on NCCL destroy#39846
Conversation
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request modifies the destroy method in pynccl.py to skip the blocking ncclCommDestroy call, aiming to prevent hangs during uncoordinated shutdowns. Feedback indicates that simply nullifying the communicator causes resource leaks in long-running processes and suggests using ncclCommAbort as a safer, non-blocking alternative. Additionally, a race condition was identified where clearing the communicator reference before updating the disabled state could lead to crashes in multi-threaded environments.
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
|
Hi @jeffreywang-anyscale, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @jeffreywang-anyscale, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
## Description - **Upgrade vLLM** from 0.18.0 to 0.19.0 across requirements, setup.py, Dockerfile, and lock files. - Adapt to vLLM 0.19.0 **API changes**: pause wait_for_inflight_requests → tri-state mode, fix `vllm.inputs` import paths. - **Stabilize test infra**: add GPU process cleanup between batch tests (temporary cleanup and will be removed once vllm-project/vllm#39846 fixes the underlying vLLM issue), lower NIXL ports to avoid ephemeral range conflicts. ## Related issues N/A ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
There was a problem hiding this comment.
We introduced the nccl destroy call because these stale nccl comm instances broke repeated scale up/down tests. When scaling down from 4 gpus to 2 for instance, the GPU memory of ranks 2,3 is not freed even after procs 2, 3 are killed since the stale nccl communicators on ranks 0, 1 hold cuda ipc references to it, eventually causing OOM on GPUs 2, 3. Merging this PR will most probably break Elastic EP, I think we need some way to properly clean these nccl comms.
Thanks @itayalroy, sounds like aborting the nccl collective group is a more promising approach. Summarize the problems here:
|
However, I will look into enhancing |
## Description - **Upgrade vLLM** from 0.18.0 to 0.19.0 across requirements, setup.py, Dockerfile, and lock files. - Adapt to vLLM 0.19.0 **API changes**: pause wait_for_inflight_requests → tri-state mode, fix `vllm.inputs` import paths. - **Stabilize test infra**: add GPU process cleanup between batch tests (temporary cleanup and will be removed once vllm-project/vllm#39846 fixes the underlying vLLM issue), lower NIXL ports to avoid ephemeral range conflicts. ## Related issues N/A ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Head branch was pushed to by a user without write access
|
@tlrmchlsmth @itayalroy This PR is ready for another pass. I switched to using |
|
The failed test ( |
|
@rtourgeman has tested repeated scale up/down scenarios with this approach and it works |
|
https://buildkite.com/vllm/ci/builds/64131#019de6dd-799e-47e8-ae0c-2a825fdc3436 failure results from the flakiness that #41421 intends to resolved. |
|
@tlrmchlsmth @SageMoore this PR should be ready to go. @itayalroy has verified that this won't break Nvidia's internal elastic EP scaling tests. |
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
## Description - **Upgrade vLLM** from 0.18.0 to 0.19.0 across requirements, setup.py, Dockerfile, and lock files. - Adapt to vLLM 0.19.0 **API changes**: pause wait_for_inflight_requests → tri-state mode, fix `vllm.inputs` import paths. - **Stabilize test infra**: add GPU process cleanup between batch tests (temporary cleanup and will be removed once vllm-project/vllm#39846 fixes the underlying vLLM issue), lower NIXL ports to avoid ephemeral range conflicts. ## Related issues N/A ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Purpose
PyNcclCommunicator.destroy()currently callsncclCommDestroy(introduced by #37131), which is a collective and deadlocks during uncoordinated shutdown — when one rank's worker has already exited, surviving ranks block forever indestroy(), leaving orphaned GPU processes that a subsequent test/run can't reclaim. This PR switches toncclCommAbortand runs it in a daemon thread with a join timeout.Why we need to run
ncclCommAbortin a background threadncclCommAbortsets the device abort flag so in-flight kernels can exit without peer rendezvous — that's why it solves the original collective-hang. But its async job still funnels throughcommDestroySync, which spins on:localPersistentRefsis incremented for every CUDA graph that captured a NCCL op on this comm and is only decremented when the captured graph object is destroyed (CUDA fires thecuUserObjectdestructor). The abort flag does not drop that ref count. WithCUDAGraphMode.FULL_AND_PIECEWISE, vLLM holds dozens of such graphs at shutdown. Those graphs are released later in the same main-thread teardown chain (model runner cleanup, gc,__del__, interpreter shutdown), so a direct main-threadncclCommAbortself-deadlocks: the thread that needs to free the graphs is blocked waiting for them to be freed.Prior art
PyTorch's c10d uses the same daemon thread pattern in
ProcessGroupNCCL::abort()(code):Related thread: https://github.com/vllm-project/vllm/pull/37131/changes#r2967953725.
Test Plan
This bug presents when instantiating vLLM with the mp backend and TP > 1 inside a Ray actor and relying on Ray for process lifecycle management. After the fix, the failure no longer reproduces.
Test Result
The second test in https://github.com/ray-project/ray/blob/c91063a871f60550a0c226685a6979b8057a9105/release/llm_tests/batch/test_batch_vllm.py#L511-L520 would've failed without the fix.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.