Skip to content

[ROCm][CI] Remove TORCH_NCCL_BLOCKING_WAIT=1 After Bugfix In ROCm 7.2#41840

Merged
robertgshaw2-redhat merged 3 commits into
vllm-project:mainfrom
ROCm:micah/check-nccl-env-7.2
May 6, 2026
Merged

[ROCm][CI] Remove TORCH_NCCL_BLOCKING_WAIT=1 After Bugfix In ROCm 7.2#41840
robertgshaw2-redhat merged 3 commits into
vllm-project:mainfrom
ROCm:micah/check-nccl-env-7.2

Conversation

@micah-wil

@micah-wil micah-wil commented May 6, 2026

Copy link
Copy Markdown
Contributor

As of ROCm 7.2, the TORCH_NCCL_BLOCKING_WAIT=1 workaround is no longer needed for distributed test groups on ROCm. The fix here: ROCm/rocm-systems#2177 was merged to address the issue tracked here ROCm/hip#3876. Here is a CI build with the affected TGs which show them passing without TORCH_NCCL_BLOCKING_WAIT=1: https://buildkite.com/vllm/amd-ci/builds/8262/canvas?sid=019dfdb2-1e62-4b4b-885a-b7ddaecc82cb&tab=output (note that the failures there are present on main).

micah-wil added 3 commits May 6, 2026 14:27
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
…T=1"

This reverts commit c219c11.

Signed-off-by: Micah Williamson <micah.williamson@amd.com>
…ROCm 7.2

Signed-off-by: Micah Williamson <micah.williamson@amd.com>

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added ci/build rocm Related to AMD ROCm bug Something isn't working labels May 6, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD May 6, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the TORCH_NCCL_BLOCKING_WAIT=1 environment variable export from multiple test steps within the .buildkite/test-amd.yaml configuration file. These changes affect various distributed, collective RPC, and model-specific test suites on AMD platforms. I have no feedback to provide as there were no review comments to evaluate.

@robertgshaw2-redhat robertgshaw2-redhat enabled auto-merge (squash) May 6, 2026 16:42
@github-actions github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label May 6, 2026
@robertgshaw2-redhat robertgshaw2-redhat merged commit 7a576e2 into vllm-project:main May 6, 2026
18 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Done in AMD May 6, 2026
@micah-wil micah-wil deleted the micah/check-nccl-env-7.2 branch May 7, 2026 18:29
libinta pushed a commit to libinta/vllm that referenced this pull request May 8, 2026
….2 (vllm-project#41840)

Signed-off-by: Micah Williamson <micah.williamson@amd.com>
Signed-off-by: Libin Tang <libin.tang@intel.com>
weifang231 pushed a commit to weifang231/eb-vllm that referenced this pull request May 13, 2026
….2 (vllm-project#41840)

Signed-off-by: Micah Williamson <micah.williamson@amd.com>
mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026
….2 (vllm-project#41840)

Signed-off-by: Micah Williamson <micah.williamson@amd.com>
jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026
….2 (vllm-project#41840)

Signed-off-by: Micah Williamson <micah.williamson@amd.com>
mvanhorn pushed a commit to mvanhorn/vllm that referenced this pull request Jun 4, 2026
….2 (vllm-project#41840)

Signed-off-by: Micah Williamson <micah.williamson@amd.com>
Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ci/build ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants