Skip to content

[AMD] upgrade to ROCm 7.2.2#24245

Closed
spaparaju wants to merge 1 commit into
sgl-project:mainfrom
spaparaju:fix/rocm-720-base-image-bump-7.2.2
Closed

[AMD] upgrade to ROCm 7.2.2#24245
spaparaju wants to merge 1 commit into
sgl-project:mainfrom
spaparaju:fix/rocm-720-base-image-bump-7.2.2

Conversation

@spaparaju
Copy link
Copy Markdown

Motivation

ROCm 7.2.0 has a bug in hipEventQuery where cross-thread calls ignore THREAD_LOCAL capture mode. This causes the NCCL watchdog to invalidate active HIP graph captures, leading to multi-GPU crashes when using
AITER custom all-reduce operations.

ROCm 7.2.2 includes the fix for this race condition (see ROCm 7.2.2 release notes).

Fixes #24151

Related issues: #23580, #23581

Modifications

Bump the ROCm 7.2 base image from rocm7.2 to rocm7.2.2 in docker/rocm.Dockerfile:

  • BASE_IMAGE_942_ROCM720: rocm/pytorch:rocm7.2_ubuntu22.04_py3.10_pytorch_release_2.9.1rocm/pytorch:rocm7.2.2_ubuntu22.04_py3.10_pytorch_release_2.9.1
  • BASE_IMAGE_950_ROCM720: rocm/pytorch:rocm7.2_ubuntu22.04_py3.10_pytorch_release_2.9.1rocm/pytorch:rocm7.2.2_ubuntu22.04_py3.10_pytorch_release_2.9.1

Accuracy Tests

N/A - This is a Docker base image version bump only. No changes to model code or kernels.

Speed Tests and Profiling

N/A - No changes to inference code. This fixes a race condition crash, not performance.

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@github-actions github-actions Bot added the amd label May 1, 2026
@spaparaju spaparaju changed the title [AMD] fixes #24151, upgrade to ROCm 7.2.2 [AMD] upgrade to ROCm 7.2.2 May 1, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the ROCm base image version from 7.2 to 7.2.2 in the Dockerfile for both gfx942 and gfx950 architectures. Feedback indicates that the TORCH_ROCM_FILE variable may also need an update to prevent force-installing a PyTorch wheel built against the older ROCm 7.2.0, which could negate the benefits of the upgrade. Additionally, there are concerns regarding naming inconsistencies and pattern matching in build stages that still use the '720' suffix, potentially causing critical configuration steps to be skipped if the build arguments are changed to match the new version.

Comment thread docker/rocm.Dockerfile
# Default base images
ARG BASE_IMAGE_942="rocm/sgl-dev:rocm7-vllm-20250904"
ARG BASE_IMAGE_942_ROCM720="rocm/pytorch:rocm7.2_ubuntu22.04_py3.10_pytorch_release_2.9.1"
ARG BASE_IMAGE_942_ROCM720="rocm/pytorch:rocm7.2.2_ubuntu22.04_py3.10_pytorch_release_2.9.1"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The upgrade to ROCm 7.2.2 in the base image may be ineffective if the TORCH_ROCM_FILE (defined at line 462 as torch-2.9.1+rocm7.2.0...) is not also updated. This wheel is force-installed at line 517, potentially overwriting the PyTorch installation in the new base image with one built against ROCm 7.2.0. If this wheel bundles its own ROCm libraries or was compiled against the buggy hipEventQuery implementation, the fix for the race condition described in the PR may not be realized in the final image. Additionally, if this specific wheel file is not present in the new base image's filesystem (as implied by the path in hack.py), the build will fail.

Comment thread docker/rocm.Dockerfile
ARG BASE_IMAGE_942_ROCM720="rocm/pytorch:rocm7.2.2_ubuntu22.04_py3.10_pytorch_release_2.9.1"
ARG BASE_IMAGE_950="rocm/sgl-dev:rocm7-vllm-20250904"
ARG BASE_IMAGE_950_ROCM720="rocm/pytorch:rocm7.2_ubuntu22.04_py3.10_pytorch_release_2.9.1"
ARG BASE_IMAGE_950_ROCM720="rocm/pytorch:rocm7.2.2_ubuntu22.04_py3.10_pytorch_release_2.9.1"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variable names and build stages (e.g., BASE_IMAGE_950_ROCM720 and gfx950-rocm720) still use the 720 suffix, which is now inconsistent with the 7.2.2 version. More importantly, the pattern matching logic in several RUN blocks (lines 121, 149, 514) specifically checks for *rocm720*. If a user attempts to build with GPU_ARCH=gfx950-rocm722 to match the new version, these critical configuration and patching steps will be skipped. Consider updating the naming and pattern matching to be more generic (e.g., rocm72*) to avoid maintenance issues with future patch releases.

@spaparaju spaparaju closed this May 3, 2026
@spaparaju spaparaju deleted the fix/rocm-720-base-image-bump-7.2.2 branch May 3, 2026 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant