[AMD] upgrade to ROCm 7.2.2#24245
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the ROCm base image version from 7.2 to 7.2.2 in the Dockerfile for both gfx942 and gfx950 architectures. Feedback indicates that the TORCH_ROCM_FILE variable may also need an update to prevent force-installing a PyTorch wheel built against the older ROCm 7.2.0, which could negate the benefits of the upgrade. Additionally, there are concerns regarding naming inconsistencies and pattern matching in build stages that still use the '720' suffix, potentially causing critical configuration steps to be skipped if the build arguments are changed to match the new version.
| # Default base images | ||
| ARG BASE_IMAGE_942="rocm/sgl-dev:rocm7-vllm-20250904" | ||
| ARG BASE_IMAGE_942_ROCM720="rocm/pytorch:rocm7.2_ubuntu22.04_py3.10_pytorch_release_2.9.1" | ||
| ARG BASE_IMAGE_942_ROCM720="rocm/pytorch:rocm7.2.2_ubuntu22.04_py3.10_pytorch_release_2.9.1" |
There was a problem hiding this comment.
The upgrade to ROCm 7.2.2 in the base image may be ineffective if the TORCH_ROCM_FILE (defined at line 462 as torch-2.9.1+rocm7.2.0...) is not also updated. This wheel is force-installed at line 517, potentially overwriting the PyTorch installation in the new base image with one built against ROCm 7.2.0. If this wheel bundles its own ROCm libraries or was compiled against the buggy hipEventQuery implementation, the fix for the race condition described in the PR may not be realized in the final image. Additionally, if this specific wheel file is not present in the new base image's filesystem (as implied by the path in hack.py), the build will fail.
| ARG BASE_IMAGE_942_ROCM720="rocm/pytorch:rocm7.2.2_ubuntu22.04_py3.10_pytorch_release_2.9.1" | ||
| ARG BASE_IMAGE_950="rocm/sgl-dev:rocm7-vllm-20250904" | ||
| ARG BASE_IMAGE_950_ROCM720="rocm/pytorch:rocm7.2_ubuntu22.04_py3.10_pytorch_release_2.9.1" | ||
| ARG BASE_IMAGE_950_ROCM720="rocm/pytorch:rocm7.2.2_ubuntu22.04_py3.10_pytorch_release_2.9.1" |
There was a problem hiding this comment.
The variable names and build stages (e.g., BASE_IMAGE_950_ROCM720 and gfx950-rocm720) still use the 720 suffix, which is now inconsistent with the 7.2.2 version. More importantly, the pattern matching logic in several RUN blocks (lines 121, 149, 514) specifically checks for *rocm720*. If a user attempts to build with GPU_ARCH=gfx950-rocm722 to match the new version, these critical configuration and patching steps will be skipped. Consider updating the naming and pattern matching to be more generic (e.g., rocm72*) to avoid maintenance issues with future patch releases.
Motivation
ROCm 7.2.0 has a bug in
hipEventQuerywhere cross-thread calls ignoreTHREAD_LOCALcapture mode. This causes the NCCL watchdog to invalidate active HIP graph captures, leading to multi-GPU crashes when usingAITER custom all-reduce operations.
ROCm 7.2.2 includes the fix for this race condition (see ROCm 7.2.2 release notes).
Fixes #24151
Related issues: #23580, #23581
Modifications
Bump the ROCm 7.2 base image from
rocm7.2torocm7.2.2indocker/rocm.Dockerfile:BASE_IMAGE_942_ROCM720:rocm/pytorch:rocm7.2_ubuntu22.04_py3.10_pytorch_release_2.9.1→rocm/pytorch:rocm7.2.2_ubuntu22.04_py3.10_pytorch_release_2.9.1BASE_IMAGE_950_ROCM720:rocm/pytorch:rocm7.2_ubuntu22.04_py3.10_pytorch_release_2.9.1→rocm/pytorch:rocm7.2.2_ubuntu22.04_py3.10_pytorch_release_2.9.1Accuracy Tests
N/A - This is a Docker base image version bump only. No changes to model code or kernels.
Speed Tests and Profiling
N/A - No changes to inference code. This fixes a race condition crash, not performance.
Checklist
speed.
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci