[Docker][ROCm] Bump ROCm720 base image from 7.2.0 to 7.2.2 to fix hipEventQuery race#24151
Open
andyluo7 wants to merge 1 commit into
Open
[Docker][ROCm] Bump ROCm720 base image from 7.2.0 to 7.2.2 to fix hipEventQuery race#24151andyluo7 wants to merge 1 commit into
andyluo7 wants to merge 1 commit into
Conversation
The current rocm/pytorch:rocm7.2_ubuntu22.04_py3.10_pytorch_release_2.9.1 tag points at ROCm 7.2.0, which has a known hipEventQuery runtime bug where cross-thread calls ignore THREAD_LOCAL capture mode. This causes the NCCL watchdog to invalidate in-flight HIP graph captures, surfacing as crashes in AITER's IPCBufferPool path (used by the AITER custom all-reduce in v0.1.12.post1). The same race exists with SGLang's own CustomAllreduce, just with a smaller window. ROCm 7.2.2 ships the upstream fix for this hipEventQuery behavior, so bumping the base image makes AITER custom all-reduce safe to use again with no C++ rebuild and no Python change. The defensive default SGLANG_USE_AITER_AR=false from sgl-project#23581 is still useful for users stuck on ROCm 7.2.0, but on the bumped base image AITER AR can be re-enabled. Refs: - ROCm/aiter#2941, ROCm/aiter#2857 - pytorch/pytorch#176251 (the upstream THREAD_LOCAL fix) - sgl-project#23580, sgl-project#23581 Signed-off-by: Andy Luo <andyluo7@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The current ROCm720 base image
points at ROCm 7.2.0, which has a known
hipEventQueryruntime bug — cross-thread calls ignoreTHREAD_LOCALcapture mode, so the NCCL watchdog can invalidate in-flight HIP graph captures.This surfaces as multi-GPU AITER custom all-reduce crashes (#23580, #23581). AITER's
IPCBufferPoolchange in v0.1.12.post1 only widened the race window — the bug also exists with SGLang's ownCustomAllreduce, just narrower.The fix landed upstream in ROCm 7.2.2 (pytorch/pytorch#176251), confirmed by the AMD AITER team in ROCm/aiter#2941 and ROCm/aiter#2857.
Modifications
Bump both
BASE_IMAGE_*_ROCM720ARGs indocker/rocm.Dockerfileto the 7.2.2 build:Same Ubuntu / Python / PyTorch version, only the ROCm patch level changes. No C++ rebuild requirement, no Python change.
Result
With this change, AITER custom all-reduce is safe to use again on the rebuilt SGLang ROCm720 images. The defensive default
SGLANG_USE_AITER_AR=falsefrom #23581 remains useful for users stuck on ROCm 7.2.0 environments, but the bug class is gone on this base image.Refs
cc @sunway513, @TennyWang1223, @HaiShaw