Skip to content

[Docker][ROCm] Bump ROCm720 base image from 7.2.0 to 7.2.2 to fix hipEventQuery race#24151

Open
andyluo7 wants to merge 1 commit into
sgl-project:mainfrom
andyluo7:base-image-bump-rocm722
Open

[Docker][ROCm] Bump ROCm720 base image from 7.2.0 to 7.2.2 to fix hipEventQuery race#24151
andyluo7 wants to merge 1 commit into
sgl-project:mainfrom
andyluo7:base-image-bump-rocm722

Conversation

@andyluo7
Copy link
Copy Markdown
Contributor

Motivation

The current ROCm720 base image

rocm/pytorch:rocm7.2_ubuntu22.04_py3.10_pytorch_release_2.9.1

points at ROCm 7.2.0, which has a known hipEventQuery runtime bug — cross-thread calls ignore THREAD_LOCAL capture mode, so the NCCL watchdog can invalidate in-flight HIP graph captures.

This surfaces as multi-GPU AITER custom all-reduce crashes (#23580, #23581). AITER's IPCBufferPool change in v0.1.12.post1 only widened the race window — the bug also exists with SGLang's own CustomAllreduce, just narrower.

The fix landed upstream in ROCm 7.2.2 (pytorch/pytorch#176251), confirmed by the AMD AITER team in ROCm/aiter#2941 and ROCm/aiter#2857.

Modifications

Bump both BASE_IMAGE_*_ROCM720 ARGs in docker/rocm.Dockerfile to the 7.2.2 build:

- rocm/pytorch:rocm7.2_ubuntu22.04_py3.10_pytorch_release_2.9.1
+ rocm/pytorch:rocm7.2.2_ubuntu22.04_py3.10_pytorch_release_2.9.1

Same Ubuntu / Python / PyTorch version, only the ROCm patch level changes. No C++ rebuild requirement, no Python change.

Result

With this change, AITER custom all-reduce is safe to use again on the rebuilt SGLang ROCm720 images. The defensive default SGLANG_USE_AITER_AR=false from #23581 remains useful for users stuck on ROCm 7.2.0 environments, but the bug class is gone on this base image.

Refs

cc @sunway513, @TennyWang1223, @HaiShaw

The current rocm/pytorch:rocm7.2_ubuntu22.04_py3.10_pytorch_release_2.9.1
tag points at ROCm 7.2.0, which has a known hipEventQuery runtime bug
where cross-thread calls ignore THREAD_LOCAL capture mode. This causes
the NCCL watchdog to invalidate in-flight HIP graph captures, surfacing
as crashes in AITER's IPCBufferPool path (used by the AITER custom
all-reduce in v0.1.12.post1). The same race exists with SGLang's own
CustomAllreduce, just with a smaller window.

ROCm 7.2.2 ships the upstream fix for this hipEventQuery behavior, so
bumping the base image makes AITER custom all-reduce safe to use again
with no C++ rebuild and no Python change. The defensive default
SGLANG_USE_AITER_AR=false from sgl-project#23581 is still useful for users stuck
on ROCm 7.2.0, but on the bumped base image AITER AR can be re-enabled.

Refs:
- ROCm/aiter#2941, ROCm/aiter#2857
- pytorch/pytorch#176251 (the upstream THREAD_LOCAL fix)
- sgl-project#23580, sgl-project#23581

Signed-off-by: Andy Luo <andyluo7@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the ROCm base image versions for BASE_IMAGE_942_ROCM720 and BASE_IMAGE_950_ROCM720 from 7.2 to 7.2.2 in docker/rocm.Dockerfile. I have no feedback to provide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant