Use unreg path for custom all-reduce during CUDA graph capture by zyzshishui · Pull Request #2075 · ROCm/aiter

zyzshishui · 2026-02-23T06:51:33Z

Motivation

Same as sgl-project/sglang#19162.

Super tiny fix, needed to be compatible with torch_memory_saver. Error path:

torch_memory_saver hook hipMalloc with hipMemAddressReserve+ hipMemMap(VMM APIs)
During register_graph_buffers, hipIpcGetMemHandleexpect a ptr by hipMallocbut in fact, got buffer from hipMemAddressReserve+ hipMemMap
hipIpcGetMemHandle does not check and accept this invalid handle (invalid because Runtime API and VMM API use different Allocation Table), which I raised a fix here
Other ranks call hipIpcOpenMemHandle(invalid_handle) and fail, causing a hang during cuda graph capturing

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

valarLip · 2026-02-23T07:12:09Z

aiter/dist/device_communicators/custom_all_reduce.py

        #     return

        self.disabled = False
+        self.tms_cudagraph = os.getenv("SGLANG_MEMORY_SAVER_CUDA_GRAPH", "0")


prefix with SGLANG_MEMORY_SAVER_CUDA_GRAPH?

Good catch. How about adding a tms_cudagraph parameter to __init__ and parse the param from sglang? But this would need sglang using updated aiter. Any suggestions?

good idea, let's add "enable_register" param to init

Added "enable_register_for_capturing" since this only control behavior for capturing not real calling. But again, we cannot make subsequent change before sglang bump up aiter's version. I will keep an eye on that

* 1 * 1 * 1

…2075) * 1 * 1 * 1

1

c19ff57

zyzshishui requested a review from a team February 23, 2026 06:51

zyzshishui mentioned this pull request Feb 23, 2026

[Bug] Custom all-reduce IPC buffers use fixed VA, conflict with torch_memory_saver pause/resume → zombie processes on MI300X #2061

Open

valarLip reviewed Feb 23, 2026

View reviewed changes

yushengsu-thu approved these changes Feb 23, 2026

View reviewed changes

zyzshishui and others added 3 commits February 23, 2026 18:53

1

8146fca

Merge branch 'main' into ar

4650e55

1

a1d148f

valarLip approved these changes Mar 3, 2026

View reviewed changes

valarLip merged commit d4f5e52 into ROCm:main Mar 3, 2026
18 checks passed

zyzshishui deleted the ar branch March 8, 2026 07:54

zyzshishui mentioned this pull request Mar 9, 2026

[ROCm] Use unreg path for aiter custom all-reduce during CUDA graph capture sgl-project/sglang#20155

Merged

5 tasks

apicciau pushed a commit that referenced this pull request Mar 12, 2026

Use unreg path for custom all-reduce during CUDA graph capture (#2075)

7e80878

* 1 * 1 * 1

valarLip pushed a commit that referenced this pull request Mar 18, 2026

Use unreg path for custom all-reduce during CUDA graph capture (#2075)

82637ad

* 1 * 1 * 1

AMD-yanfeiwang pushed a commit to AMD-yanfeiwang/aiter that referenced this pull request Mar 18, 2026

Use unreg path for custom all-reduce during CUDA graph capture (ROCm#…

214624a

…2075) * 1 * 1 * 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use unreg path for custom all-reduce during CUDA graph capture#2075

Use unreg path for custom all-reduce during CUDA graph capture#2075
valarLip merged 4 commits intoROCm:mainfrom
zyzshishui:ar

zyzshishui commented Feb 23, 2026

Uh oh!

valarLip Feb 23, 2026

Uh oh!

zyzshishui Feb 23, 2026 •

edited

Loading

Uh oh!

valarLip Feb 25, 2026

Uh oh!

zyzshishui Feb 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zyzshishui commented Feb 23, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

valarLip Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

zyzshishui Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

valarLip Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

zyzshishui Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zyzshishui Feb 23, 2026 •

edited

Loading

zyzshishui Feb 25, 2026 •

edited

Loading