[CP] Register KV cache allgather buffer with symmetric memory by wangfakang · Pull Request #24040 · sgl-project/sglang

wangfakang · 2026-04-29T09:42:10Z

Motivation

[CP] Fix missing symmetric memory registration in cp_all_gather_reorganized_into_tensor_kv_cache (#22914 follow-up)

When PR #22914 refactored and consolidated NSA utils.py into cp_utils.py, it missed wrapping the KV cache allgather buffer creation with use_symmetric_memory in cp_all_gather_reorganized_into_tensor_kv_cache. This change adds the missing symmetric memory capability to ensure proper buffer registration for improved communication efficiency when symmetric memory is available.

original Register cp-atten-allgather buffers with symm memory

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Signed-off-by: wangfakang <fakangwang@gmail.com>

gemini-code-assist

Code Review

This pull request updates cp_utils.py to ensure that tensor allocations are performed within the use_symmetric_memory context manager. A review comment suggests moving a descriptive comment inside the context manager to improve logical grouping and code clarity.

I am having trouble creating individual review comments. Click here to see my feedback.

python/sglang/srt/layers/utils/cp_utils.py (133)

The comment 'Create output tensor with proper shape for all dimensions' is placed outside the context manager, but it describes the logic inside the context manager. It should be moved inside to maintain logical grouping.

    with use_symmetric_memory(
        get_attention_cp_group(), disabled=not is_allocation_symmetric()
    ):
        # Create output tensor with proper shape for all dimensions

ShangmingCai

Looks reasonable.

CC: @Shunkangz could you take a look?

Shunkangz · 2026-04-30T02:53:29Z

LGTM. One more small question here. Does it mean that we might need to allocate the memory through ncclMemAlloc in runtime if the required size is larger than the pool size? Because the number of tokens might vary in runtime. If we want to avoid this, we should run a maximum number of tokens request in warm up to avoid this. Is it correct understanding?

wangfakang · 2026-04-30T03:02:49Z

LGTM. One more small question here. Does it mean that we might need to allocate the memory through ncclMemAlloc in runtime if the required size is larger than the pool size? Because the number of tokens might vary in runtime. If we want to avoid this, we should run a maximum number of tokens request in warm up to avoid this. Is it correct understanding?

Yes, that's correct. However, currently when SGLang starts up, it performs a check. If symm is enabled, it will by default pre-allocate 4GB of memory for warming up. This is to avoid frequent subsequent allocations due to insufficient memory, which can lead to memory fragmentation issues. Additionally, once this PR for restructuring the symm pool is merged, the memory pool will be shared across various communication.

Shunkangz · 2026-04-30T03:04:12Z

LGTM. One more small question here. Does it mean that we might need to allocate the memory through ncclMemAlloc in runtime if the required size is larger than the pool size? Because the number of tokens might vary in runtime. If we want to avoid this, we should run a maximum number of tokens request in warm up to avoid this. Is it correct understanding?

Yes, that's correct. However, currently when SGLang starts up, it performs a check. If symm is enabled, it will by default pre-allocate 4GB of memory for warming up. This is to avoid frequent subsequent allocations due to insufficient memory, which can lead to memory fragmentation issues. Additionally, once this PR for restructuring the symm pool is merged, the memory pool will be shared across various communication.

Thank you for the detailed explanation.

ShangmingCai · 2026-04-30T03:04:26Z

/tag-and-rerun-ci

wangfakang · 2026-05-04T15:48:57Z

/rerun-failed-ci

Trigger waiting test task.

wangfakang · 2026-05-06T02:22:17Z

Frendly ping @ShangmingCai @Shunkangz
I checked the logs and found that the failing test case is stage-b-test-1-gpu-large (8). The error message is ModuleNotFoundError: No module named 'sglang.srt.layers.moe.fused_moe_triton.fused_moe', which is unrelated to the changes in this PR.

ShangmingCai · 2026-05-06T03:16:21Z

/rerun-failed-ci

ShangmingCai · 2026-05-06T07:00:16Z

/rerun-stage stage-c-test-deepep-8-gpu-h200

ShangmingCai · 2026-05-06T07:00:29Z

/rerun-stage stage-c-test-4-gpu-h100

github-actions · 2026-05-06T07:00:42Z

✅ Triggered stage-c-test-deepep-8-gpu-h200 to run independently (skipping dependencies). View workflow run

github-actions · 2026-05-06T07:01:03Z

✅ Triggered stage-c-test-4-gpu-h100 to run independently (skipping dependencies). View workflow run

ShangmingCai · 2026-05-06T15:24:08Z

Related CI has passed.

* main: (894 commits) [Bug Fix] Fix RunAI streamer: corrupted weights, missing quant init, and broken URIs for multimodal models (sgl-project#22715) [Kernel] Deprecate DeepGemm in sgl kernel and apply custom wheel sgl-deep-gemm (sgl-project#24268) propagate pytest exit code from test __main__ entries (sgl-project#24487) [R3] Avoid implicit CUDA sync in routed experts DP slicing (sgl-project#24550) Add ChatCompletionRequest-style support to /v1/tokenize (sgl-project#23981) Support Triton MLA FP8 KV cache (sgl-project#20479) [diffusion] chore: align LTX-2 with official (sgl-project#24313) Expand support matrix for pypi wheel release (sgl-project#24565) [codex] Optimize Z-Image packed QKV (sgl-project#24117) [Misc] Fix breaking weight checker test (sgl-project#24553) [LoRA] Fix qkv_proj LoRA buffer sizing when tp_size > num_key_value_heads (sgl-project#24420) ci: bump test_mimo_models.py est_time 330 → 610 (sgl-project#24551) [CI] Temporarily disable marco/mcdse-2b-v1 in test_embedding_models (sgl-project#24279) Improve metrics, observability, and PD deploy tooling (sgl-project#24521) Fix diffusion fallback guards and validation (sgl-project#23335) [PD] Prevent update_status to Failed from cleared entries (sgl-project#24539) [CP] Register KV cache allgather buffer with symmetric memory (sgl-project#24040) Support getting checksums in weight checker (sgl-project#24537) Refactor buffer patterns in weight checker (sgl-project#24538) Add unit and end-to-end tests for weight checker (sgl-project#24536) ... # Conflicts: # python/sglang/srt/managers/scheduler.py # python/sglang/srt/model_executor/model_runner.py

…oject#24040) Signed-off-by: wangfakang <fakangwang@gmail.com>

wangfakang requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners April 29, 2026 09:42

[CP] Register KV cache allgather buffer with symmetric memory

6e7db61

Signed-off-by: wangfakang <fakangwang@gmail.com>

wangfakang force-pushed the opt-cp branch from efc21b3 to 6e7db61 Compare April 29, 2026 09:43

Merge branch 'main' into opt-cp

30b6da0

gemini-code-assist Bot reviewed Apr 29, 2026

View reviewed changes

Merge branch 'main' into opt-cp

7971de6

ShangmingCai reviewed Apr 29, 2026

View reviewed changes

ShangmingCai approved these changes Apr 30, 2026

View reviewed changes

github-actions Bot added the run-ci label Apr 30, 2026

ShangmingCai merged commit bfc1aea into sgl-project:main May 6, 2026
296 of 345 checks passed

LLThomas pushed a commit to LLThomas/sglang that referenced this pull request May 8, 2026

[CP] Register KV cache allgather buffer with symmetric memory (sgl-pr…

70a1e80

…oject#24040) Signed-off-by: wangfakang <fakangwang@gmail.com>

Conversation

wangfakang commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

python/sglang/srt/layers/utils/cp_utils.py (133)

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

Shunkangz commented Apr 30, 2026

Uh oh!

wangfakang commented Apr 30, 2026

Uh oh!

Shunkangz commented Apr 30, 2026

Uh oh!

ShangmingCai commented Apr 30, 2026

Uh oh!

wangfakang commented May 4, 2026

Uh oh!

wangfakang commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShangmingCai commented May 6, 2026

Uh oh!

ShangmingCai commented May 6, 2026

Uh oh!

ShangmingCai commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

ShangmingCai commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wangfakang commented Apr 29, 2026 •

edited

Loading

wangfakang commented May 6, 2026 •

edited

Loading