Bugfix: fix symm not enabled due to incorrect registration of comm by wangfakang · Pull Request #19329 · sgl-project/sglang

wangfakang · 2026-02-25T09:40:53Z

CC @ShangmingCai @nvcastet @Fridge003 @BBuf @yizhang2077 @ch-wan PTAL, thx.

Motivation

fix symm not enabled due to incorrect registration of comm.

get_local_dp_buffer uses the group obtained by get_tp_group by default to register symm.
The attn_cp_all_gather_into_tensor operation uses the group obtained by get_attention_cp_group to perform allgather.
Steps 1 and 2 cause symm to fail when enabled because the two groups are inconsistent.

sglang/python/sglang/srt/layers/communicator_nsa_cp.py

Lines 141 to 163 in 671b595

    
           def _gather_hidden_states_and_residual( 
        
               hidden_states: torch.Tensor, 
        
               residual: torch.Tensor, 
        
               forward_batch: ForwardBatch, 
        
               layernorm: torch.nn.Module, 
        
               context: CommunicateContext, 
        
               *, 
        
               residual_input_mode, 
        
           ): 
        
               if hidden_states.shape[0] != 0: 
        
                   hidden_states, residual = layernorm(hidden_states, residual) 
        
               # for prefill: attn tp scattered -> full 
        
               # for decode: attn tp full -> full 
        
               if nsa_use_prefill_cp(forward_batch): 
        
                   assert context.attn_dp_size == 1 
        
                   hidden_states, local_hidden_states = ( 
        
                       get_local_dp_buffer(), 
        
                       hidden_states, 
        
                   ) 
        
                   attn_cp_all_gather_into_tensor( 
        
                       hidden_states, 
        
                       local_hidden_states, 
        
                   )

sglang/python/sglang/srt/layers/dp_attention.py

Lines 129 to 130 in c4ef338

    
           def get_local_dp_buffer(cls) -> torch.Tensor: 
        
               with use_symmetric_memory(get_tp_group(), disabled=not cls._dp_max_padding):

sglang/python/sglang/srt/layers/dp_attention.py

Lines 572 to 573 in c4ef338

    
           def attn_cp_all_gather_into_tensor(output: torch.Tensor, input: torch.Tensor): 
        
               return get_attention_cp_group().all_gather_into_tensor(output, input)

sglang/python/sglang/srt/distributed/device_communicators/pynccl_allocator.py

Lines 180 to 184 in 671b595

    
           # Set the env var to pass this argument to the C functions. 
        
           os.environ["SGLANG_TMP_NCCL_COMM_VALUE"] = str( 
        
               self.group_coordinator.pynccl_comm.comm.value 
        
           )

sglang/python/sglang/srt/distributed/device_communicators/pynccl_allocator.py

Lines 52 to 58 in c4ef338

    
             const char *str_val = getenv("SGLANG_TMP_NCCL_COMM_VALUE"); 
        
             char *endptr; 
        
             void* int_val = (void *)strtoull(str_val, &endptr, 0); 
        
             ncclComm_t comm = (ncclComm_t)(int_val); 
        
             ncclWindow_t win; 
        
             NCCLCHECK(ncclCommWindowRegister(comm, ptr, size, &win, NCCL_WIN_COLL_SYMMETRIC));

Modifications

Modify the functions get_local_dp_buffer and get_global_dp_buffer to add the GroupOrdinator parameter, which can make the comm group registered by symm consistent with the comm group of the collection operator.

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

Signed-off-by: wangfakang <fakangwang@gmail.com>

gemini-code-assist · 2026-02-25T09:41:17Z

Summary of Changes

Hello @wangfakang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical bug preventing symmetric memory from being properly enabled in distributed operations. The fix involves enhancing the get_local_dp_buffer and get_global_dp_buffer functions to explicitly accept a communication group, thereby resolving inconsistencies between memory registration and collective communication patterns. This change ensures that distributed attention and MoE operations utilize the correct communication groups, leading to proper functionality of symmetric memory.

Highlights

Communication Group Inconsistency: Identified and fixed a bug where symmetric memory (symm) was not correctly enabled due to get_local_dp_buffer and get_global_dp_buffer using a default communication group (get_tp_group()) that was inconsistent with the group used by collective operations like attn_cp_all_gather_into_tensor (get_attention_cp_group()).
API Modification for Group Coordination: Modified the get_local_dp_buffer and get_global_dp_buffer functions to accept an explicit GroupCoordinator parameter, allowing callers to specify the correct communication group for symmetric memory registration.
Consistent Group Usage: Updated various call sites across communicator.py, communicator_nsa_cp.py, and token_dispatcher/standard.py to pass the appropriate communication group (e.g., get_attention_tp_group(), get_attention_cp_group(), or get_tp_group()) to get_local_dp_buffer and get_global_dp_buffer, ensuring consistency between symm registration and collective operations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/sglang/srt/layers/communicator.py
- Imported get_attention_tp_group.
- Updated calls to get_local_dp_buffer within _scattered_to_tp_attn_full, _gather_hidden_states_and_residual, _scatter_hidden_states, and _gather to pass get_attention_tp_group().
- Updated a call to get_global_dp_buffer within _gather_hidden_states_and_residual to pass get_tp_group().
- Added logic in _scatter_hidden_states to determine the appropriate group (get_tp_group() or get_attention_tp_group()) based on tensor model parallel world size and attention data parallel size.
python/sglang/srt/layers/communicator_nsa_cp.py
- Imported get_attention_cp_group.
- Updated the call to get_local_dp_buffer within _gather_hidden_states_and_residual to pass get_attention_cp_group().
python/sglang/srt/layers/dp_attention.py
- Modified the class methods _DpGatheredBufferWrapper.get_global_dp_buffer and _DpGatheredBufferWrapper.get_local_dp_buffer to accept a group: GroupCoordinator parameter.
- Modified the standalone functions get_global_dp_buffer and get_local_dp_buffer to accept and pass through the group parameter to their respective wrapper methods.
python/sglang/srt/layers/moe/token_dispatcher/standard.py
- Updated the call to get_local_dp_buffer within the combine method to pass get_tp_group().

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request correctly addresses a bug where symmetric memory (symm) was not being enabled due to incorrect communication group registration. The fix involves adding a group parameter to get_local_dp_buffer and get_global_dp_buffer to make the communication group explicit. This change is consistently applied across various call sites, ensuring that the appropriate communication group (get_attention_tp_group, get_tp_group, or get_attention_cp_group) is used for symmetric memory registration, aligning it with the group used by the corresponding collective communication operator. The implementation is clean and effectively resolves the issue.

Signed-off-by: wangfakang <fakangwang@gmail.com>

ShangmingCai · 2026-02-26T06:13:14Z

python/sglang/srt/layers/dp_attention.py

-    def get_local_dp_buffer(cls) -> torch.Tensor:
-        with use_symmetric_memory(get_tp_group(), disabled=not cls._dp_max_padding):
+    def get_local_dp_buffer(cls, group: GroupCoordinator) -> torch.Tensor:
+        with use_symmetric_memory(group, disabled=not cls._dp_max_padding):


Why not make group: Optional[GroupCoordinator] = None, if group==None, then we still use get_tp_group() by default.

Why not make group: Optional[GroupCoordinator] = None, if group==None, then we still use get_tp_group() by default.

Using default values can easily lead to group inconsistency, so it's necessary to explicitly declare the correct group to ensure consistency with the communication operator's context.

nvcastet · 2026-02-26T20:23:48Z

@wangfakang Thanks for your PR!
I believe, there is an issue here when multiple groups (communicators) are used with "use_symmetric_memory" context manager in the same run.
The current behavior is:

If the pool does not have enough inactive memory, it will allocate and register memory with the specified group.
If the pool has enough inactive memory, it just returns a memory block from the pool (without considering the specified group).

So we would need to come up with a design to address this issue: aka making sure the memory returned by the pool is properly registered with the correct group.

In a previous design, we decoupled registration from allocation and we could register or re-register memory segments at the exit of the context manager but now we perform the registration with the allocation since we did not want to pay the CPU cost of pytorch memory snapshot API.
I guess we could come back to this earlier design but tracking the allocations ourself in C++ instead of going through memory snapshot.
CC @merrymercy

wangfakang · 2026-02-27T02:40:52Z

@wangfakang Thanks for your PR! I believe, there is an issue here when multiple groups (communicators) are used with "use_symmetric_memory" context manager in the same run. The current behavior is:

If the pool does not have enough inactive memory, it will allocate and register memory with the specified group.

If the pool has enough inactive memory, it just returns a memory block from the pool (without considering the specified group).

So we would need to come up with a design to address this issue: aka making sure the memory returned by the pool is properly registered with the correct group.

In a previous design, we decoupled registration from allocation and we could register or re-register memory segments at the exit of the context manager but now we perform the registration with the allocation since we did not want to pay the CPU cost of pytorch memory snapshot API. I guess we could come back to this earlier design but tracking the allocations ourself in C++ instead of going through memory snapshot. CC @merrymercy

@nvcastet You're absolutely right. Indeed, there are two distinct issues here and this PR fixes the first issue (missing group passing). The second issue (memory snapshot inconsistency) remains and needs a separate solution, possibly via C++ tracking or reverting to decoupled registration.

nvcastet · 2026-02-27T15:15:26Z

Please, if possible use your own words to answer review comments instead of AI. It is easier when we have more direct and accurate back and forth.
I don't think they are distinct issues. I think it would be better to fix the other issue first before merging this one since this one would give you the impression buffers are registered but they are actually still associated with the tp group.
And before merging those fixes, we would need to re-run key configs for DSR1, and qwen to check if we did not break anything.

wangfakang · 2026-03-02T03:29:02Z

Please, if possible use your own words to answer review comments instead of AI. It is easier when we have more direct and accurate back and forth. I don't think they are distinct issues. I think it would be better to fix the other issue first before merging this one since this one would give you the impression buffers are registered but they are actually still associated with the tp group. And before merging those fixes, we would need to re-run key configs for DSR1, and qwen to check if we did not break anything.

@nvcastet I apologize for the confusion. Would reverting to the previous design require another PR? Thank you.

nvcastet · 2026-03-02T16:08:23Z

Please, if possible use your own words to answer review comments instead of AI. It is easier when we have more direct and accurate back and forth. I don't think they are distinct issues. I think it would be better to fix the other issue first before merging this one since this one would give you the impression buffers are registered but they are actually still associated with the tp group. And before merging those fixes, we would need to re-run key configs for DSR1, and qwen to check if we did not break anything.

@nvcastet I apologize for the confusion. Would reverting to the previous design require another PR? Thank you.

@wangfakang:
@merrymercy can shine in on that topic, but the issue with rolling back was the snapshot() API cost, see previous version at https://github.com/sgl-project/sglang/pull/12524/changes#diff-1857ea2e79f03309e0776136d7e45b432e0369c20ff8a57d418d68c764bb733f.
We would need a new design decoupling registration without the cpu overhead of snapshot().

wangfakang · 2026-03-09T08:43:09Z

Please, if possible use your own words to answer review comments instead of AI. It is easier when we have more direct and accurate back and forth. I don't think they are distinct issues. I think it would be better to fix the other issue first before merging this one since this one would give you the impression buffers are registered but they are actually still associated with the tp group. And before merging those fixes, we would need to re-run key configs for DSR1, and qwen to check if we did not break anything.

@nvcastet I apologize for the confusion. Would reverting to the previous design require another PR? Thank you.

@wangfakang: @merrymercy can shine in on that topic, but the issue with rolling back was the snapshot() API cost, see previous version at https://github.com/sgl-project/sglang/pull/12524/changes#diff-1857ea2e79f03309e0776136d7e45b432e0369c20ff8a57d418d68c764bb733f. We would need a new design decoupling registration without the cpu overhead of snapshot().

@nvcastet @merrymercy I refactored SymmPool to replace the global MemPool with per-group MemPool dictionary #20153. Now each communication group has its own MemPool, ensuring proper memory registration and preventing cross-group allocation issues in multi-comm scenarios.
Now, Combining #19329 and #20153 can completely fix the two types of issues mentioned by @nvcastet. I can observe all the effects through the nccl tuning log when I enable symm locally.

Bugfix: fix symm not enabled due to incorrect registration of comm.

6bff9ba

Signed-off-by: wangfakang <fakangwang@gmail.com>

wangfakang requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners February 25, 2026 09:40

Merge branch 'main' into bugfix_symm

11e5bbc

gemini-code-assist bot reviewed Feb 25, 2026

View reviewed changes

wangfakang added 2 commits February 25, 2026 17:51

fix style.

87fc74d

Signed-off-by: wangfakang <fakangwang@gmail.com>

Merge branch 'main' into bugfix_symm

7396483

ShangmingCai assigned yizhang2077 and ch-wan Feb 26, 2026

ShangmingCai reviewed Feb 26, 2026

View reviewed changes

Merge branch 'main' into bugfix_symm

5750fd4

wangfakang mentioned this pull request Mar 9, 2026

Refactor: fix symmetric memory pool isolation per communication group #20153

Open

5 tasks

wangfakang added 5 commits March 9, 2026 16:44

Merge branch 'main' into bugfix_symm

ded5f45

Merge branch 'main' into bugfix_symm

13b345a

Merge branch 'main' into bugfix_symm

e482f98

Merge branch 'main' into bugfix_symm

f648969

Merge branch 'main' into bugfix_symm

f46710d

wangfakang mentioned this pull request Mar 17, 2026

[Feature] Add DCP support for DeepSeek v3.2 #18167

Open

7 tasks

wangfakang mentioned this pull request Mar 25, 2026

Refactor: decouple segment tracking from comm registration #21392

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix: fix symm not enabled due to incorrect registration of comm#19329

Bugfix: fix symm not enabled due to incorrect registration of comm#19329
wangfakang wants to merge 10 commits intosgl-project:mainfrom
wangfakang:bugfix_symm

wangfakang commented Feb 25, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 25, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

ShangmingCai Feb 26, 2026

Uh oh!

wangfakang Feb 26, 2026 •

edited

Loading

Uh oh!

nvcastet commented Feb 26, 2026 •

edited

Loading

Uh oh!

wangfakang commented Feb 27, 2026 •

edited

Loading

Uh oh!

nvcastet commented Feb 27, 2026 •

edited

Loading

Uh oh!

wangfakang commented Mar 2, 2026 •

edited

Loading

Uh oh!

nvcastet commented Mar 2, 2026

Uh oh!

wangfakang commented Mar 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	def _gather_hidden_states_and_residual(
	hidden_states: torch.Tensor,
	residual: torch.Tensor,
	forward_batch: ForwardBatch,
	layernorm: torch.nn.Module,
	context: CommunicateContext,
	*,
	residual_input_mode,
	):
	if hidden_states.shape[0] != 0:
	hidden_states, residual = layernorm(hidden_states, residual)
	# for prefill: attn tp scattered -> full
	# for decode: attn tp full -> full
	if nsa_use_prefill_cp(forward_batch):
	assert context.attn_dp_size == 1
	hidden_states, local_hidden_states = (
	get_local_dp_buffer(),
	hidden_states,
	)
	attn_cp_all_gather_into_tensor(
	hidden_states,
	local_hidden_states,
	)

	def get_local_dp_buffer(cls) -> torch.Tensor:
	with use_symmetric_memory(get_tp_group(), disabled=not cls._dp_max_padding):

	def attn_cp_all_gather_into_tensor(output: torch.Tensor, input: torch.Tensor):
	return get_attention_cp_group().all_gather_into_tensor(output, input)

	# Set the env var to pass this argument to the C functions.
	os.environ["SGLANG_TMP_NCCL_COMM_VALUE"] = str(
	self.group_coordinator.pynccl_comm.comm.value
	)

	const char *str_val = getenv("SGLANG_TMP_NCCL_COMM_VALUE");
	char *endptr;
	void* int_val = (void *)strtoull(str_val, &endptr, 0);

	ncclComm_t comm = (ncclComm_t)(int_val);
	ncclWindow_t win;
	NCCLCHECK(ncclCommWindowRegister(comm, ptr, size, &win, NCCL_WIN_COLL_SYMMETRIC));

Conversation

wangfakang commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Feb 25, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

ShangmingCai Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

wangfakang Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nvcastet commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangfakang commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nvcastet commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangfakang commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nvcastet commented Mar 2, 2026

Uh oh!

wangfakang commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wangfakang commented Feb 25, 2026 •

edited

Loading

wangfakang Feb 26, 2026 •

edited

Loading

nvcastet commented Feb 26, 2026 •

edited

Loading

wangfakang commented Feb 27, 2026 •

edited

Loading

nvcastet commented Feb 27, 2026 •

edited

Loading

wangfakang commented Mar 2, 2026 •

edited

Loading

wangfakang commented Mar 9, 2026 •

edited

Loading