Bugfix: fix symm not enabled due to incorrect registration of comm#19329
Bugfix: fix symm not enabled due to incorrect registration of comm#19329wangfakang wants to merge 10 commits intosgl-project:mainfrom
Conversation
Signed-off-by: wangfakang <fakangwang@gmail.com>
Summary of ChangesHello @wangfakang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical bug preventing symmetric memory from being properly enabled in distributed operations. The fix involves enhancing the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request correctly addresses a bug where symmetric memory (symm) was not being enabled due to incorrect communication group registration. The fix involves adding a group parameter to get_local_dp_buffer and get_global_dp_buffer to make the communication group explicit. This change is consistently applied across various call sites, ensuring that the appropriate communication group (get_attention_tp_group, get_tp_group, or get_attention_cp_group) is used for symmetric memory registration, aligning it with the group used by the corresponding collective communication operator. The implementation is clean and effectively resolves the issue.
Signed-off-by: wangfakang <fakangwang@gmail.com>
| def get_local_dp_buffer(cls) -> torch.Tensor: | ||
| with use_symmetric_memory(get_tp_group(), disabled=not cls._dp_max_padding): | ||
| def get_local_dp_buffer(cls, group: GroupCoordinator) -> torch.Tensor: | ||
| with use_symmetric_memory(group, disabled=not cls._dp_max_padding): |
There was a problem hiding this comment.
Why not make group: Optional[GroupCoordinator] = None, if group==None, then we still use get_tp_group() by default.
There was a problem hiding this comment.
Why not make
group: Optional[GroupCoordinator] = None, if group==None, then we still useget_tp_group()by default.
Using default values can easily lead to group inconsistency, so it's necessary to explicitly declare the correct group to ensure consistency with the communication operator's context.
|
@wangfakang Thanks for your PR!
So we would need to come up with a design to address this issue: aka making sure the memory returned by the pool is properly registered with the correct group. In a previous design, we decoupled registration from allocation and we could register or re-register memory segments at the exit of the context manager but now we perform the registration with the allocation since we did not want to pay the CPU cost of pytorch memory snapshot API. |
@nvcastet You're absolutely right. Indeed, there are two distinct issues here and this PR fixes the first issue (missing group passing). The second issue (memory snapshot inconsistency) remains and needs a separate solution, possibly via C++ tracking or reverting to decoupled registration. |
|
Please, if possible use your own words to answer review comments instead of AI. It is easier when we have more direct and accurate back and forth. |
@nvcastet I apologize for the confusion. Would reverting to the previous design require another PR? Thank you. |
@wangfakang: |
@nvcastet @merrymercy I refactored SymmPool to replace the global MemPool with per-group MemPool dictionary #20153. Now each communication group has its own MemPool, ensuring proper memory registration and preventing cross-group allocation issues in multi-comm scenarios. |
CC @ShangmingCai @nvcastet @Fridge003 @BBuf @yizhang2077 @ch-wan PTAL, thx.
Motivation
fix symm not enabled due to incorrect registration of comm.
get_local_dp_bufferuses the group obtained byget_tp_groupby default to register symm.attn_cp_all_gather_into_tensoroperation uses the group obtained byget_attention_cp_groupto perform allgather.sglang/python/sglang/srt/layers/communicator_nsa_cp.py
Lines 141 to 163 in 671b595
sglang/python/sglang/srt/layers/dp_attention.py
Lines 129 to 130 in c4ef338
sglang/python/sglang/srt/layers/dp_attention.py
Lines 572 to 573 in c4ef338
sglang/python/sglang/srt/distributed/device_communicators/pynccl_allocator.py
Lines 180 to 184 in 671b595
sglang/python/sglang/srt/distributed/device_communicators/pynccl_allocator.py
Lines 52 to 58 in c4ef338
Modifications
Modify the functions
get_local_dp_bufferandget_global_dp_bufferto add theGroupOrdinatorparameter, which can make the comm group registered by symm consistent with the comm group of the collection operator.Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci