Make GLM4 MoE dual stream capture-aware and configurable (restore non-DeepEP overlap)#13778
Make GLM4 MoE dual stream capture-aware and configurable (restore non-DeepEP overlap)#13778voipmonitor wants to merge 4 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @voipmonitor, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a performance regression in GLM4 Mixture-of-Experts (MoE) models by re-introducing and enhancing the dual-stream processing capability. The changes ensure that the model can leverage parallel execution for improved throughput, especially when CUDA graph capture is active, and provide users with granular control over this behavior through new configuration options. This brings GLM4 MoE's performance and configurability in line with best practices observed in other models like DeepSeek. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request restores the dual-stream MoE path for GLM-4 models to improve performance, making it configurable through environment variables. The changes are well-structured. However, I've found a critical issue in the dual-stream implementation that prevents the intended computation overlap, effectively serializing the operations. I've also suggested a small improvement for robustness in handling the configuration options. Addressing the critical issue is essential to achieve the performance goals of this PR.
| ) -> torch.Tensor: | ||
|
|
||
| current_stream = torch.cuda.current_stream() | ||
| self.alt_stream.wait_stream(current_stream) |
There was a problem hiding this comment.
The call self.alt_stream.wait_stream(current_stream) serializes the execution on the alternate stream with the current stream. This means the operations inside the with torch.cuda.stream(self.alt_stream): block will only start after _forward_shared_experts on the current stream has completed. This defeats the purpose of using a dual-stream approach, which is to achieve parallelism and overlap computation. Removing this line will allow _forward_shared_experts and the operations on the alternate stream (gate, topk, experts) to run concurrently, restoring the intended performance improvement.
There was a problem hiding this comment.
wait_stream here doesn’t serialize the dual-stream overlap; it mirrors the original code’s ordering. The call alt_stream.wait_stream(current_stream) is placed before we enqueue shared_output = _forward_shared_experts on the current stream. That means alt stream only waits for prior work that produced hidden_states, not for shared_experts itself. After that, shared_experts runs on the current stream while gate/topk/experts run on the alt stream, so they still overlap. The later current_stream.wait_stream(self.alt_stream) is necessary before the final add. If we drop the initial wait_stream, we risk alt stream reading hidden_states before upstream ops on the current stream have finished.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
@zRzRzRzRzRzRzR - would you please review? Is it ok to restore the non-DeepEP dual-stream MoE path? |
Summary:
Why:
Testing:
Local 4× RTX (FP8, CUDA graph enabled) with default auto: throughput restored to pre-[Grammar Fix] GLM-4-MOE self.first_k_dense_replace is undefined. #12455 levels.
Manual toggles:
inclcuded optimised triton .json files for 4x RTX 6000 PRO
testing with:
Before PR: ~52 tokens/sec
After PR: ~58 tokens/sec (same speed like it was before #12455 )