Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @yizhang2077, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces comprehensive support for Deepseek Expert Parallelism (DeepEP) within the Qwen3-Next-FP8 model architecture. The changes enable efficient distributed execution of Mixture-of-Experts (MoE) layers by integrating DeepEP-specific logic for expert management, load balancing, and tensor parallelism, aiming to improve performance and scalability for large language models.
Highlights
- DeepEP Integration: Implemented DeepEP (Deepseek Expert Parallelism) support for Qwen3-Next-FP8 models, enabling specialized handling of Mixture-of-Experts (MoE) layers for improved distributed inference.
- Expert Parallelism Configuration: Enhanced MoE layer initialization to incorporate redundant experts and pass tensor parallelism configurations to sub-layers, optimizing distributed execution and resource utilization.
- Expert Weight Management: Introduced a mechanism to lazily retrieve and manage expert weights within the Qwen3-Next model, facilitating DeepEP's operational requirements and dynamic expert routing.
- Expert Distribution Tracking: Integrated global expert distribution recording to monitor and potentially optimize expert allocation across layers, which is crucial for load balancing in DeepEP.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in pull request comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces support for qwen3-next-fp8 with DeepEP. The changes are mainly in qwen2_moe.py and qwen3_next.py. In qwen2_moe.py, a new forward path for DeepEP is added, along with configurations for redundant experts and shared experts. A method to retrieve expert weights is also included. In qwen3_next.py, this new method is used to expose expert weights for load balancing, and expert distribution recording is integrated. The changes are logical and well-implemented. I have one minor suggestion to remove some redundant code to improve clarity.
| if get_moe_a2a_backend().is_deepep(): | ||
| # TODO: we will support tp < ep in the future | ||
| self.ep_size = get_moe_expert_parallel_world_size() | ||
| self.num_experts = ( | ||
| config.num_experts + global_server_args_dict["ep_num_redundant_experts"] | ||
| ) | ||
| self.top_k = config.num_experts_per_tok |
There was a problem hiding this comment.
This block of code appears to be redundant. The attributes self.ep_size, self.num_experts, and self.top_k are assigned but are not used within the class. The values for num_experts and top_k were already used during the initialization of self.experts and self.topk respectively. If this code is for future use as hinted by the TODO, it should be commented out. Otherwise, it can be removed to improve code clarity.
* origin/qwen3: (30 commits) chore: bump sgl-kernel 0.3.11 (sgl-project#10630) feat: add fused moe config for Qwen3-Next-80B-A3B-Instruct on B200 (sgl-project#10631) model support: Sarashina2VisionForCausalLM (sgl-project#10632) [Performance] Qwen3-Next: speed up update_mamba_state_after_mtp_verify by 10x; e2e up to 3.54% faster (sgl-project#10586) [Performance] Qwen3-Next: replace arange to cached query_start_loc_li… (sgl-project#10553) [Feature] Speculative decoding support lookahead (sgl-project#9873) refactor: use registry for _get_attention_backend_from_str (sgl-project#10629) [router] refactor worker to builder pattern 1/n (sgl-project#10628) Garbage collector regression in the online server (sgl-project#10621) feat: Add FlexAttention Backend for Efficient Sparse Attention (sgl-project#9947) Fix bias handling in TritonMoeQuantInfo within quantization/mxfp4.py (sgl-project#10579) [Performance] qwen3-next improve causal conv1d in prefill phase (sgl-project#10595) Fix sgl_kernel import failure on devices other than CUDA (sgl-project#10610) support qwen3-next-fp8 deepep (sgl-project#10622) update deepep version for qwen3-next deepep moe (sgl-project#10624) Feat/add heartbeat mechanism for nixl conn (sgl-project#10222) [RL] Add destroy process group api (sgl-project#9979) fix deepep assert when PD disaggregation == null (sgl-project#8274) Scale kkt after reduction (sgl-project#10604) [improvement] add average input/output token length for hicache benchmark stats output (sgl-project#10525) ...
|
Hi @yizhang2077 , I wonder if tuning triton moe kernel script need to change?
it says: I wonder how "E=128" is calculated (seems num_experts / ep_size ). in tuning script, , E is always 512 according to config.num_experts https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct/blob/main/config.json#L26.if tuning triton kernel script need to change? |
|
@yizhang2077 On main: launch server
Startup logs: |

Motivation
need merge #10624 first
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist