Conversation
…py and update Glm4MoeSparseMoeBlock to use DeepSeekV3 routing method Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Summary of ChangesHello @JustinTong0323, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses an issue preventing the "glm-4.7-fp4" model from working correctly with the "EAGLE" speculative decoding algorithm in sglang. The changes involve refining the model configuration to properly handle quantization for certain draft models and specifying the routing method for the GLM4 MoE layer, thereby enhancing compatibility and stability. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces fixes to enable glm-4.7-fp4 to work with EAGLE speculative decoding. The changes include disabling modelopt_fp4 quantization for certain NextN draft models to prevent loader errors, and updating the MoE routing method for GLM-4.7 to be compatible with the DeepSeekV3 implementation in sglang. The changes are logical and well-targeted. I have one suggestion to improve performance and maintainability in model_config.py.
| nextn_architectures_without_fp4 = [ | ||
| "DeepseekV3ForCausalLMNextN", | ||
| "Glm4MoeForCausalLMNextN", | ||
| "BailingMoeForCausalLMNextN", | ||
| ] |
There was a problem hiding this comment.
For better performance and code clarity, it's recommended to define nextn_architectures_without_fp4 as a module-level constant (e.g., _NEXTN_ARCHITECTURES_WITHOUT_FP4) instead of redefining it within the _verify_quantization method on each call. Using a frozenset would also be more efficient for membership testing.
|
/tag-and-rerun-ci |
…ptNvFp4FusedMoEMethod Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
|
Closing, because I think we resolved the issue |
Motivation
cc @ynwang007 report this. Let glm-4.7-fp4 work with eagle in sglang
Usage
Modifications
model_config.py- Disable FP4 quantization for draft NextN models- Adds a safeguard for architectures:
DeepseekV3ForCausalLMNextN, Glm4MoeForCausalLMNextN, BailingMoeForCausalLMNextN- When a draft model uses
modelopt_fp4quantization, it's automatically disabled since NextN layers are not FP4 quantized- Prevents loader errors that would occur from mismatched quantization
glm4_moe.py- Update routing method forGlm4MoeSparseMoeBlock- Imports
RoutingMethodTypeutility- Explicitly sets
routing_method_type=RoutingMethodType.DeepSeekV3when creating the FusedMoE layerThe fix ensures proper handling of GLM-4 MoE models with FP4 quantization and speculative decoding (NextN draft models).
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci) or contact authorized users to do so.