[Fix] GLM 4.7 + NVFP4 + MTP#17166
Conversation
Summary of ChangesHello @b8zhong, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses several critical issues related to model loading and performance for specific GLM models, particularly when using NVFP4 quantization and MTP (Multi-Task Prediction) checkpoints. The changes improve the robustness of the model loading process by handling quantization method compatibility more intelligently and fixing a common checkpoint packaging bug. Additionally, it optimizes performance for GLM4Moe models on compatible hardware by automatically enabling a faster MoE runner backend. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces several important fixes and improvements, primarily for GLM 4.7 models with NVFP4 quantization and MTP. The changes include a more robust quantization method detection, a workaround for incorrectly packaged checkpoints by auto-detecting and including mtp.safetensors, and performance enhancement by automatically enabling the flashinfer_trtllm MoE runner backend under specific conditions. The code is well-structured and addresses the issues effectively. I have one suggestion to refactor some conditional checks for improved conciseness and readability.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
a33c14f to
8f8fd35
Compare
|
Hi @b8zhong did you try a quality sanity test for this model? I use your launch command and got very bad output. Just very simple request can trigger, like but if we specify |
|
@ConnorLi96 You need nightly flashinfer for this fix, if you did not install that, it will be still broken |
|
Yep, this does work, very fresh update, lol |
| elif model_arch in ["Glm4MoeForCausalLM"]: | ||
| if is_sm100_supported(): | ||
| quantization_config = getattr(hf_config, "quantization_config", None) | ||
| quant_method = ( | ||
| quantization_config.get("quant_method") | ||
| if quantization_config is not None | ||
| else None | ||
| ) | ||
| if self.quantization is None and quant_method is not None: | ||
| self.quantization = quant_method | ||
| if ( | ||
| self.quantization == "modelopt_fp4" | ||
| and self.moe_a2a_backend == "none" | ||
| and self.moe_runner_backend == "auto" | ||
| ): | ||
| # Only enable flashinfer_trtllm if flashinfer-python version is >= 0.6.2 | ||
| if check_pkg_version_at_least("flashinfer-python", "0.6.2"): | ||
| self.moe_runner_backend = "flashinfer_trtllm" | ||
| logger.info( | ||
| "Use flashinfer_trtllm as MoE runner backend on sm100 for Glm4MoeForCausalLM" | ||
| ) |
There was a problem hiding this comment.
this if branch has been inserted to wrong place!
Qwen3NextForCausalLM 's Mamba radix cache v2 is inside Glm4MoeForCausalLM now
There was a problem hiding this comment.
Hi @jimmy-evo sorry about breaking this codes. It looks like it has been fixed in main though. I will be more careful next time.
Motivation
A few issues reported by @ynwang007 @JustinTong0323
Modifications
modelopt -> modelopt_fp4in fix: GLM4.7-FP4 usage #16581 I think this detection might be a better method.safetensors.index.jsonshould be mapping (you can thnkn of it like symlink) the layer 92 weights →mtp.safetensors(technically, models like Deepseek store these within the rest of the weights as an additional layer, but I suppose GLM simply decided to do it differently), into amtp.safetensorsfile seperately, but during the creation with the ModelOpt library it was not added to the index. Instead we can automatically detect this is the case, warn the user about this invalid checkpoint, but still do the remapping online if we havemtp.safetensors.trtllm-genautomatically, which has much better performance than the CUTLASS backend (requires Flashinfer 0.6.2 which will be released later) after a small fix in tiny support glm routing flashinfer-ai/flashinfer#2313.Accuracy Tests
Benchmarking and Profiling
This is an example of a "broken" checkpoint. After our fix, it will run fine automatically: