[Bugfix] Fix the acceptance rates dorp issue when applying eagle3 to QuaRot model#6914
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a performance degradation observed in speculative decoding setups involving Eagle3 and QuaRot quantized models. By ensuring that the draft model's fully connected layer weights are correctly rotationally quantized, the change aims to restore expected acceptance rates and maintain the efficiency of the quantized model inference pipeline. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request fixes a bug causing a drop in acceptance rates for speculative decoding when using a rotationally quantized (QuaRot) model with an Eagle3 draft model. The fix involves applying the same rotational quantization to the draft model's fully-connected layer weights during loading. The approach is correct, but I've identified two critical areas for improvement to enhance robustness: adding error handling for loading the rotation matrix file and making the layer selection for weight rotation more specific to prevent accidentally modifying incorrect layers.
As per the repository's style guide, here is a suggested title for the pull request:
Suggested PR Title:
[Quantization][BugFix] Fix acceptance rate drop issue when applying eagle3 to QuaRot model46fbb0b to
963e8b2
Compare
bf303e2 to
7cf15d6
Compare
…QuaRot model Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
| # Dynamically replace the `load_weights` function at runtime, | ||
| # and fix `target_config` into the new implementation with a closure. | ||
| # Future Plan: | ||
| # Remove this patch when vLLM merges the PR. |
There was a problem hiding this comment.
could you just put the pr link here?
…to qwen3next_graph * 'main' of https://github.com/vllm-project/vllm-ascend: (40 commits) [Feature] Add docs of batch invariance and make some extra operators patch (vllm-project#6910) [bugfix]Qwen2.5VL accurate question (vllm-project#6975) [CI] Add DeepSeek-V3.2 large EP nightly ci (vllm-project#6378) [Ops][BugFix] Fix RoPE shape mismatch for mtp models with flashcomm v1 enabled (vllm-project#6939) [bugfix]fix file not found error in nightly of single-node (vllm-project#6976) [Bugfix] Fix the acceptance rates dorp issue when applying eagle3 to QuaRot model (vllm-project#6914) [CI] Enable auto upgrade e2e estimated time for auto-partition suites (vllm-project#6840) [Doc][Misc] Fix msprobe_guide.md documentation issues (vllm-project#6965) [Nightly][Refactor]Migrate nightly single-node model tests from `.py` to `.yaml` (vllm-project#6503) [BugFix] Improve GDN layer detection for multimodal models (vllm-project#6941) [feat]ds3.2 pcp support mtp and chunkprefill (vllm-project#6917) [CPU binding] Implement global CPU slicing and improve IRQ binding for Ascend NPUs (vllm-project#6945) [Triton] Centralize Ascend extension op dispatch in triton_utils (vllm-project#6937) [csrc][bugfix] Add compile-time Ascend950/910_95 compatibility for custom ops between CANN8.5 and 9.0 (vllm-project#6936) [300I][Bugfix] fix unquant model weight nd2nz error (vllm-project#6851) [doc] fix supported_models (vllm-project#6930) [CI] nightly test timeout (vllm-project#6912) [CI] Upgrade CANN to 8.5.1 (vllm-project#6897) [Model]Add Qwen3-Omni quantization Ascend NPU adaptation and optimization (vllm-project#6828) [P/D][v0.16.0]Adapt to RecomputeScheduler in vLLM 0.16.0 (vllm-project#6898) ...
…QuaRot model (vllm-project#6914) ### What this PR does / why we need it? When using the target model after rotational quantization, the acceptance rate decreases because the fc weight of the draft model has not undergone rotational quantization(issue: vllm-project#6445). We fixed this issue by performing rotation quantization on the fc weight of the draft model in the same way as the main model when loading draft model. - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@15d76f7 Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
### What this PR does / why we need it? Add an e2e test for QuaRot model with eagle3 that runs both the QuaRot model and the float model, and then compares their acceptance rates. The QuaRot model adapting eagle3 PR(#6914, #7038) - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@4034c3d Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
…ct#7128) ### What this PR does / why we need it? Add an e2e test for QuaRot model with eagle3 that runs both the QuaRot model and the float model, and then compares their acceptance rates. The QuaRot model adapting eagle3 PR(vllm-project#6914, vllm-project#7038) - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@4034c3d Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
…QuaRot model (vllm-project#6914) When using the target model after rotational quantization, the acceptance rate decreases because the fc weight of the draft model has not undergone rotational quantization(issue: vllm-project#6445). We fixed this issue by performing rotation quantization on the fc weight of the draft model in the same way as the main model when loading draft model. - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@15d76f7 Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
…el (vllm-project#6914) Cherry-pick from upstream main 52d9086. Perform rotation quantization on the fc weight of the draft model in the same way as the main model when loading draft model.
…QuaRot model (vllm-project#6914) When using the target model after rotational quantization, the acceptance rate decreases because the fc weight of the draft model has not undergone rotational quantization(issue: vllm-project#6445). We fixed this issue by performing rotation quantization on the fc weight of the draft model in the same way as the main model when loading draft model. - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@15d76f7 Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
…QuaRot model (vllm-project#6914) When using the target model after rotational quantization, the acceptance rate decreases because the fc weight of the draft model has not undergone rotational quantization(issue: vllm-project#6445). We fixed this issue by performing rotation quantization on the fc weight of the draft model in the same way as the main model when loading draft model. - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@15d76f7 Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
…ct#7128) ### What this PR does / why we need it? Add an e2e test for QuaRot model with eagle3 that runs both the QuaRot model and the float model, and then compares their acceptance rates. The QuaRot model adapting eagle3 PR(vllm-project#6914, vllm-project#7038) - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@4034c3d Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
What this PR does / why we need it?
When using the target model after rotational quantization, the acceptance rate decreases because the fc weight of the draft model has not undergone rotational quantization(issue: #6445). We fixed this issue by performing rotation quantization on the fc weight of the draft model in the same way as the main model when loading draft model.
Does this PR introduce any user-facing change?
The bug was previously resolved by using the tool in the Issue(#5974). If your version has already adapted to this PR, please use the original eagle3 weights matching target model quantified by new version modelslim and do not use that tool.
How was this patch tested?