[0.13.0][cherry-pick][Bugfix] Fixed an accuracy problem of sp with eagle3 by drslark · Pull Request #5814 · vllm-project/vllm-ascend

drslark · 2026-01-12T10:38:18Z

What this PR does / why we need it?

Fixed an accuracy problem when using eagle3 with sp.

The problem is described in #5825.

It also adds a much more precise way to determine whether drafter should use sp or not.

Also, it changes the eager of drafter to be a real eager in frontend to avoid a fx-graph problem.

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

For simpilicity, we test it as in #5825.

And we get the same result of eagle3 with sp disabled.

--------------------------------------------------
total_num_output_tokens: 1000
num_drafts: 437
num_draft_tokens: 1311
num_accepted_tokens: 564
mean acceptance length: 2.29
--------------------------------------------------
acceptance at token 0: 0.62
acceptance at token 1: 0.40
acceptance at token 2: 0.27
acceptance at token 3: 0.00
acceptance at token 4: 0.00
acceptance at token 5: 0.00

vLLM version: v0.13.0
vLLM main: vllm-project/vllm@2f4e654

gemini-code-assist

Code Review

This pull request aims to fix an accuracy issue with the eagle3 speculative decoding method when sequence parallelism (sp) is enabled. The changes primarily involve replacing the torch.ops.vllm.maybe_pad_and_reduce operation with a new Python implementation split_inputs_tp_to_sp for handling sequence splitting in the EagleProposer. Additionally, it introduces logic to differentiate between the main and draft models when checking for MoE architecture to correctly configure the forward context. While reviewing, I found a critical logic error in how the model type is determined for the draft model, which I've detailed in a specific comment. The rest of the changes appear to be correct and aligned with the goal of the pull request.

drslark · 2026-01-12T10:43:01Z

/gemini review

gemini-code-assist

Code Review

This pull request addresses an accuracy issue with speculative decoding for eagle3 models by correctly handling different architectures for the main and drafter models, especially concerning Mixture-of-Experts (MoE) configurations. The changes introduce a new utility to detect if the drafter is an MoE model and update the sequence parallelism logic accordingly. The implementation of split_inputs_tp_to_sp is a key part of this fix. While the overall logic seems sound, I've identified a high-risk coding pattern in the split_inputs_tp_to_sp function where a tensor is modified in-place through aliasing, which could be a source of future bugs. My review includes a suggestion to make this operation safer.

drslark · 2026-01-12T10:50:01Z

/gemini review

gemini-code-assist

Code Review

This pull request addresses an accuracy issue with sequence parallelism in eagle3. The main change involves replacing a custom operator with a new Python function, split_inputs_tp_to_sp, for handling tensor splitting in sequence parallelism. Additionally, the logic for detecting Mixture-of-Experts models is improved to correctly differentiate between the main and drafter models. The changes appear to correctly address the bug. My primary feedback concerns the implementation of the new split_inputs_tp_to_sp function, which is used for in-place modifications in a way that could be risky and hard to maintain. Please see my detailed comment.

drslark · 2026-01-12T10:55:16Z

/gemini review

gemini-code-assist

Code Review

This pull request fixes an accuracy issue with speculative decoding (eagle3) when sequence parallelism is enabled. The main changes involve correctly identifying if the drafter model is a Mixture-of-Experts (MoE) model and replacing a faulty reduction operation with a proper sequence splitting function (split_inputs_tp_to_sp). While the new function is conceptually correct, it has a bug where it fails to zero-out padding regions in the output tensor, which can lead to using stale data and causing accuracy problems. I've provided a critical comment with a suggested fix for this issue.

drslark · 2026-01-12T13:30:28Z

/gemini review

gemini-code-assist

Code Review

This pull request aims to fix an accuracy issue when using sequence parallelism with eagle3 speculative decoding. The changes correctly differentiate between target and draft model architectures when checking for MoE models, which is a good improvement. However, I've identified a critical bug in the newly introduced split_inputs_tp_to_sp function. It is called in a way that leads to data corruption when sequence parallelism is active. I have provided a detailed comment and a suggested fix for this issue. It's also important to add tests to verify the fix and prevent future regressions, especially since the testing section in the PR description is marked as TODO.

github-actions · 2026-01-13T01:17:53Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: drslark <slarksblood@qq.com>

…gle3 (vllm-project#5814) ### What this PR does / why we need it? Fixed an accuracy problem when using eagle3 with sp. The problem is described in vllm-project#5825. It also adds a much more precise way to determine whether drafter should use `sp` or not. Also, it changes the `eager` of drafter to be a real `eager` in frontend to avoid a `fx-graph` problem. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? For simpilicity, we test it as in vllm-project#5825. And we get the same result of `eagle3` with `sp` disabled. ```text -------------------------------------------------- total_num_output_tokens: 1000 num_drafts: 437 num_draft_tokens: 1311 num_accepted_tokens: 564 mean acceptance length: 2.29 -------------------------------------------------- acceptance at token 0: 0.62 acceptance at token 1: 0.40 acceptance at token 2: 0.27 acceptance at token 3: 0.00 acceptance at token 4: 0.00 acceptance at token 5: 0.00 ``` pick-from: vllm-project#5816 * vLLM version: v0.13.0 * vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: drslark <slarksblood@qq.com>