Skip to content

[0.13.0][cherry-pick][Bugfix] Fixed an accuracy problem of sp with eagle3#5814

Merged
wangxiyuan merged 1 commit intovllm-project:releases/v0.13.0from
drslark:v0.13.0
Jan 14, 2026
Merged

[0.13.0][cherry-pick][Bugfix] Fixed an accuracy problem of sp with eagle3#5814
wangxiyuan merged 1 commit intovllm-project:releases/v0.13.0from
drslark:v0.13.0

Conversation

@drslark
Copy link
Copy Markdown
Contributor

@drslark drslark commented Jan 12, 2026

What this PR does / why we need it?

Fixed an accuracy problem when using eagle3 with sp.

The problem is described in #5825.

It also adds a much more precise way to determine whether drafter should use sp or not.

Also, it changes the eager of drafter to be a real eager in frontend to avoid a fx-graph problem.

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

For simpilicity, we test it as in #5825.

And we get the same result of eagle3 with sp disabled.

--------------------------------------------------
total_num_output_tokens: 1000
num_drafts: 437
num_draft_tokens: 1311
num_accepted_tokens: 564
mean acceptance length: 2.29
--------------------------------------------------
acceptance at token 0: 0.62
acceptance at token 1: 0.40
acceptance at token 2: 0.27
acceptance at token 3: 0.00
acceptance at token 4: 0.00
acceptance at token 5: 0.00

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix an accuracy issue with the eagle3 speculative decoding method when sequence parallelism (sp) is enabled. The changes primarily involve replacing the torch.ops.vllm.maybe_pad_and_reduce operation with a new Python implementation split_inputs_tp_to_sp for handling sequence splitting in the EagleProposer. Additionally, it introduces logic to differentiate between the main and draft models when checking for MoE architecture to correctly configure the forward context. While reviewing, I found a critical logic error in how the model type is determined for the draft model, which I've detailed in a specific comment. The rest of the changes appear to be correct and aligned with the goal of the pull request.

Comment thread vllm_ascend/ascend_forward_context.py Outdated
@drslark
Copy link
Copy Markdown
Contributor Author

drslark commented Jan 12, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an accuracy issue with speculative decoding for eagle3 models by correctly handling different architectures for the main and drafter models, especially concerning Mixture-of-Experts (MoE) configurations. The changes introduce a new utility to detect if the drafter is an MoE model and update the sequence parallelism logic accordingly. The implementation of split_inputs_tp_to_sp is a key part of this fix. While the overall logic seems sound, I've identified a high-risk coding pattern in the split_inputs_tp_to_sp function where a tensor is modified in-place through aliasing, which could be a source of future bugs. My review includes a suggestion to make this operation safer.

Comment thread vllm_ascend/spec_decode/eagle_proposer.py
@drslark
Copy link
Copy Markdown
Contributor Author

drslark commented Jan 12, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an accuracy issue with sequence parallelism in eagle3. The main change involves replacing a custom operator with a new Python function, split_inputs_tp_to_sp, for handling tensor splitting in sequence parallelism. Additionally, the logic for detecting Mixture-of-Experts models is improved to correctly differentiate between the main and drafter models. The changes appear to correctly address the bug. My primary feedback concerns the implementation of the new split_inputs_tp_to_sp function, which is used for in-place modifications in a way that could be risky and hard to maintain. Please see my detailed comment.

Comment thread vllm_ascend/spec_decode/eagle_proposer.py
@drslark
Copy link
Copy Markdown
Contributor Author

drslark commented Jan 12, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes an accuracy issue with speculative decoding (eagle3) when sequence parallelism is enabled. The main changes involve correctly identifying if the drafter model is a Mixture-of-Experts (MoE) model and replacing a faulty reduction operation with a proper sequence splitting function (split_inputs_tp_to_sp). While the new function is conceptually correct, it has a bug where it fails to zero-out padding regions in the output tensor, which can lead to using stale data and causing accuracy problems. I've provided a critical comment with a suggested fix for this issue.

Comment thread vllm_ascend/spec_decode/eagle_proposer.py
@drslark
Copy link
Copy Markdown
Contributor Author

drslark commented Jan 12, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix an accuracy issue when using sequence parallelism with eagle3 speculative decoding. The changes correctly differentiate between target and draft model architectures when checking for MoE models, which is a good improvement. However, I've identified a critical bug in the newly introduced split_inputs_tp_to_sp function. It is called in a way that leads to data corruption when sequence parallelism is active. I have provided a detailed comment and a suggested fix for this issue. It's also important to add tests to verify the fix and prevent future regressions, especially since the testing section in the PR description is marked as TODO.

Comment thread vllm_ascend/spec_decode/eagle_proposer.py
@drslark drslark force-pushed the v0.13.0 branch 3 times, most recently from ee29b88 to eb72ae8 Compare January 12, 2026 15:12
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@drslark drslark force-pushed the v0.13.0 branch 3 times, most recently from ba83f7a to fa73c7f Compare January 13, 2026 01:51
@drslark drslark force-pushed the v0.13.0 branch 3 times, most recently from 6767846 to e445290 Compare January 13, 2026 12:05
Signed-off-by: drslark <slarksblood@qq.com>
@wangxiyuan wangxiyuan added ready read for review ready-for-test start test by label for PR labels Jan 13, 2026
@wangxiyuan wangxiyuan changed the title [0.13.0][Bugfix] Fixed an accuracy problem of sp with eagle3 [0.13.0][cherry-pick][Bugfix] Fixed an accuracy problem of sp with eagle3 Jan 14, 2026
@wangxiyuan wangxiyuan merged commit 7716fc8 into vllm-project:releases/v0.13.0 Jan 14, 2026
20 checks passed
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
…gle3 (vllm-project#5814)

### What this PR does / why we need it?
Fixed an accuracy problem when using eagle3 with sp.

The problem is described in
vllm-project#5825.

It also adds a much more precise way to determine whether drafter should
use `sp` or not.

Also, it changes the `eager` of drafter to be a real `eager` in frontend
to avoid a `fx-graph` problem.

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
For simpilicity, we test it as in
vllm-project#5825.

And we get the same result of `eagle3` with `sp` disabled.

```text
--------------------------------------------------
total_num_output_tokens: 1000
num_drafts: 437
num_draft_tokens: 1311
num_accepted_tokens: 564
mean acceptance length: 2.29
--------------------------------------------------
acceptance at token 0: 0.62
acceptance at token 1: 0.40
acceptance at token 2: 0.27
acceptance at token 3: 0.00
acceptance at token 4: 0.00
acceptance at token 5: 0.00
```

pick-from: vllm-project#5816

* vLLM version: v0.13.0
* vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: drslark <slarksblood@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants