Skip to content

[Feat] Support async_scheduler and disable_padded_drafter_batch in eagle#4893

Merged
MengqingCao merged 3 commits intovllm-project:mainfrom
anon189Ty:eagle_async
Dec 16, 2025
Merged

[Feat] Support async_scheduler and disable_padded_drafter_batch in eagle#4893
MengqingCao merged 3 commits intovllm-project:mainfrom
anon189Ty:eagle_async

Conversation

@anon189Ty
Copy link
Contributor

@anon189Ty anon189Ty commented Dec 10, 2025

What this PR does / why we need it?

We refactored the eagle_proposer.py to adapt the framework of eagle.py in vllm-v0.12.0, to support the logit of padded drafter batch and async-scheduler.

Does this PR introduce any user-facing change?

How was this patch tested?

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors eagle_proposer.py to support async_scheduler and disable_padded_drafter_batch, aligning it with recent changes in vLLM. The changes are extensive, involving new methods for preparing inputs and handling attention metadata. I've found a couple of critical issues that would cause runtime errors due to undefined variables. Please address these to ensure the new logic works as intended.

Comment on lines +64 to +72
# Currently we do not use pcp. This is used to adapt the pcp branch.
self.pcp_size = 0
self.backup_next_token_ids = CpuGpuBuffer(
max_batch_size,
dtype=torch.int32,
pin_memory=is_pin_memory_available(),
device=device,
with_numpy=True,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The variable max_batch_size is used here without being defined, which will cause a NameError at runtime. It should be initialized from vllm_config.scheduler_config.max_num_seqs before being used.

Suggested change
# Currently we do not use pcp. This is used to adapt the pcp branch.
self.pcp_size = 0
self.backup_next_token_ids = CpuGpuBuffer(
max_batch_size,
dtype=torch.int32,
pin_memory=is_pin_memory_available(),
device=device,
with_numpy=True,
)
max_batch_size = vllm_config.scheduler_config.max_num_seqs
# Currently we do not use pcp. This is used to adapt the pcp branch.
self.pcp_size = 0
self.backup_next_token_ids = CpuGpuBuffer(
max_batch_size,
dtype=torch.int32,
pin_memory=is_pin_memory_available(),
device=device,
with_numpy=True,
)

@@ -549,29 +435,28 @@ def _propose(

# Compute the slot mapping.
block_numbers = (clamped_positions_cpu // self.block_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The variable clamped_positions_cpu is used here but it is not defined in this scope. This seems to be a leftover from refactoring and will cause a NameError. You probably meant to use clamped_positions, which is defined a few lines above and is on the correct device for the subsequent GPU operations.

Suggested change
block_numbers = (clamped_positions_cpu // self.block_size)
block_numbers = (clamped_positions // self.block_size)

@github-actions
Copy link
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@anon189Ty anon189Ty force-pushed the eagle_async branch 4 times, most recently from e68dae0 to db94642 Compare December 11, 2025 03:05
is_only_prefill=self.is_only_prefill,
graph_pad_size=-1, # It should be -1 when not run in fullgraph mode.
num_input_tokens=num_actual_tokens,
cos=self.cos,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check whether cos/sin needed to be sliced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked. In the current scene, cos/sin will only be None, so won't cause any errors. Slicing has been added for subsequent scenes.

@anon189Ty anon189Ty force-pushed the eagle_async branch 3 times, most recently from 4af6f05 to 364e8a0 Compare December 11, 2025 13:51
@MengqingCao MengqingCao added ready read for review ready-for-test start test by label for PR labels Dec 11, 2025
@anon189Ty anon189Ty force-pushed the eagle_async branch 2 times, most recently from 40f9983 to 9fa55d2 Compare December 12, 2025 08:11
@github-actions
Copy link
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions
Copy link
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

if self.speculative_config:
use_padded_batch_for_eagle = self.speculative_config and \
self.speculative_config.method == "mtp" and \
self.speculative_config.method in ("mtp", "eagle", "eagle3") and \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use self.speculative_config.use_eagle()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@anon189Ty anon189Ty force-pushed the eagle_async branch 2 times, most recently from 835fc47 to 2bda5de Compare December 13, 2025 13:56
@anon189Ty anon189Ty force-pushed the eagle_async branch 2 times, most recently from 56e2ea1 to e3be15f Compare December 13, 2025 16:45
@anon189Ty anon189Ty force-pushed the eagle_async branch 2 times, most recently from e101000 to 91ae862 Compare December 15, 2025 08:31
wangxiyuan pushed a commit that referenced this pull request Dec 16, 2025
### What this PR does / why we need it?

Currently, we are using `AscendRejctionSampler` that extends from
`RejctionSampler` in spec decoding. `AscendRejctionSampler` override
`forward` of `RejctionSampler`, only aming to replace `rejection_sample`
func. This
causes a lot of code of `RejctionSampler` cannot be reused, for example:
- vllm-project/vllm#19482
- vllm-project/vllm#26060
- vllm-project/vllm#29223

#### Proposed Change:
- Delete `AscendRejctionSampler` and use `RejctionSampler` directly in
model runner.
- Patch `RejctionSampler.expand_batch_to_tokens` and
`RejctionSampler.rejection_sample`, maybe a better way is to make them
as custom ops.
- Modify `NPUModelRunner` following
vllm-project/vllm#26060

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- [x] test logits processor for spec decoding
- [x] test logprobs for spec decoding
- [x] test logprobs for spec decoding + async shcheduling (test with
#4893)


- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
)

# Create an LLM.
llm = LLM(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed you're using LLM in a test case, we must clean up npu hbm when not using VllmRunner. plz modify this case to use VllmRunner or clean up by hand if keeping it as is.
plz refer to https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/conftest.py#L81

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
Co-authored-by: drslark <slarksblood@qq.com>
Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
@MengqingCao MengqingCao merged commit 5b1da4e into vllm-project:main Dec 16, 2025
14 of 16 checks passed
chenaoxuan pushed a commit to chenaoxuan/vllm-ascend that referenced this pull request Dec 20, 2025
### What this PR does / why we need it?

Currently, we are using `AscendRejctionSampler` that extends from
`RejctionSampler` in spec decoding. `AscendRejctionSampler` override
`forward` of `RejctionSampler`, only aming to replace `rejection_sample`
func. This
causes a lot of code of `RejctionSampler` cannot be reused, for example:
- vllm-project/vllm#19482
- vllm-project/vllm#26060
- vllm-project/vllm#29223

#### Proposed Change:
- Delete `AscendRejctionSampler` and use `RejctionSampler` directly in
model runner.
- Patch `RejctionSampler.expand_batch_to_tokens` and
`RejctionSampler.rejection_sample`, maybe a better way is to make them
as custom ops.
- Modify `NPUModelRunner` following
vllm-project/vllm#26060

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- [x] test logits processor for spec decoding
- [x] test logprobs for spec decoding
- [x] test logprobs for spec decoding + async shcheduling (test with
vllm-project#4893)


- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
chenaoxuan pushed a commit to chenaoxuan/vllm-ascend that referenced this pull request Dec 20, 2025
…gle (vllm-project#4893)

### What this PR does / why we need it?
We refactored the eagle_proposer.py to adapt the framework of eagle.py
in vllm-v0.12.0, to support the logit of padded drafter batch and
async-scheduler.

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
Co-authored-by: drslark <slarksblood@qq.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
### What this PR does / why we need it?

Currently, we are using `AscendRejctionSampler` that extends from
`RejctionSampler` in spec decoding. `AscendRejctionSampler` override
`forward` of `RejctionSampler`, only aming to replace `rejection_sample`
func. This
causes a lot of code of `RejctionSampler` cannot be reused, for example:
- vllm-project/vllm#19482
- vllm-project/vllm#26060
- vllm-project/vllm#29223

#### Proposed Change:
- Delete `AscendRejctionSampler` and use `RejctionSampler` directly in
model runner.
- Patch `RejctionSampler.expand_batch_to_tokens` and
`RejctionSampler.rejection_sample`, maybe a better way is to make them
as custom ops.
- Modify `NPUModelRunner` following
vllm-project/vllm#26060

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- [x] test logits processor for spec decoding
- [x] test logprobs for spec decoding
- [x] test logprobs for spec decoding + async shcheduling (test with
vllm-project#4893)

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
…gle (vllm-project#4893)

### What this PR does / why we need it?
We refactored the eagle_proposer.py to adapt the framework of eagle.py
in vllm-v0.12.0, to support the logit of padded drafter batch and
async-scheduler.

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
Co-authored-by: drslark <slarksblood@qq.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
### What this PR does / why we need it?

Currently, we are using `AscendRejctionSampler` that extends from
`RejctionSampler` in spec decoding. `AscendRejctionSampler` override
`forward` of `RejctionSampler`, only aming to replace `rejection_sample`
func. This
causes a lot of code of `RejctionSampler` cannot be reused, for example:
- vllm-project/vllm#19482
- vllm-project/vllm#26060
- vllm-project/vllm#29223

#### Proposed Change:
- Delete `AscendRejctionSampler` and use `RejctionSampler` directly in
model runner.
- Patch `RejctionSampler.expand_batch_to_tokens` and
`RejctionSampler.rejection_sample`, maybe a better way is to make them
as custom ops.
- Modify `NPUModelRunner` following
vllm-project/vllm#26060

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- [x] test logits processor for spec decoding
- [x] test logprobs for spec decoding
- [x] test logprobs for spec decoding + async shcheduling (test with
vllm-project#4893)

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: realliujiaxu <realliujiaxu@163.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
…gle (vllm-project#4893)

### What this PR does / why we need it?
We refactored the eagle_proposer.py to adapt the framework of eagle.py
in vllm-v0.12.0, to support the logit of padded drafter batch and
async-scheduler.

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
Co-authored-by: drslark <slarksblood@qq.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants