[Feat] Support async_scheduler and disable_padded_drafter_batch in eagle by anon189Ty · Pull Request #4893 · vllm-project/vllm-ascend

anon189Ty · 2025-12-10T14:43:55Z

What this PR does / why we need it?

We refactored the eagle_proposer.py to adapt the framework of eagle.py in vllm-v0.12.0, to support the logit of padded drafter batch and async-scheduler.

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e

gemini-code-assist

Code Review

This pull request refactors eagle_proposer.py to support async_scheduler and disable_padded_drafter_batch, aligning it with recent changes in vLLM. The changes are extensive, involving new methods for preparing inputs and handling attention metadata. I've found a couple of critical issues that would cause runtime errors due to undefined variables. Please address these to ensure the new logic works as intended.

gemini-code-assist · 2025-12-10T14:46:14Z

vllm_ascend/spec_decode/eagle_proposer.py

+        # Currently we do not use pcp. This is used to adapt the pcp branch.
+        self.pcp_size = 0
+        self.backup_next_token_ids = CpuGpuBuffer(
+            max_batch_size,
+            dtype=torch.int32,
+            pin_memory=is_pin_memory_available(),
+            device=device,
+            with_numpy=True,
+        )


The variable max_batch_size is used here without being defined, which will cause a NameError at runtime. It should be initialized from vllm_config.scheduler_config.max_num_seqs before being used.

Suggested change

# Currently we do not use pcp. This is used to adapt the pcp branch.

self.pcp_size = 0

self.backup_next_token_ids = CpuGpuBuffer(

max_batch_size,

dtype=torch.int32,

pin_memory=is_pin_memory_available(),

device=device,

with_numpy=True,

)

max_batch_size = vllm_config.scheduler_config.max_num_seqs

# Currently we do not use pcp. This is used to adapt the pcp branch.

self.pcp_size = 0

self.backup_next_token_ids = CpuGpuBuffer(

max_batch_size,

dtype=torch.int32,

pin_memory=is_pin_memory_available(),

device=device,

with_numpy=True,

)

gemini-code-assist · 2025-12-10T14:46:14Z

vllm_ascend/spec_decode/eagle_proposer.py

@@ -549,29 +435,28 @@ def _propose(

            # Compute the slot mapping.
            block_numbers = (clamped_positions_cpu // self.block_size)


The variable clamped_positions_cpu is used here but it is not defined in this scope. This seems to be a leftover from refactoring and will cause a NameError. You probably meant to use clamped_positions, which is defined a few lines above and is on the correct device for the subsequent GPU operations.

Suggested change

block_numbers = (clamped_positions_cpu // self.block_size)

block_numbers = (clamped_positions // self.block_size)

github-actions · 2025-12-10T15:55:39Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

whx-sjtu · 2025-12-11T09:01:00Z

vllm_ascend/attention/utils.py

+            is_only_prefill=self.is_only_prefill,
+            graph_pad_size=-1,  # It should be -1 when not run in fullgraph mode.
+            num_input_tokens=num_actual_tokens,
+            cos=self.cos,


check whether cos/sin needed to be sliced.

Checked. In the current scene, cos/sin will only be None, so won't cause any errors. Slicing has been added for subsequent scenes.

github-actions · 2025-12-12T08:12:20Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-12-12T14:42:03Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

realliujiaxu · 2025-12-13T08:01:11Z

vllm_ascend/worker/model_runner_v1.py

            if self.speculative_config:
                use_padded_batch_for_eagle = self.speculative_config and \
-                    self.speculative_config.method == "mtp" and \
+                    self.speculative_config.method in ("mtp", "eagle", "eagle3") and \


use self.speculative_config.use_eagle()

### What this PR does / why we need it? Currently, we are using `AscendRejctionSampler` that extends from `RejctionSampler` in spec decoding. `AscendRejctionSampler` override `forward` of `RejctionSampler`, only aming to replace `rejection_sample` func. This causes a lot of code of `RejctionSampler` cannot be reused, for example: - vllm-project/vllm#19482 - vllm-project/vllm#26060 - vllm-project/vllm#29223 #### Proposed Change: - Delete `AscendRejctionSampler` and use `RejctionSampler` directly in model runner. - Patch `RejctionSampler.expand_batch_to_tokens` and `RejctionSampler.rejection_sample`, maybe a better way is to make them as custom ops. - Modify `NPUModelRunner` following vllm-project/vllm#26060 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - [x] test logits processor for spec decoding - [x] test logprobs for spec decoding - [x] test logprobs for spec decoding + async shcheduling (test with #4893) - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>

MengqingCao · 2025-12-16T08:15:53Z

tests/e2e/singlecard/spec_decode_v1/test_v1_spec_decode.py

+    )
+
+    # Create an LLM.
+    llm = LLM(


I just noticed you're using LLM in a test case, we must clean up npu hbm when not using VllmRunner. plz modify this case to use VllmRunner or clean up by hand if keeping it as is.
plz refer to https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/conftest.py#L81

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>

Co-authored-by: drslark <slarksblood@qq.com> Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>

### What this PR does / why we need it? Currently, we are using `AscendRejctionSampler` that extends from `RejctionSampler` in spec decoding. `AscendRejctionSampler` override `forward` of `RejctionSampler`, only aming to replace `rejection_sample` func. This causes a lot of code of `RejctionSampler` cannot be reused, for example: - vllm-project/vllm#19482 - vllm-project/vllm#26060 - vllm-project/vllm#29223 #### Proposed Change: - Delete `AscendRejctionSampler` and use `RejctionSampler` directly in model runner. - Patch `RejctionSampler.expand_batch_to_tokens` and `RejctionSampler.rejection_sample`, maybe a better way is to make them as custom ops. - Modify `NPUModelRunner` following vllm-project/vllm#26060 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - [x] test logits processor for spec decoding - [x] test logprobs for spec decoding - [x] test logprobs for spec decoding + async shcheduling (test with vllm-project#4893) - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>

…gle (vllm-project#4893) ### What this PR does / why we need it? We refactored the eagle_proposer.py to adapt the framework of eagle.py in vllm-v0.12.0, to support the logit of padded drafter batch and async-scheduler. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: anon189Ty <Stari_Falcon@outlook.com> Co-authored-by: drslark <slarksblood@qq.com>

### What this PR does / why we need it? Currently, we are using `AscendRejctionSampler` that extends from `RejctionSampler` in spec decoding. `AscendRejctionSampler` override `forward` of `RejctionSampler`, only aming to replace `rejection_sample` func. This causes a lot of code of `RejctionSampler` cannot be reused, for example: - vllm-project/vllm#19482 - vllm-project/vllm#26060 - vllm-project/vllm#29223 #### Proposed Change: - Delete `AscendRejctionSampler` and use `RejctionSampler` directly in model runner. - Patch `RejctionSampler.expand_batch_to_tokens` and `RejctionSampler.rejection_sample`, maybe a better way is to make them as custom ops. - Modify `NPUModelRunner` following vllm-project/vllm#26060 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - [x] test logits processor for spec decoding - [x] test logprobs for spec decoding - [x] test logprobs for spec decoding + async shcheduling (test with vllm-project#4893) - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…gle (vllm-project#4893) ### What this PR does / why we need it? We refactored the eagle_proposer.py to adapt the framework of eagle.py in vllm-v0.12.0, to support the logit of padded drafter batch and async-scheduler. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: anon189Ty <Stari_Falcon@outlook.com> Co-authored-by: drslark <slarksblood@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? Currently, we are using `AscendRejctionSampler` that extends from `RejctionSampler` in spec decoding. `AscendRejctionSampler` override `forward` of `RejctionSampler`, only aming to replace `rejection_sample` func. This causes a lot of code of `RejctionSampler` cannot be reused, for example: - vllm-project/vllm#19482 - vllm-project/vllm#26060 - vllm-project/vllm#29223 #### Proposed Change: - Delete `AscendRejctionSampler` and use `RejctionSampler` directly in model runner. - Patch `RejctionSampler.expand_batch_to_tokens` and `RejctionSampler.rejection_sample`, maybe a better way is to make them as custom ops. - Modify `NPUModelRunner` following vllm-project/vllm#26060 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - [x] test logits processor for spec decoding - [x] test logprobs for spec decoding - [x] test logprobs for spec decoding + async shcheduling (test with vllm-project#4893) - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…gle (vllm-project#4893) ### What this PR does / why we need it? We refactored the eagle_proposer.py to adapt the framework of eagle.py in vllm-v0.12.0, to support the logit of padded drafter batch and async-scheduler. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: anon189Ty <Stari_Falcon@outlook.com> Co-authored-by: drslark <slarksblood@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

gemini-code-assist bot reviewed Dec 10, 2025

View reviewed changes

anon189Ty force-pushed the eagle_async branch 4 times, most recently from e68dae0 to db94642 Compare December 11, 2025 03:05

github-actions bot added the module:tests label Dec 11, 2025

whx-sjtu reviewed Dec 11, 2025

View reviewed changes

anon189Ty force-pushed the eagle_async branch 3 times, most recently from 4af6f05 to 364e8a0 Compare December 11, 2025 13:51

MengqingCao added ready read for review ready-for-test start test by label for PR labels Dec 11, 2025

anon189Ty force-pushed the eagle_async branch 2 times, most recently from 40f9983 to 9fa55d2 Compare December 12, 2025 08:11

github-actions bot added the merge-conflicts label Dec 12, 2025

anon189Ty force-pushed the eagle_async branch from 9fa55d2 to 6cc8516 Compare December 12, 2025 08:51

github-actions bot removed the merge-conflicts label Dec 12, 2025

anon189Ty force-pushed the eagle_async branch from 6cc8516 to 5fdcaab Compare December 12, 2025 12:46

github-actions bot added the merge-conflicts label Dec 12, 2025

anon189Ty force-pushed the eagle_async branch from 5fdcaab to bcb9415 Compare December 12, 2025 15:28

github-actions bot removed the merge-conflicts label Dec 12, 2025

anon189Ty force-pushed the eagle_async branch 2 times, most recently from ca2a69e to 7930081 Compare December 13, 2025 06:31

realliujiaxu reviewed Dec 13, 2025

View reviewed changes

realliujiaxu mentioned this pull request Dec 13, 2025

[Feat] Refactor rejection sampler #4975

Merged

3 tasks

anon189Ty force-pushed the eagle_async branch 2 times, most recently from 835fc47 to 2bda5de Compare December 13, 2025 13:56

anon189Ty force-pushed the eagle_async branch 2 times, most recently from 56e2ea1 to e3be15f Compare December 13, 2025 16:45

wangxiyuan approved these changes Dec 14, 2025

View reviewed changes

anon189Ty force-pushed the eagle_async branch 2 times, most recently from e101000 to 91ae862 Compare December 15, 2025 08:31

MengqingCao reviewed Dec 16, 2025

View reviewed changes

[Feat] Support async_scheduler and disable_padded_drafter_batch in eagle

4d52d8f

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>

anon189Ty force-pushed the eagle_async branch from 71b5955 to 79662d2 Compare December 16, 2025 09:07

Adapt model_runner refactor and arange length fix

fafd238

Co-authored-by: drslark <slarksblood@qq.com> Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>

anon189Ty force-pushed the eagle_async branch from 79662d2 to fafd238 Compare December 16, 2025 09:56

MengqingCao approved these changes Dec 16, 2025

View reviewed changes

Merge branch 'main' into eagle_async

c03fc2e

MengqingCao merged commit 5b1da4e into vllm-project:main Dec 16, 2025
14 of 16 checks passed

Yikun mentioned this pull request Feb 5, 2026

[v0.13.0rc2] FAQ / Feedback | 问题/反馈 #6186

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Support async_scheduler and disable_padded_drafter_batch in eagle#4893

[Feat] Support async_scheduler and disable_padded_drafter_batch in eagle#4893
MengqingCao merged 3 commits intovllm-project:mainfrom
anon189Ty:eagle_async

anon189Ty commented Dec 10, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 10, 2025

Uh oh!

gemini-code-assist bot Dec 10, 2025

Uh oh!

github-actions bot commented Dec 10, 2025

Uh oh!

whx-sjtu Dec 11, 2025

Uh oh!

anon189Ty Dec 12, 2025

Uh oh!

github-actions bot commented Dec 12, 2025

Uh oh!

github-actions bot commented Dec 12, 2025

Uh oh!

realliujiaxu Dec 13, 2025

Uh oh!

anon189Ty Dec 13, 2025

Uh oh!

MengqingCao Dec 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		@@ -549,29 +435,28 @@ def _propose(

		# Compute the slot mapping.
		block_numbers = (clamped_positions_cpu // self.block_size)

	block_numbers = (clamped_positions_cpu // self.block_size)
	block_numbers = (clamped_positions // self.block_size)

Conversation

anon189Ty commented Dec 10, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 10, 2025

Uh oh!

whx-sjtu Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

anon189Ty Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 12, 2025

Uh oh!

github-actions bot commented Dec 12, 2025

Uh oh!

realliujiaxu Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

anon189Ty Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

MengqingCao Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

anon189Ty commented Dec 10, 2025 •

edited by github-actions bot

Loading