Skip to content

[P/D][v0.16.0]Adapt to RecomputeScheduler in vLLM 0.16.0#6898

Merged
wangxiyuan merged 3 commits intovllm-project:mainfrom
wangxiaoteng888:recompute_scheduler_16
Mar 2, 2026
Merged

[P/D][v0.16.0]Adapt to RecomputeScheduler in vLLM 0.16.0#6898
wangxiyuan merged 3 commits intovllm-project:mainfrom
wangxiaoteng888:recompute_scheduler_16

Conversation

@wangxiaoteng888
Copy link
Copy Markdown
Contributor

@wangxiaoteng888 wangxiaoteng888 commented Mar 2, 2026

What this PR does / why we need it?

Adapt the recompute feature to vLLM 0.16.0, where the D node forwards recompute requests to the P node.

Does this PR introduce any user-facing change?

No

How was this patch tested?

By ci

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 2, 2026

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the RecomputeScheduler to improve its efficiency and extend its capabilities. Key changes include simplifying the request lifecycle by removing outdated KV cache transfer mechanisms, introducing explicit support for Mamba models with block-aligned cache splitting, and enhancing the tracking of prefix cache statistics. The update also modernizes type hints and strengthens the handling of request states, ensuring more reliable operation during model execution and asynchronous KV transfers.

Highlights

  • Simplified Request Management: Removed specialized __init__ and add_request logic related to PLACEHOLDER_TOKEN_ID and is_mtp_kv_consumer, streamlining how requests are initialized and added to the scheduler.
  • Mamba Block-Aligned Cache Support: Integrated new logic (_mamba_block_aligned_split) within the scheduling process to support block-aligned chunking for Mamba models, including a new reason for unschedulable requests.
  • Enhanced Prefix Cache Statistics: Introduced new variables (connector_prefix_cache_queries, connector_prefix_cache_hits) and recording logic to track prefix cache performance more comprehensively.
  • Improved Request State Handling: Refined the update_from_output method to robustly handle finished or aborted requests, especially in asynchronous KV transfer scenarios, and standardized the use of request_id for consistency.
  • Type Hint Modernization: Updated the return type of schedule and the scheduler_output parameter type in update_from_output to use the more general SchedulerOutput.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • vllm_ascend/core/recompute_scheduler.py
    • Removed PLACEHOLDER_TOKEN_ID import and related __init__ and add_request methods.
    • Updated schedule method return type to SchedulerOutput.
    • Integrated _mamba_block_aligned_split for Mamba block-aligned cache handling.
    • Added a new unschedulable reason for Mamba models.
    • Standardized request_id usage for dictionary operations.
    • Modified spec_token_ids trimming logic.
    • Added logic to skip requests waiting for streaming.
    • Initialized and recorded connector_prefix_cache_stats.
    • Simplified num_new_tokens calculation.
    • Removed redundant spec_token_ids processing for waiting requests.
    • Used any_request_id for common prefix block retrieval.
    • Updated update_from_output signature and added is_finished() check.
    • Ensured generated_token_ids exist before processing scheduled_spec_token_ids.
    • Refactored routed_experts and finish_reason handling.
    • Added num_external_computed_tokens to EngineCoreOutput.
Activity
  • No specific activity was provided in the pull request description.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the RecomputeScheduler to align with recent changes in the upstream vLLM repository. The changes include removing outdated MTP KV consumer logic, adding support for Mamba block-aligned splits, and various refactorings for streaming support and speculative decoding.

I've found a critical issue where type hints for schedule and update_from_output were changed to the base SchedulerOutput type. This will cause a runtime AttributeError in update_from_output as it accesses attributes specific to the RecomputeSchedulerOutput subclass. I've added comments to revert these type hints to fix the bug.

Per the repository style guide, here are suggestions for the pull request title and summary:

Suggested PR Title:

[Core][Update] Align RecomputeScheduler with upstream vLLM changes

Suggested PR Summary:

### What this PR does / why we need it?

This PR updates `RecomputeScheduler` to align with recent changes in vLLM (likely for v0.16.0 compatibility). The main changes are:

- Removed outdated MTP KV consumer logic and placeholder token handling for speculative decoding.
- Added support for Mamba block-aligned splits.
- Refactored request ID handling for improved readability.
- Updated logic to support streaming requests.
- Adjusted handling of stopped requests and speculative decoding statistics.

These changes are necessary to keep the forked scheduler compatible with the latest vLLM core logic.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI passed with new added/existing test.

def update_from_output(
self,
scheduler_output: RecomputeSchedulerOutput,
scheduler_output: SchedulerOutput,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The type hint for scheduler_output has been changed to SchedulerOutput. However, the method body at line 673 accesses scheduler_output.recomputed_reqs, which is an attribute specific to the RecomputeSchedulerOutput subclass. This will cause an AttributeError at runtime because SchedulerOutput does not have this attribute. To fix this bug, the type hint should be reverted to RecomputeSchedulerOutput.

Suggested change
scheduler_output: SchedulerOutput,
scheduler_output: RecomputeSchedulerOutput,

Comment thread vllm_ascend/core/recompute_scheduler.py Outdated
request.record_event(EngineCoreEventType.QUEUED)

def schedule(self) -> RecomputeSchedulerOutput:
def schedule(self) -> SchedulerOutput:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The return type hint for schedule has been changed to SchedulerOutput. While this matches the base class, the method returns a RecomputeSchedulerOutput instance, and the specific fields of this subclass are used in update_from_output. Changing the type hint in update_from_output to SchedulerOutput introduces a bug. To maintain consistency and correctness, it's best to revert this change and use the more specific RecomputeSchedulerOutput type.

Suggested change
def schedule(self) -> SchedulerOutput:
def schedule(self) -> RecomputeSchedulerOutput:

@wangxiaoteng888 wangxiaoteng888 changed the title update_recompute_for_16 [recompute][0.16.0]Adapt to RecomputeScheduler in vLLM 0.16.0 Mar 2, 2026
@wangxiaoteng888 wangxiaoteng888 changed the title [recompute][0.16.0]Adapt to RecomputeScheduler in vLLM 0.16.0 [P/D][0.16.0]Adapt to RecomputeScheduler in vLLM 0.16.0 Mar 2, 2026
@wangxiaoteng888 wangxiaoteng888 changed the title [P/D][0.16.0]Adapt to RecomputeScheduler in vLLM 0.16.0 [P/D][v0.16.0]Adapt to RecomputeScheduler in vLLM 0.16.0 Mar 2, 2026
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Comment thread vllm_ascend/core/recompute_scheduler.py Outdated
request.record_event(EngineCoreEventType.QUEUED)

def schedule(self) -> RecomputeSchedulerOutput:
def schedule(self) -> SchedulerOutput:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should return RecomputeSchedulerOutput

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK , I will fix.

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
@weiguihua2 weiguihua2 added ready read for review ready-for-test start test by label for PR labels Mar 2, 2026
@wangxiyuan wangxiyuan merged commit dfa9ff7 into vllm-project:main Mar 2, 2026
59 of 60 checks passed
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Mar 5, 2026
…to qwen3next_graph

* 'main' of https://github.com/vllm-project/vllm-ascend: (40 commits)
  [Feature] Add docs of batch invariance and make some extra operators patch (vllm-project#6910)
  [bugfix]Qwen2.5VL accurate question (vllm-project#6975)
  [CI] Add DeepSeek-V3.2 large EP nightly ci (vllm-project#6378)
  [Ops][BugFix] Fix RoPE shape mismatch for mtp models with flashcomm v1 enabled (vllm-project#6939)
  [bugfix]fix file not found error in nightly of single-node (vllm-project#6976)
  [Bugfix] Fix the acceptance rates dorp issue when applying eagle3 to QuaRot model (vllm-project#6914)
  [CI] Enable auto upgrade e2e estimated time for auto-partition suites (vllm-project#6840)
  [Doc][Misc] Fix msprobe_guide.md documentation issues (vllm-project#6965)
  [Nightly][Refactor]Migrate nightly single-node model tests from `.py` to `.yaml` (vllm-project#6503)
  [BugFix] Improve GDN layer detection for multimodal models (vllm-project#6941)
  [feat]ds3.2 pcp support mtp and chunkprefill (vllm-project#6917)
  [CPU binding] Implement global CPU slicing and improve IRQ binding for Ascend NPUs (vllm-project#6945)
  [Triton] Centralize Ascend extension op dispatch in triton_utils (vllm-project#6937)
  [csrc][bugfix] Add compile-time Ascend950/910_95 compatibility for custom ops between CANN8.5 and 9.0 (vllm-project#6936)
  [300I][Bugfix] fix unquant model weight nd2nz error (vllm-project#6851)
  [doc] fix supported_models (vllm-project#6930)
  [CI] nightly test timeout (vllm-project#6912)
  [CI] Upgrade CANN to 8.5.1 (vllm-project#6897)
  [Model]Add Qwen3-Omni quantization Ascend NPU adaptation and optimization (vllm-project#6828)
  [P/D][v0.16.0]Adapt to RecomputeScheduler in vLLM 0.16.0 (vllm-project#6898)
  ...
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
…t#6898)

### What this PR does / why we need it?
Adapt the recompute feature to vLLM 0.16.0, where the D node forwards
recompute requests to the P node.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci
- vLLM version: v0.16.0
- vLLM main:
vllm-project/vllm@15d76f7

---------

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants