[V1][Hybrid] Enable spec decode and optimize block-aligned split in mamba cache align mode by peakcrosser7 · Pull Request #33024 · vllm-project/vllm

peakcrosser7 · 2026-01-25T09:09:30Z

Purpose

Re-enabled spec decoding: Previously disabled in [V1][Hybrid] Mamba Prefix Caching with align mode #30877, speculative decoding is now re-enabled as the related issues have been confirmed as fixed.
Optimized block-aligned splitting for resumed requests: Refined the logic to ensure that Mamba states for resumed requests are also cached in a block-aligned fashion, maintaining consistency across prefill phases.

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

gemini-code-assist

Code Review

This pull request re-enables speculative decoding for Mamba with align cache mode and refactors the block-aligned splitting logic to better support resumed requests. The changes are logical and well-structured, particularly the introduction of the _mamba_compute_cache_pos helper function. However, I've identified a potential issue in the calculation of the last cacheable position for Eagle mode, which could lead to performance degradation due to cache misses under specific conditions.

vllm/v1/core/sched/scheduler.py

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

heheda12345

Is this for correctness or for high cache hit rate?

For cache hit rate of resumed request, I think we only need to ensure the last prefill chunk state was computed before so we have cached it if an external storage exists and don't need to force recompute this position. Try to make the code simple :)

heheda12345 · 2026-01-27T08:24:44Z

add ready for the large unit test

peakcrosser7 · 2026-01-27T15:45:52Z

Is this for correctness or for high cache hit rate?

For cache hit rate of resumed request, I think we only need to ensure the last prefill chunk state was computed before so we have cached it if an external storage exists and don't need to force recompute this position. Try to make the code simple :)

The changes are for both correctness and cache hit rate.
Commit 5781011 is for correctness—ensuring all prefill chunks are block-aligned so the Mamba states are valid.
Commit 1920b70 is the optimization part: this specifically optimizes resumed requests. Previously, prompt and output tokens were treated as a whole for splitting, which often placed the cached state within the output tokens—making it unusable for future request replays. By block-aligned splitting at the original prompt boundary, we cache a state that is much more likely to be hit again. Of course, I didn't consider external storage here.
If you feel this optimization adds too much complexity for such a rare case, I'm fine with removing it to keep the code simple.

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

tdoublep · 2026-01-30T14:44:49Z

vllm/v1/core/sched/scheduler.py

+        # * resumed requests: num_computed_tokens < (
+        #                       num_prompt_tokens + num_output_tokens
+        #                     )
+        if num_computed_tokens < request.num_tokens:


How will this behave for a normal decode where we have:

num_computed_tokens = request.num_tokens - 1

?

Oh right, I totally forgot that the normal decode also follows this logic. While the normal decode doesn't get affected by this logic, it still introduces some redundant computations. We could maybe use num_computed_tokens < max(request.num_prompt_tokens, request.num_tokens - 1)? It looks a bit complicated. What do you think? The main point here is just to distinguish the normal decode from the rest, since it doesn't require block-aligned processing.

mergify · 2026-02-02T17:00:55Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @peakcrosser7.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

peakcrosser7 · 2026-02-03T17:56:02Z

This PR has now been split into #33705 and #33706

peakcrosser7 · 2026-02-05T17:17:02Z

Closed because of #33705 and #33706.

peakcrosser7 added 5 commits January 24, 2026 18:45

enable spec decode in mamba cache align mode

3acfc49

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

block-aligned for resumed requests

83bb817

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

block-aligned for output tokens of resumed requests

5781011

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

fix comments

2a90d6b

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

optimize block-aligned split for resumed requests

1920b70

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

mergify bot added the v1 label Jan 25, 2026

peakcrosser7 marked this pull request as ready for review January 25, 2026 09:10

peakcrosser7 requested review from ApostaC, WoosukKwon, alexm-redhat, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners January 25, 2026 09:10

gemini-code-assist bot reviewed Jan 25, 2026

View reviewed changes

vllm/v1/core/sched/scheduler.py Show resolved Hide resolved

fix the comment

74bc4b2

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

peakcrosser7 changed the title ~~[V1][Hybrid] Support spec decode and optimize block-aligned split in mamba cache align mode~~ [V1][Hybrid] Enable spec decode and optimize block-aligned split in mamba cache align mode Jan 26, 2026

enable test_mamba_prefix_cache

c7985f7

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

heheda12345 reviewed Jan 27, 2026

View reviewed changes

heheda12345 added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 27, 2026

fix the test

0463a56

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

tdoublep mentioned this pull request Jan 30, 2026

[BugFix] Disable async scheduling for Mamba prefix caching #33352

Merged

5 tasks

tdoublep reviewed Jan 30, 2026

View reviewed changes

mergify bot added the needs-rebase label Feb 2, 2026

peakcrosser7 closed this Feb 5, 2026

peakcrosser7 deleted the ups/prefix_cache_fix_resumed branch February 18, 2026 03:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[V1][Hybrid] Enable spec decode and optimize block-aligned split in mamba cache align mode#33024

[V1][Hybrid] Enable spec decode and optimize block-aligned split in mamba cache align mode#33024
peakcrosser7 wants to merge 8 commits intovllm-project:mainfrom
peakcrosser7:ups/prefix_cache_fix_resumed

peakcrosser7 commented Jan 25, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

heheda12345 left a comment

Uh oh!

heheda12345 commented Jan 27, 2026

Uh oh!

peakcrosser7 commented Jan 27, 2026

Uh oh!

tdoublep Jan 30, 2026

Uh oh!

peakcrosser7 Jan 31, 2026

Uh oh!

mergify bot commented Feb 2, 2026

Uh oh!

peakcrosser7 commented Feb 3, 2026

Uh oh!

peakcrosser7 commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

peakcrosser7 commented Jan 25, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

heheda12345 commented Jan 27, 2026

Uh oh!

peakcrosser7 commented Jan 27, 2026

Uh oh!

tdoublep Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

peakcrosser7 Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Feb 2, 2026

Uh oh!

peakcrosser7 commented Feb 3, 2026

Uh oh!

peakcrosser7 commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

peakcrosser7 commented Jan 25, 2026 •

edited by github-actions bot

Loading