Skip to content

[V1][Hybrid] Enable spec decode and optimize block-aligned split in mamba cache align mode#33024

Closed
peakcrosser7 wants to merge 8 commits intovllm-project:mainfrom
peakcrosser7:ups/prefix_cache_fix_resumed
Closed

[V1][Hybrid] Enable spec decode and optimize block-aligned split in mamba cache align mode#33024
peakcrosser7 wants to merge 8 commits intovllm-project:mainfrom
peakcrosser7:ups/prefix_cache_fix_resumed

Conversation

@peakcrosser7
Copy link
Contributor

@peakcrosser7 peakcrosser7 commented Jan 25, 2026

Purpose

  1. Re-enabled spec decoding: Previously disabled in [V1][Hybrid] Mamba Prefix Caching with align mode #30877, speculative decoding is now re-enabled as the related issues have been confirmed as fixed.
  2. Optimized block-aligned splitting for resumed requests: Refined the logic to ensure that Mamba states for resumed requests are also cached in a block-aligned fashion, maintaining consistency across prefill phases.

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request re-enables speculative decoding for Mamba with align cache mode and refactors the block-aligned splitting logic to better support resumed requests. The changes are logical and well-structured, particularly the introduction of the _mamba_compute_cache_pos helper function. However, I've identified a potential issue in the calculation of the last cacheable position for Eagle mode, which could lead to performance degradation due to cache misses under specific conditions.

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
@peakcrosser7 peakcrosser7 changed the title [V1][Hybrid] Support spec decode and optimize block-aligned split in mamba cache align mode [V1][Hybrid] Enable spec decode and optimize block-aligned split in mamba cache align mode Jan 26, 2026
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for correctness or for high cache hit rate?

For cache hit rate of resumed request, I think we only need to ensure the last prefill chunk state was computed before so we have cached it if an external storage exists and don't need to force recompute this position. Try to make the code simple :)

@heheda12345 heheda12345 added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 27, 2026
@heheda12345
Copy link
Collaborator

add ready for the large unit test

@peakcrosser7
Copy link
Contributor Author

Is this for correctness or for high cache hit rate?

For cache hit rate of resumed request, I think we only need to ensure the last prefill chunk state was computed before so we have cached it if an external storage exists and don't need to force recompute this position. Try to make the code simple :)

The changes are for both correctness and cache hit rate.
Commit 5781011 is for correctness—ensuring all prefill chunks are block-aligned so the Mamba states are valid.
Commit 1920b70 is the optimization part: this specifically optimizes resumed requests. Previously, prompt and output tokens were treated as a whole for splitting, which often placed the cached state within the output tokens—making it unusable for future request replays. By block-aligned splitting at the original prompt boundary, we cache a state that is much more likely to be hit again. Of course, I didn't consider external storage here.
If you feel this optimization adds too much complexity for such a rare case, I'm fine with removing it to keep the code simple.

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
# * resumed requests: num_computed_tokens < (
# num_prompt_tokens + num_output_tokens
# )
if num_computed_tokens < request.num_tokens:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will this behave for a normal decode where we have:

num_computed_tokens = request.num_tokens - 1

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, I totally forgot that the normal decode also follows this logic. While the normal decode doesn't get affected by this logic, it still introduces some redundant computations. We could maybe use num_computed_tokens < max(request.num_prompt_tokens, request.num_tokens - 1)? It looks a bit complicated. What do you think? The main point here is just to distinguish the normal decode from the rest, since it doesn't require block-aligned processing.

@mergify
Copy link

mergify bot commented Feb 2, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @peakcrosser7.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 2, 2026
@peakcrosser7
Copy link
Contributor Author

This PR has now been split into #33705 and #33706

@peakcrosser7
Copy link
Contributor Author

Closed because of #33705 and #33706.

@peakcrosser7 peakcrosser7 deleted the ups/prefix_cache_fix_resumed branch February 18, 2026 03:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants