Skip to content

[Mamba1] - Kernel Level Chunk Alignment for Prefix Caching#34798

Merged
DarkLight1337 merged 16 commits intovllm-project:mainfrom
Josephasafg:mamba1-chunk-alignment-upstream
Mar 1, 2026
Merged

[Mamba1] - Kernel Level Chunk Alignment for Prefix Caching#34798
DarkLight1337 merged 16 commits intovllm-project:mainfrom
Josephasafg:mamba1-chunk-alignment-upstream

Conversation

@Josephasafg
Copy link
Contributor

@Josephasafg Josephasafg commented Feb 18, 2026

Purpose

The selective_scan_fn kernel processed tokens in fixed-size chunks kChunkSize = kNThreads * kNItems typically 2048, regardless of where the sequence started within a block. When chunked prefill split a request across scheduler iterations, the kernel would write state at positions that didn't align with block boundaries.

Example of the bug:
Block size: 2048, seqlen: 3966
Iteration 1: Process 1866 tokens → state written at position 1866 (partial block 0)
Iteration 2: Process 1100 tokens

  • Old behavior: Process all 1100 in one chunk, write state at position 2966
  • Block 0's state still only covers [0, 1866) but hash represents [0, 2048)
  • Cache hit loads incomplete state

Solution

Implement kernel-level chunk alignment (similar to PR #24683 for Mamba2) that:

  1. Calculates the first chunk size to complete the current block
  2. Processes subsequent chunks aligned to block boundaries
  3. Writes state at each block boundary completion

Dry Run:
For num_computed=1866, seqlen=1100, block_size=2048:

chunk_start_offset = 1866 % 2048 = 1866
first_chunk_size = min(1100, 2048 - 1866) = 182  // Tokens to complete block 0
remaining = 1100 - 182 = 918
n_chunks = 2

Loop iteration 1 (chunk 0): Process 182 tokens
position: 1866 → 2048
block_idx_completed = (2048 - 1) / 2048 = 0
- Write state to block 0 (now complete)

Loop iteration 2 (chunk 1): Process 918 tokens
position: 2048 → 2966
- Write state to block 1 (last_scheduled)

Test Plan

All prefix caching unittests for Mamba1/Jamba models pass
No degradation in performance and quality has improved

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Josephasafg <ajgard7@gmail.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces kernel-level chunk alignment for prefix caching in Mamba1 to address a bug where state was not written at block boundaries. The changes are primarily within the selective_scan_fwd.cu kernel, with supporting modifications to plumb the new chunk_start_offsets parameter through the C++ and Python layers. The new logic dynamically calculates chunk sizes to align with block boundaries, ensuring correct state writing for prefix caching. The implementation appears robust and correctly reflects the logic outlined in the pull request description. I have reviewed the new chunking calculations, pointer arithmetic, and state-writing logic and found no issues.

Copy link
Contributor

@divakar-amd divakar-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR, was working on a similar fix for the kernel. Requesting few changes to make it compatible for ROCm

Josephasafg and others added 4 commits February 19, 2026 19:31
@divakar-amd
Copy link
Contributor

Created #34977 to support the fix for selective_scan_fwd.cu introduced in this PR

@Josephasafg
Copy link
Contributor Author

@tdoublep Can you please take a look? Thanks!

Signed-off-by: Josephasafg <ajgard7@gmail.com>
@mergify
Copy link

mergify bot commented Feb 22, 2026

Hi @Josephasafg, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Josephasafg <ajgard7@gmail.com>
@Josephasafg
Copy link
Contributor Author

@tdoublep
I went ahead and updated the kernel to use cu_chunk_seqlen and last_chunk_indices directly (same as Mamba2), instead of the raw chunk_start_offsets I had before.

The chunk metadata computation (_compute_chunk_metadata) and tensor building (_build_chunk_metadata_tensors) now live in the base class and are shared by both Mamba1 and Mamba2 builders. The kernel derives current_position and chunk_tokens from these tensors, falling back to simple block_size chunking when they're not provided (non-APC path).

@Josephasafg Josephasafg requested a review from tdoublep February 22, 2026 15:30
@mergify
Copy link

mergify bot commented Feb 24, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Josephasafg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 24, 2026
@Josephasafg Josephasafg force-pushed the mamba1-chunk-alignment-upstream branch from 44419ac to 3939010 Compare February 24, 2026 05:15
@mergify mergify bot removed the needs-rebase label Feb 24, 2026
@mergify
Copy link

mergify bot commented Feb 25, 2026

Hi @Josephasafg, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Josephasafg <ajgard7@gmail.com>
Signed-off-by: Josephasafg <ajgard7@gmail.com>
Signed-off-by: Josephasafg <ajgard7@gmail.com>
Copy link
Member

@tdoublep tdoublep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for addressing the review comments. It looks good - just had one question for my understanding.

):
cu_chunk_seqlen_p, _, last_chunk_indices_p = (
self._build_chunk_metadata_tensors(
self.kv_cache_spec.block_size,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like for mamba1 we are forcing the kernel-level chunk size to be equal to block size, whereas for mamba2 the kernel-level chunk size is something kind of fixed by the model (it is typically equal to 256). Is there a reason we couldn't keep the kernel-level chunk size to whatever it was before prefix caching was introduced - is it just that 2048 is too big? Just want to make sure I understand this difference between mamba1 vs. mamba2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdoublep Good question.
The old mamba1 "chunk size" (kChunkSize = kNThreads * kNItems) was purely a hardware detail - how many tokens fit in one thread block's registers per loop iteration. mamba1 does not have a block size as a model parameter.

For APC we need state snapshots at block boundaries, so we use block_size as the chunk size for _compute_chunk_metadata - each iteration produces one state write to one cache block. The default mamba_block_size for mamba1 is 2048 (to comply with non-APC but it can also be decreased), which equals kNThreads * kNItems for the largest kernel config, so for full-block chunks the effective iteration size is the same as before.
With cu_chunk_seqlen, the kernel reads whatever chunk sizes the metadata builder provides. This is what enables correct handling of partial first chunks during chunked prefill - something the old fixed kChunkSize didn't handle. In the non-APC fallback, the kernel chunks by 2048 same as before.

Does that answer your question?

@tdoublep tdoublep added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 27, 2026
Signed-off-by: Josephasafg <ajgard7@gmail.com>
@DarkLight1337 DarkLight1337 merged commit bbf81f9 into vllm-project:main Mar 1, 2026
115 of 116 checks passed
sergey-zinchenko pushed a commit to sergey-zinchenko/vllm that referenced this pull request Mar 1, 2026
…ect#34798)

Signed-off-by: Josephasafg <ajgard7@gmail.com>
Signed-off-by: Sergey Zinchenko <sergey.zinchenko.rnd@gmail.com>
EanWang211123 pushed a commit to EanWang211123/vllm that referenced this pull request Mar 2, 2026
…ect#34798)

Signed-off-by: Josephasafg <ajgard7@gmail.com>
Signed-off-by: EanWang211123 <wangyiheng@sangfor.com.cn>
bhoomit pushed a commit to bhoomit/vllm that referenced this pull request Mar 2, 2026
Josephasafg added a commit to Josephasafg/vllm that referenced this pull request Mar 3, 2026
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants