[Mamba1] - Kernel Level Chunk Alignment for Prefix Caching by Josephasafg · Pull Request #34798 · vllm-project/vllm

Josephasafg · 2026-02-18T12:38:11Z

Purpose

The selective_scan_fn kernel processed tokens in fixed-size chunks kChunkSize = kNThreads * kNItems typically 2048, regardless of where the sequence started within a block. When chunked prefill split a request across scheduler iterations, the kernel would write state at positions that didn't align with block boundaries.

Example of the bug:
Block size: 2048, seqlen: 3966
Iteration 1: Process 1866 tokens → state written at position 1866 (partial block 0)
Iteration 2: Process 1100 tokens

Old behavior: Process all 1100 in one chunk, write state at position 2966
Block 0's state still only covers [0, 1866) but hash represents [0, 2048)
Cache hit loads incomplete state

Solution

Implement kernel-level chunk alignment (similar to PR #24683 for Mamba2) that:

Calculates the first chunk size to complete the current block
Processes subsequent chunks aligned to block boundaries
Writes state at each block boundary completion

Dry Run:
For num_computed=1866, seqlen=1100, block_size=2048:

chunk_start_offset = 1866 % 2048 = 1866
first_chunk_size = min(1100, 2048 - 1866) = 182  // Tokens to complete block 0
remaining = 1100 - 182 = 918
n_chunks = 2

Loop iteration 1 (chunk 0): Process 182 tokens
position: 1866 → 2048
block_idx_completed = (2048 - 1) / 2048 = 0
- Write state to block 0 (now complete)

Loop iteration 2 (chunk 1): Process 918 tokens
position: 2048 → 2966
- Write state to block 1 (last_scheduled)

Test Plan

All prefix caching unittests for Mamba1/Jamba models pass
No degradation in performance and quality has improved

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Josephasafg <ajgard7@gmail.com>

gemini-code-assist

Code Review

This pull request introduces kernel-level chunk alignment for prefix caching in Mamba1 to address a bug where state was not written at block boundaries. The changes are primarily within the selective_scan_fwd.cu kernel, with supporting modifications to plumb the new chunk_start_offsets parameter through the C++ and Python layers. The new logic dynamically calculates chunk sizes to align with block boundaries, ensuring correct state writing for prefix caching. The implementation appears robust and correctly reflects the logic outlined in the pull request description. I have reviewed the new chunking calculations, pointer arithmetic, and state-writing logic and found no issues.

divakar-amd

Thanks for this PR, was working on a similar fix for the kernel. Requesting few changes to make it compatible for ROCm

csrc/mamba/mamba_ssm/selective_scan_fwd.cu

…a1-chunk-alignment-upstream

Signed-off-by: Josephasafg <ajgard7@gmail.com>

divakar-amd · 2026-02-20T18:26:45Z

Created #34977 to support the fix for selective_scan_fwd.cu introduced in this PR

Josephasafg · 2026-02-22T13:04:33Z

@tdoublep Can you please take a look? Thanks!

csrc/mamba/mamba_ssm/selective_scan_fwd.cu

vllm/v1/attention/backends/mamba1_attn.py

Signed-off-by: Josephasafg <ajgard7@gmail.com>

mergify · 2026-02-22T15:23:00Z

Hi @Josephasafg, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Josephasafg <ajgard7@gmail.com>

Josephasafg · 2026-02-22T15:28:11Z

@tdoublep
I went ahead and updated the kernel to use cu_chunk_seqlen and last_chunk_indices directly (same as Mamba2), instead of the raw chunk_start_offsets I had before.

The chunk metadata computation (_compute_chunk_metadata) and tensor building (_build_chunk_metadata_tensors) now live in the base class and are shared by both Mamba1 and Mamba2 builders. The kernel derives current_position and chunk_tokens from these tensors, falling back to simple block_size chunking when they're not provided (non-APC path).

mergify · 2026-02-24T03:24:23Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Josephasafg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…a1-chunk-alignment-upstream

mergify · 2026-02-25T09:48:24Z

Hi @Josephasafg, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Josephasafg <ajgard7@gmail.com>

tdoublep

Thanks a lot for addressing the review comments. It looks good - just had one question for my understanding.

tdoublep · 2026-02-27T09:54:47Z

vllm/v1/attention/backends/mamba1_attn.py

+        ):
+            cu_chunk_seqlen_p, _, last_chunk_indices_p = (
+                self._build_chunk_metadata_tensors(
+                    self.kv_cache_spec.block_size,


It seems like for mamba1 we are forcing the kernel-level chunk size to be equal to block size, whereas for mamba2 the kernel-level chunk size is something kind of fixed by the model (it is typically equal to 256). Is there a reason we couldn't keep the kernel-level chunk size to whatever it was before prefix caching was introduced - is it just that 2048 is too big? Just want to make sure I understand this difference between mamba1 vs. mamba2.

@tdoublep Good question.
The old mamba1 "chunk size" (kChunkSize = kNThreads * kNItems) was purely a hardware detail - how many tokens fit in one thread block's registers per loop iteration. mamba1 does not have a block size as a model parameter.

For APC we need state snapshots at block boundaries, so we use block_size as the chunk size for _compute_chunk_metadata - each iteration produces one state write to one cache block. The default mamba_block_size for mamba1 is 2048 (to comply with non-APC but it can also be decreased), which equals kNThreads * kNItems for the largest kernel config, so for full-block chunks the effective iteration size is the same as before.
With cu_chunk_seqlen, the kernel reads whatever chunk sizes the metadata builder provides. This is what enables correct handling of partial first chunks during chunked prefill - something the old fixed kChunkSize didn't handle. In the non-APC fallback, the kernel chunks by 2048 same as before.

Does that answer your question?

Signed-off-by: Josephasafg <ajgard7@gmail.com>

…ect#34798) Signed-off-by: Josephasafg <ajgard7@gmail.com> Signed-off-by: Sergey Zinchenko <sergey.zinchenko.rnd@gmail.com>

…ect#34798) Signed-off-by: Josephasafg <ajgard7@gmail.com> Signed-off-by: EanWang211123 <wangyiheng@sangfor.com.cn>

…ect#34798) Signed-off-by: Josephasafg <ajgard7@gmail.com>

Added chunk alignment to selective scan

b2c42e4

Signed-off-by: Josephasafg <ajgard7@gmail.com>

Josephasafg requested review from LucasWilkinson and tdoublep as code owners February 18, 2026 12:38

mergify bot added the v1 label Feb 18, 2026

Josephasafg mentioned this pull request Feb 18, 2026

[Bugfix] - Fix Mamba prefix caching corruption with chunked prefill #34587

Closed

5 tasks

gemini-code-assist bot reviewed Feb 18, 2026

View reviewed changes

divakar-amd suggested changes Feb 19, 2026

View reviewed changes

csrc/mamba/mamba_ssm/selective_scan_fwd.cu Outdated Show resolved Hide resolved

csrc/mamba/mamba_ssm/selective_scan_fwd.cu Outdated Show resolved Hide resolved

Josephasafg and others added 4 commits February 19, 2026 19:31

Merge branch 'main' of https://github.com/vllm-project/vllm into mamb…

a758449

…a1-chunk-alignment-upstream

Removed unused kchunksize

0cf2b77

Signed-off-by: Josephasafg <ajgard7@gmail.com>

Removed kNThreads

52db1e5

Signed-off-by: Josephasafg <ajgard7@gmail.com>

Merge branch 'main' into mamba1-chunk-alignment-upstream

063fe4d

divakar-amd mentioned this pull request Feb 20, 2026

[Mamba][APC] Add test case to compare apc outputs #34977

Open

divakar-amd mentioned this pull request Feb 20, 2026

[CI Failure]: mi325_8: Language Models Tests (Hybrid) %N #29462

Open

3 tasks

Merge branch 'main' into mamba1-chunk-alignment-upstream

d055445

tdoublep reviewed Feb 22, 2026

View reviewed changes

csrc/mamba/mamba_ssm/selective_scan_fwd.cu Outdated Show resolved Hide resolved

vllm/v1/attention/backends/mamba1_attn.py Outdated Show resolved Hide resolved

Josephasafg added 2 commits February 22, 2026 15:30

Removed logic from mamba1 attn and used in the mixer

8a2e34a

Signed-off-by: Josephasafg <ajgard7@gmail.com>

Used existing computing logic both both mamba types when aligning chunks

732167b

Signed-off-by: Josephasafg <ajgard7@gmail.com>

mypy fix

0dce84a

Signed-off-by: Josephasafg <ajgard7@gmail.com>

Josephasafg requested a review from tdoublep February 22, 2026 15:30

mergify bot added the needs-rebase label Feb 24, 2026

Merge branch 'main' of https://github.com/vllm-project/vllm into mamb…

3939010

…a1-chunk-alignment-upstream

Josephasafg force-pushed the mamba1-chunk-alignment-upstream branch from 44419ac to 3939010 Compare February 24, 2026 05:15

mergify bot removed the needs-rebase label Feb 24, 2026

Merge branch 'main' into mamba1-chunk-alignment-upstream

d407054

Josephasafg requested a review from MatthewBonanni as a code owner February 25, 2026 09:42

Josephasafg added 3 commits February 25, 2026 11:53

Added missing func arg

0ba6930

Signed-off-by: Josephasafg <ajgard7@gmail.com>

Chaned func signature to kwargs

790405e

Signed-off-by: Josephasafg <ajgard7@gmail.com>

Moved block_idx_completed

225c0aa

Signed-off-by: Josephasafg <ajgard7@gmail.com>

tdoublep approved these changes Feb 27, 2026

View reviewed changes

tdoublep added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 27, 2026

Fixed params for mamba ssm test

f718db0

Signed-off-by: Josephasafg <ajgard7@gmail.com>

Josephasafg force-pushed the mamba1-chunk-alignment-upstream branch from b3b3abd to f718db0 Compare February 27, 2026 13:14

Josephasafg requested review from WoosukKwon, mgoin, tlrmchlsmth and yewentao256 as code owners February 27, 2026 13:14

Merge branch 'main' into mamba1-chunk-alignment-upstream

8b8d28b

DarkLight1337 merged commit bbf81f9 into vllm-project:main Mar 1, 2026
115 of 116 checks passed

bhoomit pushed a commit to bhoomit/vllm that referenced this pull request Mar 2, 2026

[Mamba1] - Kernel Level Chunk Alignment for Prefix Caching (vllm-proj…

8a7af44

…ect#34798) Signed-off-by: Josephasafg <ajgard7@gmail.com>

Josephasafg added a commit to Josephasafg/vllm that referenced this pull request Mar 3, 2026

[Mamba1] - Kernel Level Chunk Alignment for Prefix Caching (vllm-proj…

5b229a3

…ect#34798) Signed-off-by: Josephasafg <ajgard7@gmail.com>

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026

[Mamba1] - Kernel Level Chunk Alignment for Prefix Caching (vllm-proj…

f49a7fd

…ect#34798) Signed-off-by: Josephasafg <ajgard7@gmail.com>

Uh oh!

Conversation

Josephasafg commented Feb 18, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Solution

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

divakar-amd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

divakar-amd commented Feb 20, 2026

Uh oh!

Josephasafg commented Feb 22, 2026

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Feb 22, 2026

Uh oh!

Josephasafg commented Feb 22, 2026

Uh oh!

mergify bot commented Feb 24, 2026

Uh oh!

mergify bot commented Feb 25, 2026

Uh oh!

tdoublep left a comment

Choose a reason for hiding this comment

Uh oh!

tdoublep Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Josephasafg Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Josephasafg commented Feb 18, 2026 •

edited by github-actions bot

Loading