[Kernel] Chunk-aligned mamba2 by tdoublep · Pull Request #24683 · vllm-project/vllm

tdoublep · 2025-09-11T18:07:26Z

Purpose

This PR changes the way that the mamba2 kernels split the batch into "chunks". The change ensures that (a) no chunk ever contains more than one sequence, and (b) all intermediate states are computed at the chunk boundaries within each sequence.

This change is useful for three reasons:

It dramatically simplifies the kernels due to (a).
It enables much easier implementation of prefix caching for mamba due to (b)
It can improve performance, even without prefix caching, because we can entirely skip the final call to the "varlen" kernel that is used to align the final states for each sequence.

The downside is that it introduces some "virtual" padding inside the chunks. We don't actually pad anything in GPU memory, we just potentially need to use a larger grid when launching kernels and may do some redundant compute. However, this padding is bounded to at most one chunk per sequence, and my initial experiments suggest it really doesn't hurt a lot. In fact, we actually see a significant speedup because we skip the call to the final "varlen" kernel. We follow a very similar approach for working with varlen batches in the Triton attention kernels, so this kind of technique is not without precedent.

TODO:

seq_idx can be made simpler - we just need to keep track of the seq_idx per chunk
strip out redundant meta-data like chunk_indices and chunk_offsets
Merge [Model] Mamba2 varlen and metadata refactor #21467 before this one since it removes a lot of redundant code.
Merge [Model] Clean up and simplify Mamba2 Metadata Usage in both V0 and V1 #24331 before this one since it cleans things up further.

Simple example for two sequences A and B is shown below. A0 and B0 represent the chunks that were prefilled at the previous step, and A1 and B1 are the new chunks we want to prefill in this iteration.

The idea is that for sequence A, we first take enough tokens from the new part (A1) to ensure that, when taking together with the precomputed part (A0), the state is chunked-aligned. Then we fill chunks with new tokens (from A1) until we run out, at which we pad to the chunk boundary. Then repeat for B.

Test Plan

See correctness + benchmarking below.

Test Result

See correctness + benchmarking below.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep · 2025-09-12T12:12:33Z

Server:

vllm serve ibm-granite/granite-4.0-tiny-preview --enforce-eager

Client:

lm_eval --model local-completions --tasks gsm8k --num_fewshot 5 --batch_size auto --limit 500     \
    --model_args model=ibm-granite/granite-4.0-tiny-preview,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=50,max_retries=3,tokenized_requests=False

Results (main):

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.608|±  |0.0219|
|     |       |strict-match    |     5|exact_match|↑  |0.584|±  |0.0221|

Results (tpa-mamba-aligned):

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.616|±  |0.0218|
|     |       |strict-match    |     5|exact_match|↑  |0.590|±  |0.0220|

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep · 2025-09-12T18:23:40Z

Server:

vllm serve ibm-granite/granite-4.0-tiny-preview

Benchmark:

vllm bench serve \
        --model ibm-granite/granite-4.0-tiny-preview \
        --dataset-name sharegpt \
        --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
        --ignore_eos

Branch main (second run):

============ Serving Benchmark Result ============
Successful requests:                     983       
Benchmark duration (s):                  32.64     
Total input tokens:                      235252    
Total generated tokens:                  222931    
Request throughput (req/s):              30.12     
Output token throughput (tok/s):         6830.50   
Total Token throughput (tok/s):          14038.50  
---------------Time to First Token----------------
Mean TTFT (ms):                          5419.54   
Median TTFT (ms):                        5404.06   
P99 TTFT (ms):                           9049.08   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          111.52    
Median TPOT (ms):                        72.42     
P99 TPOT (ms):                           303.04    
---------------Inter-token Latency----------------
Mean ITL (ms):                           53.58     
Median ITL (ms):                         35.74     
P99 ITL (ms):                            245.94    
==================================================

Branch tpa-aligned-mamba (Second run):

============ Serving Benchmark Result ============
Successful requests:                     983       
Benchmark duration (s):                  32.34     
Total input tokens:                      233074    
Total generated tokens:                  223781    
Request throughput (req/s):              30.39     
Output token throughput (tok/s):         6918.85   
Total Token throughput (tok/s):          14125.03  
---------------Time to First Token----------------
Mean TTFT (ms):                          4103.87   
Median TTFT (ms):                        4083.23   
P99 TTFT (ms):                           7084.88   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          105.28    
Median TPOT (ms):                        83.67     
P99 TPOT (ms):                           248.41    
---------------Inter-token Latency----------------
Mean ITL (ms):                           56.46     
Median ITL (ms):                         36.00     
P99 ITL (ms):                            320.39    
==================================================

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tomeras91

Thanks @tdoublep. LGTM now. The descriptions and shapes really help

tlrmchlsmth

PR looks great at first pass. Love to see more red than green.

tlrmchlsmth · 2025-09-29T19:15:43Z

In the figure in the PR description, why does A1.a fall at the beginning of the chunk rather than the end? I thought A0 should be ahead of it rather than behind

tdoublep · 2025-09-29T19:23:49Z

In the figure in the PR description, why does A1.a fall at the beginning of the chunk rather than the end? I thought A0 should be ahead of it rather than behind

@tlrmchlsmth A0 isn't actually added to the chunk, it has already been prefilled and doesn't need to be computed again. We just need to partition A1 in such a way that len(A0)+len(A1.a)=chunk_size so that the intermediate states we get at the output of the first chunk correspond to the actual chunk boundaries within the sequence. That's why the part of the chunk after A1.a is grey to indicate that it gets padded (not actually padding in memory, only compute).

tlrmchlsmth · 2025-09-29T19:29:52Z

Do the padded regions get loaded at all?

In the figure in the PR description, why does A1.a fall at the beginning of the chunk rather than the end? I thought A0 should be ahead of it rather than behind

@tlrmchlsmth A0 isn't actually added to the chunk, it has already been prefilled and doesn't need to be computed again. We just need to partition A1 in such a way that len(A0)+len(A1.a)=chunk_size so that the intermediate states we get at the output of the first chunk correspond to the actual chunk boundaries within the sequence. That's why the part of the chunk after A1.a is grey to indicate that it gets padded (not actually padding in memory, only compute).

makes sense. So then the A0-sized padded region could overlap with another chunk, or it could fall off the end of the KV cache tensor, right? Do we mask off the loads of the padded region as well?

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep · 2025-09-29T19:35:37Z

Do the padded regions get loaded at all?

No, padding is maybe the wrong word. There isn't any actual padding of tensors in memory here.

Masking would probably be a better word. If we have 5 chunks like in the above example, we would launch a Triton kernel with a grid size of (5,..) and in the first chunk we mask out the last len(A0) slots, the second chunk we mask out nothing, third chunk we mask out chunk_size-len(A1.c)

We are basically trading off a bit of extra compute in order to get intermediate states at exactly where we want them within each sequence. It turns out it isn't really a trade-off since it strips out so much complexity, it is a net-win.

tdoublep · 2025-09-29T19:42:10Z

So then the A0-sized padded region could overlap with another chunk, or it could fall off the end of the KV cache tensor, right?

Yes, if we don't introduce the padding/masking it will lead to (a) having multiple sequences within the same chunk and (b) needing this whole mapping between "logical" and "physical" chunks to track where everything is.

Do we mask off the loads of the padded region as well?

Yes, we mask off the loads exactly (example: https://github.com/tdoublep/vllm/blob/tpa-aligned-mamba/vllm/model_executor/layers/mamba/ops/ssd_chunk_scan.py#L231)

Signed-off-by: yewentao256 <zhyanwentao@126.com>

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

…fter chunked-aligned mamba is merged (PR vllm-project#24683) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

tdoublep added 4 commits September 11, 2025 05:45

working changes

dddb650

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Merge branch 'main' into tpa-aligned-mamba

1dc7a04

working changes

2a7b216

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

fix bug

664a21a

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

mergify bot added the v1 label Sep 11, 2025

tdoublep added 14 commits September 11, 2025 14:14

fix bug

6c475d6

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

fix bug

0d00c69

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

working changes

b7ae698

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Fix bugs

9b24bce

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

working changes

e850661

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

revert some changes

0d5c3ae

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

fmt

31e05fa

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

workign

a8aff97

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

working changes

67db9b4

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

working changes

d841e82

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

working changes

908aecb

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

working changes

af7a246

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Some test cases working

7ce2b59

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Fix IMA

f950f2e

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Add back autotune config

75e01c8

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep added 8 commits September 12, 2025 14:47

cleanup

2698f2e

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

cleanup

d3f05b7

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

cleanup

df63503

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

cleanup

c5edccd

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

cleanup

712ced1

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

cleanup

dc85f7e

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

cleanup

5e827a6

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

cleanup

42e4b27

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep added 2 commits September 29, 2025 14:24

Fix plamo2

51b756b

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Review comments

37ffa92

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep requested a review from LucasWilkinson as a code owner September 29, 2025 18:44

tlrmchlsmth self-assigned this Sep 29, 2025

tomeras91 approved these changes Sep 29, 2025

View reviewed changes

tlrmchlsmth reviewed Sep 29, 2025

View reviewed changes

Fix plamo2 again

29b42cc

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tlrmchlsmth approved these changes Sep 29, 2025

View reviewed changes

tdoublep merged commit fea3e47 into vllm-project:main Sep 29, 2025
52 checks passed

This was referenced Sep 30, 2025

bring _query_start_loc_to_chunk_indices_offsets back to test_mamba_ssm_ssd.py #25980

Closed

Fix test_mamba_ssm_ssd.py due to missing _query_start_loc_to_chunk_indices_offsets #25995

Merged

pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025

[Kernel] Chunk-aligned mamba2 (vllm-project#24683)

4142c77

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

[Kernel] Chunk-aligned mamba2 (#24683)

b7973ea

Signed-off-by: yewentao256 <zhyanwentao@126.com>

tomeras91 pushed a commit to tomeras91/vllm that referenced this pull request Oct 6, 2025

[Kernel] Chunk-aligned mamba2 (vllm-project#24683)

1476b1c

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Kernel] Chunk-aligned mamba2 (vllm-project#24683)

2ad2f3c

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[Kernel] Chunk-aligned mamba2 (vllm-project#24683)

5cf78b2

tdoublep mentioned this pull request Nov 7, 2025

[Hybrid] [Kernel] Fix chunk scan kernel when BLOCK_SIZE_DSTATE > 128 #28295

Merged

5 tasks

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Kernel] Chunk-aligned mamba2 (vllm-project#24683)

3564812

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Kernel] Chunk-aligned mamba2 (vllm-project#24683)

049b6b6

heheda12345 mentioned this pull request Feb 18, 2026

[Bugfix] - Fix Mamba prefix caching corruption with chunked prefill #34587

Closed

5 tasks

Josephasafg mentioned this pull request Feb 18, 2026

[Mamba1] - Kernel Level Chunk Alignment for Prefix Caching #34798

Merged

5 tasks

tomeras91 mentioned this pull request Feb 26, 2026

[Kernel][Mamba] Optimize Mamba2 SSD prefill Triton kernels #35397

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Kernel] Chunk-aligned mamba2#24683

[Kernel] Chunk-aligned mamba2#24683
tdoublep merged 40 commits intovllm-project:mainfrom
tdoublep:tpa-aligned-mamba

tdoublep commented Sep 11, 2025 •

edited by github-actions bot

Loading

Uh oh!

tdoublep commented Sep 12, 2025

Uh oh!

tdoublep commented Sep 12, 2025

Uh oh!

tomeras91 left a comment

Uh oh!

tlrmchlsmth left a comment

Uh oh!

tlrmchlsmth commented Sep 29, 2025 •

edited

Loading

Uh oh!

tdoublep commented Sep 29, 2025 •

edited

Loading

Uh oh!

tlrmchlsmth commented Sep 29, 2025

Uh oh!

tdoublep commented Sep 29, 2025 •

edited

Loading

Uh oh!

tdoublep commented Sep 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

Conversation

tdoublep commented Sep 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

tdoublep commented Sep 12, 2025

Uh oh!

tdoublep commented Sep 12, 2025

Uh oh!

tomeras91 left a comment

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdoublep commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tlrmchlsmth commented Sep 29, 2025

Uh oh!

tdoublep commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdoublep commented Sep 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

tdoublep commented Sep 11, 2025 •

edited by github-actions bot

Loading

tlrmchlsmth commented Sep 29, 2025 •

edited

Loading

tdoublep commented Sep 29, 2025 •

edited

Loading

tdoublep commented Sep 29, 2025 •

edited

Loading