[V1][Hybrid] Mamba Prefix Caching with align mode by peakcrosser7 · Pull Request #30877 · vllm-project/vllm

peakcrosser7 · 2025-12-17T15:26:03Z

The cleaned-up version of #29272

Purpose

This PR enhances the design of #28176 , adopting the same memory layout as FullAttention while adding support for decode caching and speculative decoding.

The core idea of this Mamba Prefix-Caching implementation (referred to as LPC) is to directly cache Mamba states through block-aligned scheduling. This approach enables rapid support for Prefix-caching in Mamba models without modifications to the underlying kernel code. Furthermore, it maintains full compatibility with Speculative-Decoding/MTP/EAGLE.

Currently, this solution supports all Mamba model architectures including GDN, Mamba1, Mamba2, and Short Conv Attention, and has been adapted for relevant Mamba models such as Qwen3-Next-80B-A3B-Instruct and LFM2-700M.

Note: Speculative decoding is temporarily disabled in this PR as there are still corner-case bugs when using with prefix-caching in align mode.

Usage

To enable this feature, start the engine with the --enable-prefix-caching and --mamba-cache-mode align flags.

Design Details

Block-Aligned Scheduling

Following the design in #28176 , requests in the prefill phase are scheduled in multiples of block_size. This ensures that the Mamba states can be mapped to a specific block's hash value.

The prefix-caching stores variable-length chunk states—i.e., the number of tokens (or the incremental length) associated with each cached Mamba state may vary, but it is always a multiple of block_size.

Scheduler Logic with Mamba Prefix-Caching Enabled:

Decode requests: Scheduling logic remains unchanged
Prefill requests:
- The number of tokens scheduled per step must be an integer multiple of block_size, except for the final chunk of the request.
- The last prefill chunk is split to align with block_size, ensuring its size is ≤ block_size. This maximizes the length of the prompt that can be cached during the prefill phase.

Block Allocation Design

Prefill Stage

During the prefill stage, requests are scheduled at a block-aligned chunk granularity. For a single scheduling step consisting of chunk_len tokens, the system allocates chunk_len // block_size blocks:

Mamba State Block: Only the last block in the sequence is physically allocated to store the Mamba state.
Placeholder Null-Blocks: The preceding (chunk_len // block_size) - 1 blocks are populated with null-blocks (placeholders).

Note on Speculative Decoding (SPS): In the prefill stage with SPS enabled, the initial execution requires the allocation of gamma additional speculative blocks, which are subsequently reused in following steps.

Decode Stage

Since only a small number of tokens are scheduled per step during decoding, the allocation logic is consistent with FullAttention, where blocks are incrementally allocated one by one.

Prefix Caching Logic

Scheduler-side Logic

Similar to the FullAttention prefix-caching logic. Only immutable blocks that store Mamba states are cached (excluding the null-blocks). And prefix matching is performed via a reverse hash lookup that requires only a single block to be matched.

Worker-side Logic

Prefill Phase:
The Preprocess stage is responsible for copying Mamba states before the model forward:

Condition 1: Copy the Mamba state from the previous step to the current step.

Condition 2: Copy the Mamba state from the prefix-cache hit block to the current step.

Decode Phase:
Without Speculative Decoding: The logic remains consistent with the standard Prefill Phase.

With Speculative Decoding:
The Preprocess stage copies Mamba states when a new block is allocated:

Note: Be aware that the conv state and temporal state may reside in different blocks depending on the num_accepted_tokens.

After receiving the full number of tokens corresponding to the previous block, the Post-process stage copies the Mamba state back to the previous block.

Test Plan

from vllm import LLM, SamplingParams
from vllm.distributed import cleanup_dist_env_and_memory
import time

def main():
    MODEL = "Qwen/Qwen3-Next-80B-A3B-Instruct"
    # You can use other Mamba models for testing
    # MODEL = "ibm-granite/granite-4.0-tiny-preview"
    # MODEL = "LiquidAI/LFM2-700M"
    # MODEL = "ai21labs/AI21-Jamba-Reasoning-3B"
    PROMPT_MULTIPLE = 310
    sampling_params = SamplingParams(temperature=0.0, max_tokens=128)
    prefix = ( # examples/offline_inference/prefix_caching.py
        "You are an expert school principal, skilled in effectively managing "
        "faculty and staff. Draft 10-15 questions for a potential first grade "
        "Head Teacher for my K-12, all-girls', independent school that emphasizes "
        "community, joyful discovery, and life-long learning. The candidate is "
        "coming in for a first-round panel interview for a 8th grade Math "
        "teaching role. They have 5 years of previous teaching experience "
        "as an assistant teacher at a co-ed, public school with experience "
        "in middle school math teaching. ")
    prefix2 = ("Based on these information, fulfill "
                "the following paragraph: ")
    prompt = PROMPT_MULTIPLE * prefix + prefix2 + "Hello, my name is"
    print('Prompt length:', len(prompt))
    for APC in [False, True]:
        engine = LLM(
            model=MODEL, enable_prefix_caching=APC, 
            max_num_batched_tokens=8192,
            block_size=64,
            tensor_parallel_size=4,
            gpu_memory_utilization=0.8, 
            disable_log_stats=False,
            mamba_cache_mode="align",
        )
        for i in range(3):
            if i == 0:
                print('Warm-up')
            if i == 1:
                print('Measuring')
                start_time = time.time()
            outputs = engine.generate(prompt, sampling_params)
            print('APC:', APC, i, f"Generated text: {outputs[0].outputs[0].text!r}")
            for m in engine.llm_engine.get_metrics():
                if 'vllm:prefix_cache_hits' in m.name:
                    print(m.name, m.value)
        print('APC:', APC, "loop took --- %s seconds ---" % (time.time() - start_time))
        del engine
        cleanup_dist_env_and_memory()


if __name__ == "__main__":
    main()

Test Result

Warm-up

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]
Adding requests: 100%|██████████| 1/1 [00:00<00:00, 12.78it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:12<00:00, 12.64s/it, est. speed input: 2552.19 toks/s, output: 10.13 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:12<00:00, 12.64s/it, est. speed input: 2552.19 toks/s, output: 10.13 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:12<00:00, 12.64s/it, est. speed input: 2552.19 toks/s, output: 10.13 toks/s]
APC: False 0 Generated text: " __________, and I am the Head of School at __________. I am thrilled to welcome you to our interview today. Our school is a K-12, all-girls', independent school that emphasizes community, joyful discovery, and lifelong learning. We believe that every girl has the potential to thrive when nurtured in an environment that values curiosity, collaboration, and courage. As we consider candidates for our 8th grade Math teaching role, we are looking for educators who not only have strong content knowledge and pedagogical skills, but who also embody our core values and are excited to contribute to a vibrant, supportive, and girl"
vllm:prefix_cache_hits 0
Measuring

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]
Adding requests: 100%|██████████| 1/1 [00:00<00:00, 14.60it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.88s/it, est. speed input: 17128.75 toks/s, output: 67.97 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.88s/it, est. speed input: 17128.75 toks/s, output: 67.97 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.88s/it, est. speed input: 17128.75 toks/s, output: 67.97 toks/s]
APC: False 1 Generated text: " __________, and I am the Head of School at __________. I am thrilled to welcome you to our interview today. Our school is a K-12, all-girls', independent school that emphasizes community, joyful discovery, and lifelong learning. We believe that every girl has the potential to thrive when nurtured in an environment that values curiosity, collaboration, and courage. As we consider candidates for our 8th grade Math teaching role, we are looking for educators who not only have strong content knowledge and pedagogical skills, but who also embody our core values and are excited to contribute to a vibrant, supportive, and girl"
vllm:prefix_cache_hits 0

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]
Adding requests: 100%|██████████| 1/1 [00:00<00:00, 14.93it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.91s/it, est. speed input: 16923.13 toks/s, output: 67.16 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.91s/it, est. speed input: 16923.13 toks/s, output: 67.16 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.91s/it, est. speed input: 16923.13 toks/s, output: 67.16 toks/s]
^[[0;36m(Worker_TP0 pid=695046)^[[0;0m INFO 12-30 18:59:07 [multiproc_executor.py:709] Parent process exited, terminating worker
^[[0;36m(Worker_TP3 pid=695049)^[[0;0m INFO 12-30 18:59:07 [multiproc_executor.py:709] Parent process exited, terminating worker
^[[0;36m(Worker_TP1 pid=695047)^[[0;0m INFO 12-30 18:59:07 [multiproc_executor.py:709] Parent process exited, terminating worker
^[[0;36m(Worker_TP2 pid=695048)^[[0;0m INFO 12-30 18:59:07 [multiproc_executor.py:709] Parent process exited, terminating worker
APC: False 2 Generated text: " __________, and I am the Head of School at __________. I am thrilled to welcome you to our interview today. Our school is a K-12, all-girls', independent school that emphasizes community, joyful discovery, and lifelong learning. We believe that every girl has the potential to thrive when nurtured in an environment that values curiosity, collaboration, and courage. As we consider candidates for our 8th grade Math teaching role, we are looking for educators who not only have strong content knowledge and pedagogical skills, but who also embody our core values and are excited to contribute to a vibrant, supportive, and girl"
vllm:prefix_cache_hits 0
APC: False loop took --- 3.9298272132873535 seconds ---

Warm-up

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]
Adding requests: 100%|██████████| 1/1 [00:00<00:00, 14.51it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:12<00:00, 12.92s/it, est. speed input: 2497.21 toks/s, output: 9.91 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:12<00:00, 12.92s/it, est. speed input: 2497.21 toks/s, output: 9.91 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:12<00:00, 12.92s/it, est. speed input: 2497.21 toks/s, output: 9.91 toks/s]
APC: True 0 Generated text: " __________, and I am the Head of School at __________. I am thrilled to welcome you to our interview today. Our school is a K-12, all-girls', independent school that emphasizes community, joyful discovery, and lifelong learning. We believe that every girl has the potential to thrive when nurtured in an environment that values curiosity, collaboration, and courage. As we consider candidates for our 8th grade Math teaching role, we are looking for educators who not only have strong content knowledge and pedagogical skill, but who also bring a deep commitment to fostering a classroom culture where girls feel seen, heard, and"
vllm:prefix_cache_hits 0
Measuring

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]
Adding requests: 100%|██████████| 1/1 [00:00<00:00, 15.06it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.04it/s, est. speed input: 33534.00 toks/s, output: 133.07 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.04it/s, est. speed input: 33534.00 toks/s, output: 133.07 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.04it/s, est. speed input: 33534.00 toks/s, output: 133.07 toks/s]
APC: True 1 Generated text: " __________, and I am the Head of School at __________. I am thrilled to welcome you to our interview today. Our school is a K-12, all-girls', independent school that emphasizes community, joyful discovery, and lifelong learning. We believe that every girl has the potential to thrive when nurtured in an environment that values curiosity, collaboration, and courage. As we consider candidates for our 8th grade Math teaching role, we are looking for educators who not only have strong content knowledge and pedagogical skill, but who also bring a deep commitment to fostering a classroom culture where girls feel seen, heard, and"
vllm:prefix_cache_hits 32096

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]
Adding requests: 100%|██████████| 1/1 [00:00<00:00, 14.97it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.04it/s, est. speed input: 33591.79 toks/s, output: 133.30 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.04it/s, est. speed input: 33591.79 toks/s, output: 133.30 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.04it/s, est. speed input: 33591.79 toks/s, output: 133.30 toks/s]
^[[0;36m(Worker_TP1 pid=698526)^[[0;0m INFO 12-30 19:01:17 [multiproc_executor.py:709] Parent process exited, terminating worker
^[[0;36m(Worker_TP2 pid=698527)^[[0;0m INFO 12-30 19:01:17 [multiproc_executor.py:709] Parent process exited, terminating worker
^[[0;36m(Worker_TP3 pid=698528)^[[0;0m INFO 12-30 19:01:17 [multiproc_executor.py:709] Parent process exited, terminating worker
^[[0;36m(Worker_TP0 pid=698525)^[[0;0m INFO 12-30 19:01:17 [multiproc_executor.py:709] Parent process exited, terminating worker
APC: True 2 Generated text: " __________, and I am the Head of School at __________. I am thrilled to welcome you to our interview today. Our school is a K-12, all-girls', independent school that emphasizes community, joyful discovery, and lifelong learning. We believe that every girl has the potential to thrive when nurtured in an environment that values curiosity, collaboration, and courage. As we consider candidates for our 8th grade Math teaching role, we are looking for educators who not only have strong content knowledge and pedagogical skill, but who also bring a deep commitment to fostering a classroom culture where girls feel seen, heard, and"
vllm:prefix_cache_hits 64192
APC: True loop took --- 2.0605297088623047 seconds ---

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

mergify · 2026-01-18T06:13:33Z

Hi @peakcrosser7, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

heheda12345

LGTM! Thanks @peakcrosser7 for the great job.
There is still a long way to go for vllm to reach stable & efficient mamba support. Though there are some known issues, I'd like to merge this PR first to make it possible for more people to contribute on this work stream. Given the known issues, we keep the prefix caching of linear attention as an experimental feature that needs to be enabled explicitly.

I list some of the problems below. Most of them are not related to prefix caching support directly but do block us for moving forward. Help wanted on them!

Speculative decoding compatibility. There is some correctness issue in the current linear attention implementation as discussed in #30618. Though this PR includes the code for spec decode + prefix caching, it can only be enabled after #30618 is resolved.
#31649 detected during the debugging of 30618
We need more testing on resumed requests' prefix caching

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

vllm/config/cache.py

vllm/v1/worker/gpu_model_runner.py

tdoublep · 2026-01-19T09:24:34Z

vllm/v1/worker/gpu_model_runner.py

+                mamba_blocks_per_req = (
+                    max_num_blocks_per_req
+                    if self.cache_config.enable_prefix_caching
+                    else 1
+                ) + kv_cache_group.kv_cache_spec.num_speculative_blocks


I'm still have trouble squaring this with the code for max_memory_usage_bytes which says that for align mode is is self.page_size_bytes * (2 + self.num_speculative_blocks). Does this imply that we have mamba_blocks_per_req = 2 + kv_cache_group.kv_cache_spec.num_speculative_blocks ?

peakcrosser7 · 2026-01-19T09:51:26Z

Hi, @tdoublep.
My understanding is that max_memory_usage_bytes represents the peak memory per request during runtime. For Mamba models, while we typically need 1 + num_speculative_blocks blocks for current state computation. The extra block is necessary because the previous block cannot be released until its state is copied to the current block when allocating new blocks. Thus, the peak usage is 1 + (1 + num_speculative_blocks) blocks.

As for mamba_blocks_per_req, in align mode, Mamba shares the same block layout as FullAttn. It requires max_model_len // block_size blocks for the sequence, plus an additional num_speculative_blocks for speculative decoding states, totaling max_model_len // block_size + num_speculative_blocks.

Please let me know if I misunderstood anything. Thanks!

lfopensource · 2026-01-19T09:53:01Z

Looking at this PR, this is a good format that I should follow (didn't look at the code change itself). Role model format!

vllm/model_executor/layers/mamba/mamba_mixer.py

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

tdoublep

Thanks for the great work! This feature enables prefix caching for a broader set of models.

Let's fix the issues that remain for MTP as a follow-up.

tdoublep · 2026-01-23T08:55:32Z

vllm/v1/kv_cache_interface.py

+        # We allocate 1 block for each request now, so max_memory_usage_bytes is
+        # the same as page_size_bytes.
+        # Need to update this when supporting prefix caching.


This comment is redundant now I think

Thanks for pointing that out. We can remove it later.

tdoublep · 2026-01-23T14:26:00Z

vllm/v1/kv_cache_interface.py

+            max_model_len = vllm_config.model_config.max_model_len
+            return cdiv(max_model_len, self.block_size) * self.page_size_bytes


I think that this code for "all" is wrong actually, but it is not an issue introduced by this PR. Will fix it as a follow-up.

Agreed. It seems "all" mode performs allocation at the granularity of mamba_block_size, so we need to fix this later.

MatthewBonanni · 2026-01-23T18:20:12Z

This PR appears to fail pre-commit, I have a fix: #32956

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: 陈建华 <1647430658@qq.com>

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com>

…de align` (#7103) ### What this PR does / why we need it? To support prefix cache for Qwen3.5/Next in vLLM-Ascend, this PR mainly follows the design in [#30877](vllm-project/vllm#30877) and inherits changes to functions which are overridden in vLLM-Ascend. Note: 1. `--mamba-cache-mode align` && PD disaggregation is still not supported yet in vLLM v0.17.0(see https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py#L295). 2. The current implementation of hybrid kv cache might result in a very large block_size when scheduling. For example, if we run Qwen3.5-35B-A3B with `-tp 2`, the block_size is adjusted to 2048, which means that any prefix shorter than 2048 will never be cached. Although this behavior is consistent with vLLM, it still needs improvements in the future. 3. `--mamba-cache-mode align` requires to copy mamba states during forward steps. vLLM uses a triton kernel to implement it. However, the original version run into some bugs on Ascend hardwares. Thus we patch a new triton kernel to avoid this bug. ### Does this PR introduce _any_ user-facing change? To use mamba prefix cache, set `--enable-prefix-caching` and `--mamba-cache-mode align`. Note that the mamba state copy function(see [do_mamba_copy_block](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/mamba_utils.py#L132)) does not provide a torch native version, thus it might have trouble if users can't use triton. - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@4034c3d --------- Signed-off-by: Angazenn <supperccell@163.com>

…de align` (vllm-project#7103) ### What this PR does / why we need it? To support prefix cache for Qwen3.5/Next in vLLM-Ascend, this PR mainly follows the design in [#30877](vllm-project/vllm#30877) and inherits changes to functions which are overridden in vLLM-Ascend. Note: 1. `--mamba-cache-mode align` && PD disaggregation is still not supported yet in vLLM v0.17.0(see https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py#L295). 2. The current implementation of hybrid kv cache might result in a very large block_size when scheduling. For example, if we run Qwen3.5-35B-A3B with `-tp 2`, the block_size is adjusted to 2048, which means that any prefix shorter than 2048 will never be cached. Although this behavior is consistent with vLLM, it still needs improvements in the future. 3. `--mamba-cache-mode align` requires to copy mamba states during forward steps. vLLM uses a triton kernel to implement it. However, the original version run into some bugs on Ascend hardwares. Thus we patch a new triton kernel to avoid this bug. ### Does this PR introduce _any_ user-facing change? To use mamba prefix cache, set `--enable-prefix-caching` and `--mamba-cache-mode align`. Note that the mamba state copy function(see [do_mamba_copy_block](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/mamba_utils.py#L132)) does not provide a torch native version, thus it might have trouble if users can't use triton. - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@4034c3d --------- Signed-off-by: Angazenn <supperccell@163.com>

peakcrosser7 and others added 30 commits November 24, 2025 14:29

Implementation of lighter mamba prefix caching on V1

aa27d8f

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

[BugFix] Resolve compatibility issues in lighter mamba prefix cache

81d3561

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

[BugFix] Resolve compatibility issues for mamba

8a652af

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

Add base implementation for lighter mamba cache with standard layout

8a39893

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

update and fix bugs

ee08f54

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

fix bugs after rebasing

ce840b3

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

add running script (just for testing)

865ea28

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

add debug logs

7316d03

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

fix schedule

ad37d09

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

add test script (just for testing)

f1295e5

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

fix runner

bf445fc

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

support sps (still issues)

d8acf4b

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

add some logs for debug

b48f042

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

fix block reuse bug when SPS is enabled

336a140

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

fix block_table bug when sps is enabled

08feffb

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

fix the bug only mamba new blocks are empty

12306ed

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

add postprocess for prefix caching

60142b6

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

refactor preprocess_mamba

fa3d810

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

add some logs for debug

9082d16

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

fix the wrong kv_cache_gid bug

4b5e074

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

adjust preprocess_mamba only copying state when new blocks exist

ea92ef4

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

update mamba_gather_indices and apply to mamba models

0b3a6b5

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

fix bugs with mamba and lfm2

381ab34

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

fix the bug when prefix-caching is disable

1da8032

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

fix the bug between LPC and mamba mixer

1667210

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

tmp from commit "b15e6fd"

cb45505

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

add e2e impl (still has bug)

f2ac29b

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

update runner

2f103fc

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

[WIP] write unit test

08aa710

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

extract mamba state

5a39c44

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

fix pre-commit

fd6e24f

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

peakcrosser7 changed the title ~~[V1][Hybrid][WIP] Mamba Prefix Caching with align mode~~ [V1][Hybrid] Mamba Prefix Caching with align mode Jan 18, 2026

skip test

c200506

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

heheda12345 approved these changes Jan 19, 2026

View reviewed changes

skip test

738e7f4

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

tdoublep reviewed Jan 19, 2026

View reviewed changes

Josephasafg reviewed Jan 19, 2026

View reviewed changes

vllm/model_executor/layers/mamba/mamba_mixer.py Show resolved Hide resolved

peakcrosser7 added 2 commits January 19, 2026 15:23

update commemts

74c60f5

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

Merge branch 'main' into ups/mamba_prefix_cache_align

760a312

Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>

tdoublep approved these changes Jan 23, 2026

View reviewed changes

heheda12345 merged commit 5206e5e into vllm-project:main Jan 23, 2026
64 of 65 checks passed

MatthewBonanni mentioned this pull request Jan 23, 2026

[Bugfix][CI] Fix pre-commit #32956

Merged

5 tasks

This was referenced Jan 25, 2026

[V1] [Hybrid] Lighter Mamba Prefix Caching for Hybrid Models #28176

Closed

[V1][Hybrid] Enable spec decode and optimize block-aligned split in mamba cache align mode #33024

Closed

peakcrosser7 mentioned this pull request Feb 3, 2026

[Hybrid] Enable spec decoding in mamba cache align mode #33705

Merged

5 tasks

simondanielsson mentioned this pull request Feb 5, 2026

[V1][Hybrid] GatedDeltaNet Automatic Prefix Caching (all-mode) #26807

Open

6 tasks

Angazenn mentioned this pull request Mar 10, 2026

[Hybrid] support prefix cache for Qwen3.5/Next with --mamba-cache-mode align vllm-project/vllm-ascend#7103

Merged

Copilot AI mentioned this pull request Mar 10, 2026

[Feat][Mamba] Support Mamba KV cache prefix caching (align mode) on Ascend NPU lHrHenry233/vllm-ascend-dev#1

Closed

		max_model_len = vllm_config.model_config.max_model_len
		return cdiv(max_model_len, self.block_size) * self.page_size_bytes

Uh oh!

Conversation

peakcrosser7 commented Dec 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Usage

Design Details

Block-Aligned Scheduling

Block Allocation Design

Prefill Stage

Decode Stage

Prefix Caching Logic

Scheduler-side Logic

Worker-side Logic

Test Plan

Test Result

Uh oh!

mergify bot commented Jan 18, 2026

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tdoublep Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

peakcrosser7 commented Jan 19, 2026

Uh oh!

lfopensource commented Jan 19, 2026

Uh oh!

Uh oh!

tdoublep left a comment

Choose a reason for hiding this comment

Uh oh!

tdoublep Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

peakcrosser7 Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

tdoublep Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

peakcrosser7 Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MatthewBonanni commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

peakcrosser7 commented Dec 17, 2025 •

edited by github-actions bot

Loading

MatthewBonanni commented Jan 23, 2026 •

edited

Loading