light version of prefix caching for hybrid models gdn attention by joennlae · Pull Request #30725 · vllm-project/vllm

joennlae · 2025-12-15T23:22:00Z

Copied and rebased from #28176

@peakcrosser7

Copied and rebased from vllm-project#28176 Thanks to @peakcrosser7 and @minminsun Signed-off-by: Jannis Schönleber <joennlae@gmail.com>

gemini-code-assist

Code Review

This pull request introduces a "light version" of prefix caching for hybrid models, which is a significant new feature. The changes are spread across several files, touching scheduling, cache management, and attention backends. My review has identified a critical bug in how a new environment variable is parsed, which could lead to incorrect behavior. I've also found a potential issue with torch.compile caching due to an incorrect configuration, and a significant code duplication in the scheduler that should be refactored to improve maintainability. The rest of the changes appear consistent with the feature's goal.

gemini-code-assist · 2025-12-15T23:24:22Z

vllm/envs.py

+    "VLLM_USE_LIGHTER_MAMBA_CACHE": lambda: bool(
+        os.getenv("VLLM_USE_LIGHTER_MAMBA_CACHE", False)
+    ),


The parsing of the VLLM_USE_LIGHTER_MAMBA_CACHE environment variable is incorrect. When a user sets VLLM_USE_LIGHTER_MAMBA_CACHE=0, os.getenv returns the string "0", and bool("0") evaluates to True, which is not the intended behavior. This should be parsed similarly to other boolean environment variables in this file by converting the value to an integer before casting to a boolean.

Suggested change

"VLLM_USE_LIGHTER_MAMBA_CACHE": lambda: bool(

os.getenv("VLLM_USE_LIGHTER_MAMBA_CACHE", False)

),

"VLLM_USE_LIGHTER_MAMBA_CACHE": lambda: bool(

int(os.getenv("VLLM_USE_LIGHTER_MAMBA_CACHE", "0"))

),

gemini-code-assist · 2025-12-15T23:24:22Z

vllm/envs.py

        "VLLM_CPU_MOE_PREPACK",
        "VLLM_CPU_SGL_KERNEL",
        "VLLM_TEST_FORCE_LOAD_FORMAT",
+        "VLLM_USE_LIGHTER_MAMBA_CACHE",


The VLLM_USE_LIGHTER_MAMBA_CACHE environment variable is being added to ignored_factors in compile_factors. This will exclude it from the torch.compile cache key. However, this flag significantly alters the caching logic and computation graph for Mamba models. To prevent incorrect cache hits and potential runtime errors when using torch.compile, this variable should be part of the cache key. Please remove this line from the ignored_factors set.

gemini-code-assist · 2025-12-15T23:24:22Z

vllm/v1/core/sched/scheduler.py

+            if (
+                envs.VLLM_USE_LIGHTER_MAMBA_CACHE
+                and self.cache_config.enable_prefix_caching
+                and self._has_mamba_spec()
+            ):
+                # To enable block-aligned caching of the Mamba state, `num_new_tokens`
+                # must be a multiple of `block_size`.
+                # As an exception, if `num_new_tokens` is less than `block_size`, the
+                # state is simply not cached, requiring no special handling.
+                # Additionally, when Eagle mode is enabled, FullAttn prunes the last
+                # matching block. To prevent this from causing a Mamba cache miss, the
+                # last chunk must be larger than `block_size`.
+                block_size = self.block_size
+                max_last_chunk = block_size * (2 if self.use_eagle else 1)
+                if num_new_tokens < max_last_chunk:
+                    num_new_tokens = min(num_new_tokens, token_budget)
+                else:
+                    ori_num_new_tokens = num_new_tokens
+                    num_new_tokens = min(num_new_tokens, token_budget)
+                    num_new_tokens = num_new_tokens // block_size * block_size
+                    if (
+                        self.use_eagle
+                        and ori_num_new_tokens - num_new_tokens < block_size
+                    ):
+                        assert num_new_tokens >= block_size
+                        num_new_tokens -= block_size
+            else:
+                num_new_tokens = min(num_new_tokens, token_budget)


The logic for calculating num_new_tokens when VLLM_USE_LIGHTER_MAMBA_CACHE is enabled is duplicated in two places within the schedule method (here and at lines 593-617). This complex logic, which handles block-aligned caching for Mamba state and special conditions for Eagle mode, is difficult to maintain in two separate places. Any future changes might be applied to one copy but not the other, leading to bugs. This duplicated code should be refactored into a helper method to improve maintainability and reduce the risk of inconsistencies.

heheda12345 · 2025-12-17T06:25:39Z

@joennlae did you get a chance to talk with @peakcrosser7 . We are actually iterating on #29272

mergify · 2025-12-17T10:06:48Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @joennlae.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

joennlae · 2025-12-17T13:10:49Z

closeing due to #29272

light version of prefix caching for hybrid models gdn attention

08c6511

Copied and rebased from vllm-project#28176 Thanks to @peakcrosser7 and @minminsun Signed-off-by: Jannis Schönleber <joennlae@gmail.com>

joennlae mentioned this pull request Dec 15, 2025

[V1] [Hybrid] Lighter Mamba Prefix Caching for Hybrid Models #28176

Closed

5 tasks

mergify bot added qwen Related to Qwen models v1 labels Dec 15, 2025

gemini-code-assist bot reviewed Dec 15, 2025

View reviewed changes

heheda12345 self-assigned this Dec 17, 2025

mergify bot added the needs-rebase label Dec 17, 2025

joennlae closed this Dec 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

light version of prefix caching for hybrid models gdn attention#30725

light version of prefix caching for hybrid models gdn attention#30725
joennlae wants to merge 1 commit intovllm-project:mainfrom
44ai-labs:qwen3-next-prefix-caching

joennlae commented Dec 15, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 15, 2025

Uh oh!

gemini-code-assist bot Dec 15, 2025

Uh oh!

gemini-code-assist bot Dec 15, 2025

Uh oh!

heheda12345 commented Dec 17, 2025

Uh oh!

mergify bot commented Dec 17, 2025

Uh oh!

joennlae commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

joennlae commented Dec 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 commented Dec 17, 2025

Uh oh!

mergify bot commented Dec 17, 2025

Uh oh!

joennlae commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joennlae commented Dec 15, 2025 •

edited by github-actions bot

Loading