Skip to content

light version of prefix caching for hybrid models gdn attention#30725

Closed
joennlae wants to merge 1 commit intovllm-project:mainfrom
44ai-labs:qwen3-next-prefix-caching
Closed

light version of prefix caching for hybrid models gdn attention#30725
joennlae wants to merge 1 commit intovllm-project:mainfrom
44ai-labs:qwen3-next-prefix-caching

Conversation

@joennlae
Copy link
Contributor

@joennlae joennlae commented Dec 15, 2025

Copied and rebased from #28176

Thanks to @peakcrosser7 and @minminsun

Copied and rebased from vllm-project#28176

Thanks to @peakcrosser7 and @minminsun

Signed-off-by: Jannis Schönleber <joennlae@gmail.com>
@mergify mergify bot added qwen Related to Qwen models v1 labels Dec 15, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a "light version" of prefix caching for hybrid models, which is a significant new feature. The changes are spread across several files, touching scheduling, cache management, and attention backends. My review has identified a critical bug in how a new environment variable is parsed, which could lead to incorrect behavior. I've also found a potential issue with torch.compile caching due to an incorrect configuration, and a significant code duplication in the scheduler that should be refactored to improve maintainability. The rest of the changes appear consistent with the feature's goal.

Comment on lines +1569 to +1571
"VLLM_USE_LIGHTER_MAMBA_CACHE": lambda: bool(
os.getenv("VLLM_USE_LIGHTER_MAMBA_CACHE", False)
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The parsing of the VLLM_USE_LIGHTER_MAMBA_CACHE environment variable is incorrect. When a user sets VLLM_USE_LIGHTER_MAMBA_CACHE=0, os.getenv returns the string "0", and bool("0") evaluates to True, which is not the intended behavior. This should be parsed similarly to other boolean environment variables in this file by converting the value to an integer before casting to a boolean.

Suggested change
"VLLM_USE_LIGHTER_MAMBA_CACHE": lambda: bool(
os.getenv("VLLM_USE_LIGHTER_MAMBA_CACHE", False)
),
"VLLM_USE_LIGHTER_MAMBA_CACHE": lambda: bool(
int(os.getenv("VLLM_USE_LIGHTER_MAMBA_CACHE", "0"))
),

"VLLM_CPU_MOE_PREPACK",
"VLLM_CPU_SGL_KERNEL",
"VLLM_TEST_FORCE_LOAD_FORMAT",
"VLLM_USE_LIGHTER_MAMBA_CACHE",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The VLLM_USE_LIGHTER_MAMBA_CACHE environment variable is being added to ignored_factors in compile_factors. This will exclude it from the torch.compile cache key. However, this flag significantly alters the caching logic and computation graph for Mamba models. To prevent incorrect cache hits and potential runtime errors when using torch.compile, this variable should be part of the cache key. Please remove this line from the ignored_factors set.

Comment on lines +302 to +329
if (
envs.VLLM_USE_LIGHTER_MAMBA_CACHE
and self.cache_config.enable_prefix_caching
and self._has_mamba_spec()
):
# To enable block-aligned caching of the Mamba state, `num_new_tokens`
# must be a multiple of `block_size`.
# As an exception, if `num_new_tokens` is less than `block_size`, the
# state is simply not cached, requiring no special handling.
# Additionally, when Eagle mode is enabled, FullAttn prunes the last
# matching block. To prevent this from causing a Mamba cache miss, the
# last chunk must be larger than `block_size`.
block_size = self.block_size
max_last_chunk = block_size * (2 if self.use_eagle else 1)
if num_new_tokens < max_last_chunk:
num_new_tokens = min(num_new_tokens, token_budget)
else:
ori_num_new_tokens = num_new_tokens
num_new_tokens = min(num_new_tokens, token_budget)
num_new_tokens = num_new_tokens // block_size * block_size
if (
self.use_eagle
and ori_num_new_tokens - num_new_tokens < block_size
):
assert num_new_tokens >= block_size
num_new_tokens -= block_size
else:
num_new_tokens = min(num_new_tokens, token_budget)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for calculating num_new_tokens when VLLM_USE_LIGHTER_MAMBA_CACHE is enabled is duplicated in two places within the schedule method (here and at lines 593-617). This complex logic, which handles block-aligned caching for Mamba state and special conditions for Eagle mode, is difficult to maintain in two separate places. Any future changes might be applied to one copy but not the other, leading to bugs. This duplicated code should be refactored into a helper method to improve maintainability and reduce the risk of inconsistencies.

@heheda12345 heheda12345 self-assigned this Dec 17, 2025
@heheda12345
Copy link
Collaborator

@joennlae did you get a chance to talk with @peakcrosser7 . We are actually iterating on #29272

@mergify
Copy link

mergify bot commented Dec 17, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @joennlae.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 17, 2025
@joennlae
Copy link
Contributor Author

closeing due to #29272

@joennlae joennlae closed this Dec 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase qwen Related to Qwen models v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants