light version of prefix caching for hybrid models gdn attention#30725
light version of prefix caching for hybrid models gdn attention#30725joennlae wants to merge 1 commit intovllm-project:mainfrom
Conversation
Copied and rebased from vllm-project#28176 Thanks to @peakcrosser7 and @minminsun Signed-off-by: Jannis Schönleber <joennlae@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a "light version" of prefix caching for hybrid models, which is a significant new feature. The changes are spread across several files, touching scheduling, cache management, and attention backends. My review has identified a critical bug in how a new environment variable is parsed, which could lead to incorrect behavior. I've also found a potential issue with torch.compile caching due to an incorrect configuration, and a significant code duplication in the scheduler that should be refactored to improve maintainability. The rest of the changes appear consistent with the feature's goal.
| "VLLM_USE_LIGHTER_MAMBA_CACHE": lambda: bool( | ||
| os.getenv("VLLM_USE_LIGHTER_MAMBA_CACHE", False) | ||
| ), |
There was a problem hiding this comment.
The parsing of the VLLM_USE_LIGHTER_MAMBA_CACHE environment variable is incorrect. When a user sets VLLM_USE_LIGHTER_MAMBA_CACHE=0, os.getenv returns the string "0", and bool("0") evaluates to True, which is not the intended behavior. This should be parsed similarly to other boolean environment variables in this file by converting the value to an integer before casting to a boolean.
| "VLLM_USE_LIGHTER_MAMBA_CACHE": lambda: bool( | |
| os.getenv("VLLM_USE_LIGHTER_MAMBA_CACHE", False) | |
| ), | |
| "VLLM_USE_LIGHTER_MAMBA_CACHE": lambda: bool( | |
| int(os.getenv("VLLM_USE_LIGHTER_MAMBA_CACHE", "0")) | |
| ), |
| "VLLM_CPU_MOE_PREPACK", | ||
| "VLLM_CPU_SGL_KERNEL", | ||
| "VLLM_TEST_FORCE_LOAD_FORMAT", | ||
| "VLLM_USE_LIGHTER_MAMBA_CACHE", |
There was a problem hiding this comment.
The VLLM_USE_LIGHTER_MAMBA_CACHE environment variable is being added to ignored_factors in compile_factors. This will exclude it from the torch.compile cache key. However, this flag significantly alters the caching logic and computation graph for Mamba models. To prevent incorrect cache hits and potential runtime errors when using torch.compile, this variable should be part of the cache key. Please remove this line from the ignored_factors set.
| if ( | ||
| envs.VLLM_USE_LIGHTER_MAMBA_CACHE | ||
| and self.cache_config.enable_prefix_caching | ||
| and self._has_mamba_spec() | ||
| ): | ||
| # To enable block-aligned caching of the Mamba state, `num_new_tokens` | ||
| # must be a multiple of `block_size`. | ||
| # As an exception, if `num_new_tokens` is less than `block_size`, the | ||
| # state is simply not cached, requiring no special handling. | ||
| # Additionally, when Eagle mode is enabled, FullAttn prunes the last | ||
| # matching block. To prevent this from causing a Mamba cache miss, the | ||
| # last chunk must be larger than `block_size`. | ||
| block_size = self.block_size | ||
| max_last_chunk = block_size * (2 if self.use_eagle else 1) | ||
| if num_new_tokens < max_last_chunk: | ||
| num_new_tokens = min(num_new_tokens, token_budget) | ||
| else: | ||
| ori_num_new_tokens = num_new_tokens | ||
| num_new_tokens = min(num_new_tokens, token_budget) | ||
| num_new_tokens = num_new_tokens // block_size * block_size | ||
| if ( | ||
| self.use_eagle | ||
| and ori_num_new_tokens - num_new_tokens < block_size | ||
| ): | ||
| assert num_new_tokens >= block_size | ||
| num_new_tokens -= block_size | ||
| else: | ||
| num_new_tokens = min(num_new_tokens, token_budget) |
There was a problem hiding this comment.
The logic for calculating num_new_tokens when VLLM_USE_LIGHTER_MAMBA_CACHE is enabled is duplicated in two places within the schedule method (here and at lines 593-617). This complex logic, which handles block-aligned caching for Mamba state and special conditions for Eagle mode, is difficult to maintain in two separate places. Any future changes might be applied to one copy but not the other, leading to bugs. This duplicated code should be refactored into a helper method to improve maintainability and reduce the risk of inconsistencies.
|
@joennlae did you get a chance to talk with @peakcrosser7 . We are actually iterating on #29272 |
|
This pull request has merge conflicts that must be resolved before it can be |
|
closeing due to #29272 |
Copied and rebased from #28176
Thanks to @peakcrosser7 and @minminsun