[PD][Nixl] Add support for hybrid SSM-FA models#36687
[PD][Nixl] Add support for hybrid SSM-FA models#36687NickLucche merged 22 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive support for hybrid SSM-FA models, a significant and complex feature. While the changes span across test configurations, core KV connector logic, and scheduler behavior, a critical security vulnerability was identified in the scheduler's handling of invalid KV cache blocks for HMA-enabled requests. Specifically, the validation logic fails to check all relevant KV cache groups, which could lead to the use of uninitialized memory and potential PII leakage. Additionally, several debug print statements and potential logic errors in the Nixl connector were observed and should be addressed to ensure production readiness.
|
|
||
| register_remote_blocks(blocks_data, mamba=False) | ||
| if self._is_mamba: | ||
| assert self.num_descs == len(blocks_data) |
There was a problem hiding this comment.
This assertion is critical for ensuring that the number of descriptors (self.num_descs) matches the actual number of blocks being registered (len(blocks_data)). An inconsistency here could lead to memory corruption or incorrect KV cache transfers. It's important to verify that self.num_descs is always accurately calculated to reflect all registered blocks, including those for Mamba layers and any logical duplications for K/V splits.
vllm/v1/core/sched/scheduler.py
Outdated
| all_req_block_ids = ( | ||
| (block_id for group in req_block_ids for block_id in group) | ||
| if is_hma | ||
| else req_block_ids[0] | ||
| ) | ||
| req_num_computed_blocks = ( | ||
| req_num_computed_tokens + self.block_size - 1 | ||
| ) // self.block_size | ||
| for idx, block_id in zip(range(req_num_computed_blocks), req_block_ids): | ||
| for idx, block_id in enumerate(all_req_block_ids): | ||
| if idx >= req_num_computed_blocks: | ||
| break |
There was a problem hiding this comment.
The scheduler's logic for identifying requests affected by invalid KV cache blocks is incomplete when Hybrid Memory Allocator (HMA) is used. The code flattens all KV cache groups into a single list but only iterates through the first req_num_computed_blocks elements. In HMA mode, this typically corresponds to the blocks in the Full Attention group, causing the validation to skip blocks in other groups (e.g., Sliding Window). If a block in these skipped groups failed to load from a remote source, the scheduler will fail to detect it, potentially leading the model runner to use uninitialized or stale GPU memory, which could result in PII leakage or incorrect model outputs.
Modify the validation loop to ensure all blocks in all KV cache groups that are relevant to the computed tokens are checked against the invalid_block_ids set. Since HMA does not support partial recovery, any invalid block in any group should trigger a full eviction and recomputation for the request.
| all_req_block_ids = ( | |
| (block_id for group in req_block_ids for block_id in group) | |
| if is_hma | |
| else req_block_ids[0] | |
| ) | |
| req_num_computed_blocks = ( | |
| req_num_computed_tokens + self.block_size - 1 | |
| ) // self.block_size | |
| for idx, block_id in zip(range(req_num_computed_blocks), req_block_ids): | |
| for idx, block_id in enumerate(all_req_block_ids): | |
| if idx >= req_num_computed_blocks: | |
| break | |
| if is_hma: | |
| all_req_block_ids = [ | |
| block_id for group in req_block_ids for block_id in group | |
| ] | |
| req_num_computed_blocks = len(all_req_block_ids) | |
| else: | |
| all_req_block_ids = req_block_ids[0] | |
| req_num_computed_blocks = ( | |
| req_num_computed_tokens + self.block_size - 1 | |
| ) // self.block_size | |
| for idx, block_id in enumerate(all_req_block_ids): | |
| if idx >= req_num_computed_blocks: | |
| break |
| print(f"{self.vllm_config.cache_config.mamba_page_size_padded=}\n\n") | ||
| # block size: 400, the one from the FA spec | ||
| print(f"block size: {self.block_size}\n\n") | ||
| print("NUM_BLOCKS: ", self.num_blocks, "\n\n", flush=True) |
| if tensor_size_bytes is None: | ||
| tensor_size_bytes = curr_tensor_size_bytes | ||
|
|
||
| print(f"{layer_name=}, {[v.shape for v in cache_list]}") |
| local_block_len = self.get_backend_aware_kv_block_len( | ||
| layer_idx=i, first_split=True, mamba_view=mamba | ||
| ) | ||
| print(f"Add agent {i=}, {local_block_len=}\n", flush=True) |
| layer_spec.page_size_bytes | ||
| if isinstance(layer_spec, MambaSpec) | ||
| else layer_spec.page_size_bytes | ||
| // self._physical_blocks_per_logical_kv_block |
There was a problem hiding this comment.
The block_len_per_layer list is populated conditionally for non-Mamba specs and then truncated based on seen_base_addresses. This approach can be fragile. If the order or count of seen_base_addresses does not perfectly align with the non-Mamba layers for which block_len_per_layer was intended, it could lead to incorrect block length assignments. Consider ensuring a more robust mapping or initialization of block_len_per_layer that directly corresponds to the registered regions.
|
Tested the hybrid SSM P/D disaggregation on 2× H100 with
|
|
This pull request has merge conflicts that must be resolved before it can be |
629d263 to
33b40a3
Compare
|
Hi @NickLucche, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
1 similar comment
|
Hi @NickLucche, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
| "DP_EP=1 GPU_MEMORY_UTILIZATION=0.8 PREFILLER_TP_SIZE=2 DECODER_TP_SIZE=2 MODEL_NAMES=deepseek-ai/deepseek-vl2-tiny" # MLA+P-TP2, D-DPEP=2 (TP=1) | ||
| ) | ||
| hybrid_ssm_configs=( | ||
| "ENABLE_HMA_FLAG=1 GPU_MEMORY_UTILIZATION=0.8 MODEL_NAMES=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 VLLM_SERVE_EXTRA_ARGS=--max-model-len,8192,--trust-remote-code" |
There was a problem hiding this comment.
I hope this fits on CI
| if self._is_mamba: | ||
| assert self._is_hma_required | ||
| mamba_spec = next( |
There was a problem hiding this comment.
I could probably wrap this bit to reduce bloat
| page_size = ( | ||
| layer_spec.page_size_bytes | ||
| if isinstance(layer_spec, MambaSpec) | ||
| else layer_spec.page_size_bytes | ||
| // self._physical_blocks_per_logical_kv_block | ||
| ) | ||
| num_blocks = ( | ||
| self._logical_num_blocks | ||
| if isinstance(layer_spec, MambaSpec) | ||
| else self.num_blocks | ||
| ) | ||
| # `page_size` accounts for physical blocks, st KVCache is always | ||
| # [`num_blocks` * `page_size`] | ||
| if not isinstance(layer_spec, MambaSpec): | ||
| self.block_len_per_layer.append(page_size) | ||
| curr_tensor_size_bytes = num_blocks * page_size | ||
| if tensor_size_bytes is None: | ||
| tensor_size_bytes = curr_tensor_size_bytes |
There was a problem hiding this comment.
this logic has been moved from within the inner loop to here, extending work to solely rely on KVCacheConfig rather than tensor view.
| blocks_data: list[tuple[int, int, int]] = [] | ||
| local_base_addresses = self.kv_caches_base_addr[self.engine_id][self.tp_rank] | ||
|
|
||
| def register_blocks(blocks_data: list[tuple[int, int, int]], mamba: bool): |
There was a problem hiding this comment.
wrapping the whole block in a functoin to re-use for the mamba descriptors, appended at the end
| # local mapped:| 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15| | ||
| assert self.kv_topo is not None | ||
| block_size_ratio = self.kv_topo.block_size_ratio_from_engine_id(engine_id) | ||
| kv_topo = self.kv_topo |
There was a problem hiding this comment.
mypy was complaining
|
Hi @NickLucche, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
I see the scheduler changes related to failure recovery. |
tdoublep
left a comment
There was a problem hiding this comment.
Some initial comments (haven't finished reading it all yet).
| "deepseek-ai/deepseek-vl2-tiny": 0.19, | ||
| "deepseek-ai/DeepSeek-V2-Lite-Chat": 0.65, | ||
| "google/gemma-3-4b-it": 0.74, | ||
| "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8": 0.84, |
There was a problem hiding this comment.
I can switch to granite
| if not isinstance(layer_spec, MambaSpec): | ||
| self.block_len_per_layer.append(page_size) |
There was a problem hiding this comment.
Maybe I miss something but where does self.block_len_per_layer get populated for the Mamba layers?
There was a problem hiding this comment.
Good point, there's a comment here where I define the var
# Enable different block lengths for different layers *only* when MLA is used.
# This is not used for SSM layers, which use the counterpart `mamba_ssm_size`.
let me know if that should be expanded.
Basically UniformTypeKVCacheSpecs can allow for different page sizes. Currently that is only used for dsv32 Indexer afaik @heheda12345
|
This pull request has merge conflicts that must be resolved before it can be |
| # we just mock num_blocks to 1 for the dimension check below. | ||
| self._is_kv_layout_blocks_first = ( | ||
| # Hybrid SSM models assume a single blocks_first layout | ||
| self._is_kv_layout_blocks_first = self.is_mamba or ( |
There was a problem hiding this comment.
I wonder if _is_kv_layout_blocks_first could be a property of the attention backend rather than needing to compute it here?
There was a problem hiding this comment.
Great point!
We can address it in a scoped PR
| # Regular case: backends like FA register K/V in separate regions | ||
| return cache if self.split_k_and_v else [cache] |
There was a problem hiding this comment.
Again this is probably just my lack for familiarity with this part of the code, but how does returning the tensor vs. the tensor wrapped in a list relate to registering the K/V separately?
There was a problem hiding this comment.
because in the original register_kv_cache code we iterate over the returned value as in
for cache in cache_list
start refactoring address kernel block size miscmatch by handling 2 num_blocks Signed-off-by: NickLucche <nlucches@redhat.com>
|
Hi @NickLucche, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: NickLucche <nlucches@redhat.com>
|
Hi @NickLucche, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: NickLucche <nlucches@redhat.com>
|
Hi @NickLucche, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds support for hybrid SSM-FA models to the NixlConnector, which is a significant feature enhancement. The changes are extensive, touching upon KV cache registration, descriptor management, and metadata handling to accommodate the specific requirements of Mamba-based models alongside traditional attention mechanisms. The addition of comprehensive unit tests is commendable. I've identified a critical issue related to the calculation of page sizes for Mamba layers in the presence of a kernel block size mismatch, which could lead to incorrect behavior. The logic is unnecessarily complex and error-prone. My review includes a suggestion to refactor this for correctness and improved clarity.
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: wendyliu235 <wenjun.liu@intel.com>
Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
Signed-off-by: EricccYang <yangyang4991@gmail.com>
For a comprehensive description of the changes proposed here, check out the corresponding RFC #36780.
This PR adds support for hybrid SSM-based models such as
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8with NixlConnector, enabling KVCache transfer of both FA and Mamba states in disaggregated setups.Currently it only supports Homogeneous TP sizes on both P and D.
Note that we're only transferring actual mamba states and skipping the padding that may be present, as that might have non-trivial size.
UPDATE:
re this change"
in this PR I am trying to further move away from relying on tensor views while trying to unify usage in code of
kv_cache_configas single source of truth.This is also necessary for Mamba-like models in which tensors (
cacheabove) gives the unpadded tensor size, which doesn't reflect thenum_blocks * physical_page_size, as one would need to take into account padding manually.Important notes
--no-async-schedulingto run correctly. @ZhanqiuHu and I identified a synchronization issue where states may be transferred in a corrupted form, leading to high variance in evaluations. Will address separately as that is likely unrelated to SSMs.Test with
Enable HMA experimental support with
--no-disable-hybrid-kv-cache-manager:or
or check out unit tests added with this PR.
Results from running consecutive full lm-eval runs with no prefix caching:
TODO