[Core][KVConnector] Support HMA+NixlConnector#35758
[Core][KVConnector] Support HMA+NixlConnector#35758NickLucche merged 29 commits intovllm-project:mainfrom
Conversation
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
…nished signature cruft Signed-off-by: NickLucche <nlucches@redhat.com>
hma specific tests Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
update tests Signed-off-by: NickLucche <nlucches@redhat.com>
review Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
There was a problem hiding this comment.
Code Review
This pull request enables the NixlConnector to work with the Hybrid KV Cache Manager (HMA), which is a significant step towards optimizing performance for models with hybrid attention mechanisms. The changes are extensive, involving updates to the core connector logic to handle multiple KV cache groups, adapting existing tests, and adding new tests for HMA-specific functionality. While the implementation correctly adapts data structures and logic for HMA, I have identified a critical issue in the failure handling mechanism for HMA transfers, which could lead to silent errors or hangs. Additionally, there is a bug in one of the test scripts that could affect test correctness. Addressing these issues will be crucial for the stability and reliability of this new feature.
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for the Hybrid Memory Allocator (HMA) in the NixlConnector, a valuable enhancement for models utilizing hybrid attention mechanisms. However, it contains critical vulnerabilities where failed KV cache transfers are not properly handled when HMA is enabled. Specifically, the code skips invalidating KV cache blocks upon failure, which can lead to cross-request information leakage as stale data from reused blocks may be exposed, potentially causing correctness issues and system instability. Furthermore, the implementation uses fixed indexing (local_block_ids[0]) that can cause an IndexError and crash the engine when handling full prefix cache hits where the block list is empty. These issues should be addressed by implementing comprehensive block invalidation across all KV cache groups and ensuring safe access to the block ID structures.
| if (meta := self._recving_metadata.get(req_id)) and not self._is_hma_required: | ||
| self._invalid_block_ids.update(meta.local_block_ids[0]) |
There was a problem hiding this comment.
This section has critical vulnerabilities. Failed transfers in _handle_failed_transfer do not invalidate blocks when HMA is enabled, potentially leading to cross-request information leakage as stale data from reused blocks may be exposed. This can cause correctness issues and system instability. Furthermore, accessing meta.local_block_ids[0] on an empty tuple (possible during full prefix cache hits) will cause an IndexError and crash the engine. Comprehensive block invalidation across all KV cache groups and safe access to block ID structures are required.
| if (meta := self._recving_metadata.get(req_id)) and not self._is_hma_required: | |
| self._invalid_block_ids.update(meta.local_block_ids[0]) | |
| if (meta := self._recving_metadata.get(req_id)): | |
| for group in meta.local_block_ids: | |
| self._invalid_block_ids.update(group) |
| if ( | ||
| req_meta := self._recving_metadata.get(req_id) | ||
| ) and not self._is_hma_required: | ||
| self._invalid_block_ids.update(req_meta.local_block_ids[0]) | ||
| self._failed_recv_reqs.add(req_id) |
There was a problem hiding this comment.
When Hybrid Memory Allocator (HMA) is enabled, failed KV cache transfers do not trigger block invalidation. This is explicitly skipped with a TODO comment. If a transfer fails, the request is still marked as finished and scheduled for execution, leading the engine to use uninitialized or stale data from reused KV cache blocks. This can result in a cross-request information leak where data from a previous request is exposed to a new one. Additionally, accessing req_meta.local_block_ids[0] will raise an IndexError if local_block_ids is an empty tuple, which occurs during full prefix cache hits. This can crash the background handshake thread, leading to a Denial of Service.
| if ( | |
| req_meta := self._recving_metadata.get(req_id) | |
| ) and not self._is_hma_required: | |
| self._invalid_block_ids.update(req_meta.local_block_ids[0]) | |
| self._failed_recv_reqs.add(req_id) | |
| if (req_meta := self._recving_metadata.get(req_id)): | |
| for group in req_meta.local_block_ids: | |
| self._invalid_block_ids.update(group) |
There was a problem hiding this comment.
hma can't return empty tuple, it can return ([]) though
|
Hi @NickLucche, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: cong-or <conchubhar.gannon@gmail.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Same as #32204 but with no kv blocks recovery in case of failure during xfer.
Overview
Currently connectors are not able to take full advantage of models that employ hybrid attention (FA+SWA) and treat all layers as FA, as the Hybrid Kv Cache Manager is disabled.
This PR enables NixlConnector to work with the HMA, resulting in drastically reducing the number of bytes/regions moved with a xfer for SWA+FA models, while laying the ground for state-based ones (mamba etc).
Example of the former:
Test with
Enable HMA experimental support with
--no-disable-hybrid-kv-cache-manager:lm-eval results:
or newly added file
EDIT:
I've also validated part of the lm-eval CI locally, you can test out the different tracked configs with
Run
python -m pytest -s -v -x tests/v1/kv_connector/unit/test_invalid_blocks_correctness.py::test_hma_sync_recompute_evicts_all_blocksfor testing the invalid block handling with hma.TODOs
cc working with @heheda12345 @KuntaiDu @ivanium
Benchmarks
ShareGPT results, no-prefix-caching 8xH100.
Main:
This PR:
so up to ~7% throughput in this small-scale intra-node setup. Inter-node one would be more interesting to analyze.