Skip to content

[Core][KVConnector] Support HMA+NixlConnector#35758

Merged
NickLucche merged 29 commits intovllm-project:mainfrom
NickLucche:nixl-hma-rebase-no-recovery
Mar 6, 2026
Merged

[Core][KVConnector] Support HMA+NixlConnector#35758
NickLucche merged 29 commits intovllm-project:mainfrom
NickLucche:nixl-hma-rebase-no-recovery

Conversation

@NickLucche
Copy link
Collaborator

@NickLucche NickLucche commented Mar 2, 2026

Same as #32204 but with no kv blocks recovery in case of failure during xfer.

Overview

Currently connectors are not able to take full advantage of models that employ hybrid attention (FA+SWA) and treat all layers as FA, as the Hybrid Kv Cache Manager is disabled.

This PR enables NixlConnector to work with the HMA, resulting in drastically reducing the number of bytes/regions moved with a xfer for SWA+FA models, while laying the ground for state-based ones (mamba etc).
Example of the former:

# NON-HMA (current master)
(EngineCore_DP0 pid=521538) get_block_descs_ids [[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]]
(EngineCore_DP0 pid=521538)
get_block_descs_ids num output 4284

# HMA --no-enable-prefix-caching --no-disable-hybrid-kv-cache-manager (this PR)
get_block_descs_ids (remote descs) [[47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], [110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126], ... [379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441]]


get_block_descs_ids num output 1650

Test with

Enable HMA experimental support with --no-disable-hybrid-kv-cache-manager:

# usual P/D command
vllm serve google/gemma-3-4b-it
--trust-remote-code \
--block-size 64 \
--no-enable-prefix-caching \
--no-disable-hybrid-kv-cache-manager \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'

# usual toy_proxy_server.py command

lm-eval results:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.74|±  |0.0441|
|     |       |strict-match    |     5|exact_match|↑  | 0.74|±  |0.0441|

or newly added file

pytest -x -v -s tests/v1/kv_connector/unit/test_nixl_connector_hma.py

EDIT:
I've also validated part of the lm-eval CI locally, you can test out the different tracked configs with

cd tests && ENABLE_HMA_FLAG=1 bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh

Run python -m pytest -s -v -x tests/v1/kv_connector/unit/test_invalid_blocks_correctness.py::test_hma_sync_recompute_evicts_all_blocks for testing the invalid block handling with hma.

TODOs

  • pre-commit + mypy
  • Report and handle block-level failures
  • verify logical<>physical kernel block path
  • eval with hma disabled to make sure there's no regression
  • verify mamba-like models (defer to separate PR)
  • run lm-eval on different config
  • verify with llama4 (old optimization has been removed)
  • verify host-backed transfers (D2H->H2D)
  • block_size_ratio !=1 (defer to separate PR)

cc working with @heheda12345 @KuntaiDu @ivanium

Benchmarks

ShareGPT results, no-prefix-caching 8xH100.

Main:

# Main DTP4-PTP4
============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Maximum request concurrency:             10
Benchmark duration (s):                  74.16
Total input tokens:                      215312
Total generated tokens:                  193901
Request throughput (req/s):              13.48
Output token throughput (tok/s):         2614.71
Peak output token throughput (tok/s):    2933.00
Peak concurrent requests:                36.00
Total token throughput (tok/s):          5518.14
---------------Time to First Token----------------
Mean TTFT (ms):                          92.65
Median TTFT (ms):                        36.87
P99 TTFT (ms):                           2410.05
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          3.31
Median TPOT (ms):                        3.32
P99 TPOT (ms):                           3.46
---------------Inter-token Latency----------------
Mean ITL (ms):                           3.31
Median ITL (ms):                         3.31
P99 ITL (ms):                            4.23
==================================================

# Main "WideEP" D DPEP4 - PTP4
============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Maximum request concurrency:             10
Benchmark duration (s):                  99.42
Total input tokens:                      215312
Total generated tokens:                  199033
Request throughput (req/s):              10.06
Output token throughput (tok/s):         2001.95
Peak output token throughput (tok/s):    2236.00
Peak concurrent requests:                27.00
Total token throughput (tok/s):          4167.65
---------------Time to First Token----------------
Mean TTFT (ms):                          90.22
Median TTFT (ms):                        41.13
P99 TTFT (ms):                           120.99
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4.51
Median TPOT (ms):                        4.48
P99 TPOT (ms):                           5.08
---------------Inter-token Latency----------------
Mean ITL (ms):                           4.50
Median ITL (ms):                         4.41
P99 ITL (ms):                            7.92
================================================== 

This PR:

# HMA DTP4 - PTP4
============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Maximum request concurrency:             10
Benchmark duration (s):                  68.28
Total input tokens:                      215312
Total generated tokens:                  191092
Request throughput (req/s):              14.65
Output token throughput (tok/s):         2798.60
Peak output token throughput (tok/s):    2924.00
Peak concurrent requests:                32.00
Total token throughput (tok/s):          5951.91
---------------Time to First Token----------------
Mean TTFT (ms):                          45.78
Median TTFT (ms):                        37.02
P99 TTFT (ms):                           563.35
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          3.32
Median TPOT (ms):                        3.32
P99 TPOT (ms):                           3.47
---------------Inter-token Latency----------------
Mean ITL (ms):                           3.32
Median ITL (ms):                         3.32
P99 ITL (ms):                            4.23
==================================================

# HMA PR "WideEP" D DPEP4 - PTP4
============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Maximum request concurrency:             10
Benchmark duration (s):                  98.65
Total input tokens:                      215312
Total generated tokens:                  199033
Request throughput (req/s):              10.14
Output token throughput (tok/s):         2017.54
Peak output token throughput (tok/s):    2218.00
Peak concurrent requests:                30.00
Total token throughput (tok/s):          4200.10
---------------Time to First Token----------------
Mean TTFT (ms):                          88.58
Median TTFT (ms):                        40.12
P99 TTFT (ms):                           168.45
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4.48
Median TPOT (ms):                        4.47
P99 TPOT (ms):                           4.71
---------------Inter-token Latency----------------
Mean ITL (ms):                           4.47
Median ITL (ms):                         4.44
P99 ITL (ms):                            6.21
==================================================

so up to ~7% throughput in this small-scale intra-node setup. Inter-node one would be more interesting to analyze.

NickLucche and others added 19 commits March 2, 2026 15:20
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
…nished signature

cruft

Signed-off-by: NickLucche <nlucches@redhat.com>
hma specific tests

Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
update tests

Signed-off-by: NickLucche <nlucches@redhat.com>
review

Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
@NickLucche NickLucche changed the title Nixl hma rebase no recovery [Core][KVConnector] Support HMA+NixlConnector Mar 2, 2026
Signed-off-by: NickLucche <nlucches@redhat.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables the NixlConnector to work with the Hybrid KV Cache Manager (HMA), which is a significant step towards optimizing performance for models with hybrid attention mechanisms. The changes are extensive, involving updates to the core connector logic to handle multiple KV cache groups, adapting existing tests, and adding new tests for HMA-specific functionality. While the implementation correctly adapts data structures and logic for HMA, I have identified a critical issue in the failure handling mechanism for HMA transfers, which could lead to silent errors or hangs. Additionally, there is a bug in one of the test scripts that could affect test correctness. Addressing these issues will be crucial for the stability and reliability of this new feature.

Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
@NickLucche
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the Hybrid Memory Allocator (HMA) in the NixlConnector, a valuable enhancement for models utilizing hybrid attention mechanisms. However, it contains critical vulnerabilities where failed KV cache transfers are not properly handled when HMA is enabled. Specifically, the code skips invalidating KV cache blocks upon failure, which can lead to cross-request information leakage as stale data from reused blocks may be exposed, potentially causing correctness issues and system instability. Furthermore, the implementation uses fixed indexing (local_block_ids[0]) that can cause an IndexError and crash the engine when handling full prefix cache hits where the block list is empty. These issues should be addressed by implementing comprehensive block invalidation across all KV cache groups and ensuring safe access to the block ID structures.

Comment on lines +2192 to +2193
if (meta := self._recving_metadata.get(req_id)) and not self._is_hma_required:
self._invalid_block_ids.update(meta.local_block_ids[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

This section has critical vulnerabilities. Failed transfers in _handle_failed_transfer do not invalidate blocks when HMA is enabled, potentially leading to cross-request information leakage as stale data from reused blocks may be exposed. This can cause correctness issues and system instability. Furthermore, accessing meta.local_block_ids[0] on an empty tuple (possible during full prefix cache hits) will cause an IndexError and crash the engine. Comprehensive block invalidation across all KV cache groups and safe access to block ID structures are required.

Suggested change
if (meta := self._recving_metadata.get(req_id)) and not self._is_hma_required:
self._invalid_block_ids.update(meta.local_block_ids[0])
if (meta := self._recving_metadata.get(req_id)):
for group in meta.local_block_ids:
self._invalid_block_ids.update(group)

Comment on lines +1390 to 1394
if (
req_meta := self._recving_metadata.get(req_id)
) and not self._is_hma_required:
self._invalid_block_ids.update(req_meta.local_block_ids[0])
self._failed_recv_reqs.add(req_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

When Hybrid Memory Allocator (HMA) is enabled, failed KV cache transfers do not trigger block invalidation. This is explicitly skipped with a TODO comment. If a transfer fails, the request is still marked as finished and scheduled for execution, leading the engine to use uninitialized or stale data from reused KV cache blocks. This can result in a cross-request information leak where data from a previous request is exposed to a new one. Additionally, accessing req_meta.local_block_ids[0] will raise an IndexError if local_block_ids is an empty tuple, which occurs during full prefix cache hits. This can crash the background handshake thread, leading to a Denial of Service.

Suggested change
if (
req_meta := self._recving_metadata.get(req_id)
) and not self._is_hma_required:
self._invalid_block_ids.update(req_meta.local_block_ids[0])
self._failed_recv_reqs.add(req_id)
if (req_meta := self._recving_metadata.get(req_id)):
for group in req_meta.local_block_ids:
self._invalid_block_ids.update(group)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hma can't return empty tuple, it can return ([]) though

@mergify
Copy link

mergify bot commented Mar 5, 2026

Hi @NickLucche, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@NickLucche NickLucche added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 5, 2026
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
@NickLucche NickLucche merged commit 5b3ba94 into vllm-project:main Mar 6, 2026
61 checks passed
cong-or pushed a commit to cong-or/vllm that referenced this pull request Mar 6, 2026
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: cong-or <conchubhar.gannon@gmail.com>
alxfv pushed a commit to alxfv/vllm that referenced this pull request Mar 6, 2026
Signed-off-by: NickLucche <nlucches@redhat.com>
shaunkotek pushed a commit to shaunkotek/vllm that referenced this pull request Mar 8, 2026
Signed-off-by: NickLucche <nlucches@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants