[Core][KVConnector] Support HMA+NixlConnector by NickLucche · Pull Request #35758 · vllm-project/vllm

NickLucche · 2026-03-02T15:23:40Z

Same as #32204 but with no kv blocks recovery in case of failure during xfer.

Overview

Currently connectors are not able to take full advantage of models that employ hybrid attention (FA+SWA) and treat all layers as FA, as the Hybrid Kv Cache Manager is disabled.

This PR enables NixlConnector to work with the HMA, resulting in drastically reducing the number of bytes/regions moved with a xfer for SWA+FA models, while laying the ground for state-based ones (mamba etc).
Example of the former:

# NON-HMA (current master)
(EngineCore_DP0 pid=521538) get_block_descs_ids [[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]]
(EngineCore_DP0 pid=521538)
get_block_descs_ids num output 4284

# HMA --no-enable-prefix-caching --no-disable-hybrid-kv-cache-manager (this PR)
get_block_descs_ids (remote descs) [[47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], [110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126], ... [379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441]]


get_block_descs_ids num output 1650

Test with

Enable HMA experimental support with --no-disable-hybrid-kv-cache-manager:

# usual P/D command
vllm serve google/gemma-3-4b-it
--trust-remote-code \
--block-size 64 \
--no-enable-prefix-caching \
--no-disable-hybrid-kv-cache-manager \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'

# usual toy_proxy_server.py command

lm-eval results:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.74|±  |0.0441|
|     |       |strict-match    |     5|exact_match|↑  | 0.74|±  |0.0441|

or newly added file

pytest -x -v -s tests/v1/kv_connector/unit/test_nixl_connector_hma.py

EDIT:
I've also validated part of the lm-eval CI locally, you can test out the different tracked configs with

cd tests && ENABLE_HMA_FLAG=1 bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh

Run python -m pytest -s -v -x tests/v1/kv_connector/unit/test_invalid_blocks_correctness.py::test_hma_sync_recompute_evicts_all_blocks for testing the invalid block handling with hma.

TODOs

pre-commit + mypy
Report and handle block-level failures
verify logical<>physical kernel block path
eval with hma disabled to make sure there's no regression
verify mamba-like models (defer to separate PR)
run lm-eval on different config
verify with llama4 (old optimization has been removed)
verify host-backed transfers (D2H->H2D)
block_size_ratio !=1 (defer to separate PR)

cc working with @heheda12345 @KuntaiDu @ivanium

Benchmarks

ShareGPT results, no-prefix-caching 8xH100.

Main:

# Main DTP4-PTP4
============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Maximum request concurrency:             10
Benchmark duration (s):                  74.16
Total input tokens:                      215312
Total generated tokens:                  193901
Request throughput (req/s):              13.48
Output token throughput (tok/s):         2614.71
Peak output token throughput (tok/s):    2933.00
Peak concurrent requests:                36.00
Total token throughput (tok/s):          5518.14
---------------Time to First Token----------------
Mean TTFT (ms):                          92.65
Median TTFT (ms):                        36.87
P99 TTFT (ms):                           2410.05
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          3.31
Median TPOT (ms):                        3.32
P99 TPOT (ms):                           3.46
---------------Inter-token Latency----------------
Mean ITL (ms):                           3.31
Median ITL (ms):                         3.31
P99 ITL (ms):                            4.23
==================================================

# Main "WideEP" D DPEP4 - PTP4
============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Maximum request concurrency:             10
Benchmark duration (s):                  99.42
Total input tokens:                      215312
Total generated tokens:                  199033
Request throughput (req/s):              10.06
Output token throughput (tok/s):         2001.95
Peak output token throughput (tok/s):    2236.00
Peak concurrent requests:                27.00
Total token throughput (tok/s):          4167.65
---------------Time to First Token----------------
Mean TTFT (ms):                          90.22
Median TTFT (ms):                        41.13
P99 TTFT (ms):                           120.99
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4.51
Median TPOT (ms):                        4.48
P99 TPOT (ms):                           5.08
---------------Inter-token Latency----------------
Mean ITL (ms):                           4.50
Median ITL (ms):                         4.41
P99 ITL (ms):                            7.92
==================================================

This PR:

# HMA DTP4 - PTP4
============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Maximum request concurrency:             10
Benchmark duration (s):                  68.28
Total input tokens:                      215312
Total generated tokens:                  191092
Request throughput (req/s):              14.65
Output token throughput (tok/s):         2798.60
Peak output token throughput (tok/s):    2924.00
Peak concurrent requests:                32.00
Total token throughput (tok/s):          5951.91
---------------Time to First Token----------------
Mean TTFT (ms):                          45.78
Median TTFT (ms):                        37.02
P99 TTFT (ms):                           563.35
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          3.32
Median TPOT (ms):                        3.32
P99 TPOT (ms):                           3.47
---------------Inter-token Latency----------------
Mean ITL (ms):                           3.32
Median ITL (ms):                         3.32
P99 ITL (ms):                            4.23
==================================================

# HMA PR "WideEP" D DPEP4 - PTP4
============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Maximum request concurrency:             10
Benchmark duration (s):                  98.65
Total input tokens:                      215312
Total generated tokens:                  199033
Request throughput (req/s):              10.14
Output token throughput (tok/s):         2017.54
Peak output token throughput (tok/s):    2218.00
Peak concurrent requests:                30.00
Total token throughput (tok/s):          4200.10
---------------Time to First Token----------------
Mean TTFT (ms):                          88.58
Median TTFT (ms):                        40.12
P99 TTFT (ms):                           168.45
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4.48
Median TPOT (ms):                        4.47
P99 TPOT (ms):                           4.71
---------------Inter-token Latency----------------
Mean ITL (ms):                           4.47
Median ITL (ms):                         4.44
P99 ITL (ms):                            6.21
==================================================

so up to ~7% throughput in this small-scale intra-node setup. Inter-node one would be more interesting to analyze.

Signed-off-by: NickLucche <nlucches@redhat.com>

…nished signature cruft Signed-off-by: NickLucche <nlucches@redhat.com>

hma specific tests Signed-off-by: NickLucche <nlucches@redhat.com>

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Signed-off-by: NickLucche <nlucches@redhat.com>

update tests Signed-off-by: NickLucche <nlucches@redhat.com>

review Signed-off-by: NickLucche <nlucches@redhat.com>

Signed-off-by: NickLucche <nlucches@redhat.com>

gemini-code-assist

Code Review

This pull request enables the NixlConnector to work with the Hybrid KV Cache Manager (HMA), which is a significant step towards optimizing performance for models with hybrid attention mechanisms. The changes are extensive, involving updates to the core connector logic to handle multiple KV cache groups, adapting existing tests, and adding new tests for HMA-specific functionality. While the implementation correctly adapts data structures and logic for HMA, I have identified a critical issue in the failure handling mechanism for HMA transfers, which could lead to silent errors or hangs. Additionally, there is a bug in one of the test scripts that could affect test correctness. Addressing these issues will be crucial for the stability and reliability of this new feature.

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche · 2026-03-05T13:18:42Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for the Hybrid Memory Allocator (HMA) in the NixlConnector, a valuable enhancement for models utilizing hybrid attention mechanisms. However, it contains critical vulnerabilities where failed KV cache transfers are not properly handled when HMA is enabled. Specifically, the code skips invalidating KV cache blocks upon failure, which can lead to cross-request information leakage as stale data from reused blocks may be exposed, potentially causing correctness issues and system instability. Furthermore, the implementation uses fixed indexing (local_block_ids[0]) that can cause an IndexError and crash the engine when handling full prefix cache hits where the block list is empty. These issues should be addressed by implementing comprehensive block invalidation across all KV cache groups and ensuring safe access to the block ID structures.

gemini-code-assist · 2026-03-05T13:25:59Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+        if (meta := self._recving_metadata.get(req_id)) and not self._is_hma_required:
+            self._invalid_block_ids.update(meta.local_block_ids[0])


This section has critical vulnerabilities. Failed transfers in _handle_failed_transfer do not invalidate blocks when HMA is enabled, potentially leading to cross-request information leakage as stale data from reused blocks may be exposed. This can cause correctness issues and system instability. Furthermore, accessing meta.local_block_ids[0] on an empty tuple (possible during full prefix cache hits) will cause an IndexError and crash the engine. Comprehensive block invalidation across all KV cache groups and safe access to block ID structures are required.

Suggested change

if (meta := self._recving_metadata.get(req_id)) and not self._is_hma_required:

self._invalid_block_ids.update(meta.local_block_ids[0])

if (meta := self._recving_metadata.get(req_id)):

for group in meta.local_block_ids:

self._invalid_block_ids.update(group)

gemini-code-assist · 2026-03-05T13:25:59Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+                if (
+                    req_meta := self._recving_metadata.get(req_id)
+                ) and not self._is_hma_required:
+                    self._invalid_block_ids.update(req_meta.local_block_ids[0])
                self._failed_recv_reqs.add(req_id)


When Hybrid Memory Allocator (HMA) is enabled, failed KV cache transfers do not trigger block invalidation. This is explicitly skipped with a TODO comment. If a transfer fails, the request is still marked as finished and scheduled for execution, leading the engine to use uninitialized or stale data from reused KV cache blocks. This can result in a cross-request information leak where data from a previous request is exposed to a new one. Additionally, accessing req_meta.local_block_ids[0] will raise an IndexError if local_block_ids is an empty tuple, which occurs during full prefix cache hits. This can crash the background handshake thread, leading to a Denial of Service.

Suggested change

if (

req_meta := self._recving_metadata.get(req_id)

) and not self._is_hma_required:

self._invalid_block_ids.update(req_meta.local_block_ids[0])

self._failed_recv_reqs.add(req_id)

if (req_meta := self._recving_metadata.get(req_id)):

for group in req_meta.local_block_ids:

self._invalid_block_ids.update(group)

hma can't return empty tuple, it can return ([]) though

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

mergify · 2026-03-05T13:27:15Z

Hi @NickLucche, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: NickLucche <nlucches@redhat.com>

vllm/distributed/kv_transfer/kv_connector/utils.py

Signed-off-by: NickLucche <nlucches@redhat.com>

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: cong-or <conchubhar.gannon@gmail.com>

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche and others added 19 commits March 2, 2026 15:20

is_null instead of 0 check

a54fc74

Signed-off-by: NickLucche <nlucches@redhat.com>

get_sw_clippped_blocks to fix over-allocation for swa on D

c664dbf

Signed-off-by: NickLucche <nlucches@redhat.com>

fix issue with null blocks on P being one extra (17) by clipping

f284578

Signed-off-by: NickLucche <nlucches@redhat.com>

remove llama4 opt

2e9e384

Signed-off-by: NickLucche <nlucches@redhat.com>

supportshma + scheduler change

c1234f0

Signed-off-by: NickLucche <nlucches@redhat.com>

partial prefix cache hit + block_size_ratio + signatures

8cfd981

Signed-off-by: NickLucche <nlucches@redhat.com>

block failure handling + block_ratio handling + remove old request_fi…

06d2669

…nished signature cruft Signed-off-by: NickLucche <nlucches@redhat.com>

update tests

7198bec

hma specific tests Signed-off-by: NickLucche <nlucches@redhat.com>

fix issue for heterogenuous block_size and layout

08c55dc

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

cpu-buffer case+precommit

6ec65ba

Signed-off-by: NickLucche <nlucches@redhat.com>

failure logging for hma

cf9c2e5

update tests Signed-off-by: NickLucche <nlucches@redhat.com>

hma e2e lm-eval test

d9cec70

review Signed-off-by: NickLucche <nlucches@redhat.com>

enable hma on all configs opt

b22efd7

Signed-off-by: NickLucche <nlucches@redhat.com>

request-level failure for hma

41122ab

Signed-off-by: NickLucche <nlucches@redhat.com>

add request-level failure tests

3602394

Signed-off-by: NickLucche <nlucches@redhat.com>

micro-opt for sw clip

0b48167

Signed-off-by: NickLucche <nlucches@redhat.com>

account for window across blocks

33bb65e

Signed-off-by: NickLucche <nlucches@redhat.com>

revert all sched changes

b6870bc

Signed-off-by: NickLucche <nlucches@redhat.com>

disable failure recovery

036af11

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche requested review from ApostaC, WoosukKwon, alexm-redhat, heheda12345, njhill, orozery, robertgshaw2-redhat and ywang96 as code owners March 2, 2026 15:23

NickLucche changed the title ~~Nixl hma rebase no recovery~~ [Core][KVConnector] Support HMA+NixlConnector Mar 2, 2026

mergify bot added v1 kv-connector labels Mar 2, 2026

fix

a1ddbf6

Signed-off-by: NickLucche <nlucches@redhat.com>

gemini-code-assist bot reviewed Mar 2, 2026

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Show resolved Hide resolved

tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh Outdated Show resolved Hide resolved

NickLucche added 5 commits March 2, 2026 15:32

missing sched changes

9d08e75

Signed-off-by: NickLucche <nlucches@redhat.com>

rebase cruft

380d543

Signed-off-by: NickLucche <nlucches@redhat.com>

revert invalid block changes

b29597b

Signed-off-by: NickLucche <nlucches@redhat.com>

update tests

de7a452

Signed-off-by: NickLucche <nlucches@redhat.com>

revert sched changes

72a709d

Signed-off-by: NickLucche <nlucches@redhat.com>

gemini-code-assist bot reviewed Mar 5, 2026

View reviewed changes

NickLucche added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 5, 2026

NickLucche added 2 commits March 5, 2026 14:12

precommit

dde50e0

Signed-off-by: NickLucche <nlucches@redhat.com>

Merge branch 'main' into nixl-hma-rebase-no-recovery

5ee9c4c

orozery approved these changes Mar 5, 2026

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/utils.py Outdated Show resolved Hide resolved

NickLucche added 2 commits March 5, 2026 15:37

cruft

70f929e

Signed-off-by: NickLucche <nlucches@redhat.com>

max model len gemma

f9c31f3

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche merged commit 5b3ba94 into vllm-project:main Mar 6, 2026
61 checks passed

cong-or pushed a commit to cong-or/vllm that referenced this pull request Mar 6, 2026

[Core][KVConnector] Support HMA+NixlConnector (vllm-project#35758)

1fbb3f4

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: cong-or <conchubhar.gannon@gmail.com>

alxfv pushed a commit to alxfv/vllm that referenced this pull request Mar 6, 2026

[Core][KVConnector] Support HMA+NixlConnector (vllm-project#35758)

ca1a1c3

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche mentioned this pull request Mar 7, 2026

[NIXL][1/N] Refactor kernel_block_size detection #35752

Draft

shaunkotek pushed a commit to shaunkotek/vllm that referenced this pull request Mar 8, 2026

[Core][KVConnector] Support HMA+NixlConnector (vllm-project#35758)

f08c9f4

Signed-off-by: NickLucche <nlucches@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core][KVConnector] Support HMA+NixlConnector#35758

[Core][KVConnector] Support HMA+NixlConnector#35758
NickLucche merged 29 commits intovllm-project:mainfrom
NickLucche:nixl-hma-rebase-no-recovery

NickLucche commented Mar 2, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

NickLucche commented Mar 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 5, 2026

Uh oh!

gemini-code-assist bot Mar 5, 2026

Uh oh!

NickLucche Mar 5, 2026

Uh oh!

Uh oh!

mergify bot commented Mar 5, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		if (meta := self._recving_metadata.get(req_id)) and not self._is_hma_required:
		self._invalid_block_ids.update(meta.local_block_ids[0])

-        if (meta := self._recving_metadata.get(req_id)) and not self._is_hma_required:
-            self._invalid_block_ids.update(meta.local_block_ids[0])
+        if (meta := self._recving_metadata.get(req_id)):
+            for group in meta.local_block_ids:
+                self._invalid_block_ids.update(group)

Uh oh!

Conversation

NickLucche commented Mar 2, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Test with

TODOs

Benchmarks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

NickLucche commented Mar 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

NickLucche Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify bot commented Mar 5, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NickLucche commented Mar 2, 2026 •

edited by github-actions bot

Loading