[Core][NIXL] Support HMA+NixlConnector by NickLucche · Pull Request #31802 · vllm-project/vllm

NickLucche · 2026-01-06T11:04:14Z

This PR is based on an early version of #30166 so the diff is a mess. I will clean it up and rebase it asap and provide a more accurate description of the PR then.

UPDATE: check out #32204 for the updated PR

Overview

Currently connectors are not able to take full advantage of models that employ hybrid attention (FA+SWA) and treat all layers as FA, as the Hybrid Kv Cache Manager is disabled.

This PR enables NixlConnector to work with the HMA, resulting in drastically reducing the number of bytes/regions moved with a xfer for SWA+FA models, while laying the ground for state-based ones (mamba etc).
Example of the former:

# NON-HMA (current master)
(EngineCore_DP0 pid=521538) get_block_descs_ids [[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]]
(EngineCore_DP0 pid=521538)
(EngineCore_DP0 pid=521538) get_block_descs_ids num output 4284

# HMA --no-enable-prefix-caching --no-disable-hybrid-kv-cache-manager (this PR)
get_block_descs_ids (remote descs) [[47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], [110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126], ... [379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441]]


get_block_descs_ids num output 1650

Test with

Enable HMA experimental support with --no-disable-hybrid-kv-cache-manager:

# usual P/D command
vllm serve google/gemma-3-4b-it
--trust-remote-code \
--block-size 64 \
--no-enable-prefix-caching \
--no-disable-hybrid-kv-cache-manager \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'

# usual toy_proxy_server.py command

lm-eval results:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.74|±  |0.0441|
|     |       |strict-match    |     5|exact_match|↑  | 0.74|±  |0.0441|

TODO

Report and handle block-level failures
verify logical<>physical kernel block path
verify mamba-like models (defer to separate PR)
verify with llama4 (we had an old optimization)
verify host-backed transfers (D2H->H2D)
pre-commit+linting

cc working with @heheda12345 @KuntaiDu @ivanium

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> Co-authored-by: KuntaiDu <kuntai@uchicago.edu>

…indow, and leading padding with null blocks Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> fixes Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> fix get_num_blocks_to_allocate Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

…ocks Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

…ll blocks inside the single_type_block_manager Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

…ng that in a follow-up PR Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

…l_computed_tokens allocation to the same function Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

…efix_lm issue Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

…m-project#30723) Signed-off-by: mgoin <mgoin64@gmail.com>

…tion endpoints (vllm-project#30769) Signed-off-by: NickLucche <nlucches@redhat.com>

Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>

…lm-project#30014) Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

…oups (vllm-project#29627) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com>

…s only (vllm-project#30475) Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Sun Kim <sunytokki@gmail.com>

Signed-off-by: NickLucche <nlucches@redhat.com>

mergify · 2026-01-06T11:04:52Z

Documentation preview: https://vllm--31802.org.readthedocs.build/en/31802/

mergify · 2026-01-06T11:05:18Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces support for HMA (Hybrid Memory Allocator) with the NixlConnector, which is a significant feature. The changes span across CI configurations, low-level CUDA kernels, and core scheduling logic. The PR also includes support for Turing architecture in Marlin kernels and various refactorings and bug fixes. While the changes are extensive and well-structured, I've identified several critical and high-severity issues. These mainly consist of temporary workarounds and FIXME comments that need to be addressed before merging, as well as the removal of important safety checks (static_assert) in CUDA kernels which could lead to memory corruption.

I am having trouble creating individual review comments. Click here to see my feedback.

csrc/moe/marlin_moe_wna16/marlin_template.h (1021-1025)

The static_assert that validates shared memory layout has been removed. This check is important to prevent potential memory corruption in the CUDA kernel. Was its removal intentional? If so, could you please explain why it's no longer needed? If not, it should be restored.

csrc/quantization/gptq_marlin/marlin_template.h (876-880)

Similar to the other Marlin template, the static_assert for shared memory layout validation has been removed here. This could lead to memory corruption. Please restore it or provide a reason for its removal.

vllm/config/model.py (1126-1128)

This temporary change to disable mm_prefix_lm should be removed before merging, as indicated by the comment.

vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py (80-82)

This temporary workaround should be removed before merging, as indicated by the comment.

vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py (212-214)

This temporary workaround to return None should be removed before merging.

vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py (324-334)

This temporary workaround for request_finished_all_groups should be addressed and the comment removed before merging.

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py (682)

Please address this FIXME regarding handling a tuple of blocks.

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py (2195)

Please address this FIXME about marking blocks per group in the error handling path.

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py (2264)

Please address this FIXME regarding flattening the tuple in _logical_to_kernel_block_ids.

vllm/v1/core/sched/scheduler.py (1623-1624)

Please address these FIXME comments regarding HMA and environment variable changes.

vllm/v1/core/sched/scheduler.py (1627)

Please address this FIXME about handling blocks across layers.

vllm/v1/core/sched/scheduler.py (1638)

Please address this FIXME about per-group caching.

NickLucche · 2026-01-12T19:04:36Z

closing in favor of #32204

ivanium and others added 30 commits December 13, 2025 17:25

Squashed merge PR vllm-project#23624

e86fe9d

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> Co-authored-by: KuntaiDu <kuntai@uchicago.edu>

fix: skip outside sliding window tokens when touch and save cached bl…

ee81aa1

…ocks Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

fix: make interfaces consistent and remove debug prints

7acbe08

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

nits: remove test scripts

0b2218c

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

fix: revert cache_block() changes as we have already handled the nu…

15ef476

…ll blocks inside the single_type_block_manager Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

fix: revert KVCacheManager.allocate_slots() interface changes; revisi…

eb7bfcf

…ng that in a follow-up PR Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

revert unrelated changes

327e472

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

revert blocks_to_touch changes

580efd4

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

fix: update test cases

37d7c3b

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

doc string nits

6169524

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

ignore mypy errors

ccfc676

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

fix: resolve comments; mainly merge local_computed_tokens and externa…

ad761e6

…l_computed_tokens allocation to the same function Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

fix: simplify return values of get_num_blocks_to_allocate

89af30c

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

test: update test cases

cf666cd

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

fix: num_new_tokens can be 0 when load_kv_async is enabled

75593ea

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

fix: revert changes to factory.py

fd34b51

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

nits

75c27e3

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

workaround lmcache new interfaces

76855bc

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

fix: avoid memory leak in remove_skipped_blocks; workaround gemma3 pr…

188f661

…efix_lm issue Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

Squashed merge main

0bed603

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

Merge branch 'main' into feat/partial_ext_token_hit

fe99429

Merge branch 'main' into feat/partial_ext_token_hit

ae264a1

nits: revise function name and comments

6c9afb3

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

[CI] Generalize gsm8k test args and add Qwen3-Next MTP B200 test (vll…

00ac78e

…m-project#30723) Signed-off-by: mgoin <mgoin64@gmail.com>

[Frontend] Add max-completion-token option to transcription/transla…

b81b822

…tion endpoints (vllm-project#30769) Signed-off-by: NickLucche <nlucches@redhat.com>

[Refactor] Small refactor for group topk (vllm-project#30562)

03adf8a

Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>

[Perf] Do FP4 quant before All gather on flashinfer trtllmgen MOE (vl…

15110e1

…lm-project#30014) Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

[Attention] Cache attention metadata builds across hybrid KV-cache gr…

dfbea50

…oups (vllm-project#29627) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com>

[Core][MM] Optimize encoder cache manager by operating with embedding…

ecf0943

…s only (vllm-project#30475) Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Sun Kim <sunytokki@gmail.com>

NickLucche added 4 commits December 22, 2025 11:08

is_null instead of 0 check

ae80edf

Signed-off-by: NickLucche <nlucches@redhat.com>

get_sw_clippped_blocks to fix over-allocation for swa on D

2eb9904

Signed-off-by: NickLucche <nlucches@redhat.com>

fix issue with null blocks on P being one extra (17) by clipping

7576e55

Signed-off-by: NickLucche <nlucches@redhat.com>

remove llama4 opt

4f17655

Signed-off-by: NickLucche <nlucches@redhat.com>

github-project-automation bot added this to Structured Output and NVIDIA Jan 6, 2026

mergify bot added tpu Related to Google TPUs tool-calling kv-connector labels Jan 6, 2026

github-project-automation bot added this to Tool Calling Jan 6, 2026

mergify bot added the needs-rebase label Jan 6, 2026

gemini-code-assist bot reviewed Jan 6, 2026

View reviewed changes

NickLucche closed this Jan 12, 2026

github-project-automation bot moved this to Done in Structured Output Jan 12, 2026

github-project-automation bot moved this to Done in NVIDIA Jan 12, 2026

github-project-automation bot moved this to Done in Tool Calling Jan 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core][NIXL] Support HMA+NixlConnector#31802

[Core][NIXL] Support HMA+NixlConnector#31802
NickLucche wants to merge 75 commits intovllm-project:mainfrom
NickLucche:nixl-hma2

NickLucche commented Jan 6, 2026 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Jan 6, 2026

Uh oh!

mergify bot commented Jan 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

NickLucche commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Uh oh!

Conversation

NickLucche commented Jan 6, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Test with

TODO

Uh oh!

mergify bot commented Jan 6, 2026

Uh oh!

mergify bot commented Jan 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

csrc/moe/marlin_moe_wna16/marlin_template.h (1021-1025)

csrc/quantization/gptq_marlin/marlin_template.h (876-880)

vllm/config/model.py (1126-1128)

vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py (80-82)

vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py (212-214)

vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py (324-334)

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py (682)

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py (2195)

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py (2264)

vllm/v1/core/sched/scheduler.py (1623-1624)

vllm/v1/core/sched/scheduler.py (1627)

vllm/v1/core/sched/scheduler.py (1638)

Uh oh!

NickLucche commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

NickLucche commented Jan 6, 2026 •

edited by github-actions bot

Loading