Skip to content

[Core][NIXL] Support HMA+NixlConnector#31802

Closed
NickLucche wants to merge 75 commits intovllm-project:mainfrom
NickLucche:nixl-hma2
Closed

[Core][NIXL] Support HMA+NixlConnector#31802
NickLucche wants to merge 75 commits intovllm-project:mainfrom
NickLucche:nixl-hma2

Conversation

@NickLucche
Copy link
Collaborator

@NickLucche NickLucche commented Jan 6, 2026

This PR is based on an early version of #30166 so the diff is a mess. I will clean it up and rebase it asap and provide a more accurate description of the PR then.

UPDATE: check out #32204 for the updated PR

Overview

Currently connectors are not able to take full advantage of models that employ hybrid attention (FA+SWA) and treat all layers as FA, as the Hybrid Kv Cache Manager is disabled.

This PR enables NixlConnector to work with the HMA, resulting in drastically reducing the number of bytes/regions moved with a xfer for SWA+FA models, while laying the ground for state-based ones (mamba etc).
Example of the former:

# NON-HMA (current master)
(EngineCore_DP0 pid=521538) get_block_descs_ids [[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]]
(EngineCore_DP0 pid=521538)
(EngineCore_DP0 pid=521538) get_block_descs_ids num output 4284

# HMA --no-enable-prefix-caching --no-disable-hybrid-kv-cache-manager (this PR)
get_block_descs_ids (remote descs) [[47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], [110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126], ... [379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441]]


get_block_descs_ids num output 1650

Test with

Enable HMA experimental support with --no-disable-hybrid-kv-cache-manager:

# usual P/D command
vllm serve google/gemma-3-4b-it
--trust-remote-code \
--block-size 64 \
--no-enable-prefix-caching \
--no-disable-hybrid-kv-cache-manager \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'

# usual toy_proxy_server.py command

lm-eval results:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.74|±  |0.0441|
|     |       |strict-match    |     5|exact_match|↑  | 0.74|±  |0.0441|

TODO

  • Report and handle block-level failures
  • verify logical<>physical kernel block path
  • verify mamba-like models (defer to separate PR)
  • verify with llama4 (we had an old optimization)
  • verify host-backed transfers (D2H->H2D)
  • pre-commit+linting

cc working with @heheda12345 @KuntaiDu @ivanium

ivanium and others added 30 commits December 13, 2025 17:25
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Co-authored-by: KuntaiDu <kuntai@uchicago.edu>
…indow, and leading padding with null blocks

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

fixes

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

fix get_num_blocks_to_allocate

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
…ocks

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
…ll blocks inside the single_type_block_manager

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
…ng that in a follow-up PR

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
…l_computed_tokens allocation to the same function

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
…efix_lm issue

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
…tion endpoints (vllm-project#30769)

Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
…lm-project#30014)

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
…oups (vllm-project#29627)

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com>
…s only (vllm-project#30475)

Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Sun Kim <sunytokki@gmail.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
@mergify
Copy link

mergify bot commented Jan 6, 2026

Documentation preview: https://vllm--31802.org.readthedocs.build/en/31802/

@mergify mergify bot added documentation Improvements or additions to documentation ci/build deepseek Related to DeepSeek models frontend llama Related to Llama models multi-modality Related to multi-modality (#4194) performance Performance-related issues qwen Related to Qwen models nvidia structured-output v1 labels Jan 6, 2026
@mergify mergify bot added tpu Related to Google TPUs tool-calling kv-connector labels Jan 6, 2026
@mergify
Copy link

mergify bot commented Jan 6, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 6, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for HMA (Hybrid Memory Allocator) with the NixlConnector, which is a significant feature. The changes span across CI configurations, low-level CUDA kernels, and core scheduling logic. The PR also includes support for Turing architecture in Marlin kernels and various refactorings and bug fixes. While the changes are extensive and well-structured, I've identified several critical and high-severity issues. These mainly consist of temporary workarounds and FIXME comments that need to be addressed before merging, as well as the removal of important safety checks (static_assert) in CUDA kernels which could lead to memory corruption.

I am having trouble creating individual review comments. Click here to see my feedback.

csrc/moe/marlin_moe_wna16/marlin_template.h (1021-1025)

critical

The static_assert that validates shared memory layout has been removed. This check is important to prevent potential memory corruption in the CUDA kernel. Was its removal intentional? If so, could you please explain why it's no longer needed? If not, it should be restored.

csrc/quantization/gptq_marlin/marlin_template.h (876-880)

critical

Similar to the other Marlin template, the static_assert for shared memory layout validation has been removed here. This could lead to memory corruption. Please restore it or provide a reason for its removal.

vllm/config/model.py (1126-1128)

critical

This temporary change to disable mm_prefix_lm should be removed before merging, as indicated by the comment.

vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py (80-82)

high

This temporary workaround should be removed before merging, as indicated by the comment.

vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py (212-214)

high

This temporary workaround to return None should be removed before merging.

vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py (324-334)

high

This temporary workaround for request_finished_all_groups should be addressed and the comment removed before merging.

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py (682)

high

Please address this FIXME regarding handling a tuple of blocks.

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py (2195)

high

Please address this FIXME about marking blocks per group in the error handling path.

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py (2264)

high

Please address this FIXME regarding flattening the tuple in _logical_to_kernel_block_ids.

vllm/v1/core/sched/scheduler.py (1623-1624)

high

Please address these FIXME comments regarding HMA and environment variable changes.

vllm/v1/core/sched/scheduler.py (1627)

high

Please address this FIXME about handling blocks across layers.

vllm/v1/core/sched/scheduler.py (1638)

high

Please address this FIXME about per-group caching.

@NickLucche
Copy link
Collaborator Author

closing in favor of #32204

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build deepseek Related to DeepSeek models documentation Improvements or additions to documentation frontend kv-connector llama Related to Llama models multi-modality Related to multi-modality (#4194) needs-rebase nvidia performance Performance-related issues qwen Related to Qwen models structured-output tool-calling tpu Related to Google TPUs v1

Projects

Status: Done
Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.