[NIXL] Heterogeneous KV Layout and block_size - prefill NHD and nP > nD support by xuechendi · Pull Request #30448 · vllm-project/vllm

xuechendi · 2025-12-11T02:06:02Z

Purpose

Proposal codes for #26744

This PR is to support non-host-buffer path Heterogeneous KV Layout and block_size - prefill NHD and nP > nD
Only support when prefix_cache is disabled

This PR includes two bug fix PR, pending on them merged firstly
#30420
#30419

Test Plan

1P 1D with nP > nD + Prefill with NHD

DISABLE_PREFIX_CACHE=true AGREED_BLOCK_SIZE=16  PREFILL_KV_LAYOUT=NHD PREFILL_BLOCK_SIZE=64 DECODE_BLOCK_SIZE=16 bash tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh

=> Accuracy passed Qwen3-0.6= ~0.41

P1D2 + prefill with NHD

DECODER_TP_SIZE=2 DISABLE_PREFIX_CACHE=true AGREED_BLOCK_SIZE=16  PREFILL_KV_LAYOUT=NHD PREFILL_BLOCK_SIZE=64 DECODE_BLOCK_SIZE=16 bash tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh

Accuracy passed Qwen3-0.6= ~0.41

Design

This PR is to propose update KV_layout and block_size at prefill side after request complete prefill fwd

case 1, very naive case, the proposal can work properly

fwd_prefill => KV update(NHD to HND and blk_size = 16) 
=> send finish to decoder => decoder copy data correctly

Case 2, with chunked prefill

fwd partial prefill => temp save block_ids to nixl_connector_worker
=> second partial fwd => KV update(NHD to HND and blk_size = 16) 
=> send finish to decoder => decoder copy data correctly

Changes proposed in this PR:

add new metadata block_size_after_save and kv_layout_after_save => If prefill is using different setting to decode, the block_size_after_save and kv_layout_after_save will use agreed block_size and 'HND' respectively.
This feature will only get enabled when prefix_cache is disabled, and enable_permute_local_kv is on.
Add a new kv_caches_postprocess function, and called at wait_for_save

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces support for heterogeneous KV layouts and block sizes in NIXL, particularly for prefill scenarios. The changes are extensive, touching both the integration test script and the core NIXL connector logic. While the overall direction is good, I've found a few critical issues. The test script has a bug in how it constructs JSON configuration strings, which can lead to test failures. More critically, in the NIXL connector, there's a debugging function with print statements left in the code, and a significant bug in the KV cache post-processing logic where block size changes are not applied in-place to the actual cache tensor. These issues need to be addressed to ensure correctness and stability.

tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

orozery · 2025-12-12T04:21:23Z

@xuechendi @markmc
Other than verifying functional correctness of the logic introduced here,
I think it's also worthwhile to also check the e2e performance implications of it.
P/D is all about improving performance, so in theory some P->D flows are better to be handshake-failed instead of proceeding.

mergify · 2025-12-15T16:10:36Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @xuechendi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

xuechendi · 2025-12-16T22:04:11Z

@xuechendi @markmc Other than verifying functional correctness of the logic introduced here, I think it's also worthwhile to also check the e2e performance implications of it. P/D is all about improving performance, so in theory some P->D flows are better to be handshake-failed instead of proceeding.

@orozery, thanks for sharing your thought.
This path is for our heterogeneous arch proposal for Intel Gaudi + cuda. The specific PR is for prefill with Gaudi and decode with CUDA. (where Gaudi only support NHD and more performant with block_size=128)
Our performance team reported such setting bring similar performance comparing to H200 to H200 with way better TCO story.
For H200 to H200, it is not necessary to do hetero block_size or kv_layout, so we explicitly asked for "enable_permute_local_kv=true" to set in kv_connector_config to enable this feature.

mergify · 2025-12-18T00:53:43Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @xuechendi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

vllm/distributed/kv_transfer/kv_connector/utils.py

cursor · 2026-01-13T19:59:03Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+        if block_size_ratio > 1:
+            num_blocks = math.ceil(
+                (request.num_tokens - 1) / self.block_size_after_save
+            )


Off-by-one error in block count calculation

High Severity

The formula ceil((request.num_tokens - 1) / self.block_size_after_save) undercounts the required number of blocks. The standard vLLM pattern (seen in single_type_kv_cache_manager.py) uses cdiv(num_tokens, block_size) without the -1. For example, with 17 tokens and block_size=16, the correct count is 2 blocks, but this formula yields 1. This causes incomplete KV cache data to be transferred, potentially corrupting the decode process.

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

vllm/distributed/kv_transfer/kv_connector/utils.py

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

1. update example to support prefill HND and agreed_block_size 2. enable prefill side kv_layout and block_size update Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi · 2026-01-13T22:46:12Z

@NickLucche , may you help to review this PR to support nP > nD case

NickLucche

Thanks for the work @xuechendi !

I currently have 2 main issues with current implementation:

This is making the scheduler kv cache dependent. In its current form, the scheduler is only operating on the "logical" plane on block ids and it is agnostic to layout/block_size and other physical configs. This is a separation of concerns which I quite like and would like to keep unless really necessary. One alternative I would propose for block_size mismatch is to group the changes to the scheduler you're proposing into a Mixin class.
The "AGREED_BLOCK_SIZE" design is breaking the current workflow which is handshake based: currently 2 peer instances exchange info at runtime and match their config/logic accordingly after discovery. Allowing a top-bottom agreement through config may result in feeling "custom" or unordinary. Let's discuss what other options we have to maintain the discovery-based paradigm.

Also, this PR will clash with #32204. Therefore I would like to land it after the latter to address potential conflicts with this feature in this PR rather than on the hma one.

NickLucche · 2026-01-15T17:16:29Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+    assert vllm_config.kv_transfer_config.enable_permute_local_kv
+    if vllm_config.cache_config.enable_prefix_caching:
+        logger.warning_once(
+            "KV cache postprocess is not compatible with prefix caching."
+        )
+        return False, current_kv_cache_layout, current_block_size


shouldn't we check this during handshake validation? I think we should crash somewhere if we're in hetero-block size scenario with unsupported config

same for these other constraints you listed

prefill NHD and nP > nD

NickLucche · 2026-01-16T10:04:10Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

                    "All kv cache tensors must have the same number of blocks"
                )

+                block_size_ratio_on_save = self.block_size // self.block_size_on_save


this is constant and should be computed once after this loop detectes any logical<>kernel block size mismatch

NickLucche · 2026-01-16T10:05:57Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+                block_len_per_layer_on_save.append(
+                    curr_tensor_size_bytes
+                    // self.num_blocks
+                    // block_size_ratio_on_save


this is basically block_len_per_layer // block_size_ratio_on_save where denominator is some constant.
We don't need a separate bookkeeping struct for it, just compute it on the fly from block_len_per_layer

NickLucche · 2026-01-16T10:18:33Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+                num_blocks_on_save = (
+                    curr_tensor_size_bytes // block_len_per_layer_on_save[-1]
+                )


ditto, this should not be computed in a for loop.
It can also be simplified to

num_blocks = curr_tensor_size_bytes / N num_blocks_on_save = curr_tensor_size_bytes / (N / block_size_ratio_on_save) =>num_blocks_on_save = curr_tensor_size_bytes/N *block_size_ratio_on_save ==>num_blocks_on_save = num_blocks * block_size_ratio_on_save

NickLucche · 2026-01-16T10:20:20Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+        (
+            self.src_xfer_handles_by_block_size[self.block_size_on_save],
+            self.src_blocks_data,
+        ) = self.register_local_xfer_handler(self.block_size_on_save)


is this pre-commit or..?

yes, because I updated self.block_size to self.block_size_on_save

NickLucche · 2026-01-16T10:24:00Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+                self.postprocess_kv_caches_on_save,
+                self.kv_cache_layout_on_save,
+                self.block_size_on_save,


some of this should go into the compat check right?

NickLucche · 2026-01-16T10:31:39Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+    agreed_block_size = int(
+        vllm_config.kv_transfer_config.get_from_extra_config(
+            "agreed_block_size", current_block_size
+        )
+    )


do I get this right, this is an agreement through config before handshake, which is actually breaking the current flow as per all similar logic is discovery-based: configs are matched in between instances and evaluated at runtime.

do we have alternatives to this pre-agreement? It may also end up looking as "custom" wrt current ux @xuechendi
cc @markmc

@xuechendi could we handle this with an additional factor in compute_nixl_compatibility_hash() ?

Yes, @NickLucche , agreed_block_size is breaking the current discovery-based design that prefill side will use agreed_block_size as its block_size when communicate to decoders.

I tried to search for possible alternatives in my mind, but I really can't come up with any. If we go with my previous PR that let decoder side to permute for large block_size, decoder will end up with allocating temp memory to hold a block_size= 128 buffer when it only has block_size=16* num_blocks=4 buffers.

NickLucche · 2026-01-16T10:33:36Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

    remote: RemoteMeta | None = None


+def if_postprocess_kvcache_on_save(


not a great function name, better to change to should_postprocess_kvcache_on_save or "needs" or
should_transform_kv_for_transfer or similar

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

enable support for prefill side kv_layout and block_size update 1. update example to support prefill HND and agreed_block_size 2. enable prefill side kv_layout and block_size update Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Yeonsil Yoon <yeon.sil.yoon@intel.com>

mergify · 2026-01-22T15:19:44Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @xuechendi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

enable support for prefill side kv_layout and block_size update 1. update example to support prefill HND and agreed_block_size 2. enable prefill side kv_layout and block_size update Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Yeonsil Yoon <yeon.sil.yoon@intel.com>

1. update example to support prefill HND and agreed_block_size 2. enable prefill side kv_layout and block_size update Port vllm-project/vllm#30448 to vllm-gaudi --------- Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Yeonsil Yoon <yeon.sil.yoon@intel.com>

…-project#867) 1. update example to support prefill HND and agreed_block_size 2. enable prefill side kv_layout and block_size update Port vllm-project/vllm#30448 to vllm-gaudi --------- Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Yeonsil Yoon <yeon.sil.yoon@intel.com> Signed-off-by: Wang, Zheng W <zheng.w.wang@intel.com>

…-project#867) 1. update example to support prefill HND and agreed_block_size 2. enable prefill side kv_layout and block_size update Port vllm-project/vllm#30448 to vllm-gaudi --------- Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Yeonsil Yoon <yeon.sil.yoon@intel.com>

…-project#867) 1. update example to support prefill HND and agreed_block_size 2. enable prefill side kv_layout and block_size update Port vllm-project/vllm#30448 to vllm-gaudi --------- Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Yeonsil Yoon <yeon.sil.yoon@intel.com> Signed-off-by: slokesha <slokeshappa@habana.ai>

xuechendi requested review from ApostaC, NickLucche, WoosukKwon, alexm-redhat, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners December 11, 2025 02:06

mergify bot added v1 kv-connector labels Dec 11, 2025

gemini-code-assist bot reviewed Dec 11, 2025

View reviewed changes

tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh Show resolved Hide resolved

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Outdated Show resolved Hide resolved

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Dec 11, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Outdated Show resolved Hide resolved

xuechendi force-pushed the dev/prefill_KV_process branch 7 times, most recently from 82ce6d5 to cf6f09e Compare December 11, 2025 19:55

mergify bot added the needs-rebase label Dec 15, 2025

xuechendi force-pushed the dev/prefill_KV_process branch from cf6f09e to 373011b Compare December 16, 2025 21:24

mergify bot removed the needs-rebase label Dec 16, 2025

xuechendi force-pushed the dev/prefill_KV_process branch from 373011b to 0399312 Compare December 16, 2025 21:27

xuechendi force-pushed the dev/prefill_KV_process branch from 0399312 to 6914f52 Compare December 16, 2025 22:04

mergify bot added the needs-rebase label Dec 18, 2025

xuechendi mentioned this pull request Dec 18, 2025

Add heterogeneous pd docs vllm-project/vllm-gaudi#714

Draft

mergify bot removed the needs-rebase label Dec 19, 2025

xuechendi force-pushed the dev/prefill_KV_process branch from 851655e to 1260965 Compare January 13, 2026 19:24

cursor bot reviewed Jan 13, 2026

View reviewed changes

tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh Show resolved Hide resolved

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Show resolved Hide resolved

vllm/distributed/kv_transfer/kv_connector/utils.py Show resolved Hide resolved

xuechendi force-pushed the dev/prefill_KV_process branch from 1260965 to 76ed03d Compare January 13, 2026 19:46

cursor bot reviewed Jan 13, 2026

View reviewed changes

xuechendi force-pushed the dev/prefill_KV_process branch from 76ed03d to 2a07d6b Compare January 13, 2026 20:12

cursor bot reviewed Jan 13, 2026

View reviewed changes

xuechendi force-pushed the dev/prefill_KV_process branch 2 times, most recently from 73d5da4 to 0d69f5b Compare January 13, 2026 21:17

cursor bot reviewed Jan 13, 2026

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Show resolved Hide resolved

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Show resolved Hide resolved

enable support for prefill side kv_layout and block_size update

eae6e09

1. update example to support prefill HND and agreed_block_size 2. enable prefill side kv_layout and block_size update Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi force-pushed the dev/prefill_KV_process branch from 0d69f5b to eae6e09 Compare January 13, 2026 22:00

NickLucche reviewed Jan 16, 2026

View reviewed changes

remove comments

b3aa4bd

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

yeonsily mentioned this pull request Jan 21, 2026

enable support for prefill side kv_layout and block_size update vllm-project/vllm-gaudi#859

Closed

mergify bot added the needs-rebase label Jan 22, 2026

yeonsily mentioned this pull request Jan 22, 2026

Enable support for prefill side kv_layout and block_size update vllm-project/vllm-gaudi#867

Merged

mergify bot removed the needs-rebase label Jan 28, 2026

		remote: RemoteMeta \| None = None


		def if_postprocess_kvcache_on_save(

Uh oh!

Conversation

xuechendi commented Dec 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Design

Changes proposed in this PR:

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

orozery commented Dec 12, 2025

Uh oh!

mergify bot commented Dec 15, 2025

Uh oh!

xuechendi commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Dec 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Jan 13, 2026

Choose a reason for hiding this comment

Off-by-one error in block count calculation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xuechendi commented Jan 13, 2026

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

xuechendi commented Dec 11, 2025 •

edited by github-actions bot

Loading

xuechendi commented Dec 16, 2025 •

edited

Loading