Skip to content

[NIXL] Heterogeneous KV Layout and block_size - prefill NHD and nP > nD support#30448

Open
xuechendi wants to merge 2 commits intovllm-project:mainfrom
xuechendi:dev/prefill_KV_process
Open

[NIXL] Heterogeneous KV Layout and block_size - prefill NHD and nP > nD support#30448
xuechendi wants to merge 2 commits intovllm-project:mainfrom
xuechendi:dev/prefill_KV_process

Conversation

@xuechendi
Copy link
Copy Markdown
Contributor

@xuechendi xuechendi commented Dec 11, 2025

Purpose

Proposal codes for #26744

This PR is to support non-host-buffer path Heterogeneous KV Layout and block_size - prefill NHD and nP > nD
Only support when prefix_cache is disabled

This PR includes two bug fix PR, pending on them merged firstly
#30420
#30419

Test Plan

1P 1D with nP > nD + Prefill with NHD

DISABLE_PREFIX_CACHE=true AGREED_BLOCK_SIZE=16  PREFILL_KV_LAYOUT=NHD PREFILL_BLOCK_SIZE=64 DECODE_BLOCK_SIZE=16 bash tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh

=> Accuracy passed Qwen3-0.6= ~0.41

P1D2 + prefill with NHD

DECODER_TP_SIZE=2 DISABLE_PREFIX_CACHE=true AGREED_BLOCK_SIZE=16  PREFILL_KV_LAYOUT=NHD PREFILL_BLOCK_SIZE=64 DECODE_BLOCK_SIZE=16 bash tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh

Accuracy passed Qwen3-0.6= ~0.41

Design

This PR is to propose update KV_layout and block_size at prefill side after request complete prefill fwd

case 1, very naive case, the proposal can work properly

fwd_prefill => KV update(NHD to HND and blk_size = 16) 
=> send finish to decoder => decoder copy data correctly

Case 2, with chunked prefill

fwd partial prefill => temp save block_ids to nixl_connector_worker
=> second partial fwd => KV update(NHD to HND and blk_size = 16) 
=> send finish to decoder => decoder copy data correctly

Changes proposed in this PR:

  1. add new metadata block_size_after_save and kv_layout_after_save => If prefill is using different setting to decode, the block_size_after_save and kv_layout_after_save will use agreed block_size and 'HND' respectively.
  2. This feature will only get enabled when prefix_cache is disabled, and enable_permute_local_kv is on.
  3. Add a new kv_caches_postprocess function, and called at wait_for_save

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for heterogeneous KV layouts and block sizes in NIXL, particularly for prefill scenarios. The changes are extensive, touching both the integration test script and the core NIXL connector logic. While the overall direction is good, I've found a few critical issues. The test script has a bug in how it constructs JSON configuration strings, which can lead to test failures. More critically, in the NIXL connector, there's a debugging function with print statements left in the code, and a significant bug in the KV cache post-processing logic where block size changes are not applied in-place to the actual cache tensor. These issues need to be addressed to ensure correctness and stability.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@xuechendi xuechendi force-pushed the dev/prefill_KV_process branch 7 times, most recently from 82ce6d5 to cf6f09e Compare December 11, 2025 19:55
@orozery
Copy link
Copy Markdown
Collaborator

orozery commented Dec 12, 2025

@xuechendi @markmc
Other than verifying functional correctness of the logic introduced here,
I think it's also worthwhile to also check the e2e performance implications of it.
P/D is all about improving performance, so in theory some P->D flows are better to be handshake-failed instead of proceeding.

@mergify
Copy link
Copy Markdown

mergify bot commented Dec 15, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @xuechendi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 15, 2025
@xuechendi xuechendi force-pushed the dev/prefill_KV_process branch from cf6f09e to 373011b Compare December 16, 2025 21:24
@mergify mergify bot removed the needs-rebase label Dec 16, 2025
@xuechendi xuechendi force-pushed the dev/prefill_KV_process branch from 373011b to 0399312 Compare December 16, 2025 21:27
@xuechendi
Copy link
Copy Markdown
Contributor Author

xuechendi commented Dec 16, 2025

@xuechendi @markmc Other than verifying functional correctness of the logic introduced here, I think it's also worthwhile to also check the e2e performance implications of it. P/D is all about improving performance, so in theory some P->D flows are better to be handshake-failed instead of proceeding.

@orozery, thanks for sharing your thought.
This path is for our heterogeneous arch proposal for Intel Gaudi + cuda. The specific PR is for prefill with Gaudi and decode with CUDA. (where Gaudi only support NHD and more performant with block_size=128)
Our performance team reported such setting bring similar performance comparing to H200 to H200 with way better TCO story.
For H200 to H200, it is not necessary to do hetero block_size or kv_layout, so we explicitly asked for "enable_permute_local_kv=true" to set in kv_connector_config to enable this feature.

@xuechendi xuechendi force-pushed the dev/prefill_KV_process branch from 0399312 to 6914f52 Compare December 16, 2025 22:04
@mergify
Copy link
Copy Markdown

mergify bot commented Dec 18, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @xuechendi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot removed the needs-rebase label Dec 19, 2025
@xuechendi xuechendi force-pushed the dev/prefill_KV_process branch from 851655e to 1260965 Compare January 13, 2026 19:24
@xuechendi xuechendi force-pushed the dev/prefill_KV_process branch from 1260965 to 76ed03d Compare January 13, 2026 19:46
if block_size_ratio > 1:
num_blocks = math.ceil(
(request.num_tokens - 1) / self.block_size_after_save
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off-by-one error in block count calculation

High Severity

The formula ceil((request.num_tokens - 1) / self.block_size_after_save) undercounts the required number of blocks. The standard vLLM pattern (seen in single_type_kv_cache_manager.py) uses cdiv(num_tokens, block_size) without the -1. For example, with 17 tokens and block_size=16, the correct count is 2 blocks, but this formula yields 1. This causes incomplete KV cache data to be transferred, potentially corrupting the decode process.

Fix in Cursor Fix in Web

@xuechendi xuechendi force-pushed the dev/prefill_KV_process branch from 76ed03d to 2a07d6b Compare January 13, 2026 20:12
@xuechendi xuechendi force-pushed the dev/prefill_KV_process branch 2 times, most recently from 73d5da4 to 0d69f5b Compare January 13, 2026 21:17
1. update example to support prefill HND and agreed_block_size
2. enable prefill side kv_layout and block_size update

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
@xuechendi xuechendi force-pushed the dev/prefill_KV_process branch from 0d69f5b to eae6e09 Compare January 13, 2026 22:00
@xuechendi
Copy link
Copy Markdown
Contributor Author

@NickLucche , may you help to review this PR to support nP > nD case

Copy link
Copy Markdown
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work @xuechendi !

I currently have 2 main issues with current implementation:

  • This is making the scheduler kv cache dependent. In its current form, the scheduler is only operating on the "logical" plane on block ids and it is agnostic to layout/block_size and other physical configs. This is a separation of concerns which I quite like and would like to keep unless really necessary. One alternative I would propose for block_size mismatch is to group the changes to the scheduler you're proposing into a Mixin class.
  • The "AGREED_BLOCK_SIZE" design is breaking the current workflow which is handshake based: currently 2 peer instances exchange info at runtime and match their config/logic accordingly after discovery. Allowing a top-bottom agreement through config may result in feeling "custom" or unordinary. Let's discuss what other options we have to maintain the discovery-based paradigm.

Also, this PR will clash with #32204. Therefore I would like to land it after the latter to address potential conflicts with this feature in this PR rather than on the hma one.

Comment on lines +259 to +264
assert vllm_config.kv_transfer_config.enable_permute_local_kv
if vllm_config.cache_config.enable_prefix_caching:
logger.warning_once(
"KV cache postprocess is not compatible with prefix caching."
)
return False, current_kv_cache_layout, current_block_size
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we check this during handshake validation? I think we should crash somewhere if we're in hetero-block size scenario with unsupported config

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for these other constraints you listed

prefill NHD and nP > nD

"All kv cache tensors must have the same number of blocks"
)

block_size_ratio_on_save = self.block_size // self.block_size_on_save
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is constant and should be computed once after this loop detectes any logical<>kernel block size mismatch

Comment on lines +1481 to +1484
block_len_per_layer_on_save.append(
curr_tensor_size_bytes
// self.num_blocks
// block_size_ratio_on_save
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is basically block_len_per_layer // block_size_ratio_on_save where denominator is some constant.
We don't need a separate bookkeeping struct for it, just compute it on the fly from block_len_per_layer

Comment on lines +1486 to +1488
num_blocks_on_save = (
curr_tensor_size_bytes // block_len_per_layer_on_save[-1]
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, this should not be computed in a for loop.
It can also be simplified to

num_blocks = curr_tensor_size_bytes / N 
num_blocks_on_save = curr_tensor_size_bytes / (N / block_size_ratio_on_save)
=>num_blocks_on_save = curr_tensor_size_bytes/N *block_size_ratio_on_save

==>num_blocks_on_save = num_blocks * block_size_ratio_on_save

Comment on lines +1535 to +1538
(
self.src_xfer_handles_by_block_size[self.block_size_on_save],
self.src_blocks_data,
) = self.register_local_xfer_handler(self.block_size_on_save)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this pre-commit or..?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, because I updated self.block_size to self.block_size_on_save

Comment on lines +1108 to +1110
self.postprocess_kv_caches_on_save,
self.kv_cache_layout_on_save,
self.block_size_on_save,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of this should go into the compat check right?

Comment on lines +267 to +271
agreed_block_size = int(
vllm_config.kv_transfer_config.get_from_extra_config(
"agreed_block_size", current_block_size
)
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do I get this right, this is an agreement through config before handshake, which is actually breaking the current flow as per all similar logic is discovery-based: configs are matched in between instances and evaluated at runtime.

do we have alternatives to this pre-agreement? It may also end up looking as "custom" wrt current ux @xuechendi
cc @markmc

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xuechendi could we handle this with an additional factor in compute_nixl_compatibility_hash() ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @NickLucche , agreed_block_size is breaking the current discovery-based design that prefill side will use agreed_block_size as its block_size when communicate to decoders.

I tried to search for possible alternatives in my mind, but I really can't come up with any. If we go with my previous PR that let decoder side to permute for large block_size, decoder will end up with allocating temp memory to hold a block_size= 128 buffer when it only has block_size=16* num_blocks=4 buffers.

remote: RemoteMeta | None = None


def if_postprocess_kvcache_on_save(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a great function name, better to change to should_postprocess_kvcache_on_save or "needs" or
should_transform_kv_for_transfer or similar

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
yeonsily added a commit to hsubramony/vllm-gaudi that referenced this pull request Jan 21, 2026
enable support for prefill side kv_layout and block_size update
1. update example to support prefill HND and agreed_block_size
2. enable prefill side kv_layout and block_size update

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Yeonsil Yoon <yeon.sil.yoon@intel.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Jan 22, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @xuechendi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 22, 2026
yeonsily added a commit to yeonsily/vllm-gaudi that referenced this pull request Jan 22, 2026
enable support for prefill side kv_layout and block_size update
1. update example to support prefill HND and agreed_block_size
2. enable prefill side kv_layout and block_size update

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Yeonsil Yoon <yeon.sil.yoon@intel.com>
@mergify mergify bot removed the needs-rebase label Jan 28, 2026
xuechendi pushed a commit to vllm-project/vllm-gaudi that referenced this pull request Jan 28, 2026
1. update example to support prefill HND and agreed_block_size
2. enable prefill side kv_layout and block_size update

Port vllm-project/vllm#30448 to vllm-gaudi

---------

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Yeonsil Yoon <yeon.sil.yoon@intel.com>
testdig pushed a commit to testdig/vllm-gaudi-fork that referenced this pull request Jan 29, 2026
…-project#867)

1. update example to support prefill HND and agreed_block_size
2. enable prefill side kv_layout and block_size update

Port vllm-project/vllm#30448 to vllm-gaudi

---------

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Yeonsil Yoon <yeon.sil.yoon@intel.com>
Signed-off-by: Wang, Zheng W <zheng.w.wang@intel.com>
yiliu30 pushed a commit to yiliu30/vllm-gaudi that referenced this pull request Feb 4, 2026
…-project#867)

1. update example to support prefill HND and agreed_block_size
2. enable prefill side kv_layout and block_size update

Port vllm-project/vllm#30448 to vllm-gaudi

---------

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Yeonsil Yoon <yeon.sil.yoon@intel.com>
slokesha pushed a commit to libinta/vllm-gaudi that referenced this pull request Feb 9, 2026
…-project#867)

1. update example to support prefill HND and agreed_block_size
2. enable prefill side kv_layout and block_size update

Port vllm-project/vllm#30448 to vllm-gaudi

---------

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Yeonsil Yoon <yeon.sil.yoon@intel.com>
Signed-off-by: slokesha <slokeshappa@habana.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants