[NIXL] Heterogeneous KV Layout and block_size - prefill NHD and nP > nD support#30448
[NIXL] Heterogeneous KV Layout and block_size - prefill NHD and nP > nD support#30448xuechendi wants to merge 2 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for heterogeneous KV layouts and block sizes in NIXL, particularly for prefill scenarios. The changes are extensive, touching both the integration test script and the core NIXL connector logic. While the overall direction is good, I've found a few critical issues. The test script has a bug in how it constructs JSON configuration strings, which can lead to test failures. More critically, in the NIXL connector, there's a debugging function with print statements left in the code, and a significant bug in the KV cache post-processing logic where block size changes are not applied in-place to the actual cache tensor. These issues need to be addressed to ensure correctness and stability.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
82ce6d5 to
cf6f09e
Compare
|
@xuechendi @markmc |
|
This pull request has merge conflicts that must be resolved before it can be |
cf6f09e to
373011b
Compare
373011b to
0399312
Compare
@orozery, thanks for sharing your thought. |
0399312 to
6914f52
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
851655e to
1260965
Compare
1260965 to
76ed03d
Compare
| if block_size_ratio > 1: | ||
| num_blocks = math.ceil( | ||
| (request.num_tokens - 1) / self.block_size_after_save | ||
| ) |
There was a problem hiding this comment.
Off-by-one error in block count calculation
High Severity
The formula ceil((request.num_tokens - 1) / self.block_size_after_save) undercounts the required number of blocks. The standard vLLM pattern (seen in single_type_kv_cache_manager.py) uses cdiv(num_tokens, block_size) without the -1. For example, with 17 tokens and block_size=16, the correct count is 2 blocks, but this formula yields 1. This causes incomplete KV cache data to be transferred, potentially corrupting the decode process.
76ed03d to
2a07d6b
Compare
73d5da4 to
0d69f5b
Compare
1. update example to support prefill HND and agreed_block_size 2. enable prefill side kv_layout and block_size update Signed-off-by: Chendi Xue <chendi.xue@intel.com>
0d69f5b to
eae6e09
Compare
|
@NickLucche , may you help to review this PR to support nP > nD case |
NickLucche
left a comment
There was a problem hiding this comment.
Thanks for the work @xuechendi !
I currently have 2 main issues with current implementation:
- This is making the scheduler kv cache dependent. In its current form, the scheduler is only operating on the "logical" plane on block ids and it is agnostic to layout/block_size and other physical configs. This is a separation of concerns which I quite like and would like to keep unless really necessary. One alternative I would propose for block_size mismatch is to group the changes to the scheduler you're proposing into a Mixin class.
- The "AGREED_BLOCK_SIZE" design is breaking the current workflow which is handshake based: currently 2 peer instances exchange info at runtime and match their config/logic accordingly after discovery. Allowing a top-bottom agreement through config may result in feeling "custom" or unordinary. Let's discuss what other options we have to maintain the discovery-based paradigm.
Also, this PR will clash with #32204. Therefore I would like to land it after the latter to address potential conflicts with this feature in this PR rather than on the hma one.
| assert vllm_config.kv_transfer_config.enable_permute_local_kv | ||
| if vllm_config.cache_config.enable_prefix_caching: | ||
| logger.warning_once( | ||
| "KV cache postprocess is not compatible with prefix caching." | ||
| ) | ||
| return False, current_kv_cache_layout, current_block_size |
There was a problem hiding this comment.
shouldn't we check this during handshake validation? I think we should crash somewhere if we're in hetero-block size scenario with unsupported config
There was a problem hiding this comment.
same for these other constraints you listed
prefill NHD and nP > nD
| "All kv cache tensors must have the same number of blocks" | ||
| ) | ||
|
|
||
| block_size_ratio_on_save = self.block_size // self.block_size_on_save |
There was a problem hiding this comment.
this is constant and should be computed once after this loop detectes any logical<>kernel block size mismatch
| block_len_per_layer_on_save.append( | ||
| curr_tensor_size_bytes | ||
| // self.num_blocks | ||
| // block_size_ratio_on_save |
There was a problem hiding this comment.
this is basically block_len_per_layer // block_size_ratio_on_save where denominator is some constant.
We don't need a separate bookkeeping struct for it, just compute it on the fly from block_len_per_layer
| num_blocks_on_save = ( | ||
| curr_tensor_size_bytes // block_len_per_layer_on_save[-1] | ||
| ) |
There was a problem hiding this comment.
ditto, this should not be computed in a for loop.
It can also be simplified to
num_blocks = curr_tensor_size_bytes / N
num_blocks_on_save = curr_tensor_size_bytes / (N / block_size_ratio_on_save)
=>num_blocks_on_save = curr_tensor_size_bytes/N *block_size_ratio_on_save
==>num_blocks_on_save = num_blocks * block_size_ratio_on_save
| ( | ||
| self.src_xfer_handles_by_block_size[self.block_size_on_save], | ||
| self.src_blocks_data, | ||
| ) = self.register_local_xfer_handler(self.block_size_on_save) |
There was a problem hiding this comment.
is this pre-commit or..?
There was a problem hiding this comment.
yes, because I updated self.block_size to self.block_size_on_save
| self.postprocess_kv_caches_on_save, | ||
| self.kv_cache_layout_on_save, | ||
| self.block_size_on_save, |
There was a problem hiding this comment.
some of this should go into the compat check right?
| agreed_block_size = int( | ||
| vllm_config.kv_transfer_config.get_from_extra_config( | ||
| "agreed_block_size", current_block_size | ||
| ) | ||
| ) |
There was a problem hiding this comment.
do I get this right, this is an agreement through config before handshake, which is actually breaking the current flow as per all similar logic is discovery-based: configs are matched in between instances and evaluated at runtime.
do we have alternatives to this pre-agreement? It may also end up looking as "custom" wrt current ux @xuechendi
cc @markmc
There was a problem hiding this comment.
@xuechendi could we handle this with an additional factor in compute_nixl_compatibility_hash() ?
There was a problem hiding this comment.
Yes, @NickLucche , agreed_block_size is breaking the current discovery-based design that prefill side will use agreed_block_size as its block_size when communicate to decoders.
I tried to search for possible alternatives in my mind, but I really can't come up with any. If we go with my previous PR that let decoder side to permute for large block_size, decoder will end up with allocating temp memory to hold a block_size= 128 buffer when it only has block_size=16* num_blocks=4 buffers.
| remote: RemoteMeta | None = None | ||
|
|
||
|
|
||
| def if_postprocess_kvcache_on_save( |
There was a problem hiding this comment.
not a great function name, better to change to should_postprocess_kvcache_on_save or "needs" or
should_transform_kv_for_transfer or similar
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
enable support for prefill side kv_layout and block_size update 1. update example to support prefill HND and agreed_block_size 2. enable prefill side kv_layout and block_size update Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Yeonsil Yoon <yeon.sil.yoon@intel.com>
|
This pull request has merge conflicts that must be resolved before it can be |
enable support for prefill side kv_layout and block_size update 1. update example to support prefill HND and agreed_block_size 2. enable prefill side kv_layout and block_size update Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Yeonsil Yoon <yeon.sil.yoon@intel.com>
1. update example to support prefill HND and agreed_block_size 2. enable prefill side kv_layout and block_size update Port vllm-project/vllm#30448 to vllm-gaudi --------- Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Yeonsil Yoon <yeon.sil.yoon@intel.com>
…-project#867) 1. update example to support prefill HND and agreed_block_size 2. enable prefill side kv_layout and block_size update Port vllm-project/vllm#30448 to vllm-gaudi --------- Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Yeonsil Yoon <yeon.sil.yoon@intel.com> Signed-off-by: Wang, Zheng W <zheng.w.wang@intel.com>
…-project#867) 1. update example to support prefill HND and agreed_block_size 2. enable prefill side kv_layout and block_size update Port vllm-project/vllm#30448 to vllm-gaudi --------- Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Yeonsil Yoon <yeon.sil.yoon@intel.com>
…-project#867) 1. update example to support prefill HND and agreed_block_size 2. enable prefill side kv_layout and block_size update Port vllm-project/vllm#30448 to vllm-gaudi --------- Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Yeonsil Yoon <yeon.sil.yoon@intel.com> Signed-off-by: slokesha <slokeshappa@habana.ai>
Purpose
Proposal codes for #26744
This PR is to support non-host-buffer path Heterogeneous KV Layout and block_size - prefill NHD and nP > nD
Only support when prefix_cache is disabled
This PR includes two bug fix PR, pending on them merged firstly
#30420
#30419
Test Plan
1P 1D with nP > nD + Prefill with NHD
=> Accuracy passed Qwen3-0.6= ~0.41
P1D2 + prefill with NHD
Accuracy passed Qwen3-0.6= ~0.41
Design
This PR is to propose update KV_layout and block_size at prefill side after request complete prefill fwd
case 1, very naive case, the proposal can work properly
Case 2, with chunked prefill
Changes proposed in this PR:
block_size_after_saveandkv_layout_after_save=> If prefill is using different setting to decode, theblock_size_after_saveandkv_layout_after_savewill use agreed block_size and 'HND' respectively.enable_permute_local_kvis on.kv_caches_postprocessfunction, and called atwait_for_saveEssential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.