[Model Runner v2] fix pd accuracy#42888
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the KV cache registration logic in the NIXL worker to correctly validate tensor dimensions when the V2 model runner is enabled by using logical block counts. A critical issue was identified where related variables still use physical units for stride and block length calculations, which could lead to incorrect memory addressing and potential out-of-bounds access.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d8e8ac794c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
I filed the original issue #42846 and have a PR #42872 open that takes a different approach — worth comparing the two. Your fix skips the #42872 instead fixes the reshape path directly: If I've mischaracterized your approach, happy to be corrected. Happy to discuss which approach is preferable — or whether parts of both are worth combining. |
I believe this matches V2's intended design — keep the KV cache at the logical shape (rather than reshape to kernel grain like V1 does) and defer any kernel-grain conversion to attention metadata construction. The PR also passes the accuracy tests in CI |
|
Thanks for looking into this. I think the key question is whether MRV2 actually has a later logical->kernel block conversion for the dense FlashInfer path. From what I can see, the FlashInfer metadata builder only uses the block size from its kv_cache_spec:
So unless MRV2 passes kernel_block_size into the metadata builder and reshapes the KV cache accordingly, the dense FlashInfer path still operates as if page_size is 128. I do not see a separate metadata-time conversion that turns logical block ids into kernel block ids for this path. That is why I am hesitant about fixing this in the NIXL worker. Forcing _physical_blocks_per_logical_kv_block = 1 makes NIXL accept the current logical KV cache shape, but it does not address the underlying mismatch between:
In other words, NIXL is probably just the first component that asserts on the mismatch. The more general fix should be in MRV2 KV cache initialization / block table construction, matching the MRV1 behavior around prepare_kernel_block_sizes(). If there is a specific MRV2 metadata path that performs this logical->kernel conversion for dense FlashInfer, could you point me to it? Thanks. |
NickLucche
left a comment
There was a problem hiding this comment.
not sure it's working as is for all cases, especially with hybrid ssm.
I was hoping we could clarify the necessity for this change in MRv2, given it's breaking interface for all connectors, not just nixl.
I am reverting it here #42766
Purpose
FIX #42846
Related error: https://buildkite.com/vllm/ci/builds/66617/canvas?jid=019e38db-bb6b-4ab4-8378-fa658fc52ac8&tab=output
_sync_block_size_with_kerneldoublesself.num_blocksand halvesself.block_sizeso NIXL descriptors walk the cache at the kernel block_size. That matches V1's KV cache layout (V1 reshapes the cache to kernel granularity ingpu_model_runner._reshape_kv_cache_tensors), but not V2's — V2 (gpu/attn_utils._reshape_kv_cache) keeps the tensor at logical block_size.Fix
Skip the
num_blocks *= ratio/block_size //= ratiostep under V2 and pin_physical_blocks_per_logical_kv_block = 1so the rest of the NIXL worker (byte accounting inregister_kv_caches, descriptor strides in_build_fa_local/_build_fa_remote,_logical_to_kernel_block_ids) stays 1:1 with V2's logical-grain cache layout. V1 path is unchanged.Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.