[Bugfix][Model Runner v2] Fix MRV2 KV cache kernel block sizing.#42872
[Bugfix][Model Runner v2] Fix MRV2 KV cache kernel block sizing.#42872chfeng-cs wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces kernel block expansion for the vLLM V1 worker, allowing the KV cache to be managed with larger logical blocks while being processed by kernels using smaller physical block sizes. It includes changes to BlockTables to handle the mapping between these block types and updates the attention backend initialization to propagate kernel_block_sizes. However, the current implementation has critical issues regarding the alignment of kernel_block_sizes with KV cache groups, as certain specifications are skipped during preparation, leading to incorrect indexing or missing caches. Additionally, there is a potential shape mismatch when storage_block_size differs from the logical block size, which could result in out-of-bounds memory access.
| kernel_block_sizes = prepare_kernel_block_sizes(kv_cache_config, attn_groups) | ||
| for kv_cache_group_id, groups in enumerate(attn_groups): | ||
| kernel_block_size = ( | ||
| kernel_block_sizes[kv_cache_group_id] | ||
| if kv_cache_group_id < len(kernel_block_sizes) | ||
| else None | ||
| ) |
There was a problem hiding this comment.
The indexing of kernel_block_sizes by kv_cache_group_id is incorrect because prepare_kernel_block_sizes (in vllm/v1/worker/utils.py) skips EncoderOnlyAttentionSpec groups. This results in a length mismatch and misaligned mapping between groups and their kernel block sizes. If an encoder-only group exists, subsequent groups will receive the wrong block size or None, causing them to fallback to logical block sizes and defeating the purpose of this fix. prepare_kernel_block_sizes should be updated to return a list of the same length as kv_cache_groups (e.g., by using None for skipped groups).
There was a problem hiding this comment.
prepare_kernel_block_sizes() currently returns a compact list because EncoderOnlyAttentionSpec does not allocate KV cache. In the current KV cache config construction, encoder-only
groups are appended after regular KV cache groups, so the compact list remains aligned for all non-encoder-only groups, and the trailing encoder-only group is skipped by the existing guard.
This PR keeps that existing behavior and focuses on the Qwen3 + FlashInfer + NIXL MRV2 regression. If MRV2 later allows encoder-only groups before regular attention groups, we should revisit this
indexing contract separately.
| if kv_cache_group_id >= len(kernel_block_sizes): | ||
| continue |
There was a problem hiding this comment.
In _reshape_kv_cache, the loop incorrectly assumes that kernel_block_sizes is 1-to-1 with kv_cache_config.kv_cache_groups. Because prepare_kernel_block_sizes skips certain specs, len(kernel_block_sizes) may be less than the number of groups. The continue on line 173 will cause the last groups in the configuration to be skipped entirely, leading to missing KV caches for those layers.
There was a problem hiding this comment.
Same underlying concern as above. Under the current KV cache config construction, the groups skipped by this guard are trailing encoder-only groups, which do not allocate KV cache.
| num_blocks_per_kv_block = ( | ||
| kv_cache_spec.block_size // kernel_block_size | ||
| ) | ||
| kernel_num_blocks = num_blocks * num_blocks_per_kv_block | ||
| if kv_cache_spec.storage_block_size != kv_cache_spec.block_size: | ||
| shape_block_size = kv_cache_spec.storage_block_size | ||
| else: | ||
| shape_block_size = kernel_block_size |
There was a problem hiding this comment.
There is a mismatch between kernel_num_blocks and shape_block_size when storage_block_size is used. kernel_num_blocks is currently calculated based on kernel_block_size, but shape_block_size might be set to storage_block_size. If these two differ (e.g., in MLA where storage_block_size is 1 but kernel_block_size might be larger), the resulting KV cache shape will be incorrect and likely too small, leading to out-of-bounds access. kernel_num_blocks should be calculated using the same block size used for the shape.
| num_blocks_per_kv_block = ( | |
| kv_cache_spec.block_size // kernel_block_size | |
| ) | |
| kernel_num_blocks = num_blocks * num_blocks_per_kv_block | |
| if kv_cache_spec.storage_block_size != kv_cache_spec.block_size: | |
| shape_block_size = kv_cache_spec.storage_block_size | |
| else: | |
| shape_block_size = kernel_block_size | |
| if kv_cache_spec.storage_block_size != kv_cache_spec.block_size: | |
| shape_block_size = kv_cache_spec.storage_block_size | |
| else: | |
| shape_block_size = kernel_block_size | |
| num_blocks_per_kv_block = ( | |
| kv_cache_spec.block_size // shape_block_size | |
| ) | |
| kernel_num_blocks = num_blocks * num_blocks_per_kv_block |
|
Hi @chfeng-cs, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
eb7fe50 to
931f274
Compare
|
Hi @chfeng-cs, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
931f274 to
eb7fe50
Compare
|
Hi @chfeng-cs, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
eb7fe50 to
1f2b399
Compare
|
Hi @chfeng-cs, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Use backend kernel block sizes when initializing Model Runner V2 attention metadata, KV cache views, and block tables. This keeps FlashInfer's physical block view consistent with NIXL registration when the KV manager block size is larger than the kernel block size. Add a focused regression test for MRV2 block table logical-to-kernel block expansion. Signed-off-by: fengchuanheng <fengchuanheng@sjtu.edu.cn>
1f2b399 to
25cf3a2
Compare
|
Closing in favor of #42766 and #42955. Thanks for the guidance @NickLucche. |
|
Thanks @chfeng-cs |
Purpose
Fix Model Runner V2 KV cache handling when the backend kernel block size differs from the KV manager block size.
For FlashInfer with
--block-size 128, MRV2 was still constructing KV cache/block table state using the logical block size, while NIXL expected the physical/kernel block view. This caused NIXL KVcache registration to fail during startup.
Closes #42846
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.