Skip to content

[Bugfix][Model Runner v2] Fix MRV2 KV cache kernel block sizing.#42872

Closed
chfeng-cs wants to merge 1 commit into
vllm-project:mainfrom
chfeng-cs:fix-mrv2-flashinfer-kernel-block-size
Closed

[Bugfix][Model Runner v2] Fix MRV2 KV cache kernel block sizing.#42872
chfeng-cs wants to merge 1 commit into
vllm-project:mainfrom
chfeng-cs:fix-mrv2-flashinfer-kernel-block-size

Conversation

@chfeng-cs
Copy link
Copy Markdown
Contributor

@chfeng-cs chfeng-cs commented May 17, 2026

Purpose

Fix Model Runner V2 KV cache handling when the backend kernel block size differs from the KV manager block size.

For FlashInfer with --block-size 128, MRV2 was still constructing KV cache/block table state using the logical block size, while NIXL expected the physical/kernel block view. This caused NIXL KV
cache registration to fail during startup.

Closes #42846

Test Plan

  • Added a focused MRV2 block table regression test.
  • Verified the Qwen3 + FlashInfer + NIXL single-GPU startup repro.

Test Result

> pytest tests/v1/worker/test_gpu_model_runner_v2.py -q
.                                                                                                                                                                                              [100%]
1 passed in 0.71s

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

@mergify mergify Bot added v1 bug Something isn't working labels May 17, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces kernel block expansion for the vLLM V1 worker, allowing the KV cache to be managed with larger logical blocks while being processed by kernels using smaller physical block sizes. It includes changes to BlockTables to handle the mapping between these block types and updates the attention backend initialization to propagate kernel_block_sizes. However, the current implementation has critical issues regarding the alignment of kernel_block_sizes with KV cache groups, as certain specifications are skipped during preparation, leading to incorrect indexing or missing caches. Additionally, there is a potential shape mismatch when storage_block_size differs from the logical block size, which could result in out-of-bounds memory access.

Comment on lines +101 to +107
kernel_block_sizes = prepare_kernel_block_sizes(kv_cache_config, attn_groups)
for kv_cache_group_id, groups in enumerate(attn_groups):
kernel_block_size = (
kernel_block_sizes[kv_cache_group_id]
if kv_cache_group_id < len(kernel_block_sizes)
else None
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The indexing of kernel_block_sizes by kv_cache_group_id is incorrect because prepare_kernel_block_sizes (in vllm/v1/worker/utils.py) skips EncoderOnlyAttentionSpec groups. This results in a length mismatch and misaligned mapping between groups and their kernel block sizes. If an encoder-only group exists, subsequent groups will receive the wrong block size or None, causing them to fallback to logical block sizes and defeating the purpose of this fix. prepare_kernel_block_sizes should be updated to return a list of the same length as kv_cache_groups (e.g., by using None for skipped groups).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prepare_kernel_block_sizes() currently returns a compact list because EncoderOnlyAttentionSpec does not allocate KV cache. In the current KV cache config construction, encoder-only
groups are appended after regular KV cache groups, so the compact list remains aligned for all non-encoder-only groups, and the trailing encoder-only group is skipped by the existing guard.

This PR keeps that existing behavior and focuses on the Qwen3 + FlashInfer + NIXL MRV2 regression. If MRV2 later allows encoder-only groups before regular attention groups, we should revisit this
indexing contract separately.

Comment on lines +172 to +173
if kv_cache_group_id >= len(kernel_block_sizes):
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In _reshape_kv_cache, the loop incorrectly assumes that kernel_block_sizes is 1-to-1 with kv_cache_config.kv_cache_groups. Because prepare_kernel_block_sizes skips certain specs, len(kernel_block_sizes) may be less than the number of groups. The continue on line 173 will cause the last groups in the configuration to be skipped entirely, leading to missing KV caches for those layers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same underlying concern as above. Under the current KV cache config construction, the groups skipped by this guard are trailing encoder-only groups, which do not allocate KV cache.

Comment thread vllm/v1/worker/gpu/attn_utils.py Outdated
Comment on lines +187 to +194
num_blocks_per_kv_block = (
kv_cache_spec.block_size // kernel_block_size
)
kernel_num_blocks = num_blocks * num_blocks_per_kv_block
if kv_cache_spec.storage_block_size != kv_cache_spec.block_size:
shape_block_size = kv_cache_spec.storage_block_size
else:
shape_block_size = kernel_block_size
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a mismatch between kernel_num_blocks and shape_block_size when storage_block_size is used. kernel_num_blocks is currently calculated based on kernel_block_size, but shape_block_size might be set to storage_block_size. If these two differ (e.g., in MLA where storage_block_size is 1 but kernel_block_size might be larger), the resulting KV cache shape will be incorrect and likely too small, leading to out-of-bounds access. kernel_num_blocks should be calculated using the same block size used for the shape.

Suggested change
num_blocks_per_kv_block = (
kv_cache_spec.block_size // kernel_block_size
)
kernel_num_blocks = num_blocks * num_blocks_per_kv_block
if kv_cache_spec.storage_block_size != kv_cache_spec.block_size:
shape_block_size = kv_cache_spec.storage_block_size
else:
shape_block_size = kernel_block_size
if kv_cache_spec.storage_block_size != kv_cache_spec.block_size:
shape_block_size = kv_cache_spec.storage_block_size
else:
shape_block_size = kernel_block_size
num_blocks_per_kv_block = (
kv_cache_spec.block_size // shape_block_size
)
kernel_num_blocks = num_blocks * num_blocks_per_kv_block

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 17, 2026

Hi @chfeng-cs, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@chfeng-cs chfeng-cs force-pushed the fix-mrv2-flashinfer-kernel-block-size branch from eb7fe50 to 931f274 Compare May 17, 2026 13:42
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 17, 2026

Hi @chfeng-cs, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@chfeng-cs chfeng-cs force-pushed the fix-mrv2-flashinfer-kernel-block-size branch from 931f274 to eb7fe50 Compare May 17, 2026 15:50
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 17, 2026

Hi @chfeng-cs, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@chfeng-cs chfeng-cs force-pushed the fix-mrv2-flashinfer-kernel-block-size branch from eb7fe50 to 1f2b399 Compare May 18, 2026 05:16
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 18, 2026

Hi @chfeng-cs, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@chfeng-cs chfeng-cs changed the title [Bugfix][CI] Fix MRV2 KV cache kernel block sizing. [Bugfix][[Model Runner v2]] Fix MRV2 KV cache kernel block sizing. May 18, 2026
@chfeng-cs chfeng-cs changed the title [Bugfix][[Model Runner v2]] Fix MRV2 KV cache kernel block sizing. [Bugfix][Model Runner v2] Fix MRV2 KV cache kernel block sizing. May 18, 2026
Use backend kernel block sizes when initializing Model Runner V2
attention metadata, KV cache views, and block tables. This keeps
FlashInfer's physical block view consistent with NIXL registration
when the KV manager block size is larger than the kernel block size.

Add a focused regression test for MRV2 block table logical-to-kernel
block expansion.

Signed-off-by: fengchuanheng <fengchuanheng@sjtu.edu.cn>
@chfeng-cs chfeng-cs force-pushed the fix-mrv2-flashinfer-kernel-block-size branch from 1f2b399 to 25cf3a2 Compare May 18, 2026 05:50
@chfeng-cs
Copy link
Copy Markdown
Contributor Author

Closing in favor of #42766 and #42955. Thanks for the guidance @NickLucche.

@chfeng-cs chfeng-cs closed this May 18, 2026
@njhill
Copy link
Copy Markdown
Member

njhill commented May 19, 2026

Thanks @chfeng-cs

@njhill njhill added the v2 label May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working v1 v2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug][CI] NIXL + FlashInfer fails with Qwen3 MRV2 and --block-size 128

2 participants