[Bugfix] Fix DeepseekV32 AssertionError: num_kv_heads == 1 by chaunceyjiang · Pull Request #33086 · vllm-project/vllm

chaunceyjiang · 2026-01-26T10:56:19Z

Purpose

Introduced #30207

vllm/vllm/distributed/kv_transfer/kv_connector/utils.py

Lines 322 to 326 in 64e3d67

    
           # Figure out whether the first dimension of the cache is K/V 
        
           # or num_blocks. This is used to register the memory regions correctly. 
        
           kv_cache_shape = self.attn_backend.get_kv_cache_shape( 
        
               num_blocks=1, block_size=16, num_kv_heads=4, head_size=1 
        
           )

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

gemini-code-assist

Code Review

This pull request addresses a bug by removing a restrictive assertion related to num_kv_heads in the DeepseekV32IndexerBackend. While this resolves the immediate AssertionError, it's crucial to ensure that the backend's get_kv_cache_shape and underlying attention kernels correctly handle scenarios where num_kv_heads is greater than 1. The PR description could benefit from more details regarding the specific DeepseekV3.2 configurations that triggered the error and how the backend is expected to behave with varying num_kv_heads values. Additionally, filling out the 'Purpose', 'Test Plan', and 'Test Result' sections in the PR body would greatly enhance clarity and reviewability.

vllm/v1/attention/backends/mla/indexer.py

chaunceyjiang · 2026-01-26T11:05:46Z

vllm/v1/attention/backends/mla/indexer.py

        head_size: int,
        cache_dtype_str: str = "auto",
    ) -> tuple[int, ...]:
-        assert num_kv_heads == 1


vllm/vllm/distributed/kv_transfer/kv_connector/utils.py

Lines 322 to 326 in 64e3d67

# Figure out whether the first dimension of the cache is K/V

# or num_blocks. This is used to register the memory regions correctly.

kv_cache_shape = self.attn_backend.get_kv_cache_shape(

num_blocks=1, block_size=16, num_kv_heads=4, head_size=1

)

/cc @NickLucche

I’m not sure whether this fix is correct. PTAL.

@chaunceyjiang I think the issue is in how we're using the function, let me look into it

Hi @NickLucche, for a change like #33090, I’m inclined to remove this assert. This would make the use of get_kv_cache_shape more flexible.

same as: FlashMLASparseBackend

@NickLucche WDYT?

[Bugfix] Fix DeepseekV32 AssertionError: num_kv_heads == 1

a11ec24

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

mergify bot added deepseek Related to DeepSeek models v1 bug Something isn't working labels Jan 26, 2026

gemini-code-assist bot reviewed Jan 26, 2026

View reviewed changes

vllm/v1/attention/backends/mla/indexer.py Show resolved Hide resolved

chaunceyjiang marked this pull request as ready for review January 26, 2026 11:03

chaunceyjiang requested a review from pavanimajety as a code owner January 26, 2026 11:03

chaunceyjiang requested a review from NickLucche January 26, 2026 11:04

chaunceyjiang commented Jan 26, 2026

View reviewed changes

NickLucche mentioned this pull request Jan 26, 2026

[Bugfix] Fix DeepseekV32 AssertionError: num_kv_heads == 1 #33090

Merged

Merge branch 'main' into pd_dsv32_2

47c9462

chaunceyjiang closed this Jan 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix DeepseekV32 AssertionError: num_kv_heads == 1#33086

[Bugfix] Fix DeepseekV32 AssertionError: num_kv_heads == 1#33086
chaunceyjiang wants to merge 2 commits intovllm-project:mainfrom
chaunceyjiang:pd_dsv32_2

chaunceyjiang commented Jan 26, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

chaunceyjiang Jan 26, 2026 •

edited

Loading

Uh oh!

NickLucche Jan 26, 2026

Uh oh!

NickLucche Jan 26, 2026

Uh oh!

chaunceyjiang Jan 26, 2026

Uh oh!

chaunceyjiang Jan 26, 2026

Uh oh!

chaunceyjiang Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	# Figure out whether the first dimension of the cache is K/V
	# or num_blocks. This is used to register the memory regions correctly.
	kv_cache_shape = self.attn_backend.get_kv_cache_shape(
	num_blocks=1, block_size=16, num_kv_heads=4, head_size=1
	)

Uh oh!

Conversation

chaunceyjiang commented Jan 26, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chaunceyjiang Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NickLucche Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

NickLucche Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

chaunceyjiang Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

chaunceyjiang Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

chaunceyjiang Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chaunceyjiang commented Jan 26, 2026 •

edited by github-actions bot

Loading

chaunceyjiang Jan 26, 2026 •

edited

Loading