Skip to content

[Bugfix] limit cudagraph capture sizes by num_blocks for GDN models#34881

Closed
ZJY0516 wants to merge 7 commits intovllm-project:mainfrom
ZJY0516:fix-qwen-35-dp
Closed

[Bugfix] limit cudagraph capture sizes by num_blocks for GDN models#34881
ZJY0516 wants to merge 7 commits intovllm-project:mainfrom
ZJY0516:fix-qwen-35-dp

Conversation

@ZJY0516
Copy link
Copy Markdown
Member

@ZJY0516 ZJY0516 commented Feb 19, 2026

Purpose

FIX #34094
Fixes a corner case bug where vLLM crashes with an AssertionError during CUDA graph capture for GDN (Gated Delta Net) models when the configured cudagraph capture size exceeds the available KV cache blocks (num_blocks).

Problem

For GDN models using CUDA graphs, vLLM determines the cudagraph capture sizes at config initialization time based on max_num_seqs. However, the actual number of KV cache blocks (num_blocks) is determined later during memory profiling and can be smaller than the cudagraph capture size due to memory constraints.

When this happens, during causal_conv1d_update, the following assertion fails because num_cache_lines (which equals num_blocks) is smaller than the batch size:

assert num_cache_lines >= batch  # AssertionError!

The GDN attention backend uses num_blocks as cache lines for conv states. When CUDA graph capture creates batches larger than num_blocks, the conv state cache doesn't have enough slots, triggering the assertion.

Solution

Add _maybe_limit_cudagraph_sizes_by_num_blocks() in GPUModelRunner.initialize_kv_cache() to:

Check if the model uses GDN attention (by detecting GDN_ATTN backend)
Check if CUDAGraphMode has FULL mode enabled (only affects FULL/FULL_AND_PIECEWISE modes)
If max_cudagraph_capture_size > num_blocks, filter cudagraph_capture_sizes to only include sizes ≤ num_blocks and update max_cudagraph_capture_size accordingly
This ensures CUDA graph capture never attempts batch sizes larger than available cache lines for GDN models.

Test

vllm serve Qwen/Qwen3.5-35B-A3B --port 8001 --gpu-memory-utilization 0.55

log

(EngineCore_DP0 pid=808156) WARNING 02-27 16:51:15 [gpu_model_runner.py:6032] Limiting max_cudagraph_capture_size from 512 to 336 due to num_blocks=348 constraint for GDN model
lm_eval \
    --model local-completions \
    --model_args "model=Qwen/Qwen3.5-35B-A3B,base_url=http://0.0.0.0:8001/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=256,timeout=5000,max_length=4096" \
    --tasks gsm8k \
    --num_fewshot 5
Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.8658 ± 0.0094
strict-match 5 exact_match 0.8484 ± 0.0099

cc @tdoublep @ywang96


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
@ZJY0516 ZJY0516 requested a review from tdoublep as a code owner February 19, 2026 07:46
@mergify mergify bot added qwen Related to Qwen models bug Something isn't working labels Feb 19, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an assertion error by improving the validation logic in causal_conv1d_update. The new assertion correctly checks that all conv_state_indices are within the valid range of num_cache_lines. However, I've identified a critical edge case in the new code. When the batch size is zero, conv_state_indices.max() will be called on an empty tensor, causing a RuntimeError. I've provided a suggestion to fix this.

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Copy link
Copy Markdown
Member

@tdoublep tdoublep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand this change. If the assert is duplicated - why doesn't the first one fail?

This check seems like a correct thing to have imo.

@ZJY0516
Copy link
Copy Markdown
Member Author

ZJY0516 commented Feb 19, 2026

I don't really understand this change. If the assert is duplicated - why doesn't the first one fail?

This check seems like a correct thing to have imo.

Because conv_state_indices is not None in this case

@tdoublep
Copy link
Copy Markdown
Member

OK, but then we read the batch size from the conv_state_indices, and we want to verify that the number of possible blocks (e.g., the outer dimension of conv_state) is greater than or equal to the batch size. Isn't this a reasonable check? Is this something to do with the interaction of DP?

@ZJY0516
Copy link
Copy Markdown
Member Author

ZJY0516 commented Feb 19, 2026

OK, but then we read the batch size from the conv_state_indices, and we want to verify that the number of possible blocks (e.g., the outer dimension of conv_state) is greater than or equal to the batch size. Isn't this a reasonable check? Is this something to do with the interaction of DP?

Yes, you are right. I'll try to fix it in another way

@ZJY0516 ZJY0516 marked this pull request as draft February 19, 2026 13:15
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
@ywang96 ywang96 added this to Qwen3.5 Feb 23, 2026
@ywang96 ywang96 moved this to Investigating in Qwen3.5 Feb 23, 2026
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
@mergify mergify bot added the v1 label Feb 27, 2026
@ZJY0516 ZJY0516 changed the title [Bugfix] fix qwen 3.5 dp+ep assertion error [Bugfix] limit cudagraph capture sizes by num_blocks for GDN models Feb 27, 2026
@ZJY0516 ZJY0516 marked this pull request as ready for review February 27, 2026 08:30
@mergify mergify bot added the nvidia label Feb 27, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Feb 27, 2026

Hi @ZJY0516, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Copy link
Copy Markdown
Member

@tdoublep tdoublep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL at this other PR that seems to implement similar solution:
#34571

Comment on lines +6021 to +6026
has_gdn = any(
layer.get_attn_backend().get_name() == "GDN_ATTN"
for layer in attn_layers.values()
)
if not has_gdn:
return
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a specific change to GDN? I think we need to do this for all hybrid models actually.

return

original_sizes = self.compilation_config.cudagraph_capture_sizes or []
filtered_sizes = [s for s in original_sizes if s <= num_blocks]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this is empty?

else:
break

def _maybe_limit_cudagraph_sizes_by_num_blocks(self, num_blocks: int) -> None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be in GPU model runner?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working nvidia qwen Related to Qwen models v1

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

[Bug]: assert num_cache_lines >= batch

3 participants