[Bugfix] limit cudagraph capture sizes by num_blocks for GDN models by ZJY0516 · Pull Request #34881 · vllm-project/vllm

ZJY0516 · 2026-02-19T07:46:49Z

Purpose

FIX #34094
Fixes a corner case bug where vLLM crashes with an AssertionError during CUDA graph capture for GDN (Gated Delta Net) models when the configured cudagraph capture size exceeds the available KV cache blocks (num_blocks).

Problem

For GDN models using CUDA graphs, vLLM determines the cudagraph capture sizes at config initialization time based on max_num_seqs. However, the actual number of KV cache blocks (num_blocks) is determined later during memory profiling and can be smaller than the cudagraph capture size due to memory constraints.

When this happens, during causal_conv1d_update, the following assertion fails because num_cache_lines (which equals num_blocks) is smaller than the batch size:

assert num_cache_lines >= batch  # AssertionError!

The GDN attention backend uses num_blocks as cache lines for conv states. When CUDA graph capture creates batches larger than num_blocks, the conv state cache doesn't have enough slots, triggering the assertion.

Solution

Add _maybe_limit_cudagraph_sizes_by_num_blocks() in GPUModelRunner.initialize_kv_cache() to:

Check if the model uses GDN attention (by detecting GDN_ATTN backend)
Check if CUDAGraphMode has FULL mode enabled (only affects FULL/FULL_AND_PIECEWISE modes)
If max_cudagraph_capture_size > num_blocks, filter cudagraph_capture_sizes to only include sizes ≤ num_blocks and update max_cudagraph_capture_size accordingly
This ensures CUDA graph capture never attempts batch sizes larger than available cache lines for GDN models.

Test

vllm serve Qwen/Qwen3.5-35B-A3B --port 8001 --gpu-memory-utilization 0.55

log

(EngineCore_DP0 pid=808156) WARNING 02-27 16:51:15 [gpu_model_runner.py:6032] Limiting max_cudagraph_capture_size from 512 to 336 due to num_blocks=348 constraint for GDN model

lm_eval \
    --model local-completions \
    --model_args "model=Qwen/Qwen3.5-35B-A3B,base_url=http://0.0.0.0:8001/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=256,timeout=5000,max_length=4096" \
    --tasks gsm8k \
    --num_fewshot 5

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8658	±	0.0094
		strict-match	5	exact_match	↑	0.8484	±	0.0099

cc @tdoublep @ywang96

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

gemini-code-assist

Code Review

This pull request addresses an assertion error by improving the validation logic in causal_conv1d_update. The new assertion correctly checks that all conv_state_indices are within the valid range of num_cache_lines. However, I've identified a critical edge case in the new code. When the batch size is zero, conv_state_indices.max() will be called on an empty tensor, causing a RuntimeError. I've provided a suggestion to fix this.

vllm/model_executor/layers/mamba/ops/causal_conv1d.py

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

tdoublep

I don't really understand this change. If the assert is duplicated - why doesn't the first one fail?

This check seems like a correct thing to have imo.

ZJY0516 · 2026-02-19T11:50:35Z

I don't really understand this change. If the assert is duplicated - why doesn't the first one fail?

This check seems like a correct thing to have imo.

Because conv_state_indices is not None in this case

tdoublep · 2026-02-19T12:41:51Z

OK, but then we read the batch size from the conv_state_indices, and we want to verify that the number of possible blocks (e.g., the outer dimension of conv_state) is greater than or equal to the batch size. Isn't this a reasonable check? Is this something to do with the interaction of DP?

ZJY0516 · 2026-02-19T13:04:56Z

OK, but then we read the batch size from the conv_state_indices, and we want to verify that the number of possible blocks (e.g., the outer dimension of conv_state) is greater than or equal to the batch size. Isn't this a reasonable check? Is this something to do with the interaction of DP?

Yes, you are right. I'll try to fix it in another way

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

mergify · 2026-02-27T08:42:15Z

Hi @ZJY0516, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

tdoublep

PTAL at this other PR that seems to implement similar solution:
#34571

tdoublep · 2026-02-27T10:10:17Z

vllm/v1/worker/gpu_model_runner.py

+        has_gdn = any(
+            layer.get_attn_backend().get_name() == "GDN_ATTN"
+            for layer in attn_layers.values()
+        )
+        if not has_gdn:
+            return


Why is this a specific change to GDN? I think we need to do this for all hybrid models actually.

tdoublep · 2026-02-27T10:10:47Z

vllm/v1/worker/gpu_model_runner.py

+            return
+
+        original_sizes = self.compilation_config.cudagraph_capture_sizes or []
+        filtered_sizes = [s for s in original_sizes if s <= num_blocks]


What happens if this is empty?

tdoublep · 2026-02-27T10:11:24Z

vllm/v1/worker/gpu_model_runner.py

                else:
                    break

+    def _maybe_limit_cudagraph_sizes_by_num_blocks(self, num_blocks: int) -> None:


Does this need to be in GPU model runner?

init

729108e

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

ZJY0516 requested a review from tdoublep as a code owner February 19, 2026 07:46

mergify bot added qwen Related to Qwen models bug Something isn't working labels Feb 19, 2026

gemini-code-assist bot reviewed Feb 19, 2026

View reviewed changes

vllm/model_executor/layers/mamba/ops/causal_conv1d.py Outdated Show resolved Hide resolved

update

28c2fa4

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

tdoublep reviewed Feb 19, 2026

View reviewed changes

ZJY0516 marked this pull request as draft February 19, 2026 13:15

update

58e5feb

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

ywang96 added this to Qwen3.5 Feb 23, 2026

ywang96 moved this to Investigating in Qwen3.5 Feb 23, 2026

ZJY0516 added 2 commits February 27, 2026 16:28

update

471aa14

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

Merge branch 'main' into fix-qwen-35-dp

0ddf557

mergify bot added the v1 label Feb 27, 2026

ZJY0516 changed the title ~~[Bugfix] fix qwen 3.5 dp+ep assertion error~~ [Bugfix] limit cudagraph capture sizes by num_blocks for GDN models Feb 27, 2026

ZJY0516 marked this pull request as ready for review February 27, 2026 08:30

mergify bot added the nvidia label Feb 27, 2026

github-project-automation bot added this to NVIDIA Feb 27, 2026

ZJY0516 added 2 commits February 27, 2026 16:59

Merge branch 'main' into fix-qwen-35-dp

4f4e6f2

update

c998878

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

tdoublep reviewed Feb 27, 2026

View reviewed changes

tdoublep mentioned this pull request Feb 27, 2026

[Bugfix] Cap FULL decode cudagraph sizes for Mamba/hybrid models (#34094) #34571

Merged

5 tasks

ZJY0516 closed this Feb 27, 2026

github-project-automation bot moved this from Investigating to Done in Qwen3.5 Feb 27, 2026

github-project-automation bot moved this to Done in NVIDIA Feb 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] limit cudagraph capture sizes by num_blocks for GDN models#34881

[Bugfix] limit cudagraph capture sizes by num_blocks for GDN models#34881
ZJY0516 wants to merge 7 commits intovllm-project:mainfrom
ZJY0516:fix-qwen-35-dp

ZJY0516 commented Feb 19, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

tdoublep left a comment

Uh oh!

ZJY0516 commented Feb 19, 2026

Uh oh!

tdoublep commented Feb 19, 2026

Uh oh!

ZJY0516 commented Feb 19, 2026

Uh oh!

mergify bot commented Feb 27, 2026

Uh oh!

tdoublep left a comment

Uh oh!

tdoublep Feb 27, 2026

Uh oh!

tdoublep Feb 27, 2026

Uh oh!

tdoublep Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ZJY0516 commented Feb 19, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Problem

Solution

Test

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

tdoublep left a comment

Choose a reason for hiding this comment

Uh oh!

ZJY0516 commented Feb 19, 2026

Uh oh!

tdoublep commented Feb 19, 2026

Uh oh!

ZJY0516 commented Feb 19, 2026

Uh oh!

mergify bot commented Feb 27, 2026

Uh oh!

tdoublep left a comment

Choose a reason for hiding this comment

Uh oh!

tdoublep Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

tdoublep Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

tdoublep Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ZJY0516 commented Feb 19, 2026 •

edited by github-actions bot

Loading