Skip to content

[Bugfix] Fix FlashInfer GDN warmup ValueError on SM90 GPUs#36876

Merged
tdoublep merged 5 commits intovllm-project:mainfrom
tdoublep:fix-gdn-warmup-flashinfer-unpack
Mar 13, 2026
Merged

[Bugfix] Fix FlashInfer GDN warmup ValueError on SM90 GPUs#36876
tdoublep merged 5 commits intovllm-project:mainfrom
tdoublep:fix-gdn-warmup-flashinfer-unpack

Conversation

@tdoublep
Copy link
Member

@tdoublep tdoublep commented Mar 12, 2026

Summary

  • PR [Bugfix] Warm up Triton autotuner for GDN layers during V1 profiling #36599 added Triton autotuner warmup for GDN layers during V1 profiling, which also exercises the FlashInfer path on SM90 GPUs
  • FlashInfer's chunk_gated_delta_rule returns a single tensor when output_final_state=False, but fi_chunk_gated_delta_rule always unpacked two values, causing a ValueError
  • Fix: handle the return value based on output_final_state — unpack the tuple when True, use the single tensor when False

Error before fix

WARNING 03-12 10:34:06 [qwen3_next.py:724] GDN prefill kernel warmup (T=16) failed for layer model.layers.0.linear_attn. First inference may OOM due to autotuner.
WARNING 03-12 10:34:06 [qwen3_next.py:724] Traceback (most recent call last):
WARNING 03-12 10:34:06 [qwen3_next.py:724]   File "/workspace/vllm/vllm/model_executor/models/qwen3_next.py", line 712, in _warmup_prefill_kernels
WARNING 03-12 10:34:06 [qwen3_next.py:724]     self.chunk_gated_delta_rule(
WARNING 03-12 10:34:06 [qwen3_next.py:724]   File "/workspace/vllm/vllm/model_executor/models/qwen3_next.py", line 176, in forward_cuda
WARNING 03-12 10:34:06 [qwen3_next.py:724]     return fi_chunk_gated_delta_rule(
WARNING 03-12 10:34:06 [qwen3_next.py:724]   File "/workspace/vllm/vllm/model_executor/models/qwen3_next.py", line 138, in fi_chunk_gated_delta_rule
WARNING 03-12 10:34:06 [qwen3_next.py:724]     output, final_state = chunk_gated_delta_rule_fi(
WARNING 03-12 10:34:06 [qwen3_next.py:724] ValueError: too many values to unpack (expected 2)

This error repeats for T=16, T=32, and T=64 for each GDN layer.

Test plan

  • Ran tests/v1/e2e/test_mamba_prefix_cache.py::test_mamba_prefix_cache on H100 (SM90) with all caches cleared — passes without the ValueError warning

🤖 Generated with Claude Code

@mergify mergify bot added qwen Related to Qwen models bug Something isn't working labels Mar 12, 2026
@tdoublep tdoublep requested a review from ywang96 March 12, 2026 12:10
@tdoublep tdoublep marked this pull request as ready for review March 12, 2026 12:11
@tdoublep tdoublep requested a review from sighingnow as a code owner March 12, 2026 12:11
@ZJY0516
Copy link
Member

ZJY0516 commented Mar 12, 2026

Sorry about this. I think we still need this because flashinfer kernel is a jit kernel

@tdoublep
Copy link
Member Author

@ZJY0516 OK, then we need just need to fix the return types. Let me update the PR

@tdoublep tdoublep force-pushed the fix-gdn-warmup-flashinfer-unpack branch from 99d804b to 927f39c Compare March 12, 2026 12:17
@tdoublep tdoublep changed the title [Bugfix] Skip GDN Triton warmup on SM90 GPUs using FlashInfer [Bugfix] Fix FlashInfer GDN warmup ValueError on SM90 GPUs Mar 12, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a crash during the Gated Delta Net (GDN) layer warmup on SM90 GPUs. The fix involves skipping the warmup, which is intended for Triton autotuning, on SM90 architectures as they use the FlashInfer backend and do not require this step. The change is implemented by adding a conditional check for CUDA and SM90 device capability. While the fix is correct, I've added a comment regarding code duplication that could be addressed to improve long-term maintainability.

I am having trouble creating individual review comments. Click here to see my feedback.

vllm/model_executor/models/qwen3_next.py (682-683)

high

While this check correctly fixes the issue, it duplicates the logic from ChunkGatedDeltaRule.__init__ which is used to determine whether to use the FlashInfer backend. This could lead to future maintenance issues if the backend selection logic changes but this check is not updated in tandem.

To improve maintainability and avoid this duplication, consider centralizing the backend choice logic. For example, you could add a property to the ChunkGatedDeltaRule class to indicate which backend is in use:

# In ChunkGatedDeltaRule
@property
def uses_flashinfer(self) -> bool:
    return self._forward_method == self.forward_cuda

Then, you could use this property here to make the decision, ensuring the warmup logic always stays in sync with the actual backend being used:

# In _warmup_prefill_kernels
if self.chunk_gated_delta_rule.uses_flashinfer:
    return

This would make the code more robust to future changes.

FlashInfer's chunk_gated_delta_rule returns a single tensor when
output_final_state=False, but the wrapper always unpacked two values.
This caused a ValueError during GDN kernel warmup (added in vllm-project#36599)
on SM90 GPUs (H100/H200).

Handle the return value based on output_final_state: unpack the tuple
when True, use the single tensor when False.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
@tdoublep tdoublep force-pushed the fix-gdn-warmup-flashinfer-unpack branch from 927f39c to 5fea689 Compare March 12, 2026 12:23
@tdoublep
Copy link
Member Author

@ZJY0516 Please take another look. We now fix the error more explicitly, and still allow the warmup phase to happen when using FI kernels.

Copy link
Member

@ZJY0516 ZJY0516 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this

@mergify
Copy link

mergify bot commented Mar 12, 2026

Hi @tdoublep, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@mergify
Copy link

mergify bot commented Mar 12, 2026

Hi @tdoublep, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@ZJY0516
Copy link
Member

ZJY0516 commented Mar 12, 2026

Could we merge this asap?

@mergify
Copy link

mergify bot commented Mar 12, 2026

Hi @tdoublep, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@ywang96 ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 12, 2026
@xyang16
Copy link
Contributor

xyang16 commented Mar 13, 2026

Can this fix get merged?

@tdoublep tdoublep merged commit f296a19 into vllm-project:main Mar 13, 2026
52 checks passed
whycoming pushed a commit to whycoming/vllm that referenced this pull request Mar 13, 2026
athrael-soju pushed a commit to athrael-soju/vllm that referenced this pull request Mar 16, 2026
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants