[Bugfix] Fix FlashInfer GDN warmup ValueError on SM90 GPUs#36876
[Bugfix] Fix FlashInfer GDN warmup ValueError on SM90 GPUs#36876tdoublep merged 5 commits intovllm-project:mainfrom
Conversation
|
Sorry about this. I think we still need this because flashinfer kernel is a jit kernel |
|
@ZJY0516 OK, then we need just need to fix the return types. Let me update the PR |
99d804b to
927f39c
Compare
There was a problem hiding this comment.
Code Review
This pull request correctly addresses a crash during the Gated Delta Net (GDN) layer warmup on SM90 GPUs. The fix involves skipping the warmup, which is intended for Triton autotuning, on SM90 architectures as they use the FlashInfer backend and do not require this step. The change is implemented by adding a conditional check for CUDA and SM90 device capability. While the fix is correct, I've added a comment regarding code duplication that could be addressed to improve long-term maintainability.
I am having trouble creating individual review comments. Click here to see my feedback.
vllm/model_executor/models/qwen3_next.py (682-683)
While this check correctly fixes the issue, it duplicates the logic from ChunkGatedDeltaRule.__init__ which is used to determine whether to use the FlashInfer backend. This could lead to future maintenance issues if the backend selection logic changes but this check is not updated in tandem.
To improve maintainability and avoid this duplication, consider centralizing the backend choice logic. For example, you could add a property to the ChunkGatedDeltaRule class to indicate which backend is in use:
# In ChunkGatedDeltaRule
@property
def uses_flashinfer(self) -> bool:
return self._forward_method == self.forward_cudaThen, you could use this property here to make the decision, ensuring the warmup logic always stays in sync with the actual backend being used:
# In _warmup_prefill_kernels
if self.chunk_gated_delta_rule.uses_flashinfer:
returnThis would make the code more robust to future changes.
FlashInfer's chunk_gated_delta_rule returns a single tensor when output_final_state=False, but the wrapper always unpacked two values. This caused a ValueError during GDN kernel warmup (added in vllm-project#36599) on SM90 GPUs (H100/H200). Handle the return value based on output_final_state: unpack the tuple when True, use the single tensor when False. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
927f39c to
5fea689
Compare
|
@ZJY0516 Please take another look. We now fix the error more explicitly, and still allow the warmup phase to happen when using FI kernels. |
|
Hi @tdoublep, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @tdoublep, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Could we merge this asap? |
|
Hi @tdoublep, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Can this fix get merged? |
…ect#36876) Signed-off-by: whycoming <120623296@qq.com>
…ect#36876) Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
…ect#36876) Signed-off-by: wendyliu235 <wenjun.liu@intel.com>
Summary
chunk_gated_delta_rulereturns a single tensor whenoutput_final_state=False, butfi_chunk_gated_delta_rulealways unpacked two values, causing aValueErroroutput_final_state— unpack the tuple whenTrue, use the single tensor whenFalseError before fix
This error repeats for T=16, T=32, and T=64 for each GDN layer.
Test plan
tests/v1/e2e/test_mamba_prefix_cache.py::test_mamba_prefix_cacheon H100 (SM90) with all caches cleared — passes without theValueErrorwarning🤖 Generated with Claude Code