Skip to content

Bump Flashinfer Version and Re-enable DeepSeek NVFP4 AR+Norm Fusion#34899

Merged
vllm-bot merged 3 commits intovllm-project:mainfrom
wzhao18:wzhao/bump-flashinfer-version
Feb 20, 2026
Merged

Bump Flashinfer Version and Re-enable DeepSeek NVFP4 AR+Norm Fusion#34899
vllm-bot merged 3 commits intovllm-project:mainfrom
wzhao18:wzhao/bump-flashinfer-version

Conversation

@wzhao18
Copy link
Contributor

@wzhao18 wzhao18 commented Feb 19, 2026

Purpose

The patch for fixing the Deepseek V3 accuracy issue with AR+rms+fp4 fusion (#34395) is included in flashinfer 0.6.4. This PR bumps flashinfer version and re-enables the fusion pass by default.

Test Plan

serve nvidia/DeepSeek-V3.1-NVFP4 -tp=4 -cc.pass_config.fuse_allreduce_rms=True

Test Result

lm_eval --model local-completions --model_args "base_url=http://0.0.0.0:8000/v1/completions,max_length=8192,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" --tasks gsm8k --num_fewshot 5
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9538|±  |0.0058|
|     |       |strict-match    |     5|exact_match|↑  |0.9492|±  |0.0060|

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added ci/build deepseek Related to DeepSeek models nvidia labels Feb 19, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly bumps the flashinfer version to 0.6.4 across several configuration and dependency files. This update incorporates a fix for an accuracy issue with AR+rms+fp4 fusion in Deepseek V3 models. As a result, the workaround that previously disabled this fusion pass has been removed from vllm/model_executor/models/config.py. The changes are consistent, well-justified, and should improve performance by re-enabling the fusion pass. The provided test results support the correctness of this change.

Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick turnaround, very clean. cc @mgoin any blockers for upgrading the FI version?

@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Feb 19, 2026
@ProExpertProg ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 19, 2026
@wzhao18
Copy link
Contributor Author

wzhao18 commented Feb 19, 2026

@ProExpertProg Can we run full CI for this PR? Not sure by default it performs a wide enough sweep of tests.

@mgoin
Copy link
Member

mgoin commented Feb 19, 2026

Yeah this should be good! I will like to enable a lot of extra testing before merge to make sure we catch the moe tests

@ProExpertProg ProExpertProg added the ready-run-all-tests Trigger CI with all tests for wide-ranging PRs label Feb 19, 2026
@ProExpertProg
Copy link
Collaborator

Okay enabled all CI tests!

@wzhao18 wzhao18 force-pushed the wzhao/bump-flashinfer-version branch from 9abeaa8 to e41b6a1 Compare February 19, 2026 17:19
@wzhao18
Copy link
Contributor Author

wzhao18 commented Feb 19, 2026

CI fails because flashinfer-jit-cache 0.6.4 is not yet available on https://flashinfer.ai/whl/cu129/flashinfer-jit-cache/. Will check about this.

EDIT: seems there is an issue in flashinfer's release build: https://github.com/flashinfer-ai/flashinfer/actions/runs/22168423112

@mgoin mgoin self-requested a review February 19, 2026 18:25
@wzhao18 wzhao18 force-pushed the wzhao/bump-flashinfer-version branch from e41b6a1 to 6afd78b Compare February 19, 2026 21:23
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
@wzhao18 wzhao18 force-pushed the wzhao/bump-flashinfer-version branch from 6afd78b to 42b4213 Compare February 20, 2026 01:23
@wzhao18
Copy link
Contributor Author

wzhao18 commented Feb 20, 2026

Took a look at the failed tests. Some tests aborted without running. Only one looking suspicious is:

FAILED evals/gsm8k/test_gsm8k_correctness.py::test_gsm8k_correctness[DeepSeek-V2-Lite-Instruct-FP8] - AssertionError: GSM8K metric too low: 0.0114 < 0.7200 - 0.0800 = 0.6400
--
assert np.float64(0.011372251705837756) >= (0.72 - 0.08)
==================================================== 1 failed, 6 passed, 17 warnings in 1504.50s (0:25:04) =====================================================

But I ran again locally and it is passing.

@vadiklyutiy
Copy link
Collaborator

May we rerun CI?

@mgoin
Copy link
Member

mgoin commented Feb 20, 2026

@vadiklyutiy don't worry about re-running, I've validated the failures are unrelated. I just want to wait for a few other PRs to merge first so main CI has good signal

@vllm-bot vllm-bot merged commit ea5f903 into vllm-project:main Feb 20, 2026
143 of 151 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Feb 20, 2026
@dosubot
Copy link

dosubot bot commented Feb 20, 2026

Related Documentation

Checked 0 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

DarkLight1337 added a commit to DarkLight1337/vllm that referenced this pull request Feb 21, 2026
…llm-project#34899)

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
joeqzzuo pushed a commit to joeqzzuo/vllm that referenced this pull request Feb 21, 2026
…llm-project#34899)

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Signed-off-by: joezuo <qianzhou.zuo@gmail.com>
yugong333 pushed a commit to yugong333/vllm that referenced this pull request Feb 22, 2026
…llm-project#34899)

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
jmamou pushed a commit to jmamou/vllm that referenced this pull request Feb 23, 2026
…llm-project#34899)

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
puririshi98 added a commit to puririshi98/vllm that referenced this pull request Feb 25, 2026
This test aims to cover: vllm-project#34899

Not sure where this test would be put in the tests folder, open to suggestions from reviewers.

Signed-off-by: Rishi Puri <riship@nvidia.com>
@ProExpertProg ProExpertProg linked an issue Feb 26, 2026 that may be closed by this pull request
1 task
llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026
…llm-project#34899)

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
scottgl9 added a commit to scottgl9/vllm that referenced this pull request Mar 2, 2026
Add comprehensive performance analysis for MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10:

Architecture confirmed:
- Attention IS NVFP4 in this model (ignore list = only lm_head + MoE gates)
- 3 MTP modules present (layers 62-64) — biggest performance lever available
- Per-step weight load: ~6.15 GB → 36–44 tok/s theoretical ceiling on GB10

Performance gap analysis:
- Current: 24 tok/s on Strix Halo (AMD); GB10 expected similar baseline
- vLLM is 1.78x slower than SGLang at BS=1 for NVFP4 MoE (documented gap)
- Gap sources: activation quant overhead, kernel launch overhead, no fused
  shuffle+reduce in MoE, generic CUTLASS configs

Key new PRs to integrate:
- vllm-project#35041 (OPEN): MTP+NVFP4 weight shape mismatch — required for MTP+NVFP4
- vllm-project#35442 (OPEN): Non-blocking MTP token copy — 6ms→200µs CPU-GPU sync
- vllm-project#33303 (OPEN): MiniMax PP+DP for multi-Spark scaling

Already-merged PRs confirmed in HEAD:
- vllm-project#34718 (act_quant_fusion.py): SiLU+FP4 fusion
- vllm-project#34899 (allreduce_rms_fusion.py): NVFP4 AR+Norm fusion
- vllm-project#30885: 8x4 SF tiling (not yet effective on GB10 — TRTLLM backend blocked)
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
…llm-project#34899)

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
scottgl9 added a commit to scottgl9/vllm that referenced this pull request Mar 4, 2026
Add comprehensive performance analysis for MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10:

Architecture confirmed:
- Attention IS NVFP4 in this model (ignore list = only lm_head + MoE gates)
- 3 MTP modules present (layers 62-64) — biggest performance lever available
- Per-step weight load: ~6.15 GB → 36–44 tok/s theoretical ceiling on GB10

Performance gap analysis:
- Current: 24 tok/s on Strix Halo (AMD); GB10 expected similar baseline
- vLLM is 1.78x slower than SGLang at BS=1 for NVFP4 MoE (documented gap)
- Gap sources: activation quant overhead, kernel launch overhead, no fused
  shuffle+reduce in MoE, generic CUTLASS configs

Key new PRs to integrate:
- vllm-project#35041 (OPEN): MTP+NVFP4 weight shape mismatch — required for MTP+NVFP4
- vllm-project#35442 (OPEN): Non-blocking MTP token copy — 6ms→200µs CPU-GPU sync
- vllm-project#33303 (OPEN): MiniMax PP+DP for multi-Spark scaling

Already-merged PRs confirmed in HEAD:
- vllm-project#34718 (act_quant_fusion.py): SiLU+FP4 fusion
- vllm-project#34899 (allreduce_rms_fusion.py): NVFP4 AR+Norm fusion
- vllm-project#30885: 8x4 SF tiling (not yet effective on GB10 — TRTLLM backend blocked)
askliar pushed a commit to askliar/vllm that referenced this pull request Mar 9, 2026
…llm-project#34899)

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Signed-off-by: Andrii Skliar <askliar@nvidia.com>
Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026
…llm-project#34899)

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build deepseek Related to DeepSeek models nvidia ready ONLY add when PR is ready to merge/full CI is needed ready-run-all-tests Trigger CI with all tests for wide-ranging PRs

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Bug]: AR+rms+fp4 fusion results in total accuracy collapse for DSV3-fp4

6 participants