Bump Flashinfer Version and Re-enable DeepSeek NVFP4 AR+Norm Fusion#34899
Bump Flashinfer Version and Re-enable DeepSeek NVFP4 AR+Norm Fusion#34899vllm-bot merged 3 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request correctly bumps the flashinfer version to 0.6.4 across several configuration and dependency files. This update incorporates a fix for an accuracy issue with AR+rms+fp4 fusion in Deepseek V3 models. As a result, the workaround that previously disabled this fusion pass has been removed from vllm/model_executor/models/config.py. The changes are consistent, well-justified, and should improve performance by re-enabling the fusion pass. The provided test results support the correctness of this change.
ProExpertProg
left a comment
There was a problem hiding this comment.
Thanks for the quick turnaround, very clean. cc @mgoin any blockers for upgrading the FI version?
|
@ProExpertProg Can we run full CI for this PR? Not sure by default it performs a wide enough sweep of tests. |
|
Yeah this should be good! I will like to enable a lot of extra testing before merge to make sure we catch the moe tests |
|
Okay enabled all CI tests! |
9abeaa8 to
e41b6a1
Compare
|
CI fails because flashinfer-jit-cache 0.6.4 is not yet available on https://flashinfer.ai/whl/cu129/flashinfer-jit-cache/. Will check about this. EDIT: seems there is an issue in flashinfer's release build: https://github.com/flashinfer-ai/flashinfer/actions/runs/22168423112 |
e41b6a1 to
6afd78b
Compare
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
6afd78b to
42b4213
Compare
|
Took a look at the failed tests. Some tests aborted without running. Only one looking suspicious is: But I ran again locally and it is passing. |
|
May we rerun CI? |
|
@vadiklyutiy don't worry about re-running, I've validated the failures are unrelated. I just want to wait for a few other PRs to merge first so main CI has good signal |
…llm-project#34899) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
…llm-project#34899) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: joezuo <qianzhou.zuo@gmail.com>
…llm-project#34899) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
…llm-project#34899) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
This test aims to cover: vllm-project#34899 Not sure where this test would be put in the tests folder, open to suggestions from reviewers. Signed-off-by: Rishi Puri <riship@nvidia.com>
…llm-project#34899) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Add comprehensive performance analysis for MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10: Architecture confirmed: - Attention IS NVFP4 in this model (ignore list = only lm_head + MoE gates) - 3 MTP modules present (layers 62-64) — biggest performance lever available - Per-step weight load: ~6.15 GB → 36–44 tok/s theoretical ceiling on GB10 Performance gap analysis: - Current: 24 tok/s on Strix Halo (AMD); GB10 expected similar baseline - vLLM is 1.78x slower than SGLang at BS=1 for NVFP4 MoE (documented gap) - Gap sources: activation quant overhead, kernel launch overhead, no fused shuffle+reduce in MoE, generic CUTLASS configs Key new PRs to integrate: - vllm-project#35041 (OPEN): MTP+NVFP4 weight shape mismatch — required for MTP+NVFP4 - vllm-project#35442 (OPEN): Non-blocking MTP token copy — 6ms→200µs CPU-GPU sync - vllm-project#33303 (OPEN): MiniMax PP+DP for multi-Spark scaling Already-merged PRs confirmed in HEAD: - vllm-project#34718 (act_quant_fusion.py): SiLU+FP4 fusion - vllm-project#34899 (allreduce_rms_fusion.py): NVFP4 AR+Norm fusion - vllm-project#30885: 8x4 SF tiling (not yet effective on GB10 — TRTLLM backend blocked)
…llm-project#34899) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Add comprehensive performance analysis for MiniMax-M2.5-REAP-139B-A10B-NVFP4-GB10: Architecture confirmed: - Attention IS NVFP4 in this model (ignore list = only lm_head + MoE gates) - 3 MTP modules present (layers 62-64) — biggest performance lever available - Per-step weight load: ~6.15 GB → 36–44 tok/s theoretical ceiling on GB10 Performance gap analysis: - Current: 24 tok/s on Strix Halo (AMD); GB10 expected similar baseline - vLLM is 1.78x slower than SGLang at BS=1 for NVFP4 MoE (documented gap) - Gap sources: activation quant overhead, kernel launch overhead, no fused shuffle+reduce in MoE, generic CUTLASS configs Key new PRs to integrate: - vllm-project#35041 (OPEN): MTP+NVFP4 weight shape mismatch — required for MTP+NVFP4 - vllm-project#35442 (OPEN): Non-blocking MTP token copy — 6ms→200µs CPU-GPU sync - vllm-project#33303 (OPEN): MiniMax PP+DP for multi-Spark scaling Already-merged PRs confirmed in HEAD: - vllm-project#34718 (act_quant_fusion.py): SiLU+FP4 fusion - vllm-project#34899 (allreduce_rms_fusion.py): NVFP4 AR+Norm fusion - vllm-project#30885: 8x4 SF tiling (not yet effective on GB10 — TRTLLM backend blocked)
…llm-project#34899) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: Andrii Skliar <askliar@nvidia.com>
…llm-project#34899) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Purpose
The patch for fixing the Deepseek V3 accuracy issue with AR+rms+fp4 fusion (#34395) is included in flashinfer 0.6.4. This PR bumps flashinfer version and re-enables the fusion pass by default.
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.