Bump Flashinfer Version and Re-enable DeepSeek NVFP4 AR+Norm Fusion by wzhao18 · Pull Request #34899 · vllm-project/vllm

wzhao18 · 2026-02-19T17:07:38Z

Purpose

The patch for fixing the Deepseek V3 accuracy issue with AR+rms+fp4 fusion (#34395) is included in flashinfer 0.6.4. This PR bumps flashinfer version and re-enables the fusion pass by default.

Test Plan

serve nvidia/DeepSeek-V3.1-NVFP4 -tp=4 -cc.pass_config.fuse_allreduce_rms=True

Test Result

lm_eval --model local-completions --model_args "base_url=http://0.0.0.0:8000/v1/completions,max_length=8192,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" --tasks gsm8k --num_fewshot 5

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9538|±  |0.0058|
|     |       |strict-match    |     5|exact_match|↑  |0.9492|±  |0.0060|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request correctly bumps the flashinfer version to 0.6.4 across several configuration and dependency files. This update incorporates a fix for an accuracy issue with AR+rms+fp4 fusion in Deepseek V3 models. As a result, the workaround that previously disabled this fusion pass has been removed from vllm/model_executor/models/config.py. The changes are consistent, well-justified, and should improve performance by re-enabling the fusion pass. The provided test results support the correctness of this change.

ProExpertProg

Thanks for the quick turnaround, very clean. cc @mgoin any blockers for upgrading the FI version?

wzhao18 · 2026-02-19T17:14:01Z

@ProExpertProg Can we run full CI for this PR? Not sure by default it performs a wide enough sweep of tests.

mgoin · 2026-02-19T17:14:16Z

Yeah this should be good! I will like to enable a lot of extra testing before merge to make sure we catch the moe tests

ProExpertProg · 2026-02-19T17:14:48Z

Okay enabled all CI tests!

wzhao18 · 2026-02-19T17:47:23Z

CI fails because flashinfer-jit-cache 0.6.4 is not yet available on https://flashinfer.ai/whl/cu129/flashinfer-jit-cache/. Will check about this.

EDIT: seems there is an issue in flashinfer's release build: https://github.com/flashinfer-ai/flashinfer/actions/runs/22168423112

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

wzhao18 · 2026-02-20T15:22:54Z

Took a look at the failed tests. Some tests aborted without running. Only one looking suspicious is:

FAILED evals/gsm8k/test_gsm8k_correctness.py::test_gsm8k_correctness[DeepSeek-V2-Lite-Instruct-FP8] - AssertionError: GSM8K metric too low: 0.0114 < 0.7200 - 0.0800 = 0.6400
--
assert np.float64(0.011372251705837756) >= (0.72 - 0.08)
==================================================== 1 failed, 6 passed, 17 warnings in 1504.50s (0:25:04) =====================================================

But I ran again locally and it is passing.

vadiklyutiy · 2026-02-20T16:03:22Z

May we rerun CI?

mgoin · 2026-02-20T18:39:36Z

@vadiklyutiy don't worry about re-running, I've validated the failures are unrelated. I just want to wait for a few other PRs to merge first so main CI has good signal

dosubot · 2026-02-20T21:37:42Z

Related Documentation

Checked 0 published document(s) in 1 knowledge base(s). No updates required.

^{How did I do? Any feedback?}