Skip to content

[Performance][Fix] update nvfp4 code to support renorm routing#28569

Merged
vllm-bot merged 4 commits intovllm-project:mainfrom
jiahanc:Qwen3nvfp4
Nov 17, 2025
Merged

[Performance][Fix] update nvfp4 code to support renorm routing#28569
vllm-bot merged 4 commits intovllm-project:mainfrom
jiahanc:Qwen3nvfp4

Conversation

@jiahanc
Copy link
Contributor

@jiahanc jiahanc commented Nov 12, 2025

Purpose

Fixes #28007

  • Add multi routing method to flashinfer fp4 trtllm moe to support models like Qwen3
  • Add flashinfer trtllm moe into global_sf list which was missed

Test Plan

VLLM_USE_FLASHINFER_MOE_FP4=1 VLLM_FLASHINFER_MOE_BACKEND=latency vllm serve nvidia/Qwen3-235B-A22B-FP4   --max-num-batched-tokens 8192     --max-model-len 16384     --no-enable-prefix-caching     --cuda_graph_sizes 1024     --async-scheduling  -tp 2   --enable-expert-parallel
lm_eval --model local-completions --tasks gsm8k --model_args model=nvidia/Qwen3-235B-A22B-FP4,base_url=http://0.0.0.0:8000/v1/completions,max_retries=3,tokenized_requests=False,timeout=1200,max_gen_toks=2048,max_length=8192 --batch_size 2048 --trust_remote_code --limit 0.5

Test Result

[2025-11-12 20:50:33] INFO evaluation_tracker.py:280: Output path not provided, skipping saving results aggregated
local-completions (model=nvidia/Qwen3-235B-A22B-FP4,base_url=http://0.0.0.0:8000/v1/completions,max_retries=3,tokenized_requests=False,timeout=1200,max_gen_toks=2048,max_length=8192,trust_remote_code=True), gen_kwargs: (None), limit: 0.5, num_fewshot: None, batch_size: 2048
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9348|±  |0.0096|
|     |       |strict-match    |     5|exact_match|↑  |0.9348|±  |0.0096|

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added frontend performance Performance-related issues labels Nov 12, 2025
@mergify mergify bot added the nvidia label Nov 13, 2025
@jiahanc jiahanc marked this pull request as ready for review November 13, 2025 17:14
@jiahanc jiahanc requested a review from pavanimajety November 13, 2025 17:16
@jiahanc jiahanc changed the title [Performance] update nvfp4 code to support renorm routing [Performance][Fix] update nvfp4 code to support renorm routing Nov 13, 2025
Copy link
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work!

Copy link
Collaborator

@pavanimajety pavanimajety left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, LGTM.

@github-project-automation github-project-automation bot moved this to In review in NVIDIA Nov 13, 2025
@pavanimajety pavanimajety added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 13, 2025
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you

@mgoin mgoin enabled auto-merge (squash) November 14, 2025 17:46
@pavanimajety pavanimajety enabled auto-merge (squash) November 15, 2025 04:55
@vllm-bot vllm-bot merged commit 561253b into vllm-project:main Nov 17, 2025
52 of 53 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in NVIDIA Nov 17, 2025
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
…project#28569)

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
…project#28569)

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend nvidia performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Bug]: Can't run Flashinfer MoE TRTLLM backend FP4 for Qwen3 235B

5 participants