[Perf] Do FP4 quant before All gather on flashinfer trtllmgen MOE by jiahanc · Pull Request #30014 · vllm-project/vllm

jiahanc · 2025-12-04T01:45:51Z

Purpose

Move the FP4 quant before All Gather when Flashinfer TRTLLMGEN MOE is used on DP + EP enabled. This reduce the message size in All gather thus speed up All gather kernel time.
Blocked by #29804, will add same opt to flashinfer_trtllm_fp4_routed_moe after merged.
Original size

hidden_states (num_tokens, hidden_size), dtype bf16
routing_logits (num_tokens, num_experts), dtype fp32

After change

hidden_state (num_tokens, hidden_size/2), dtype unit8
hidden_state_sf (num_tokens , hidden_size/16), dtype fp8
routing_logits (num_tokens, num_experts), dtype fp32

On DeepSeek-R1 case, it is a 2.4x less message size

Test Plan

python3 -m vllm.entrypoints.openai.api_server --model nvidia/DeepSeek-R1-0528-FP4-v2 --tokenizer nvidia/DeepSeek-R1-0528-FP4-v2 --dtype auto --kv-cache-dtype fp8 --tensor-parallel-size 1 --pipeline-parallel-size 1 --data-parallel-size 4 --enable-expert-parallel --swap-space 16 --max-num-seqs 1024 --trust-remote-code --max-model-len 4096 --gpu-memory-utilization 0.8 --max-num-batched-tokens 16384 --no-enable-prefix-caching --compilation_config.pass_config.enable_fi_allreduce_fusion true --compilation_config.pass_config.enable_attn_fusion true --compilation_config.pass_config.enable_noop true --compilation_config.custom_ops+=+quant_fp8,+rms_norm --compilation_config.max_cudagraph_capture_size 2048 &


lm_eval --model local-completions --tasks gsm8k --model_args model=nvidia/DeepSeek-R1-0528-FP4-v2,base_url=http://0.0.0.0:8000/v1/completions,max_retries=3,tokenized_requests=False,timeout=1200,max_gen_toks=2048,max_length=8192 --batch_size 2048 --trust_remote_code --limit 0.5

Test Result

local-completions (model=nvidia/DeepSeek-R1-0528-FP4-v2,base_url=http://0.0.0.0:8000/v1/completions,max_retries=3,tokenized_requests=False,timeout=1200,max_gen_toks=2048,max_length=8192,trust_remote_code=True), gen_kwargs: (None), limit: 0.5, num_fewshot: None, batch_size: 2048
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9712|±  |0.0065|
|     |       |strict-match    |     5|exact_match|↑  |0.9712|±  |0.0065|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces a performance optimization by moving FP4 quantization before the All-Gather operation in Flashinfer TRTLLMGEN MOE. This reduces the communication overhead. The changes correctly plumb the necessary extra_tensors through the distributed communication layers. I've identified one critical issue regarding an incorrect return type annotation that should be addressed.

vllm/distributed/parallel_state.py

jiahanc · 2025-12-04T01:47:40Z

Perf test on 4xGB200, pure prefill test (ISL 2048, OSL 1)
original perf

============ Serving Benchmark Result ============
Successful requests:                     512       
Failed requests:                         0         
Maximum request concurrency:             128       
Benchmark duration (s):                  15.15     
Total input tokens:                      1048064   
Total generated tokens:                  512       
Request throughput (req/s):              33.80     
Output token throughput (tok/s):         33.80     
Peak output token throughput (tok/s):    55.00     
Peak concurrent requests:                183.00    
Total Token throughput (tok/s):          69219.00

quant before all gather optimization

============ Serving Benchmark Result ============
Successful requests:                     512       
Failed requests:                         0         
Maximum request concurrency:             128       
Benchmark duration (s):                  14.22     
Total input tokens:                      1048064   
Total generated tokens:                  512       
Request throughput (req/s):              35.99     
Output token throughput (tok/s):         35.99     
Peak output token throughput (tok/s):    64.00     
Peak concurrent requests:                192.00    
Total Token throughput (tok/s):          73714.89

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-12-04T01:50:51Z

vllm/distributed/device_communicators/cuda_communicator.py

+        return self.all2all_manager.dispatch(
+            hidden_states,
+            router_logits,
+            is_sequence_parallel,
+            extra_tensors,  # type: ignore[call-arg]


Dispatch passes extra tensors to backends that cannot accept them

The new pre-quantized path now calls all2all_manager.dispatch(...) with an extra_tensors argument (see call below), but only NaiveAll2AllManager and AgRsAll2AllManager were updated to accept that parameter. Other supported backends (e.g., PPLXAll2AllManager.dispatch at vllm/distributed/device_communicators/all2all.py:229-235 and DeepEPAll2AllManagerBase.dispatch at lines 276-282) still take only (hidden_states, router_logits, is_sequence_parallel). When ModelOpt FP4 MoE runs with those backends and post_quant_allgather is enabled, this call will raise TypeError: dispatch() takes 4 positional arguments but 5 were given, crashing inference for those configurations.

Useful? React with 👍 / 👎.

jiahanc · 2025-12-04T17:33:20Z

@mgoin @pavanimajety may you help review?

mgoin · 2025-12-04T17:54:40Z

cc @bnellnm @varun-sundar-rabindranath for the dispatch change

vllm/distributed/device_communicators/all2all.py

vllm/model_executor/layers/fused_moe/layer.py

vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py

vllm/distributed/device_communicators/all2all.py

mergify · 2025-12-08T20:35:16Z

Hi @jiahanc, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

jiahanc · 2025-12-11T23:47:06Z

@bnellnm @pavanimajety may you help re-review the PR? Thanks~

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-12-11T23:50:20Z

vllm/model_executor/layers/fused_moe/layer.py

+                dispatch_res = get_ep_group().dispatch(
+                    hidden_states_to_dispatch,
+                    router_logits,


Dispatch uses undefined hidden_states_to_dispatch

In the DP dispatch path the call to get_ep_group().dispatch(...) uses hidden_states_to_dispatch, but that variable is only assigned inside the if post_quant_allgather branch above; when the optimization is disabled (e.g., any dp_size>1 run that is not ModelOpt TRTLLM), this block is skipped and hidden_states_to_dispatch is undefined, so the forward will crash with an UnboundLocalError before any dispatch occurs.

Useful? React with 👍 / 👎.

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

…lm-project#30014) Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

…lm-project#30014) Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

…lm-project#30014) Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

…lm-project#30014) Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

jiahanc requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners December 4, 2025 01:45

mergify bot added the nvidia label Dec 4, 2025

github-project-automation bot added this to NVIDIA Dec 4, 2025

gemini-code-assist bot reviewed Dec 4, 2025

View reviewed changes

vllm/distributed/parallel_state.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Dec 4, 2025

View reviewed changes

jiahanc force-pushed the postQuantComm branch from 9c03d3d to c32a482 Compare December 4, 2025 04:17

bnellnm reviewed Dec 4, 2025

View reviewed changes

vllm/distributed/device_communicators/all2all.py Show resolved Hide resolved

bnellnm reviewed Dec 4, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/layer.py Show resolved Hide resolved

pavanimajety reviewed Dec 4, 2025

View reviewed changes

vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py Outdated Show resolved Hide resolved

pavanimajety reviewed Dec 4, 2025

View reviewed changes

vllm/distributed/device_communicators/all2all.py Show resolved Hide resolved

LucasWilkinson assigned pavanimajety Dec 5, 2025

jiahanc force-pushed the postQuantComm branch from a52f04b to 5a0be5b Compare December 8, 2025 20:30

jiahanc marked this pull request as draft December 8, 2025 23:22

jiahanc force-pushed the postQuantComm branch from e12a693 to ab45c40 Compare December 11, 2025 23:45

jiahanc marked this pull request as ready for review December 11, 2025 23:46

chatgpt-codex-connector bot reviewed Dec 11, 2025

View reviewed changes

mgoin added performance Performance-related issues moe ready ONLY add when PR is ready to merge/full CI is needed labels Dec 12, 2025

jiahanc force-pushed the postQuantComm branch 2 times, most recently from 6c9de8d to fdfd942 Compare December 12, 2025 21:18

jiahanc force-pushed the postQuantComm branch from fdfd942 to 2e0f12b Compare December 15, 2025 17:28

jiahanc added 6 commits December 15, 2025 23:40

move quant before all gather in fp4

bcf3413

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

extra tensor in base class

193328c

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

move get extra tensor to base class

d8127ce

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

lint

4d15aae

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

add quant before comm for routed trtllmgen also

dd5ae90

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

fix

15f9155

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

jiahanc force-pushed the postQuantComm branch from 2e0f12b to 15f9155 Compare December 15, 2025 23:40

mgoin moved this to In review in NVIDIA Dec 16, 2025

mgoin approved these changes Dec 16, 2025

View reviewed changes

github-project-automation bot moved this from In review to Ready in NVIDIA Dec 16, 2025

pavanimajety merged commit 254a7f8 into vllm-project:main Dec 16, 2025
62 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Dec 16, 2025

NickLucche pushed a commit to NickLucche/vllm that referenced this pull request Dec 17, 2025

[Perf] Do FP4 quant before All gather on flashinfer trtllmgen MOE (vl…

15110e1

…lm-project#30014) Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

minosfuture mentioned this pull request Feb 2, 2026

[Performance]: Optimize DeepSeekR1 Throughput on Blackwell #33583

Open

1 task

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[Perf] Do FP4 quant before All gather on flashinfer trtllmgen MOE (vl…

72ad6d4

…lm-project#30014) Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

Uh oh!

Conversation

jiahanc commented Dec 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

jiahanc commented Dec 4, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

jiahanc commented Dec 4, 2025

Uh oh!

mgoin commented Dec 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Dec 8, 2025

Uh oh!

jiahanc commented Dec 11, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jiahanc commented Dec 4, 2025 •

edited by github-actions bot

Loading