[Perf] Do FP4 quant before All gather on flashinfer trtllmgen MOE #30014
[Perf] Do FP4 quant before All gather on flashinfer trtllmgen MOE #30014pavanimajety merged 6 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a performance optimization by moving FP4 quantization before the All-Gather operation in Flashinfer TRTLLMGEN MOE. This reduces the communication overhead. The changes correctly plumb the necessary extra_tensors through the distributed communication layers. I've identified one critical issue regarding an incorrect return type annotation that should be addressed.
|
Perf test on 4xGB200, pure prefill test (ISL 2048, OSL 1) quant before all gather optimization |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| return self.all2all_manager.dispatch( | ||
| hidden_states, | ||
| router_logits, | ||
| is_sequence_parallel, | ||
| extra_tensors, # type: ignore[call-arg] |
There was a problem hiding this comment.
Dispatch passes extra tensors to backends that cannot accept them
The new pre-quantized path now calls all2all_manager.dispatch(...) with an extra_tensors argument (see call below), but only NaiveAll2AllManager and AgRsAll2AllManager were updated to accept that parameter. Other supported backends (e.g., PPLXAll2AllManager.dispatch at vllm/distributed/device_communicators/all2all.py:229-235 and DeepEPAll2AllManagerBase.dispatch at lines 276-282) still take only (hidden_states, router_logits, is_sequence_parallel). When ModelOpt FP4 MoE runs with those backends and post_quant_allgather is enabled, this call will raise TypeError: dispatch() takes 4 positional arguments but 5 were given, crashing inference for those configurations.
Useful? React with 👍 / 👎.
9c03d3d to
c32a482
Compare
|
@mgoin @pavanimajety may you help review? |
|
cc @bnellnm @varun-sundar-rabindranath for the |
vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py
Outdated
Show resolved
Hide resolved
a52f04b to
5a0be5b
Compare
|
Hi @jiahanc, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
e12a693 to
ab45c40
Compare
|
@bnellnm @pavanimajety may you help re-review the PR? Thanks~ |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| dispatch_res = get_ep_group().dispatch( | ||
| hidden_states_to_dispatch, | ||
| router_logits, |
There was a problem hiding this comment.
Dispatch uses undefined hidden_states_to_dispatch
In the DP dispatch path the call to get_ep_group().dispatch(...) uses hidden_states_to_dispatch, but that variable is only assigned inside the if post_quant_allgather branch above; when the optimization is disabled (e.g., any dp_size>1 run that is not ModelOpt TRTLLM), this block is skipped and hidden_states_to_dispatch is undefined, so the forward will crash with an UnboundLocalError before any dispatch occurs.
Useful? React with 👍 / 👎.
6c9de8d to
fdfd942
Compare
fdfd942 to
2e0f12b
Compare
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
2e0f12b to
15f9155
Compare
…lm-project#30014) Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
…lm-project#30014) Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
…lm-project#30014) Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
…lm-project#30014) Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Purpose
Move the FP4 quant before All Gather when Flashinfer TRTLLMGEN MOE is used on DP + EP enabled. This reduce the message size in All gather thus speed up All gather kernel time.
Blocked by #29804, will add same opt to flashinfer_trtllm_fp4_routed_moe after merged.
Original size
After change
On DeepSeek-R1 case, it is a 2.4x less message size
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.