[BugFix] Fix TRT-LLM NVFP4 DP/EP by jiahanc · Pull Request #32349 · vllm-project/vllm

jiahanc · 2026-01-14T19:42:10Z

Purpose

Flashinfer trtllm nvfp4 was broke by #31692 .

(EngineCore_DP2 pid=545)   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 841, in __call__
(EngineCore_DP2 pid=545)     return self._op(*args, **kwargs)
(EngineCore_DP2 pid=545)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP2 pid=545)   File "/scratch/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 2146, in moe_forward_shared
(EngineCore_DP2 pid=545)     return self.forward_impl(hidden_states, router_logits)
(EngineCore_DP2 pid=545)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP2 pid=545)   File "/scratch/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 1950, in forward_impl
(EngineCore_DP2 pid=545)     self.quant_method.prepare_dp_allgather_tensor(
(EngineCore_DP2 pid=545)   File "/scratch/vllm/vllm/model_executor/layers/quantization/modelopt.py", line 1576, in prepare_dp_allgather_tensor
(EngineCore_DP2 pid=545)     assert self.moe_quant_config is not None
(EngineCore_DP2 pid=545)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Test Plan

VLLM_RANDOMIZE_DP_DUMMY_INPUTS=1 \
VLLM_ATTENTION_BACKEND=FLASHINFER_MLA \
VLLM_USE_FLASHINFER_MOE_FP4=1 \
VLLM_USE_TRTLLM_RAGGED_DEEPSEEK_PREFILL=1 \
VLLM_FLASHINFER_MOE_BACKEND=latency \
python3 -m vllm.entrypoints.openai.api_server \
  --model nvidia/DeepSeek-R1-0528-FP4-v2 \
  --tokenizer nvidia/DeepSeek-R1-0528-FP4-v2 \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 1 \
  --pipeline-parallel-size 1 \
  --data-parallel-size 4 \
  --enable-expert-parallel \
  --max-model-len 34816 \
  --disable-uvicorn-access-log \
  --no-enable-prefix-caching \
  --async-scheduling \
  --all2all-backend allgather_reducescatter \
  --compilation-config.cudagraph_mode FULL_DECODE_ONLY \
  --compilation_config.custom_ops+=+rms_norm,+rotary_embedding \
  --gpu-memory-utilization 0.9 \
  --stream-interval 50 \
  --max-num-seqs 1024 \
  --max-num-batched-tokens 8192 \
  --max-cudagraph-capture-size 1024 &

lm_eval --model local-completions --tasks gsm8k --model_args model=nvidia/DeepSeek-R1-0528-FP4-v2,base_url=http://0.0.0.0:8000/v1/completions,max_retries=3,tokenized_requests=False,timeout=1200,max_gen_toks=2048,max_length=8192 --batch_size 2048 --trust_remote_code --limit 0.5

Test Result

local-completions (model=nvidia/DeepSeek-R1-0528-FP4-v2,base_url=http://0.0.0.0:8000/v1/completions,max_retries=3,tokenized_requests=False,timeout=1200,max_gen_toks=2048,max_length=8192,trust_remote_code=True), gen_kwargs: (None), limit: 0.5, num_fewshot: None, batch_size: 2048
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9621|±  |0.0074|
|     |       |strict-match    |     5|exact_match|↑  |0.9621|±  |0.0074|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

jiahanc · 2026-01-14T19:42:50Z

@pavanimajety @mgoin can you help review? Thanks!

gemini-code-assist

Code Review

This pull request aims to fix a crash that occurs when using FlashInfer with NVFP4 quantization for MoE layers, specifically with the TRT-LLM backend. The issue stems from an assertion failure where moe_quant_config was unexpectedly None. The proposed change correctly addresses the crash for the TRT-LLM backend by sourcing the a1_gscale from the layer object directly. However, my review indicates that this fix, while effective for the intended case, may introduce an AttributeError for other backends like FlashInfer-CUTLASS, where layer.a1_gscale is not set. I've provided a critical comment with a suggested code change to ensure the fix is robust across all supported backends.

gemini-code-assist · 2026-01-14T19:44:57Z

vllm/model_executor/layers/quantization/modelopt.py

        hidden_states_fp4, hidden_states_sf = flashinfer.fp4_quantize(
            hidden_states,
-            a1_gscale,
+            layer.a1_gscale,


While this change correctly fixes the issue for the FLASHINFER_TRTLLM backend, it could introduce a bug for other backends like FLASHINFER_CUTLASS. For non-TRTLLM backends, self.moe_quant_config is created, but layer.a1_gscale is not set, which would lead to an AttributeError here.

A more robust approach would be to conditionally access a1_gscale from either self.moe_quant_config or layer.

Suggested change

layer.a1_gscale,

self.moe_quant_config.a1_gscale if self.moe_quant_config else layer.a1_gscale,

robertgshaw2-redhat · 2026-01-14T19:45:28Z

thanks for the fix! sorry I did not catch this one in my testing.

robertgshaw2-redhat · 2026-01-14T19:46:45Z

Can you post your deployment?

robertgshaw2-redhat · 2026-01-14T19:46:59Z

add your launch command. I want to verify the result.

robertgshaw2-redhat · 2026-01-14T19:47:37Z

Could you please add this case to the B200 Temporary MoE Refactor job?

cursor

Comment @cursor review or bugbot run to trigger another review on this PR

vllm/model_executor/layers/quantization/modelopt.py

robertgshaw2-redhat · 2026-01-14T19:51:03Z

The bot is correct. This maybe_gather_dp is a hack. Can you please add an assert that this should only be called with the TRTLLM backend? and also add a TODO for Rob to refactor the need for maybe_gather_dp

jiahanc · 2026-01-14T20:30:33Z

add your launch command. I want to verify the result.

Added in the PR test plan

jiahanc · 2026-01-14T20:31:44Z

Could you please add this case to the B200 Temporary MoE Refactor job?

May you elaborate more on this? Is it a github issue I shall add to ?

robertgshaw2-redhat · 2026-01-14T20:56:21Z

Could you please add this case to the B200 Temporary MoE Refactor job?

May you elaborate more on this? Is it a github issue I shall add to ?

https://github.com/vllm-project/vllm/blob/main/.buildkite/test-pipeline.yaml#L1441-L1446

jiahanc · 2026-01-14T21:45:52Z

@robertgshaw2-redhat updated PR per comment, may you help review? Thanks~

tests/evals/gsm8k/configs/moe-refactor-dp-ep/config-b200.txt

robertgshaw2-redhat · 2026-01-14T23:40:44Z

I made some improvements to the PR

robertgshaw2-redhat · 2026-01-15T00:50:57Z

Thanks for the fix! I triggered the tests. PLease ping me on slack if you need any more help getting this landed.

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

Signed-off-by: Robert Shaw <robshaw@redhat.com>

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

hjjq · 2026-01-15T22:42:33Z

Hi @robertgshaw2-redhat , the only failing test seems to be due to container startup errors, can you rerun it and merge if it passes? Thanks!

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com> Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>

jiahanc requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners January 14, 2026 19:42

gemini-code-assist bot reviewed Jan 14, 2026

View reviewed changes

robertgshaw2-redhat changed the title ~~[fix] fix FI NVFP4 trtllm moe break~~ [BugFix] Fix TRT-LLM NVFP4 DP/EP Jan 14, 2026

mergify bot added the bug Something isn't working label Jan 14, 2026

wangshangsam assigned jiahanc Jan 14, 2026

wangshangsam added the nvidia label Jan 14, 2026

github-project-automation bot added this to NVIDIA Jan 14, 2026

cursor bot reviewed Jan 14, 2026

View reviewed changes

vllm/model_executor/layers/quantization/modelopt.py Show resolved Hide resolved

jiahanc force-pushed the fix-trtllm-moe branch from 94a8a39 to fd2f9d1 Compare January 14, 2026 21:45

robertgshaw2-redhat reviewed Jan 14, 2026

View reviewed changes

tests/evals/gsm8k/configs/moe-refactor-dp-ep/config-b200.txt Outdated Show resolved Hide resolved

robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 14, 2026

jiahanc added 3 commits January 15, 2026 20:20

fix FI trtllm moe break

a719012

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

add constrain and test

ecc6001

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

move

07cda52

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

Robert Shaw and others added 4 commits January 15, 2026 20:20

clean up logic for naive dispatch ag/rs

06a2657

Signed-off-by: Robert Shaw <robshaw@redhat.com>

updated test name

a35e55c

Signed-off-by: Robert Shaw <robshaw@redhat.com>

updated guard

db5bdbb

Signed-off-by: Robert Shaw <robshaw@redhat.com>

fix name mapping

a66619d

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

hjjq force-pushed the fix-trtllm-moe branch from 71d6f5a to a66619d Compare January 15, 2026 20:22

robertgshaw2-redhat approved these changes Jan 19, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Jan 19, 2026

robertgshaw2-redhat merged commit 7350331 into vllm-project:main Jan 19, 2026
57 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Jan 19, 2026

	layer.a1_gscale,
	self.moe_quant_config.a1_gscale if self.moe_quant_config else layer.a1_gscale,

Uh oh!

Conversation

jiahanc commented Jan 14, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

jiahanc commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat commented Jan 14, 2026

Uh oh!

robertgshaw2-redhat commented Jan 14, 2026

Uh oh!

robertgshaw2-redhat commented Jan 14, 2026

Uh oh!

robertgshaw2-redhat commented Jan 14, 2026

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

robertgshaw2-redhat commented Jan 14, 2026

Uh oh!

jiahanc commented Jan 14, 2026

Uh oh!

jiahanc commented Jan 14, 2026

Uh oh!

robertgshaw2-redhat commented Jan 14, 2026

Uh oh!

jiahanc commented Jan 14, 2026

Uh oh!

Uh oh!

robertgshaw2-redhat commented Jan 14, 2026

Uh oh!

robertgshaw2-redhat commented Jan 15, 2026

Uh oh!

hjjq commented Jan 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jiahanc commented Jan 14, 2026 •

edited by github-actions bot

Loading

jiahanc commented Jan 14, 2026 •

edited

Loading