Skip to content

[RL] [FlashInfer] Integrate FlashInfer trtllm_fp4_block_scale_routed_moe#22209

Closed
zianglih wants to merge 6 commits intosgl-project:mainfrom
zianglih:nvfp4-routed
Closed

[RL] [FlashInfer] Integrate FlashInfer trtllm_fp4_block_scale_routed_moe#22209
zianglih wants to merge 6 commits intosgl-project:mainfrom
zianglih:nvfp4-routed

Conversation

@zianglih
Copy link
Copy Markdown
Contributor

@zianglih zianglih commented Apr 6, 2026

Motivation

@HumansAnd

This PR largely mirrors existing routed MoE integration:

This PR also depends on #22204 for FlashInfer trtllm moe refactoring.

Modifications

  • Rename and expand test_update_weights_from_disk_blackwell.py, now it covers both mxfp8 and nvfp4
  • Expand test_flashinfer_trtllm_gen_moe_backend.py for nvfp4 coverage
  • Add integration for trtllm_fp4_block_scale_routed_moe

Accuracy Tests

gsm8k

python3 -m sglang.launch_server --kv-cache-dtype bf16 --model nvidia/Qwen3-30B-A3B-NVFP4
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.945
Invalid: 0.001
Latency: 8.517 s
Output throughput: 17180.418 token/s
curl -sS http://localhost:30000/update_weights_from_disk \
  -H 'Content-Type: application/json' \
  -d '{
    "model_path": "nvidia/Qwen3-30B-A3B-NVFP4",
    "flush_cache": true,
    "abort_all_requests": false
  }'
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.942
Invalid: 0.001
Latency: 7.789 s
Output throughput: 18788.124 token/s
python3 -m pytest -s -q test/registered/backends/test_flashinfer_trtllm_gen_moe_backend.py -k NVFP4
============================================================================ warnings summary ============================================================================
../../usr/local/lib/python3.12/dist-packages/_pytest/config/__init__.py:1428
  /usr/local/lib/python3.12/dist-packages/_pytest/config/__init__.py:1428: PytestConfigWarning: Unknown config option: asyncio_mode
  
    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2 passed, 6 deselected, 3 warnings in 137.38s (0:02:17)
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

python3 -m pytest -s -q test/registered/rl/test_update_weights_from_disk_blackwell.py -k NVFP4
============================================================================ warnings summary ============================================================================
../../usr/local/lib/python3.12/dist-packages/_pytest/config/__init__.py:1428
  /usr/local/lib/python3.12/dist-packages/_pytest/config/__init__.py:1428: PytestConfigWarning: Unknown config option: asyncio_mode
  
    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
1 passed, 1 deselected, 3 warnings, 3 subtests passed in 92.21s (0:01:32)
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for FP4 MoE using the FlashInfer/TRT-LLM backend, including a new routed MoE wrapper and integration with the model optimization quantization path. The implementation refactors weight handling to use standard parameter names and adds comprehensive tests for NVFP4 backends and weight updates. Feedback was provided to refactor the new FP4 MoE wrapper using a keyword argument dictionary to ensure consistency with existing wrappers in the codebase.

)
metrics = run_eval(args)
print(f"{metrics=}")
self.assertGreater(metrics["score"], 0.89)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set to 0.89 according to #22136

@nvpohanh
Copy link
Copy Markdown
Collaborator

nvpohanh commented Apr 6, 2026

cc @trevor-m

@trevor-m
Copy link
Copy Markdown
Collaborator

trevor-m commented Apr 7, 2026

We also have #21240

@zianglih
Copy link
Copy Markdown
Contributor Author

zianglih commented Apr 7, 2026

@trevor-m do you have plan on merging the PR? I can close this one since the implementation looks identical.

@zianglih
Copy link
Copy Markdown
Contributor Author

zianglih commented Apr 7, 2026

I will strip this PR to weight update and test changes and hold untill #21240 merges.

@zianglih zianglih changed the title [RL] [FlashInfer] Integrate FlashInfer trtllm_fp4_block_scale_routed_moe [RL] [FlashInfer] Fix weight update and expand tests for FlashInfer nvfp4 moe Apr 7, 2026
@zianglih zianglih changed the title [RL] [FlashInfer] Fix weight update and expand tests for FlashInfer nvfp4 moe [RL] [FlashInfer] Refactor NVFP4 trtllm shuffling/swizzling to in-place replacement Apr 7, 2026
@zianglih zianglih changed the title [RL] [FlashInfer] Refactor NVFP4 trtllm shuffling/swizzling to in-place replacement [RL] [FlashInfer] Integrate FlashInfer trtllm_fp4_block_scale_routed_moe Apr 7, 2026
@zianglih
Copy link
Copy Markdown
Contributor Author

zianglih commented Apr 7, 2026

Closing this PR since flashinfer trtllm nvfp4 routed moe implementation is duplicated with #21240

Moving weight update refactoring and test file changes to:

@zianglih zianglih closed this Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blackwell SM100/SM120 quant LLM Quantization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants