[NVIDIA] [feat] Integrate flashinfer Trtllmgen bf16 moe by Linda-Stadter · Pull Request #32954 · vllm-project/vllm

Linda-Stadter · 2026-01-23T17:15:31Z

Purpose

Integrate flashinfer trtllm-gen BF16 moe to supported models
This is a rebased version of PR 28238 by @jiahanc and includes adaptation to the latest moe refactoring changes. I have further verified that the accuracy issues discussed in 28238 are solved.

Test Plan

VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=latency vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --max-num-batched-tokens 8192 --max-model-len 32768 --no-enable-prefix-caching --async-scheduling --compilation_config.pass_config.enable_fi_allreduce_fusion true --compilation_config.pass_config.enable_noop true --compilation_config.custom_ops+=+rms_norm --compilation_config.cudagraph_mode FULL_DECODE_ONLY --compilation_config.splitting_ops [] --disable-log-requests -tp 2

lm_eval --model local-completions --tasks gsm8k --model_args model=Qwen/Qwen3-Next-80B-A3B-Instruct,base_url=http://0.0.0.0:8000/v1/completions,max_retries=3,tokenized_requests=False,timeout=1200,max_gen_toks=2048,max_length=8192 --batch_size 2048 --trust_remote_code --limit 0.5

Test Result


|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9606|±  |0.0076|
|     |       |strict-match    |     5|exact_match|↑  |0.9167|±  |0.0108|

Test Plan

VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND="latency" vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct -tp 2 --enable-expert-parallel --async-scheduling --no-enable-prefix-caching

lm_eval --model local-completions --model_args model=Qwen/Qwen3-Next-80B-A3B-Instruct,base_url=http://0.0.0.0:8000/v1/completions -t gsm8k --num_fewshot 5 --batch_size 250

Test Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8537|±  |0.0097|
|     |       |strict-match    |     5|exact_match|↑  |0.8135|±  |0.0107|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>

github-actions · 2026-01-23T17:15:40Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request integrates flashinfer trtllm-gen BF16 moe to supported models. The changes include adding new functions for BF16 support in flashinfer_trtllm_moe.py, modifying unquantized.py to include FLASHINFER_TRTLLM as a backend option, and updating unquantized_fused_moe_method.py to handle the new backend. The review focuses on ensuring the correctness of the new BF16 implementation and the proper integration of the TRTLLM backend.

mergify · 2026-01-23T17:19:46Z

Hi @Linda-Stadter, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

vadiklyutiy · 2026-01-23T23:35:03Z

@pavanimajety
This is a continuation of #28238 by @jiahanc, which you previously approved. Please take a look.

mergify · 2026-01-26T17:30:20Z

Hi @Linda-Stadter, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>

mergify · 2026-01-27T15:28:26Z

Hi @Linda-Stadter, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>

pavanimajety

LGTM, minor feedback.

Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>

vadiklyutiy

Look good

…#32954) Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>

…#32954) Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com> Signed-off-by: Pai <416932041@qq.com>

vllm-project/vllm#32954

[NVIDIA] [feat] Integrate flashinfer Trtllmgen bf16 moe

5aab0b1

Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>

Linda-Stadter requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners January 23, 2026 17:15

mergify Bot added the nvidia label Jan 23, 2026

github-project-automation Bot added this to NVIDIA Jan 23, 2026

gemini-code-assist Bot reviewed Jan 23, 2026

View reviewed changes

cursor Bot reviewed Jan 23, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/quantization/utils/flashinfer_utils.py

Comment thread vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py

Fix pre-commit and comments

6863d85

Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>

Linda-Stadter force-pushed the trtllmgen_bf16_moe_rebased branch from 731bf70 to 6863d85 Compare January 26, 2026 17:39

vadiklyutiy reviewed Jan 27, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/oracle/unquantized.py Outdated

Comment thread vllm/model_executor/layers/fused_moe/oracle/unquantized.py

Fix forward native + resolve comments

38056f1

Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>

Linda-Stadter force-pushed the trtllmgen_bf16_moe_rebased branch from 7bf4ac6 to 38056f1 Compare January 27, 2026 15:34

pavanimajety reviewed Jan 27, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/oracle/unquantized.py Outdated

Comment thread vllm/model_executor/layers/fused_moe/oracle/unquantized.py Outdated

Comment thread vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py Outdated

pavanimajety added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 27, 2026

mgoin reviewed Jan 27, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/oracle/unquantized.py Outdated

Comment thread vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py Outdated

Comment thread vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py Outdated

Comments and updated condition

d7202bb

Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>

vadiklyutiy self-requested a review January 28, 2026 16:19

vadiklyutiy approved these changes Jan 28, 2026

View reviewed changes

pavanimajety approved these changes Jan 29, 2026

View reviewed changes

github-project-automation Bot moved this to Ready in NVIDIA Jan 29, 2026

pavanimajety merged commit 0493d89 into vllm-project:main Jan 29, 2026
59 of 60 checks passed

github-project-automation Bot moved this from Ready to Done in NVIDIA Jan 29, 2026

vadiklyutiy mentioned this pull request Jan 29, 2026

[Tracking Issue]: Qwen3-next performance optimisations #27225

Closed

12 tasks

kyuyeunk mentioned this pull request Jan 30, 2026

[Bugfix] Fix monolithic select vllm-project/tpu-inference#1573

Merged

pawel-olejniczak mentioned this pull request Jan 30, 2026

[FIX_FOR_VLLM_LATEST] Fix the MultiModalKwargsItem to align with vllm changes vllm-project/vllm-gaudi#903

Closed

apd10 pushed a commit to apd10/vllm that referenced this pull request Jan 31, 2026

[NVIDIA] [feat] Integrate flashinfer Trtllmgen bf16 moe (vllm-project…

2547580

…#32954) Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>

Linda-Stadter mentioned this pull request Feb 3, 2026

[NVIDIA][test] Tests for flashinfer TRTLLM BF16 MoE #33715

Merged

5 tasks

jiahanc mentioned this pull request Feb 6, 2026

[NVIDIA] [feat] Integrate flashinfer Trtllmgen bf16 moe #28238

Closed

5 tasks

qdanik added a commit to qdanik/vllm that referenced this pull request Feb 20, 2026

[NVIDIA] [feat] Integrate flashinfer Trtllmgen bf16 moe

d54d3df

vllm-project/vllm#32954

Uh oh!

Conversation

Linda-Stadter commented Jan 23, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Test Plan

Test Result

Uh oh!

github-actions Bot commented Jan 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Jan 23, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vadiklyutiy commented Jan 23, 2026

Uh oh!

mergify Bot commented Jan 26, 2026

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Jan 27, 2026

Uh oh!

pavanimajety left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vadiklyutiy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Linda-Stadter commented Jan 23, 2026 •

edited by github-actions Bot

Loading