Skip to content

[NVIDIA] [feat] Integrate flashinfer Trtllmgen bf16 moe#32954

Merged
pavanimajety merged 4 commits intovllm-project:mainfrom
Linda-Stadter:trtllmgen_bf16_moe_rebased
Jan 29, 2026
Merged

[NVIDIA] [feat] Integrate flashinfer Trtllmgen bf16 moe#32954
pavanimajety merged 4 commits intovllm-project:mainfrom
Linda-Stadter:trtllmgen_bf16_moe_rebased

Conversation

@Linda-Stadter
Copy link
Copy Markdown
Contributor

@Linda-Stadter Linda-Stadter commented Jan 23, 2026

Purpose

  • Integrate flashinfer trtllm-gen BF16 moe to supported models
  • This is a rebased version of PR 28238 by @jiahanc and includes adaptation to the latest moe refactoring changes. I have further verified that the accuracy issues discussed in 28238 are solved.

Test Plan

VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=latency vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --max-num-batched-tokens 8192 --max-model-len 32768 --no-enable-prefix-caching --async-scheduling --compilation_config.pass_config.enable_fi_allreduce_fusion true --compilation_config.pass_config.enable_noop true --compilation_config.custom_ops+=+rms_norm --compilation_config.cudagraph_mode FULL_DECODE_ONLY --compilation_config.splitting_ops [] --disable-log-requests -tp 2

lm_eval --model local-completions --tasks gsm8k --model_args model=Qwen/Qwen3-Next-80B-A3B-Instruct,base_url=http://0.0.0.0:8000/v1/completions,max_retries=3,tokenized_requests=False,timeout=1200,max_gen_toks=2048,max_length=8192 --batch_size 2048 --trust_remote_code --limit 0.5

Test Result


|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9606|±  |0.0076|
|     |       |strict-match    |     5|exact_match|↑  |0.9167|±  |0.0108|

Test Plan

VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND="latency" vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct -tp 2 --enable-expert-parallel --async-scheduling --no-enable-prefix-caching

lm_eval --model local-completions --model_args model=Qwen/Qwen3-Next-80B-A3B-Instruct,base_url=http://0.0.0.0:8000/v1/completions -t gsm8k --num_fewshot 5 --batch_size 250

Test Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8537|±  |0.0097|
|     |       |strict-match    |     5|exact_match|↑  |0.8135|±  |0.0107| 

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates flashinfer trtllm-gen BF16 moe to supported models. The changes include adding new functions for BF16 support in flashinfer_trtllm_moe.py, modifying unquantized.py to include FLASHINFER_TRTLLM as a backend option, and updating unquantized_fused_moe_method.py to handle the new backend. The review focuses on ensuring the correctness of the new BF16 implementation and the proper integration of the TRTLLM backend.

Comment thread vllm/model_executor/layers/fused_moe/oracle/unquantized.py Outdated
Comment thread vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py
Comment thread vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py
Comment thread vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py Outdated
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jan 23, 2026

Hi @Linda-Stadter, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

Comment thread vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
@vadiklyutiy
Copy link
Copy Markdown
Collaborator

@pavanimajety
This is a continuation of #28238 by @jiahanc, which you previously approved. Please take a look.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jan 26, 2026

Hi @Linda-Stadter, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
@Linda-Stadter Linda-Stadter force-pushed the trtllmgen_bf16_moe_rebased branch from 731bf70 to 6863d85 Compare January 26, 2026 17:39
Comment thread vllm/model_executor/layers/fused_moe/oracle/unquantized.py Outdated
Comment thread vllm/model_executor/layers/fused_moe/oracle/unquantized.py
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jan 27, 2026

Hi @Linda-Stadter, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
@Linda-Stadter Linda-Stadter force-pushed the trtllmgen_bf16_moe_rebased branch from 7bf4ac6 to 38056f1 Compare January 27, 2026 15:34
Copy link
Copy Markdown
Collaborator

@pavanimajety pavanimajety left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor feedback.

Comment thread vllm/model_executor/layers/fused_moe/oracle/unquantized.py Outdated
Comment thread vllm/model_executor/layers/fused_moe/oracle/unquantized.py Outdated
Comment thread vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py Outdated
@pavanimajety pavanimajety added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 27, 2026
Comment thread vllm/model_executor/layers/fused_moe/oracle/unquantized.py Outdated
Comment thread vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py Outdated
Comment thread vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py Outdated
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
@vadiklyutiy vadiklyutiy self-requested a review January 28, 2026 16:19
Copy link
Copy Markdown
Collaborator

@vadiklyutiy vadiklyutiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good

@github-project-automation github-project-automation Bot moved this to Ready in NVIDIA Jan 29, 2026
@pavanimajety pavanimajety merged commit 0493d89 into vllm-project:main Jan 29, 2026
59 of 60 checks passed
@github-project-automation github-project-automation Bot moved this from Ready to Done in NVIDIA Jan 29, 2026
apd10 pushed a commit to apd10/vllm that referenced this pull request Jan 31, 2026
…#32954)

Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
PiratePai pushed a commit to PiratePai/epd_shm that referenced this pull request Feb 3, 2026
…#32954)

Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
Signed-off-by: Pai <416932041@qq.com>
qdanik added a commit to qdanik/vllm that referenced this pull request Feb 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants