[NVIDIA] [feat] Integrate flashinfer Trtllmgen bf16 moe#32954
[NVIDIA] [feat] Integrate flashinfer Trtllmgen bf16 moe#32954pavanimajety merged 4 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request integrates flashinfer trtllm-gen BF16 moe to supported models. The changes include adding new functions for BF16 support in flashinfer_trtllm_moe.py, modifying unquantized.py to include FLASHINFER_TRTLLM as a backend option, and updating unquantized_fused_moe_method.py to handle the new backend. The review focuses on ensuring the correctness of the new BF16 implementation and the proper integration of the TRTLLM backend.
|
Hi @Linda-Stadter, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
|
@pavanimajety |
|
Hi @Linda-Stadter, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
731bf70 to
6863d85
Compare
|
Hi @Linda-Stadter, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
7bf4ac6 to
38056f1
Compare
pavanimajety
left a comment
There was a problem hiding this comment.
LGTM, minor feedback.
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
…#32954) Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
…#32954) Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com> Signed-off-by: Pai <416932041@qq.com>
Purpose
Test Plan
VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=latency vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --max-num-batched-tokens 8192 --max-model-len 32768 --no-enable-prefix-caching --async-scheduling --compilation_config.pass_config.enable_fi_allreduce_fusion true --compilation_config.pass_config.enable_noop true --compilation_config.custom_ops+=+rms_norm --compilation_config.cudagraph_mode FULL_DECODE_ONLY --compilation_config.splitting_ops [] --disable-log-requests -tp 2lm_eval --model local-completions --tasks gsm8k --model_args model=Qwen/Qwen3-Next-80B-A3B-Instruct,base_url=http://0.0.0.0:8000/v1/completions,max_retries=3,tokenized_requests=False,timeout=1200,max_gen_toks=2048,max_length=8192 --batch_size 2048 --trust_remote_code --limit 0.5Test Result
Test Plan
VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND="latency" vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct -tp 2 --enable-expert-parallel --async-scheduling --no-enable-prefix-cachinglm_eval --model local-completions --model_args model=Qwen/Qwen3-Next-80B-A3B-Instruct,base_url=http://0.0.0.0:8000/v1/completions -t gsm8k --num_fewshot 5 --batch_size 250Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.