Fix MoE backend selection for LoRA (unquantized MoE)#40273
Fix MoE backend selection for LoRA (unquantized MoE)#40273robertgshaw2-redhat merged 9 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces logic to force the Triton backend for unquantized MoE when LoRA is enabled and adds a corresponding test case. Feedback indicates that the current early return implementation is problematic because it bypasses support for BATCHED_TRITON (required for models like DeepSeek-V3), skips backend selection logging, and overrides user-specified backend preferences. A suggestion was provided to filter the available backends instead of returning early.
9056200 to
0168ea9
Compare
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
4f5f7fe to
d57b541
Compare
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
51f70ad to
4c0c318
Compare
Should keep ROCm behavior unchanged. Also update tests. Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
2c410cb to
cdf3444
Compare
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
tomeras91
left a comment
There was a problem hiding this comment.
Thanks @danisereb!
Added a few suggestions
- Use Triton for both CUDA and ROCm (aligned with select_fp8_moe_backend). - Update tests accordingly. Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
82f0ef3 to
6b852cd
Compare
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
| if current_platform.is_out_of_tree(): | ||
| return UnquantizedMoeBackend.OOT, None | ||
|
|
||
| if moe_config.is_lora_enabled: |
There was a problem hiding this comment.
This logic is now aligned with select_fp8_moe_backend (early exit if LoRA is enabled).
tomeras91
left a comment
There was a problem hiding this comment.
Much better now.
Left a small nit
|
|
||
|
|
||
| @skipif_not_cuda_rocm | ||
| def test_select_explicit_triton_ignores_flashinfer_env(monkeypatch): |
There was a problem hiding this comment.
nit: This test can run on all platforms.. nothing CUDA/ROCm specific about it
There was a problem hiding this comment.
Wasn't sure about XPUs.
…oject#40273)" This reverts commit d1135a5.
Purpose
When using LoRA adapters with Nemotron Nano BF16:
https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
The following error was raised:
In previous vLLM versions the default backend for unquantized MoE was
TritonExperts.The new default backend "FlashInfer CUTLASS" does not support LoRA (see class
FlashInferExperts, functionmoe_sum).This PR selects
TritonExpertswhen LoRA is enabled.This change aligns with
select_fp8_moe_backend,select_mxfp8_moe_backend,select_gpt_oss_mxfp4_moe_backend.Test Plan
Add new test(s) in
pytest tests/kernels/moe/test_unquantized_backend_selection.pyCheck that LoRA works with Nemotron Nano BF16.
Test Result
All tests in
test_unquantized_backend_selection.pypassed.LoRA adapters now work with Nemotron Nano BF16 (TP1/2/4):
When running without LoRA adapters, the expected default backend is selected:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.