[Bugfix] fix qwen3-omni performance regression#3575
Conversation
fe19226 to
f843563
Compare
|
VLLM_USE_FLASHINFER_MOE_FP16 seems to default to 0. However, it will check whether the VLLM_USE_FLASHINFER_MOE_FP16 environment variable is set. |
|
Would it happen regression for thinker-only when |
regression for both thinker & talker when |
Even if the environment variable VLLM_USE_FLASHINFER_MOE_FP16 is not set, the flash infer backend will still be used. |
ZJY0516
left a comment
There was a problem hiding this comment.
Please add a todo in comment to remove this later. It's a vllm moe performance regression
…e regression Signed-off-by: rein yang <ruiruyang2@gmail.com>
b0f3589 to
3d27e2b
Compare
DONE |
|
reopen because regression issues still exist in the main branch, the environment setting is only introduced in CI. |
Signed-off-by: rein yang <ruiruyang2@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: rein yang <ruiruyang2@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Signed-off-by: akshatvishu <akshatnayak197@gmail.com>
To avoid performance degradation when using FlashInfer CUTLASS Unquantized MoE backend, Qwen3-Omni sets VLLM_USE_FLASHINFER_MOE_FP16=0 by default (Using TRITON Unquantized MoE backend).
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
to solve #3556

before: start a non async chunk Qwen3-omni server with FlashInfer CUTLASS Unquantized MoE backend
now: start a non async chunk Qwen3-omni server with Triton Unquantized MoE backend
TTFP change from 46525.73 to 29900.31 ms on same bench test.
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)
@ZeldaHuang @amy-why-3459 @tzhouam @Gaohan123 @hsliuustc0106