[LoRA] Set default MXFP4 LoRA backend to Marlin#30598
[LoRA] Set default MXFP4 LoRA backend to Marlin#30598vllm-bot merged 2 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Xin Yang <xyangx@amazon.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request changes the default MXFP4 backend for LoRA from Triton to Marlin, citing better performance and accuracy. The logic is updated to select Marlin by default, and only fall back to Triton if VLLM_MXFP4_USE_MARLIN is explicitly set to 0 and Triton kernels are supported. The change is correct, well-scoped to LoRA as per the PR title, and aligns with the stated purpose. The implementation is clear and concise.
yewentao256
left a comment
There was a problem hiding this comment.
Please add metrics in the PR description to show the acc / perf issue you mentioned.
lm_evalfor accvllm bench...for performance
jeejeelee
left a comment
There was a problem hiding this comment.
After addressing @yewentao256's comment, LGTM
@yewentao256 Thanks for review!
|
Let's fix the issue first then, do you have time to take a deep look into this issue? |
Signed-off-by: Xin Yang <xyangx@amazon.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Signed-off-by: Xin Yang <xyangx@amazon.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
Signed-off-by: Xin Yang <xyangx@amazon.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Signed-off-by: Xin Yang <xyangx@amazon.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Purpose
This PR set default MXFP4 LoRA backend to Marlin because Triton has accuracy issues and Marlin has slight better performance.
VLLM_MXFP4_USE_MARLIN=0explicitly) andtriton_kernelsis supported.VLLM_MXFP4_USE_MARLINis not setVLLM_MXFP4_USE_MARLIN=1triton_kernelsis not supportedBenchmarking
Marlin:
Triton:
Marlin is slightly better because in mxfp4 Triton LoRA is implemented in UnfusedOAITritonExperts. It has to unfuse the activation and reduction to allow to inject lora modules, so this makes it lose the
triton_kernels's optimizations of fused activation and fused moe_sum.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.