[torch.compile] Add torch inductor pass for fusing silu_and_mul with subsequent scaled_fp8_quant operations#10867
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
27be0bd to
e2fda7f
Compare
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
tlrmchlsmth
left a comment
There was a problem hiding this comment.
Focused on csrc/quantization/activation_kernels.cu. spotted a couple of potential int32_t overflows
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be |
…silu-mul-quant Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
…silu-mul-quant Signed-off-by: Sage Moore <sage@neuralmagic.com>
| Because patterns can only be registered once, the pass is a singleton. | ||
| This will be addressed in a future version of PyTorch: | ||
| https://github.com/pytorch/pytorch/pull/139321#issuecomment-2452354980 |
There was a problem hiding this comment.
This should have been fixed in pytorch/pytorch#139321 (@eellison), and yes that's in 2.7.0
There was a problem hiding this comment.
Nice! In that case, @SageMoore could you clean this up before landing?
Signed-off-by: Sage Moore <sage@neuralmagic.com>
…silu-mul-quant Signed-off-by: Sage Moore <sage@neuralmagic.com>
…silu-mul-quant Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
|
Here are lm_eval results for |
tlrmchlsmth
left a comment
There was a problem hiding this comment.
still LGTM, and thanks for cleaning up that last piece!
|
A follow-up question: are we planning on doing the dynamic pathway? |
…subsequent scaled_fp8_quant operations (vllm-project#10867) Signed-off-by: Sage Moore <sage@neuralmagic.com>
…subsequent scaled_fp8_quant operations (vllm-project#10867) Signed-off-by: Sage Moore <sage@neuralmagic.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
…subsequent scaled_fp8_quant operations (vllm-project#10867) Signed-off-by: Sage Moore <sage@neuralmagic.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
Credit to @LucasWilkinson for the kernel.
This pass currently only supports static per-tensor quantization. Other quantization schemes will be included in a subsequent PRs.
I've attached some QPS sweeps that were run using
neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8on an H100. Generally speaking, this pass improves the TPOT of FP8 Llama by 2-3%. There are similar improvements with TTFT with the exception of 20QPS which is much (~2x) faster.fused_results
torch_compile_results