[torch.compile] Add torch inductor pass for fusing silu_and_mul with subsequent scaled_fp8_quant operations by SageMoore · Pull Request #10867 · vllm-project/vllm

SageMoore · 2024-12-03T16:49:39Z

Credit to @LucasWilkinson for the kernel.

This pass currently only supports static per-tensor quantization. Other quantization schemes will be included in a subsequent PRs.

I've attached some QPS sweeps that were run using neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 on an H100. Generally speaking, this pass improves the TPOT of FP8 Llama by 2-3%. There are similar improvements with TTFT with the exception of 20QPS which is much (~2x) faster.

fused_results
torch_compile_results

github-actions · 2024-12-03T16:49:52Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Sage Moore <sage@neuralmagic.com>

tlrmchlsmth

Focused on csrc/quantization/activation_kernels.cu. spotted a couple of potential int32_t overflows

…silu-mul-quant

mergify · 2025-01-21T19:55:20Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @SageMoore.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…silu-mul-quant

Signed-off-by: Sage Moore <sage@neuralmagic.com>

mergify · 2025-02-08T12:26:19Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @SageMoore.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…silu-mul-quant Signed-off-by: Sage Moore <sage@neuralmagic.com>

Signed-off-by: Sage Moore <sage@neuralmagic.com>

…silu-mul-quant Signed-off-by: Sage Moore <sage@neuralmagic.com>

…silu-mul-quant

tlrmchlsmth · 2025-04-29T15:00:28Z

+    Because patterns can only be registered once, the pass is a singleton.
+    This will be addressed in a future version of PyTorch:
+    https://github.com/pytorch/pytorch/pull/139321#issuecomment-2452354980


Is this still an issue on 2.7.0? (@zou3519)

This should have been fixed in pytorch/pytorch#139321 (@eellison), and yes that's in 2.7.0

Nice! In that case, @SageMoore could you clean this up before landing?

@SageMoore I think the comment got left in

Signed-off-by: Sage Moore <sage@neuralmagic.com>

…silu-mul-quant Signed-off-by: Sage Moore <sage@neuralmagic.com>

Signed-off-by: Sage Moore <sage@neuralmagic.com>

SageMoore · 2025-04-30T17:36:30Z

Here are lm_eval results for neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 with fusion running.

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.736|±  |0.0197|
|     |       |strict-match    |     5|exact_match|↑  |0.732|±  |0.0198|

tlrmchlsmth

still LGTM, and thanks for cleaning up that last piece!

ProExpertProg · 2025-05-01T15:08:17Z

A follow-up question: are we planning on doing the dynamic pathway?

…subsequent scaled_fp8_quant operations (vllm-project#10867) Signed-off-by: Sage Moore <sage@neuralmagic.com>

…subsequent scaled_fp8_quant operations (vllm-project#10867) Signed-off-by: Sage Moore <sage@neuralmagic.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

…subsequent scaled_fp8_quant operations (vllm-project#10867) Signed-off-by: Sage Moore <sage@neuralmagic.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>

mergify bot added the ci/build label Dec 3, 2024

SageMoore added 17 commits December 6, 2024 20:33

init

2e0031a

Signed-off-by: Sage Moore <sage@neuralmagic.com>

remove backend format changes

8a957c7

Signed-off-by: Sage Moore <sage@neuralmagic.com>

format

2913716

Signed-off-by: Sage Moore <sage@neuralmagic.com>

move activation_quant_kernels to the quantization dir

11c6fae

Signed-off-by: Sage Moore <sage@neuralmagic.com>

added replacement unit test

2dfecb5

Signed-off-by: Sage Moore <sage@neuralmagic.com>

added kernel unit test

702fa46

Signed-off-by: Sage Moore <sage@neuralmagic.com>

misc cleanup

583ff4c

Signed-off-by: Sage Moore <sage@neuralmagic.com>

move activation quant fusion to its own pass

e5680f7

Signed-off-by: Sage Moore <sage@neuralmagic.com>

update test

4b775c4

Signed-off-by: Sage Moore <sage@neuralmagic.com>

format

d5ff865

Signed-off-by: Sage Moore <sage@neuralmagic.com>

format

c970dec

Signed-off-by: Sage Moore <sage@neuralmagic.com>

format

596c445

Signed-off-by: Sage Moore <sage@neuralmagic.com>

format

7ab3e18

Signed-off-by: Sage Moore <sage@neuralmagic.com>

format

d347431

Signed-off-by: Sage Moore <sage@neuralmagic.com>

format

553d99c

Signed-off-by: Sage Moore <sage@neuralmagic.com>

format

774559d

Signed-off-by: Sage Moore <sage@neuralmagic.com>

format

e2fda7f

Signed-off-by: Sage Moore <sage@neuralmagic.com>

SageMoore force-pushed the sage/silu-mul-quant branch from 27be0bd to e2fda7f Compare December 6, 2024 20:34

SageMoore marked this pull request as ready for review December 6, 2024 20:36

SageMoore requested review from WoosukKwon and tlrmchlsmth as code owners December 6, 2024 20:36

SageMoore added 2 commits December 9, 2024 18:12

minor comment fix

6915fa2

Signed-off-by: Sage Moore <sage@neuralmagic.com>

minor updates

6d4b8d0

Signed-off-by: Sage Moore <sage@neuralmagic.com>

bnellnm reviewed Dec 10, 2024

View reviewed changes

Comment thread vllm/compilation/activation_quant_fusion.py Outdated

bnellnm reviewed Dec 10, 2024

View reviewed changes

Comment thread tests/kernels/test_fused_quant_activation.py

ProExpertProg reviewed Dec 12, 2024

View reviewed changes

Comment thread csrc/torch_bindings.cpp

tlrmchlsmth reviewed Dec 12, 2024

View reviewed changes

Comment thread csrc/core/math.hpp Outdated

Comment thread csrc/quantization/activation_kernels.cu Outdated

Comment thread csrc/quantization/activation_kernels.cu

Comment thread csrc/quantization/activation_kernels.cu Outdated

tlrmchlsmth mentioned this pull request Dec 12, 2024

[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support #10995

Merged

Merge branch 'main' of https://github.com/neuralmagic/vllm into sage/…

554012e

…silu-mul-quant

mergify bot added the needs-rebase label Jan 21, 2025

Merge branch 'main' of https://github.com/neuralmagic/vllm into sage/…

4d313f6

…silu-mul-quant

mergify bot removed the needs-rebase label Jan 27, 2025

format

635c798

Signed-off-by: Sage Moore <sage@neuralmagic.com>

mergify bot added the needs-rebase label Feb 8, 2025

Merge branch 'main' of https://github.com/neuralmagic/vllm into sage/…

584e437

…silu-mul-quant Signed-off-by: Sage Moore <sage@neuralmagic.com>

mergify bot removed the needs-rebase label Apr 24, 2025

SageMoore added 5 commits April 24, 2025 15:52

add long strings back

c5ae5d7

Signed-off-by: Sage Moore <sage@neuralmagic.com>

remove whitespace

cef3530

Signed-off-by: Sage Moore <sage@neuralmagic.com>

misc unit test fixes

aa5a394

Signed-off-by: Sage Moore <sage@neuralmagic.com>

Merge branch 'main' of https://github.com/neuralmagic/vllm into sage/…

2b9d7b4

…silu-mul-quant Signed-off-by: Sage Moore <sage@neuralmagic.com>

Merge branch 'main' of https://github.com/neuralmagic/vllm into sage/…

5901fec

…silu-mul-quant

tlrmchlsmth reviewed Apr 29, 2025

View reviewed changes

SageMoore added 4 commits April 30, 2025 14:47

remove ActivationQuantFusionPass singleton

5926992

Signed-off-by: Sage Moore <sage@neuralmagic.com>

Merge branch 'main' of https://github.com/neuralmagic/vllm into sage/…

241b056

…silu-mul-quant Signed-off-by: Sage Moore <sage@neuralmagic.com>

Merge branch 'main' of https://github.com/neuralmagic/vllm into sage/…

1077775

…silu-mul-quant Signed-off-by: Sage Moore <sage@neuralmagic.com>

add header

24640e1

Signed-off-by: Sage Moore <sage@neuralmagic.com>

tlrmchlsmth approved these changes May 1, 2025

View reviewed changes

vllm-bot merged commit 460a2b1 into vllm-project:main May 1, 2025
72 of 75 checks passed

radeksm pushed a commit to radeksm/vllm that referenced this pull request May 2, 2025

[torch.compile] Add torch inductor pass for fusing silu_and_mul with …

f36cb9a

…subsequent scaled_fp8_quant operations (vllm-project#10867) Signed-off-by: Sage Moore <sage@neuralmagic.com>

SageMoore deleted the sage/silu-mul-quant branch June 18, 2025 14:31

Uh oh!

Conversation

SageMoore commented Dec 3, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 3, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Jan 21, 2025

Uh oh!

mergify bot commented Feb 8, 2025

Uh oh!

tlrmchlsmth Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

zou3519 Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

ProExpertProg May 1, 2025

Choose a reason for hiding this comment

Uh oh!

SageMoore commented Apr 30, 2025

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ProExpertProg commented May 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

SageMoore commented Dec 3, 2024 •

edited by github-actions bot

Loading

zou3519 Apr 29, 2025 •

edited

Loading