[torch.compile] Enable attention and allreduce fusion without custom ops enabled #24604

ProExpertProg · 2025-09-10T19:23:47Z

Purpose

This PR enables matching the torch implementations of custom ops QuantFP8 and RMSNorm. On main, fusion currently requires enabling custom ops, but they are slower than their torch counterparts, so the benefit of custom fusion passes is reduced.

We add a bunch of "matcher util" objects which can be called in patterns and get traced to the same fx nodes as the custom op they correspond to in both enabled and disabled form automatically.

This PR also adds additional debugging utilities and adds E2E fusion tests to verify fusions happen in models end-to-end instead of just in unit tests.

Test Plan

Unit tests, added more fusion E2E tests.

Test Result

Tests all pass

Performance numbers

Below are B200 numbers (with flashinfer) from vllm bench serve on the following serve command:

vllm serve redhatai/meta-llama-3.1-70B-Instruct-FP8 --no-enable-prefix-caching --load-format dummy --kv-cache-dtype=fp8 -O.splitting_ops=[] -O.cudagraph_mode=FULL_DECODE_ONLY

We test the following regimes with corresponding additional arguments:

none: -O.custom_ops='["none"]' -O.pass_config={"enable_fi_allreduce_fusion":false,"enable_attn_fusion":false,"enable_noop":true}
none_fusion_attention: -O.custom_ops='["none"]' -O.pass_config={"enable_fi_allreduce_fusion":false,"enable_attn_fusion":true,"enable_noop":true}
none_fusion_attention_allreduce: -O.custom_ops='["none"]' -O.pass_config={"enable_fi_allreduce_fusion":true,"enable_attn_fusion":true,"enable_noop":true}
rms_quant: -O.custom_ops='["none", "+quant_fp8", "+rms_norm"]' -O.pass_config={"enable_fi_allreduce_fusion":false,"enable_attn_fusion":false,"enable_noop":true}
rms_quant_fusion_attention: -O.custom_ops='["none", "+quant_fp8", "+rms_norm"]' -O.pass_config={"enable_fi_allreduce_fusion":false,"enable_attn_fusion":true,"enable_noop":true}
rms_quant_fusion_attention_allreduce: -O.custom_ops='["none", "+quant_fp8", "+rms_norm"]' -O.pass_config={"enable_fi_allreduce_fusion":true,"enable_attn_fusion":true,"enable_noop":true}

2 (none_fusion_attention) and 3 (none_fusion_attention_allreduce) are newly possible with this PR. On main, results are similar except those two are worse as fusion cannot happen without custom ops enabled.

redhatai/meta-llama-3.1-70B-Instruct-FP8 (TP=1):

Past QPS=10 the server is overloaded so the latency spikes and becomes much more variable. Also note that allreduce fusion is a noop for tp=1.

📊 TTFT Median (ms)

Source	1.0	5.0	10.0	15.0	20.0	inf
none	80.24	85.20	148.05	3740.51	27131.90	29992.91
none_fusion_attention_allreduce	78.42	83.89	146.75	1739.83	24117.31	28886.46
rms_quant	80.38	87.31	162.54	5298.40	26009.12	35230.81
rms_quant_fusion_attention_allreduce	79.50	85.38	149.76	3904.91	24737.15	34760.88

📊 TPOT Median (ms)

Source	1.0	5.0	10.0	15.0	20.0	inf
none	16.86	22.06	41.13	287.30	293.71	285.44
none_fusion_attention_allreduce	16.61	21.86	39.95	285.85	284.93	275.04
rms_quant	17.17	23.12	44.32	234.08	232.37	227.34
rms_quant_fusion_attention_allreduce	16.97	22.45	42.21	228.79	228.76	225.60

📊 ITL Median (ms)

Source	1.0	5.0	10.0	15.0	20.0	inf
none	16.11	16.49	21.31	207.46	180.96	64.85
none_fusion_attention_allreduce	15.87	16.28	21.07	223.46	191.32	64.14
rms_quant	16.35	17.02	22.16	103.97	50.63	47.99
rms_quant_fusion_attention_allreduce	16.20	16.71	21.46	143.86	48.40	47.59

redhatai/meta-llama-3.1-70B-Instruct-FP8 (TP=4):

Note that allreduce fusion reduces TPOT at low QP but increases it at high QPS and increases TTFT across the board, this will be addressed in #24248 and #24252.

📊 TTFT Median (ms)

Source	1.0	5.0	10.0	15.0	20.0	inf
none	72.01	76.32	89.95	112.80	136.13	12852.81
none_fusion_attention	71.10	75.38	93.56	116.45	133.33	12662.96
none_fusion_attention_allreduce	74.16	77.85	94.50	123.77	139.41	21980.02
rms_quant	74.95	78.65	90.93	115.75	162.43	12620.21
rms_quant_fusion_attention	73.77	76.84	94.10	124.59	146.62	20634.74
rms_quant_fusion_attention_allreduce	75.12	77.69	86.11	112.16	199.10	13078.33

📊 TPOT Median (ms)

Source	1.0	5.0	10.0	15.0	20.0	inf
none	9.66	12.72	20.68	31.90	46.11	257.66
none_fusion_attention	9.39	12.44	20.69	29.16	43.08	142.36
none_fusion_attention_allreduce	8.28	11.16	20.84	34.21	47.54	220.19
rms_quant	9.92	13.13	21.95	32.97	62.92	146.71
rms_quant_fusion_attention	9.60	12.78	20.58	34.18	50.37	146.19
rms_quant_fusion_attention_allreduce	8.43	11.39	20.38	31.10	78.54	145.60

📊 ITL Median (ms)

Source	1.0	5.0	10.0	15.0	20.0	inf
none	9.23	9.43	11.45	15.04	63.60	194.04
none_fusion_attention	8.97	9.16	11.29	20.99	62.27	188.26
none_fusion_attention_allreduce	7.85	8.05	11.51	23.63	65.02	191.91
rms_quant	9.47	9.71	11.75	16.26	66.33	188.91
rms_quant_fusion_attention	9.18	9.45	11.51	24.76	68.46	187.65
rms_quant_fusion_attention_allreduce	8.01	8.28	11.60	14.71	73.63	194.55

mergify · 2025-09-10T19:24:24Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-09-12T19:12:31Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-09-16T13:06:41Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…g utils, fix DCE bug (#23091), fix test (#24376), and prep for custom op matching (#24604) (#24542) Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: luka <[email protected]> Signed-off-by: Luka Govedič <[email protected]>

mergify · 2025-09-22T19:43:12Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…g utils, fix DCE bug (vllm-project#23091), fix test (vllm-project#24376), and prep for custom op matching (vllm-project#24604) (vllm-project#24542) Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: luka <[email protected]> Signed-off-by: Luka Govedič <[email protected]>

ProExpertProg · 2025-10-17T01:27:27Z

The reason this is needed is it lets us do fusion without having to enable custom ops (-O.custom_ops=["+quant_fp8"]). Enabling custom ops leads to lost performance, as demonstrated in the PR description. That's because there are 4 quant ops per layer, one per matmul, and +quant_fp8 leads to lost performance for the 3 remaining ones that did not get fused onto attention, for example. In the example of SequenceParallelismPass, we're not even fusing, just moving the ops around. In AllReduceRMSNormFusion, we're going to be only fusing for some shapes, so we're again left with an inefficient implementation for others.

I agree this is a somewhat fragile approach. I would be happy to work on a "lowering" approach where we preserve the high-level structure of ops until later. The downside would be that it would require more work (I think), and we might lose access to optimizations that currently happen before our passes . But I think it wouldn't hurt Inductor in general to have a more explicit sense of converting between higher-level and lower-level representations (or we just move where our custom passes happen). We can tie this work into the "autotuning custom op implementations" like done in pytorch/pytorch#164212.

ProExpertProg · 2025-10-17T02:14:47Z

As discussed offline, we are going to proceed by merging this PR. After PTC, we will move our custom op matching passes to post_grad_custom_pre_pass to reduce the risk of inductor transformations breaking custom op matching. Some of our matching depends on removing noop views and slices, so we'll have to figure out a solution for that.

BoyuanFeng · 2025-10-17T02:28:56Z

view/slice noop eliminations were upstreamed to PyTorch so I'm wondering if this is sufficient pytorch/pytorch#151095 pytorch/pytorch#151175

…hing-2 Signed-off-by: Luka Govedič <[email protected]>

ProExpertProg · 2025-10-17T02:51:39Z

@BoyuanFeng wouldn't that run after post_grad_custom_pre_pass though?

Signed-off-by: Luka Govedič <[email protected]>

…g utils, fix DCE bug (vllm-project#23091), fix test (vllm-project#24376), and prep for custom op matching (vllm-project#24604) (vllm-project#24542) Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: luka <[email protected]> Signed-off-by: Luka Govedič <[email protected]>

…ops enabled (vllm-project#24604) Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: Luka Govedič <[email protected]>

…ops enabled (vllm-project#24604) Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: Alberto Perdomo <[email protected]>

…ops enabled (vllm-project#24604) Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: Luka Govedič <[email protected]>

…g utils, fix DCE bug (vllm-project#23091), fix test (vllm-project#24376), and prep for custom op matching (vllm-project#24604) (vllm-project#24542) Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: luka <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

…ops enabled (vllm-project#24604) Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

…ops enabled (vllm-project#24604) Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: 0xrushi <[email protected]>

…ops enabled (vllm-project#24604) Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: Luka Govedič <[email protected]>

…g utils, fix DCE bug (vllm-project#23091), fix test (vllm-project#24376), and prep for custom op matching (vllm-project#24604) (vllm-project#24542) Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: luka <[email protected]> Signed-off-by: Luka Govedič <[email protected]>

…ops enabled (vllm-project#24604) Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: Luka Govedič <[email protected]>

ProExpertProg mentioned this pull request Sep 10, 2025

TEMP changes for matching #22856

Closed

4 tasks

mergify bot added the needs-rebase label Sep 10, 2025

ProExpertProg force-pushed the luka/custom-op-matching-2 branch from b374514 to 4a44829 Compare September 11, 2025 04:42

mergify bot removed the needs-rebase label Sep 11, 2025

mergify bot added the needs-rebase label Sep 12, 2025

ProExpertProg force-pushed the luka/custom-op-matching-2 branch 2 times, most recently from 42f2231 to a8c9181 Compare September 12, 2025 19:20

mergify bot removed the needs-rebase label Sep 12, 2025

mgoin self-assigned this Sep 15, 2025

mergify bot added the needs-rebase label Sep 16, 2025

ProExpertProg force-pushed the luka/custom-op-matching-2 branch 6 times, most recently from 1e9326c to e3d0c83 Compare September 20, 2025 02:04

mergify bot removed the needs-rebase label Sep 20, 2025

ProExpertProg mentioned this pull request Sep 20, 2025

[torch.compile] Cleanup compilation tests and custom passes, add debug utils, fix DCE bug (#23091), fix test (#24376), and prep for custom op matching (#24604) #24542

Merged

ProExpertProg force-pushed the luka/custom-op-matching-2 branch from e3d0c83 to 9151d01 Compare September 20, 2025 13:49

mergify bot added the needs-rebase label Sep 22, 2025

ProExpertProg force-pushed the luka/custom-op-matching-2 branch from 9151d01 to da3cb54 Compare September 23, 2025 14:24

mergify bot removed the needs-rebase label Sep 23, 2025

ProExpertProg mentioned this pull request Sep 25, 2025

[Feature]: Enabling performance optimizations by default #25689

Open

11 tasks

mergify bot added the needs-rebase label Oct 17, 2025

ProExpertProg mentioned this pull request Oct 17, 2025

Sequence parallelism with torch rms norm #27062

Closed

Merge remote-tracking branch 'upstream/main' into luka/custom-op-matc…

c27a182

…hing-2 Signed-off-by: Luka Govedič <[email protected]>

zou3519 approved these changes Oct 17, 2025

View reviewed changes

ProExpertProg mentioned this pull request Oct 17, 2025

[RFC]: Move custom fusion passes to earlier in the compilation pipeline #27072

Open

1 task

Removed TODO

c4f913d

Signed-off-by: Luka Govedič <[email protected]>

mergify bot removed the needs-rebase label Oct 17, 2025

ProExpertProg mentioned this pull request Oct 17, 2025

[RFC]: To Inductor partition or to not Inductor partition (by default in v0.11.1) #27080

Open

1 task

mgoin merged commit bd7157a into vllm-project:main Oct 17, 2025
87 checks passed

ProExpertProg deleted the luka/custom-op-matching-2 branch October 17, 2025 14:41

angelayi mentioned this pull request Oct 17, 2025

[compile] Enable sequence parallelism matching w/o custom ops enabled #27126

Merged

ZJY0516 mentioned this pull request Oct 18, 2025

[torch.compile] Enable silu_mul_fp8_quant fusion without custom ops enabled #27146

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[torch.compile] Enable attention and allreduce fusion without custom ops enabled #24604

[torch.compile] Enable attention and allreduce fusion without custom ops enabled #24604

Uh oh!

ProExpertProg commented Sep 10, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Sep 10, 2025

Uh oh!

mergify bot commented Sep 12, 2025

Uh oh!

mergify bot commented Sep 16, 2025

Uh oh!

mergify bot commented Sep 22, 2025

Uh oh!

ProExpertProg commented Oct 17, 2025

Uh oh!

ProExpertProg commented Oct 17, 2025

Uh oh!

BoyuanFeng commented Oct 17, 2025

Uh oh!

ProExpertProg commented Oct 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Uh oh!

[torch.compile] Enable attention and allreduce fusion without custom ops enabled #24604

[torch.compile] Enable attention and allreduce fusion without custom ops enabled #24604

Uh oh!

Conversation

ProExpertProg commented Sep 10, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Performance numbers

📊 TTFT Median (ms)

📊 TPOT Median (ms)

📊 ITL Median (ms)

📊 TTFT Median (ms)

📊 TPOT Median (ms)

📊 ITL Median (ms)

Uh oh!

mergify bot commented Sep 10, 2025

Uh oh!

mergify bot commented Sep 12, 2025

Uh oh!

mergify bot commented Sep 16, 2025

Uh oh!

mergify bot commented Sep 22, 2025

Uh oh!

ProExpertProg commented Oct 17, 2025

Uh oh!

ProExpertProg commented Oct 17, 2025

Uh oh!

BoyuanFeng commented Oct 17, 2025

Uh oh!

ProExpertProg commented Oct 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

ProExpertProg commented Sep 10, 2025 •

edited by github-actions bot

Loading