[vLLM IR] 3/N fused_add_rms_norm and maybe_inplace by ProExpertProg · Pull Request #36823 · vllm-project/vllm

ProExpertProg · 2026-03-11T21:35:45Z

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2026-03-11T21:36:36Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces a maybe_inplace mechanism to the vLLM IR, allowing for performance optimizations through in-place operations while maintaining functional semantics for default op calls. It also adds a new fused_add_rms_norm op that leverages this new capability. The changes are extensive, touching the IR definition, compiler passes, and kernel implementations. My review has identified a critical safety issue where a potential misuse of an in-place operation only triggers a warning instead of an error, and a high-severity issue related to an incomplete compiler pass (CloneCleanupPass) that is being added.

gemini-code-assist · 2026-03-11T21:39:21Z

+                        logger.warning(
+                            "Node %s (input to %s) has another use", arg, node
+                        )
+                        # TODO raise error, this is undefined behavior, which should not be allowed.
+                        #  Users can just use the default overload if they want to keep activation inputs untouched.


The check for other users of an activation input that is about to be modified in-place currently only logs a warning. This can lead to silent correctness issues if the input tensor is used elsewhere after being modified. As the TODO comment suggests, this should raise an error to prevent such undefined behavior. Allowing compilation to proceed with a warning could introduce hard-to-debug errors downstream.

Suggested change

logger.warning(

"Node %s (input to %s) has another use", arg, node

)

# TODO raise error, this is undefined behavior, which should not be allowed.

# Users can just use the default overload if they want to keep activation inputs untouched.

raise ValueError(

f"Node {arg} (input to {node}) has another use in {user}. "

f"Using maybe_inplace on an input with multiple users is not allowed. "

f"Use the default overload if you want to keep activation inputs untouched."

)

gemini-code-assist · 2026-03-11T21:39:21Z

+            continue  # TODO
+            node.replace_all_uses_with(node.args[0])
+            graph.erase_node(node)
+            count += 1


The CloneCleanupPass is added to the pass manager but its implementation is currently a no-op due to the continue # TODO statement. The logic to remove clone nodes is commented out. Merging incomplete or placeholder code can lead to confusion and makes it unclear if the feature is intended to be active. The pass should either be fully implemented or removed from the pass manager until it's ready.

Suggested change

continue # TODO

node.replace_all_uses_with(node.args[0])

graph.erase_node(node)

count += 1

# A clone is safe to remove if its input has no other users.

# This is a conservative check. A more sophisticated analysis

# could trace back to the `maybe_inplace` call and its metadata.

if len(node.args[0].users) == 1:

node.replace_all_uses_with(node.args[0])

graph.erase_node(node)

count += 1

mergify · 2026-03-12T19:43:42Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ProExpertProg · 2026-03-31T22:16:01Z

Current status of perf:

$ vllm bench latency --model=Qwen/Qwen3-0.6B
Avg latency: 0.19557336465998862 seconds
10% percentile latency: 0.1915025576017797 seconds
25% percentile latency: 0.19555562996538356 seconds
50% percentile latency: 0.19578603247646242 seconds
75% percentile latency: 0.1963934577361215 seconds
90% percentile latency: 0.19710877431789414 seconds
99% percentile latency: 0.19903717429377138 seconds

$ vllm bench latency --model=Qwen/Qwen3-0.6B -cc.custom_ops+=+rms_norm --ir-op-priority.rms_norm=vllm_c --ir-op-priority.fused_add_rms_norm=vllm_c
Avg latency: 0.21115826283348724 seconds
10% percentile latency: 0.2069199254270643 seconds
25% percentile latency: 0.21143249873421155 seconds
50% percentile latency: 0.21161350194597617 seconds
75% percentile latency: 0.21171232018969022 seconds
90% percentile latency: 0.21274942716117948 seconds
99% percentile latency: 0.21377979715238327 seconds

$ vllm bench latency --model=Qwen/Qwen3-0.6B --enforce-eager
Avg latency: 1.0280421572815006 seconds
10% percentile latency: 1.013216928683687 seconds
25% percentile latency: 1.014344347815495 seconds
50% percentile latency: 1.019462305586785 seconds
75% percentile latency: 1.0264160812075716 seconds
90% percentile latency: 1.036091740813572 seconds
99% percentile latency: 1.1814956719905605 seconds

$ vllm bench latency --model=Qwen/Qwen3-0.6B -cc.cudagraph_mode=NONE
Avg latency: 1.1771849144715816 seconds
10% percentile latency: 1.1685888864682057 seconds
25% percentile latency: 1.1711883139505517 seconds
50% percentile latency: 1.1759423059993424 seconds
75% percentile latency: 1.1811810505460016 seconds
90% percentile latency: 1.1855081893503665 seconds
99% percentile latency: 1.204472551472718 seconds

Main

$ vllm bench latency --model=Qwen/Qwen3-0.6B
Avg latency: 0.1961628031722891 seconds
10% percentile latency: 0.19574561258777975 seconds
25% percentile latency: 0.1958314114890527 seconds
50% percentile latency: 0.19613034051144496 seconds
75% percentile latency: 0.19636730520869605 seconds
90% percentile latency: 0.19754589340882375 seconds
99% percentile latency: 0.19999273694236763 seconds

$ vllm bench latency --model=Qwen/Qwen3-0.6B -cc.custom_ops+=+rms_norm
Avg latency: 0.21067771341186017 seconds
10% percentile latency: 0.20676958258263767 seconds
25% percentile latency: 0.21087214731960557 seconds
50% percentile latency: 0.21127225446980447 seconds
75% percentile latency: 0.2117120226903353 seconds
90% percentile latency: 0.21260529567953199 seconds
99% percentile latency: 0.21379286048118956 seconds

$ vllm bench latency --model=Qwen/Qwen3-0.6B --enforce-eager
Avg latency: 0.9905871206894517 seconds
10% percentile latency: 0.9860326998285018 seconds
25% percentile latency: 0.9877241560025141 seconds
50% percentile latency: 0.9896844719769433 seconds
75% percentile latency: 0.9927226560539566 seconds
90% percentile latency: 0.9962949193781242 seconds
99% percentile latency: 1.0001726696128026 seconds

$ vllm bench latency --model=Qwen/Qwen3-0.6B -cc.cudagraph_mode=NONE
Avg latency: 1.2380261707507694 seconds
10% percentile latency: 1.2285222162608989 seconds
25% percentile latency: 1.2319129704847 seconds
50% percentile latency: 1.2369173425249755 seconds
75% percentile latency: 1.2427808310312685 seconds
90% percentile latency: 1.2495167235843838 seconds
99% percentile latency: 1.2555694498249794 seconds

old

$ vllm bench latency --model=Qwen/Qwen3-0.6B --enforce-eager
Avg latency: 0.9591279058333991 seconds
10% percentile latency: 0.9503665335010737 seconds
25% percentile latency: 0.9520448662506169 seconds
50% percentile latency: 0.9549696334979672 seconds
75% percentile latency: 0.960952906251805 seconds
90% percentile latency: 0.9692660426007933 seconds
99% percentile latency: 0.9955393221892519 seconds


$ vllm bench latency --model=Qwen/Qwen3-0.6B -cc.cudagraph_mode=NONE
Avg latency: 1.949457830299798 seconds
10% percentile latency: 1.9308824338993873 seconds
25% percentile latency: 1.9359367577499142 seconds
50% percentile latency: 1.9469799755006534 seconds
75% percentile latency: 1.9632210410009066 seconds
90% percentile latency: 1.971349612998165 seconds
99% percentile latency: 1.991639568930841 seconds

mergify · 2026-04-09T02:10:27Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

mergify bot added the nvidia label Mar 11, 2026

mergify bot added the needs-rebase label Mar 11, 2026

github-project-automation bot added this to NVIDIA Mar 11, 2026

ProExpertProg force-pushed the luka/vllm-ir/rms-norm-inplace branch from 4b47060 to 837d6f3 Compare March 11, 2026 21:38

mergify bot removed the needs-rebase label Mar 11, 2026

gemini-code-assist bot reviewed Mar 11, 2026

View reviewed changes

ProExpertProg force-pushed the luka/vllm-ir/rms-norm-inplace branch from 837d6f3 to d5e968e Compare March 11, 2026 21:44

ProExpertProg mentioned this pull request Mar 11, 2026

[vLLM IR] 3/N fused_add_rms_norm and maybe_inplace #34068

Closed

5 tasks

ProExpertProg added torch.compile vllm-ir vLLM IR: intermediate representation and kernel registration labels Mar 11, 2026

github-project-automation bot added this to torch.compile integration Mar 11, 2026

github-project-automation bot moved this to To triage in torch.compile integration Mar 11, 2026

ProExpertProg changed the title ~~Draft [vLLM IR] 3/N fused_add_rms_norm and maybe_inplace~~ [vLLM IR] 3/N fused_add_rms_norm and maybe_inplace Mar 11, 2026

ProExpertProg force-pushed the luka/vllm-ir/rms-norm-batch-invariant branch from 8041106 to d8fe95a Compare March 12, 2026 10:02

ProExpertProg force-pushed the luka/vllm-ir/rms-norm-inplace branch from d5e968e to 1939b89 Compare March 12, 2026 17:23

ProExpertProg force-pushed the luka/vllm-ir/rms-norm-batch-invariant branch from d8fe95a to 810b9f3 Compare March 12, 2026 19:42

mergify bot added the needs-rebase label Mar 12, 2026

ProExpertProg mentioned this pull request Mar 17, 2026

[vLLM IR] 1/N Implement IR skeleton and rms_norm op #33825

Merged

4 tasks

ProExpertProg force-pushed the luka/vllm-ir/rms-norm-batch-invariant branch 3 times, most recently from 4868432 to fefd5b0 Compare March 20, 2026 13:58

ProExpertProg force-pushed the luka/vllm-ir/rms-norm-batch-invariant branch from fefd5b0 to 81c6989 Compare March 31, 2026 19:39

ProExpertProg force-pushed the luka/vllm-ir/rms-norm-inplace branch from 1939b89 to 2725b81 Compare March 31, 2026 22:03

mergify bot removed the needs-rebase label Mar 31, 2026

ProExpertProg force-pushed the luka/vllm-ir/rms-norm-inplace branch from 2725b81 to 65f848e Compare April 1, 2026 14:18

ProExpertProg force-pushed the luka/vllm-ir/rms-norm-batch-invariant branch from 49ddf9b to 06de5e1 Compare April 1, 2026 14:18

ProExpertProg mentioned this pull request Apr 1, 2026

[vLLM IR] Port RoPE ops to IR #38756

Open

github-project-automation bot added this to AMD and gpt-oss Issues & Enhancements Apr 9, 2026

mergify bot added the intel-gpu Related to Intel GPU label Apr 9, 2026

github-project-automation bot moved this to Todo in AMD Apr 9, 2026

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Apr 9, 2026

mergify bot added cpu Related to CPU backends structured-output speculative-decoding v1 tpu Related to Google TPUs labels Apr 9, 2026

github-project-automation bot added this to Structured Output Apr 9, 2026

mergify bot added the tool-calling label Apr 9, 2026

mergify bot added the needs-rebase label Apr 9, 2026

github-project-automation bot added this to Tool Calling Apr 9, 2026

mergify bot added the kv-connector label Apr 9, 2026

ProExpertProg force-pushed the luka/vllm-ir/rms-norm-batch-invariant branch from c6f01a4 to dafa54a Compare April 9, 2026 15:26

ProExpertProg force-pushed the luka/vllm-ir/rms-norm-inplace branch from ebca8b9 to 8b5e95e Compare April 9, 2026 15:26

mergify bot removed tpu Related to Google TPUs needs-rebase labels Apr 9, 2026

panpan0000 mentioned this pull request Apr 14, 2026

Introduce De-dup/Similarity-Check in CI Workflow for PR/Issue #39695

Open

5 tasks

ProExpertProg added 8 commits April 17, 2026 17:17

Disable fusions for batch-invariant

8ac7a8a

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

TMP: no cleanup post-lowering

57e7b37

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

Add maybe_inplace overload with unit tests

d5ac058

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

fused_add_rms_norm IR op & all implementations

b8cf30e

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

Functionalization, clone elimination, and tests

33583c1

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

Fix custom passes & tests

9fe1e0f

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

remove aiter rms_norms

586ffd1

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

Add doc

6b2a07f

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

ProExpertProg force-pushed the luka/vllm-ir/rms-norm-inplace branch from 8b5e95e to 6b2a07f Compare April 18, 2026 01:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[vLLM IR] 3/N fused_add_rms_norm and maybe_inplace#36823

[vLLM IR] 3/N fused_add_rms_norm and maybe_inplace#36823
ProExpertProg wants to merge 8 commits intoluka/vllm-ir/rms-norm-batch-invariantfrom
luka/vllm-ir/rms-norm-inplace

ProExpertProg commented Mar 11, 2026 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Mar 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 11, 2026

Uh oh!

gemini-code-assist bot Mar 11, 2026

Uh oh!

mergify bot commented Mar 12, 2026

Uh oh!

ProExpertProg commented Mar 31, 2026 •

edited

Loading

Uh oh!

mergify bot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ProExpertProg commented Mar 11, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Mar 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 12, 2026

Uh oh!

ProExpertProg commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Main

old

Uh oh!

mergify bot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ProExpertProg commented Mar 11, 2026 •

edited by github-actions bot

Loading

ProExpertProg commented Mar 31, 2026 •

edited

Loading