Skip to content

[vLLM IR] 3/N fused_add_rms_norm and maybe_inplace#36823

Draft
ProExpertProg wants to merge 8 commits intoluka/vllm-ir/rms-norm-batch-invariantfrom
luka/vllm-ir/rms-norm-inplace
Draft

[vLLM IR] 3/N fused_add_rms_norm and maybe_inplace#36823
ProExpertProg wants to merge 8 commits intoluka/vllm-ir/rms-norm-batch-invariantfrom
luka/vllm-ir/rms-norm-inplace

Conversation

@ProExpertProg
Copy link
Copy Markdown
Collaborator

@ProExpertProg ProExpertProg commented Mar 11, 2026

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added the nvidia label Mar 11, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Mar 11, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 11, 2026
@ProExpertProg ProExpertProg force-pushed the luka/vllm-ir/rms-norm-inplace branch from 4b47060 to 837d6f3 Compare March 11, 2026 21:38
@mergify mergify bot removed the needs-rebase label Mar 11, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a maybe_inplace mechanism to the vLLM IR, allowing for performance optimizations through in-place operations while maintaining functional semantics for default op calls. It also adds a new fused_add_rms_norm op that leverages this new capability. The changes are extensive, touching the IR definition, compiler passes, and kernel implementations. My review has identified a critical safety issue where a potential misuse of an in-place operation only triggers a warning instead of an error, and a high-severity issue related to an incomplete compiler pass (CloneCleanupPass) that is being added.

Comment on lines +69 to +73
logger.warning(
"Node %s (input to %s) has another use", arg, node
)
# TODO raise error, this is undefined behavior, which should not be allowed.
# Users can just use the default overload if they want to keep activation inputs untouched.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The check for other users of an activation input that is about to be modified in-place currently only logs a warning. This can lead to silent correctness issues if the input tensor is used elsewhere after being modified. As the TODO comment suggests, this should raise an error to prevent such undefined behavior. Allowing compilation to proceed with a warning could introduce hard-to-debug errors downstream.

Suggested change
logger.warning(
"Node %s (input to %s) has another use", arg, node
)
# TODO raise error, this is undefined behavior, which should not be allowed.
# Users can just use the default overload if they want to keep activation inputs untouched.
raise ValueError(
f"Node {arg} (input to {node}) has another use in {user}. "
f"Using maybe_inplace on an input with multiple users is not allowed. "
f"Use the default overload if you want to keep activation inputs untouched."
)

Comment on lines +169 to +172
continue # TODO
node.replace_all_uses_with(node.args[0])
graph.erase_node(node)
count += 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The CloneCleanupPass is added to the pass manager but its implementation is currently a no-op due to the continue # TODO statement. The logic to remove clone nodes is commented out. Merging incomplete or placeholder code can lead to confusion and makes it unclear if the feature is intended to be active. The pass should either be fully implemented or removed from the pass manager until it's ready.

Suggested change
continue # TODO
node.replace_all_uses_with(node.args[0])
graph.erase_node(node)
count += 1
# A clone is safe to remove if its input has no other users.
# This is a conservative check. A more sophisticated analysis
# could trace back to the `maybe_inplace` call and its metadata.
if len(node.args[0].users) == 1:
node.replace_all_uses_with(node.args[0])
graph.erase_node(node)
count += 1

@ProExpertProg ProExpertProg force-pushed the luka/vllm-ir/rms-norm-inplace branch from 837d6f3 to d5e968e Compare March 11, 2026 21:44
@ProExpertProg ProExpertProg added torch.compile vllm-ir vLLM IR: intermediate representation and kernel registration labels Mar 11, 2026
@ProExpertProg ProExpertProg changed the title Draft [vLLM IR] 3/N fused_add_rms_norm and maybe_inplace [vLLM IR] 3/N fused_add_rms_norm and maybe_inplace Mar 11, 2026
@ProExpertProg ProExpertProg force-pushed the luka/vllm-ir/rms-norm-batch-invariant branch from 8041106 to d8fe95a Compare March 12, 2026 10:02
@ProExpertProg ProExpertProg force-pushed the luka/vllm-ir/rms-norm-inplace branch from d5e968e to 1939b89 Compare March 12, 2026 17:23
@ProExpertProg ProExpertProg force-pushed the luka/vllm-ir/rms-norm-batch-invariant branch from d8fe95a to 810b9f3 Compare March 12, 2026 19:42
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Mar 12, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 12, 2026
@ProExpertProg ProExpertProg force-pushed the luka/vllm-ir/rms-norm-batch-invariant branch 3 times, most recently from 4868432 to fefd5b0 Compare March 20, 2026 13:58
@ProExpertProg ProExpertProg force-pushed the luka/vllm-ir/rms-norm-batch-invariant branch from fefd5b0 to 81c6989 Compare March 31, 2026 19:39
@ProExpertProg ProExpertProg force-pushed the luka/vllm-ir/rms-norm-inplace branch from 1939b89 to 2725b81 Compare March 31, 2026 22:03
@mergify mergify bot removed the needs-rebase label Mar 31, 2026
@ProExpertProg
Copy link
Copy Markdown
Collaborator Author

ProExpertProg commented Mar 31, 2026

Current status of perf:

$ vllm bench latency --model=Qwen/Qwen3-0.6B
Avg latency: 0.19557336465998862 seconds
10% percentile latency: 0.1915025576017797 seconds
25% percentile latency: 0.19555562996538356 seconds
50% percentile latency: 0.19578603247646242 seconds
75% percentile latency: 0.1963934577361215 seconds
90% percentile latency: 0.19710877431789414 seconds
99% percentile latency: 0.19903717429377138 seconds

$ vllm bench latency --model=Qwen/Qwen3-0.6B -cc.custom_ops+=+rms_norm --ir-op-priority.rms_norm=vllm_c --ir-op-priority.fused_add_rms_norm=vllm_c
Avg latency: 0.21115826283348724 seconds
10% percentile latency: 0.2069199254270643 seconds
25% percentile latency: 0.21143249873421155 seconds
50% percentile latency: 0.21161350194597617 seconds
75% percentile latency: 0.21171232018969022 seconds
90% percentile latency: 0.21274942716117948 seconds
99% percentile latency: 0.21377979715238327 seconds

$ vllm bench latency --model=Qwen/Qwen3-0.6B --enforce-eager
Avg latency: 1.0280421572815006 seconds
10% percentile latency: 1.013216928683687 seconds
25% percentile latency: 1.014344347815495 seconds
50% percentile latency: 1.019462305586785 seconds
75% percentile latency: 1.0264160812075716 seconds
90% percentile latency: 1.036091740813572 seconds
99% percentile latency: 1.1814956719905605 seconds

$ vllm bench latency --model=Qwen/Qwen3-0.6B -cc.cudagraph_mode=NONE
Avg latency: 1.1771849144715816 seconds
10% percentile latency: 1.1685888864682057 seconds
25% percentile latency: 1.1711883139505517 seconds
50% percentile latency: 1.1759423059993424 seconds
75% percentile latency: 1.1811810505460016 seconds
90% percentile latency: 1.1855081893503665 seconds
99% percentile latency: 1.204472551472718 seconds

Main

$ vllm bench latency --model=Qwen/Qwen3-0.6B
Avg latency: 0.1961628031722891 seconds
10% percentile latency: 0.19574561258777975 seconds
25% percentile latency: 0.1958314114890527 seconds
50% percentile latency: 0.19613034051144496 seconds
75% percentile latency: 0.19636730520869605 seconds
90% percentile latency: 0.19754589340882375 seconds
99% percentile latency: 0.19999273694236763 seconds

$ vllm bench latency --model=Qwen/Qwen3-0.6B -cc.custom_ops+=+rms_norm
Avg latency: 0.21067771341186017 seconds
10% percentile latency: 0.20676958258263767 seconds
25% percentile latency: 0.21087214731960557 seconds
50% percentile latency: 0.21127225446980447 seconds
75% percentile latency: 0.2117120226903353 seconds
90% percentile latency: 0.21260529567953199 seconds
99% percentile latency: 0.21379286048118956 seconds

$ vllm bench latency --model=Qwen/Qwen3-0.6B --enforce-eager
Avg latency: 0.9905871206894517 seconds
10% percentile latency: 0.9860326998285018 seconds
25% percentile latency: 0.9877241560025141 seconds
50% percentile latency: 0.9896844719769433 seconds
75% percentile latency: 0.9927226560539566 seconds
90% percentile latency: 0.9962949193781242 seconds
99% percentile latency: 1.0001726696128026 seconds

$ vllm bench latency --model=Qwen/Qwen3-0.6B -cc.cudagraph_mode=NONE
Avg latency: 1.2380261707507694 seconds
10% percentile latency: 1.2285222162608989 seconds
25% percentile latency: 1.2319129704847 seconds
50% percentile latency: 1.2369173425249755 seconds
75% percentile latency: 1.2427808310312685 seconds
90% percentile latency: 1.2495167235843838 seconds
99% percentile latency: 1.2555694498249794 seconds

old

$ vllm bench latency --model=Qwen/Qwen3-0.6B --enforce-eager
Avg latency: 0.9591279058333991 seconds
10% percentile latency: 0.9503665335010737 seconds
25% percentile latency: 0.9520448662506169 seconds
50% percentile latency: 0.9549696334979672 seconds
75% percentile latency: 0.960952906251805 seconds
90% percentile latency: 0.9692660426007933 seconds
99% percentile latency: 0.9955393221892519 seconds


$ vllm bench latency --model=Qwen/Qwen3-0.6B -cc.cudagraph_mode=NONE
Avg latency: 1.949457830299798 seconds
10% percentile latency: 1.9308824338993873 seconds
25% percentile latency: 1.9359367577499142 seconds
50% percentile latency: 1.9469799755006534 seconds
75% percentile latency: 1.9632210410009066 seconds
90% percentile latency: 1.971349612998165 seconds
99% percentile latency: 1.991639568930841 seconds

@ProExpertProg ProExpertProg force-pushed the luka/vllm-ir/rms-norm-inplace branch from 2725b81 to 65f848e Compare April 1, 2026 14:18
@ProExpertProg ProExpertProg force-pushed the luka/vllm-ir/rms-norm-batch-invariant branch from 49ddf9b to 06de5e1 Compare April 1, 2026 14:18
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 9, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 9, 2026
@mergify mergify bot added the kv-connector label Apr 9, 2026
@ProExpertProg ProExpertProg force-pushed the luka/vllm-ir/rms-norm-batch-invariant branch from c6f01a4 to dafa54a Compare April 9, 2026 15:26
@ProExpertProg ProExpertProg force-pushed the luka/vllm-ir/rms-norm-inplace branch from ebca8b9 to 8b5e95e Compare April 9, 2026 15:26
@mergify mergify bot removed tpu Related to Google TPUs needs-rebase labels Apr 9, 2026
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
@ProExpertProg ProExpertProg force-pushed the luka/vllm-ir/rms-norm-inplace branch from 8b5e95e to 6b2a07f Compare April 18, 2026 01:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build cpu Related to CPU backends deepseek Related to DeepSeek models documentation Improvements or additions to documentation frontend gpt-oss Related to GPT-OSS models intel-gpu Related to Intel GPU kv-connector multi-modality Related to multi-modality (#4194) new-model Requests to new models nvidia performance Performance-related issues qwen Related to Qwen models rocm Related to AMD ROCm speculative-decoding structured-output tool-calling torch.compile v1 vllm-ir vLLM IR: intermediate representation and kernel registration

Projects

Status: Todo
Status: No status
Status: No status
Status: No status
Status: To Triage
Status: To triage
Status: In review

Development

Successfully merging this pull request may close these issues.

2 participants