Skip to content

[vLLM IR] 3/N fused_add_rms_norm and maybe_inplace#34068

Closed
ProExpertProg wants to merge 19 commits intovllm-project:mainfrom
neuralmagic:luka/vllm-ir/rms-norm-inplace
Closed

[vLLM IR] 3/N fused_add_rms_norm and maybe_inplace#34068
ProExpertProg wants to merge 19 commits intovllm-project:mainfrom
neuralmagic:luka/vllm-ir/rms-norm-inplace

Conversation

@ProExpertProg
Copy link
Copy Markdown
Collaborator

@ProExpertProg ProExpertProg commented Feb 7, 2026

This is part two of RMSNorm to vLLM IR conversion, after #33825. Includes adding a maybe_inplace overload and proper handling.

TODO: remove clones, properly pickle custom_pre_grad_pass

Dispatching overhead seems to be around 20% for now, which is not great.

UPDATE: got the dispatching overhead down to negligible!

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added nvidia rocm Related to AMD ROCm labels Feb 7, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Feb 7, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-executed architectural improvement by adding a new vLLM Intermediate Representation (IR) layer for custom operations. The initial focus is on rms_norm and fused_add_rms_norm, including a maybe_inplace variant to handle in-place operations gracefully. The changes are comprehensive, including a robust registration and dispatching mechanism, lowering passes to translate IR ops into concrete kernel implementations, and a new configuration system for kernel priorities. Existing fusion passes and model layers have been cleanly refactored to adopt this new IR. The addition of extensive and thorough tests for the new IR system is commendable. Overall, this is an excellent refactoring that builds a solid foundation for managing and extending kernel implementations in vLLM, greatly improving maintainability and extensibility.

@ProExpertProg ProExpertProg added torch.compile vllm-ir vLLM IR: intermediate representation and kernel registration labels Feb 8, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Feb 9, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 9, 2026
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
… default application and validation, including more robust schema checks

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
@ProExpertProg ProExpertProg force-pushed the luka/vllm-ir/rms-norm-inplace branch from 5046f7a to 02d1591 Compare February 10, 2026 22:12
@mergify mergify bot removed the needs-rebase label Feb 10, 2026
@ProExpertProg
Copy link
Copy Markdown
Collaborator Author

ProExpertProg commented Feb 12, 2026

Measuring the dispatching overhead, it's about 24% in eager mode (and 0 in compiled mode). By skipping the extra layer of torch custom op, we drop down to 15%. Note that this is only rms-norm, if every op was wrapped into a vLLM IR op it would be more. But also it's qwen which has a ton of norms.

This PR:

$ vllm bench latency --enforce-eager --model=Qwen/Qwen3-0.6B
Avg latency: 1.2152492157338808 seconds
10% percentile latency: 1.2036446358077229 seconds
25% percentile latency: 1.2078355035046116 seconds
50% percentile latency: 1.2112166829174384 seconds
75% percentile latency: 1.2189692648826167 seconds
90% percentile latency: 1.2255195130594074 seconds
99% percentile latency: 1.2715601262333804 seconds

# Bypassing torch dispatching
$ vllm bench latency --enforce-eager --model=Qwen/Qwen3-0.6B
Avg latency: 1.1208888423163443 seconds
10% percentile latency: 1.101336152642034 seconds
25% percentile latency: 1.1054603178054094 seconds
50% percentile latency: 1.110795821994543 seconds
75% percentile latency: 1.1208480294444598 seconds
90% percentile latency: 1.1498165469616652 seconds
99% percentile latency: 1.2160589512367734 seconds

# also removing default arg binding
$ vllm bench latency --enforce-eager --model=Qwen/Qwen3-0.6B
Avg latency: 1.0481553357404967 seconds
10% percentile latency: 1.0379706282168626 seconds
25% percentile latency: 1.0412185442983173 seconds
50% percentile latency: 1.0463172430172563 seconds
75% percentile latency: 1.0550272777909413 seconds
90% percentile latency: 1.0605557654052973 seconds
99% percentile latency: 1.0657283452572301 seconds

# also no supports_args check
$ vllm bench latency
Avg latency: 0.9391857257888963 seconds
10% percentile latency: 0.9315106127643957 seconds
25% percentile latency: 0.9336321024456993 seconds
50% percentile latency: 0.9388158750953153 seconds
75% percentile latency: 0.9438988686306402 seconds
90% percentile latency: 0.9466064346954226 seconds
99% percentile latency: 0.9529449114599265 seconds

$ vllm bench latency -cc.mode=NONE -cc.cudagraph_mode=FULL_DECODE_ONLY --model=Qwen/Qwen3-0.6B
Avg latency: 0.22638876158744098 seconds
10% percentile latency: 0.22529362430796027 seconds
25% percentile latency: 0.22640719747869298 seconds
50% percentile latency: 0.22668499848805368 seconds
75% percentile latency: 0.22784376773051918 seconds
90% percentile latency: 0.22901816791854798 seconds
99% percentile latency: 0.23151227492373436 seconds

$ vllm bench latency -cc.cudagraph_mode=FULL_DECODE_ONLY --model=Qwen/Qwen3-0.6B
Avg latency: 0.2009479501362269 seconds
10% percentile latency: 0.19971295476425438 seconds
25% percentile latency: 0.19984736770857126 seconds
50% percentile latency: 0.20022834197152406 seconds
75% percentile latency: 0.20195828197756782 seconds
90% percentile latency: 0.2033034526044503 seconds
99% percentile latency: 0.2035756545769982 seconds

Main:

$ vllm bench latency --enforce-eager --model=Qwen/Qwen3-0.6B
Avg latency: 0.9762042934074998 seconds
10% percentile latency: 0.9623611782211811 seconds
25% percentile latency: 0.9689289649832062 seconds
50% percentile latency: 0.9749053731793538 seconds
75% percentile latency: 0.9816581851337105 seconds
90% percentile latency: 0.991626029717736 seconds
99% percentile latency: 0.9985674665891565 seconds

$ vllm bench latency -cc.mode=NONE -cc.cudagraph_mode=FULL_DECODE_ONLY --model=Qwen/Qwen3-0.6B
Avg latency: 0.2221401832997799 seconds
10% percentile latency: 0.22153066149912776 seconds
25% percentile latency: 0.22170765756163746 seconds
50% percentile latency: 0.2221028634812683 seconds
75% percentile latency: 0.2223803079687059 seconds
90% percentile latency: 0.22383065768517554 seconds
99% percentile latency: 0.22525297554209828 seconds

$ vllm bench latency -cc.cudagraph_mode=FULL_DECODE_ONLY --model=Qwen/Qwen3-0.6B
Avg latency: 0.20247513990228375 seconds
10% percentile latency: 0.20137016247026623 seconds
25% percentile latency: 0.20162693836027756 seconds
50% percentile latency: 0.20182030159048736 seconds
75% percentile latency: 0.20236327144084498 seconds
90% percentile latency: 0.20497609921731055 seconds
99% percentile latency: 0.20644027675967663 seconds

@ProExpertProg
Copy link
Copy Markdown
Collaborator Author

ProExpertProg commented Feb 12, 2026

Seems like the inefficiency is coming from apply_arg_defaults - didn't realize it was that slow. I think without that the overhead will be minimal.

Screenshot 2026-02-12 at 5 00 32 PM

@tjtanaa
Copy link
Copy Markdown
Collaborator

tjtanaa commented Feb 13, 2026

Seems like the inefficiency is coming from apply_arg_defaults - didn't realize it was that slow. I think without that the overhead will be minimal.

image The RED one is `apply_arg_defaults`, how about the one circle GREEN?, it is almost as long as `apply_arg_defaults`. Are we able to by pass that?

@ProExpertProg
Copy link
Copy Markdown
Collaborator Author

That's another apply_arg_defaults - I managed to call it twice accidentally 😅

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Feb 18, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Comment thread vllm/config/kernel.py
@ProExpertProg ProExpertProg changed the title [vLLM IR] fused_add_rms_norm and maybe_inplace [vLLM IR] 3/N fused_add_rms_norm and maybe_inplace Mar 10, 2026
@ProExpertProg
Copy link
Copy Markdown
Collaborator Author

New PR: #36823

@github-project-automation github-project-automation bot moved this to Done in NVIDIA Mar 11, 2026
@github-project-automation github-project-automation bot moved this from To triage to Done in torch.compile integration Mar 11, 2026
@github-project-automation github-project-automation bot moved this from Todo to Done in AMD Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase nvidia rocm Related to AMD ROCm torch.compile vllm-ir vLLM IR: intermediate representation and kernel registration

Projects

Status: Done
Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants