[vLLM IR] 3/N fused_add_rms_norm and maybe_inplace by ProExpertProg · Pull Request #34068 · vllm-project/vllm

ProExpertProg · 2026-02-07T23:10:43Z

This is part two of RMSNorm to vLLM IR conversion, after #33825. Includes adding a maybe_inplace overload and proper handling.

TODO: remove clones, properly pickle custom_pre_grad_pass

Dispatching overhead seems to be around 20% for now, which is not great.

UPDATE: got the dispatching overhead down to negligible!

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces a significant and well-executed architectural improvement by adding a new vLLM Intermediate Representation (IR) layer for custom operations. The initial focus is on rms_norm and fused_add_rms_norm, including a maybe_inplace variant to handle in-place operations gracefully. The changes are comprehensive, including a robust registration and dispatching mechanism, lowering passes to translate IR ops into concrete kernel implementations, and a new configuration system for kernel priorities. Existing fusion passes and model layers have been cleanly refactored to adopt this new IR. The addition of extensive and thorough tests for the new IR system is commendable. Overall, this is an excellent refactoring that builds a solid foundation for managing and extending kernel implementations in vLLM, greatly improving maintainability and extensibility.

mergify · 2026-02-09T09:46:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

… default application and validation, including more robust schema checks Signed-off-by: Luka Govedič <lgovedic@redhat.com>

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

ProExpertProg · 2026-02-12T21:27:16Z

Measuring the dispatching overhead, it's about 24% in eager mode (and 0 in compiled mode). By skipping the extra layer of torch custom op, we drop down to 15%. Note that this is only rms-norm, if every op was wrapped into a vLLM IR op it would be more. But also it's qwen which has a ton of norms.

This PR:

$ vllm bench latency --enforce-eager --model=Qwen/Qwen3-0.6B
Avg latency: 1.2152492157338808 seconds
10% percentile latency: 1.2036446358077229 seconds
25% percentile latency: 1.2078355035046116 seconds
50% percentile latency: 1.2112166829174384 seconds
75% percentile latency: 1.2189692648826167 seconds
90% percentile latency: 1.2255195130594074 seconds
99% percentile latency: 1.2715601262333804 seconds

# Bypassing torch dispatching
$ vllm bench latency --enforce-eager --model=Qwen/Qwen3-0.6B
Avg latency: 1.1208888423163443 seconds
10% percentile latency: 1.101336152642034 seconds
25% percentile latency: 1.1054603178054094 seconds
50% percentile latency: 1.110795821994543 seconds
75% percentile latency: 1.1208480294444598 seconds
90% percentile latency: 1.1498165469616652 seconds
99% percentile latency: 1.2160589512367734 seconds

# also removing default arg binding
$ vllm bench latency --enforce-eager --model=Qwen/Qwen3-0.6B
Avg latency: 1.0481553357404967 seconds
10% percentile latency: 1.0379706282168626 seconds
25% percentile latency: 1.0412185442983173 seconds
50% percentile latency: 1.0463172430172563 seconds
75% percentile latency: 1.0550272777909413 seconds
90% percentile latency: 1.0605557654052973 seconds
99% percentile latency: 1.0657283452572301 seconds

# also no supports_args check
$ vllm bench latency
Avg latency: 0.9391857257888963 seconds
10% percentile latency: 0.9315106127643957 seconds
25% percentile latency: 0.9336321024456993 seconds
50% percentile latency: 0.9388158750953153 seconds
75% percentile latency: 0.9438988686306402 seconds
90% percentile latency: 0.9466064346954226 seconds
99% percentile latency: 0.9529449114599265 seconds

$ vllm bench latency -cc.mode=NONE -cc.cudagraph_mode=FULL_DECODE_ONLY --model=Qwen/Qwen3-0.6B
Avg latency: 0.22638876158744098 seconds
10% percentile latency: 0.22529362430796027 seconds
25% percentile latency: 0.22640719747869298 seconds
50% percentile latency: 0.22668499848805368 seconds
75% percentile latency: 0.22784376773051918 seconds
90% percentile latency: 0.22901816791854798 seconds
99% percentile latency: 0.23151227492373436 seconds

$ vllm bench latency -cc.cudagraph_mode=FULL_DECODE_ONLY --model=Qwen/Qwen3-0.6B
Avg latency: 0.2009479501362269 seconds
10% percentile latency: 0.19971295476425438 seconds
25% percentile latency: 0.19984736770857126 seconds
50% percentile latency: 0.20022834197152406 seconds
75% percentile latency: 0.20195828197756782 seconds
90% percentile latency: 0.2033034526044503 seconds
99% percentile latency: 0.2035756545769982 seconds

Main:

$ vllm bench latency --enforce-eager --model=Qwen/Qwen3-0.6B
Avg latency: 0.9762042934074998 seconds
10% percentile latency: 0.9623611782211811 seconds
25% percentile latency: 0.9689289649832062 seconds
50% percentile latency: 0.9749053731793538 seconds
75% percentile latency: 0.9816581851337105 seconds
90% percentile latency: 0.991626029717736 seconds
99% percentile latency: 0.9985674665891565 seconds

$ vllm bench latency -cc.mode=NONE -cc.cudagraph_mode=FULL_DECODE_ONLY --model=Qwen/Qwen3-0.6B
Avg latency: 0.2221401832997799 seconds
10% percentile latency: 0.22153066149912776 seconds
25% percentile latency: 0.22170765756163746 seconds
50% percentile latency: 0.2221028634812683 seconds
75% percentile latency: 0.2223803079687059 seconds
90% percentile latency: 0.22383065768517554 seconds
99% percentile latency: 0.22525297554209828 seconds

$ vllm bench latency -cc.cudagraph_mode=FULL_DECODE_ONLY --model=Qwen/Qwen3-0.6B
Avg latency: 0.20247513990228375 seconds
10% percentile latency: 0.20137016247026623 seconds
25% percentile latency: 0.20162693836027756 seconds
50% percentile latency: 0.20182030159048736 seconds
75% percentile latency: 0.20236327144084498 seconds
90% percentile latency: 0.20497609921731055 seconds
99% percentile latency: 0.20644027675967663 seconds

ProExpertProg · 2026-02-12T21:58:56Z

Seems like the inefficiency is coming from apply_arg_defaults - didn't realize it was that slow. I think without that the overhead will be minimal.

tjtanaa · 2026-02-13T15:47:33Z

Seems like the inefficiency is coming from apply_arg_defaults - didn't realize it was that slow. I think without that the overhead will be minimal.

The RED one is `apply_arg_defaults`, how about the one circle GREEN?, it is almost as long as `apply_arg_defaults`. Are we able to by pass that?

ProExpertProg · 2026-02-13T16:10:41Z

That's another apply_arg_defaults - I managed to call it twice accidentally 😅

mergify · 2026-02-18T17:45:44Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ProExpertProg · 2026-03-11T21:45:30Z

New PR: #36823

mergify bot added nvidia rocm Related to AMD ROCm labels Feb 7, 2026

github-project-automation bot added this to AMD and NVIDIA Feb 7, 2026

github-project-automation bot moved this to Todo in AMD Feb 7, 2026

gemini-code-assist bot reviewed Feb 7, 2026

View reviewed changes

ProExpertProg added torch.compile vllm-ir vLLM IR: intermediate representation and kernel registration labels Feb 8, 2026

github-project-automation bot added this to torch.compile integration Feb 8, 2026

github-project-automation bot moved this to To triage in torch.compile integration Feb 8, 2026

mergify bot added the needs-rebase label Feb 9, 2026

ProExpertProg added 18 commits February 10, 2026 17:11

Initial op implementation

c86a0cb

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

IR op dispatching & test

19dc950

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

Add layernorm op and vllm_c & aiter implementation

d94c3bc

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

Add opcheck, add dispatch test, register for CUDA and CPU

1a9ea80

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

Adjust tolerance

bf634f8

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

Extract dispatch, test warnings

5947918

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

Lowering pass implemented and working (on CUDA)!

d5d853e

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

Fix aiter (remove variance)

4b22444

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

supports_args: improve logging, test lowering

068f973

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

Add support for weight=None to kernels, and test it

818ce14

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

Refactor argument handling in lowering pass and op classes to improve…

887a7fc

… default application and validation, including more robust schema checks Signed-off-by: Luka Govedič <lgovedic@redhat.com>

Layernorm E2E integration! seems to work!

8ec2826

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

Log final op priority, log defaults in debug

ebc8ed7

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

e2e fusions remove FI autotuning

75d2c17

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

fix lowering test

38a263e

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

Use IR rms_norm in fusions, tests passing! (aiter TBD)

3f73de1

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

Add maybe_inplace overload and tests

f3662bb

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

fused_add_rms_norm, raising pass, fix lowering for inplace impls

e0a3a7e

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

remove aiter rmsnorm registrations

02d1591

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

ProExpertProg force-pushed the luka/vllm-ir/rms-norm-inplace branch from 5046f7a to 02d1591 Compare February 10, 2026 22:12

mergify bot removed the needs-rebase label Feb 10, 2026

ProExpertProg mentioned this pull request Feb 12, 2026

[vLLM IR] 1/N Implement IR skeleton and rms_norm op #33825

Merged

4 tasks

mergify bot added the needs-rebase label Feb 18, 2026

wxsIcey mentioned this pull request Feb 26, 2026

[RFC]: vllm IR Framework Adaptation for vllm-ascend vllm-project/vllm-ascend#6834

Open

7 tasks

wxsIcey reviewed Mar 3, 2026

View reviewed changes

Comment thread vllm/config/kernel.py

ProExpertProg changed the title ~~[vLLM IR] fused_add_rms_norm and maybe_inplace~~ [vLLM IR] 3/N fused_add_rms_norm and maybe_inplace Mar 10, 2026

ProExpertProg closed this Mar 11, 2026

github-project-automation bot moved this to Done in NVIDIA Mar 11, 2026

github-project-automation bot moved this from To triage to Done in torch.compile integration Mar 11, 2026

github-project-automation bot moved this from Todo to Done in AMD Mar 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[vLLM IR] 3/N fused_add_rms_norm and maybe_inplace#34068

[vLLM IR] 3/N fused_add_rms_norm and maybe_inplace#34068
ProExpertProg wants to merge 19 commits intovllm-project:mainfrom
neuralmagic:luka/vllm-ir/rms-norm-inplace

ProExpertProg commented Feb 7, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mergify bot commented Feb 9, 2026

Uh oh!

ProExpertProg commented Feb 12, 2026 •

edited

Loading

Uh oh!

ProExpertProg commented Feb 12, 2026 •

edited

Loading

Uh oh!

tjtanaa commented Feb 13, 2026 •

edited

Loading

Uh oh!

ProExpertProg commented Feb 13, 2026

Uh oh!

mergify bot commented Feb 18, 2026

Uh oh!

Uh oh!

ProExpertProg commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ProExpertProg commented Feb 7, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify bot commented Feb 9, 2026

Uh oh!

ProExpertProg commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ProExpertProg commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tjtanaa commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ProExpertProg commented Feb 13, 2026

Uh oh!

mergify bot commented Feb 18, 2026

Uh oh!

Uh oh!

ProExpertProg commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ProExpertProg commented Feb 7, 2026 •

edited by github-actions bot

Loading

ProExpertProg commented Feb 12, 2026 •

edited

Loading

ProExpertProg commented Feb 12, 2026 •

edited

Loading

tjtanaa commented Feb 13, 2026 •

edited

Loading