[vLLM IR] 1/N Implement IR skeleton and rms_norm op#33825
[vLLM IR] 1/N Implement IR skeleton and rms_norm op#33825ProExpertProg merged 56 commits intovllm-project:mainfrom
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request introduces the foundational skeleton for a new Intermediate Representation (IR) in vLLM and undertakes a significant refactoring of the compilation passes and their associated tests. The new IR system provides a clean and extensible way to define and register custom operations. The refactoring effort organizes compilation passes into a more structured passes subdirectory and overhauls the end-to-end fusion tests for improved maintainability. The CI configuration has also been updated accordingly to reflect these structural changes. My review found one area for improvement in the new IR implementation.
6769d7b to
7b9d397
Compare
bc6dbee to
289444c
Compare
2e45e02 to
e3227db
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
# Conflicts: # vllm/config/kernel.py # vllm/model_executor/layers/layernorm.py # vllm/platforms/interface.py
| return ir.ops.rms_norm( | ||
| x, self.weight, self.variance_epsilon, self.variance_size_override | ||
| ) |
There was a problem hiding this comment.
Claude said (and I think is reasonable): use self.weight.data here. Previously when Dynamo was tracing through this that is what was being used (
vllm/vllm/model_executor/layers/layernorm.py
Line 279 in 978fc18
This fixes the issue with torch.no_grad(). I'm not sure if there's something larger that is wrong, but changing this back to self.weight.data gets us back the previous behavior
There was a problem hiding this comment.
Wow nice find!
zou3519
left a comment
There was a problem hiding this comment.
LGTM. I think claude figured out the torch.no_grad() issue, we should fix that before merging
|
Hi @ProExpertProg, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Luka Govedic <luka.govedic@gmail.com>
Signed-off-by: Luka Govedic <luka.govedic@gmail.com>
Signed-off-by: Luka Govedic <luka.govedic@gmail.com>
Signed-off-by: Luka Govedic <luka.govedic@gmail.com>
c59d43f to
d98bda6
Compare
|
@ProExpertProg We added a new Intel CI pipeline that only gates Intel PRs, so it does not apply to your PR. Feel free to ignore the result. |
|
Benchmarking results for Qwen-0.6B on B200: main (
|
|
main:
|
|
I checked out this PR on MI355, here are my findings:
DeepSeek-R1:
Llama-3.1-70B-Instruct
Tldr: I think everything looks good from my initial look |
Signed-off-by: Luka Govedic <luka.govedic@gmail.com> # Conflicts: # vllm/compilation/passes/pass_manager.py
|
Woohoo! |
Purpose
This PR implements the foundational infrastructure for vLLM IR (Intermediate Representation), a functional IR system for vLLM custom operations, starting with the
rms_normoperation. This is the first of many PRs to addresses RFC #32358.What is vLLM IR? vLLM IR is a functional intermediate representation that separates operation semantics from implementation and dispatching. It serves as a higher-level torch dialect with the following main benefits:
This PR contains the following initial features of vLLM IR:
vllm.ir.register_op) returns anIrOpobject which is a callable object containing op metadata and utilities. Impl registration decorator (IrOp.register_impl) returns anIrOpImplobject which contains implementation metadata and utilities.IrOp.dispatchdispatches the call to the selected implementation, according to the priority list and runtimesupport_argspredicates.rms_normop & implementation registration: the op lives invllm/ir/ops/layernorm.py. The implementations live invllm/kernels/*.py- different files for different providers.vllm.compilation.passes.ir.lowering_pass.VllmIRLoweringPassruns at the end of post-grad custom post-passes and lowersvllm_irtorch ops into the selected implementation, according to the priority list and runtimesupport_argspredicates (consistent with eager-mode dispatching).MatcherRMSNormreplaced withtorch.ops.vllm_ir.rms_norm: In custom compile passes, the fragile matching utility can be fully replaced by calling the vLLM IR op in pattern matcher patterns and replacements.IROpPriorityConfiginKernelConfig, including a top-level CLI flag. Each platform also defines its own default op priority, which is combined with any user-specified values. This priority is then passed down to the IR op priority at the start of every forward pass.Kernel implementation providers:
Other non-IR changes:
disable_log_dedupfixture for testingwarning_onceandinfo_oncefused_add_rms_norm& batch invariant: This PR leaves thefused_add_rms_normand batch invariant parts of the RMSNorm custom op intact. PRs 2/N (#36816) and 3/N (#36823) will address these two, after which we can migrateRMSNormfromCustomOptoPluggableLayer.Remaining TODOs:
direct_dispatchimplementationTest Plan
tests/ir/test_op.py- IR op registration, dispatching, priority systemtests/kernels/ir/test_layernorm.py- RMS norm implementation teststests/compile/passes/ir/test_lowering.py- Lowering pass testsqwen0.6B --enforce-eagerto measure worst-case dispatching overheadTest Result
CI tests passing. Manually validated that Inductor output code is identical for Deepseek-V3.1: complex because it contains rms_norm on q as well as per-layer.
Qwen-0.6B latency configuration sweep on B200:
Command:
vllm bench latency --model=Qwen/Qwen3-0.6B--enforce-eager-cc.cudagraph_mode=FULL_DECODE_ONLY -cc.mode=NONE-cc.cudagraph_mode=FULL_DECODE_ONLY-cc.custom_ops+=+rms_norm --ir-op-priority.rms_norm=vllm_clm_eval (B200)
main
PR
lm_eval (H100)
main
PR
lm_eval (MI355)