[vLLM IR] Cache the fx_replacement to avoid re-tracing the same impl by gcanlin · Pull Request #39034 · vllm-project/vllm

gcanlin · 2026-04-05T17:14:49Z

Purpose

Cache the fwd_only trace result in VllmIRLoweringPass to avoid re-tracing the same impl_fn + arg shapes across subgraphs
Use match.replace_with_graph with cached GraphModule instead of match.replace_by_example which re-traces every time

Test Plan

VLLM_LOGGING_LEVEL=DEBUG vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct

Test Result

Didn't hit cache: 30ms

hit cache: 1.5ms

(EngineCore pid=634293) DEBUG 04-05 16:53:12 [compilation/passes/vllm_inductor_pass.py:84] NoOpEliminationPass completed in 0.7 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:12 [compilation/passes/vllm_inductor_pass.py:84] PostCleanupPass completed in 1.3 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:12 [compilation/.../ir/lowering_pass.py:110] Traced replacement for rms_norm (cache size: 1)
(EngineCore pid=634293) DEBUG 04-05 16:53:12 [compilation/.../ir/lowering_pass.py:110] Traced replacement for rms_norm (cache size: 2)
(EngineCore pid=634293) DEBUG 04-05 16:53:12 [compilation/.../ir/lowering_pass.py:110] Traced replacement for rms_norm (cache size: 3)
(EngineCore pid=634293) DEBUG 04-05 16:53:12 [compilation/.../ir/lowering_pass.py:171] VllmIRLoweringPass lowered 3 vLLM IR nodes
(EngineCore pid=634293) DEBUG 04-05 16:53:12 [compilation/.../ir/lowering_pass.py:185] Selected implementations: rms_norm=native*3
(EngineCore pid=634293) DEBUG 04-05 16:53:12 [compilation/passes/vllm_inductor_pass.py:84] VllmIRLoweringPass completed in 30.9 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:12 [compilation/passes/vllm_inductor_pass.py:84] PostCleanupPass completed in 1.7 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:12 [compilation/.../utility/fix_functionalization.py:203] De-functionalized 0 nodes, removed 0 nodes
(EngineCore pid=634293) DEBUG 04-05 16:53:12 [compilation/passes/vllm_inductor_pass.py:84] FixFunctionalizationPass completed in 0.2 ms
(APIServer pid=633825) DEBUG 04-05 16:53:13 [v1/engine/utils.py:1122] Waiting for 1 local, 0 remote core engine proc(s) to start.
(EngineCore pid=634293) INFO 04-05 16:53:18 [compilation/backends.py:372] Cache the graph of compile range (1, 2048) for later use
(EngineCore pid=634293) DEBUG 04-05 16:53:18 [compilation/backends.py:377] Store the 0-th graph for compile range(1, 2048) from inductor_standalone via handle ('artifact_compile_range_1_2048_subgraph_0', '/root/.cache/vllm/torch_compile_cache/607d5a75c8/rank_0_0/backbone/artifact_compile_range_1_2048_subgraph_0')
(EngineCore pid=634293) DEBUG 04-05 16:53:18 [compilation/.../utility/noop_elimination.py:105] Removed 0 no-op reshapes and slices
(EngineCore pid=634293) DEBUG 04-05 16:53:18 [compilation/passes/vllm_inductor_pass.py:84] NoOpEliminationPass completed in 0.7 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:18 [compilation/passes/vllm_inductor_pass.py:84] PostCleanupPass completed in 1.8 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:18 [compilation/.../ir/lowering_pass.py:171] VllmIRLoweringPass lowered 2 vLLM IR nodes
(EngineCore pid=634293) DEBUG 04-05 16:53:18 [compilation/.../ir/lowering_pass.py:185] Selected implementations: rms_norm=native*2
(EngineCore pid=634293) DEBUG 04-05 16:53:18 [compilation/passes/vllm_inductor_pass.py:84] VllmIRLoweringPass completed in 1.5 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:18 [compilation/passes/vllm_inductor_pass.py:84] PostCleanupPass completed in 2.1 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:18 [compilation/.../utility/fix_functionalization.py:203] De-functionalized 0 nodes, removed 0 nodes
(EngineCore pid=634293) DEBUG 04-05 16:53:18 [compilation/passes/vllm_inductor_pass.py:84] FixFunctionalizationPass completed in 0.4 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:20 [compilation/backends.py:377] Store the 1-th graph for compile range(1, 2048) from inductor_standalone via handle ('artifact_compile_range_1_2048_subgraph_1', '/root/.cache/vllm/torch_compile_cache/607d5a75c8/rank_0_0/backbone/artifact_compile_range_1_2048_subgraph_1')
(EngineCore pid=634293) DEBUG 04-05 16:53:21 [compilation/backends.py:377] Store the 2-th graph for compile range(1, 2048) from inductor_standalone via handle ('artifact_compile_range_1_2048_subgraph_2', '/root/.cache/vllm/torch_compile_cache/607d5a75c8/rank_0_0/backbone/artifact_compile_range_1_2048_subgraph_2')
(EngineCore pid=634293) DEBUG 04-05 16:53:21 [compilation/.../utility/noop_elimination.py:105] Removed 0 no-op reshapes and slices
(EngineCore pid=634293) DEBUG 04-05 16:53:21 [compilation/passes/vllm_inductor_pass.py:84] NoOpEliminationPass completed in 0.7 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:21 [compilation/passes/vllm_inductor_pass.py:84] PostCleanupPass completed in 1.8 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:21 [compilation/.../ir/lowering_pass.py:171] VllmIRLoweringPass lowered 2 vLLM IR nodes
(EngineCore pid=634293) DEBUG 04-05 16:53:21 [compilation/.../ir/lowering_pass.py:185] Selected implementations: rms_norm=native*2
(EngineCore pid=634293) DEBUG 04-05 16:53:21 [compilation/passes/vllm_inductor_pass.py:84] VllmIRLoweringPass completed in 1.4 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:21 [compilation/passes/vllm_inductor_pass.py:84] PostCleanupPass completed in 2.1 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:21 [compilation/.../utility/fix_functionalization.py:203] De-functionalized 0 nodes, removed 0 nodes
(EngineCore pid=634293) DEBUG 04-05 16:53:21 [compilation/passes/vllm_inductor_pass.py:84] FixFunctionalizationPass completed in 0.2 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:22 [compilation/backends.py:377] Store the 4-th graph for compile range(1, 2048) from inductor_standalone via handle ('artifact_compile_range_1_2048_subgraph_4', '/root/.cache/vllm/torch_compile_cache/607d5a75c8/rank_0_0/backbone/artifact_compile_range_1_2048_subgraph_4')
(APIServer pid=633825) DEBUG 04-05 16:53:23 [v1/engine/utils.py:1122] Waiting for 1 local, 0 remote core engine proc(s) to start.
(EngineCore pid=634293) DEBUG 04-05 16:53:25 [compilation/.../utility/noop_elimination.py:105] Removed 0 no-op reshapes and slices
(EngineCore pid=634293) DEBUG 04-05 16:53:25 [compilation/passes/vllm_inductor_pass.py:84] NoOpEliminationPass completed in 0.7 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:25 [compilation/passes/vllm_inductor_pass.py:84] PostCleanupPass completed in 0.2 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:25 [compilation/.../ir/lowering_pass.py:171] VllmIRLoweringPass lowered 0 vLLM IR nodes
(EngineCore pid=634293) DEBUG 04-05 16:53:25 [compilation/.../ir/lowering_pass.py:185] Selected implementations:
(EngineCore pid=634293) DEBUG 04-05 16:53:25 [compilation/passes/vllm_inductor_pass.py:84] VllmIRLoweringPass completed in 0.4 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:25 [compilation/passes/vllm_inductor_pass.py:84] PostCleanupPass completed in 0.2 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:25 [compilation/.../utility/fix_functionalization.py:203] De-functionalized 0 nodes, removed 0 nodes

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

gcanlin · 2026-04-05T17:19:38Z

@ProExpertProg Could you help review it? I’m still new to vLLM IR, so some parts of the code may not be fully thought through. Appreciate your understanding.

gemini-code-assist

Code Review

This pull request introduces a caching mechanism for traced replacement graphs in the VllmIRLoweringPass to avoid redundant tracing of the same implementations. It adds a _replacement_cache and helper methods to generate hashable metadata from operation arguments. A critical issue was identified in the _make_arg_meta implementation, which lacks support for unhashable types like lists or symbolic types (e.g., torch.SymInt), which will cause a TypeError when used as cache keys.

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

ProExpertProg

Thanks for implementing this! Looks good overall, would be nice to get eyes from the PyTorch side.

I think we should wait for #36816 and #36823 to merge first (aiming for this week)

ProExpertProg · 2026-04-07T21:03:58Z

@@ -87,17 +134,33 @@ def lower_matched_op(self, match: Match, *args, **kwargs):

        # replace_by_example wants node args, not the fake tensors


Fix this comment: move below and mention replace_with_graph requires node args

ProExpertProg · 2026-04-07T21:04:55Z

        )(self.lower_matched_op)

+    @staticmethod
+    def _make_arg_meta(val: Any) -> Any:


@BoyuanFeng can you check the caching logic?

ProExpertProg · 2026-04-07T21:05:48Z

+        Return a cached traced replacement graph, or trace and cache a new one.
+        """
+        cache_key = (
+            impl_fn,


Is it safe to use the impl_fn? Or should we use the IR op's name and provider instead?

[vLLM IR] Cache the fx_replacement to avoid re-tracing the same impl

f1090ce

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

gcanlin requested review from BoyuanFeng, ProExpertProg, vadiklyutiy, youkaichao and zou3519 as code owners April 5, 2026 17:14

gemini-code-assist bot reviewed Apr 5, 2026

View reviewed changes

Comment thread vllm/compilation/passes/ir/lowering_pass.py

fix gemini

77de016

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

ProExpertProg added this to vLLM IR Apr 6, 2026

github-project-automation bot moved this to Todo in vLLM IR Apr 6, 2026

ProExpertProg moved this from Todo to In review in vLLM IR Apr 6, 2026

ProExpertProg added the vllm-ir vLLM IR: intermediate representation and kernel registration label Apr 6, 2026

ProExpertProg reviewed Apr 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[vLLM IR] Cache the fx_replacement to avoid re-tracing the same impl#39034

[vLLM IR] Cache the fx_replacement to avoid re-tracing the same impl#39034
gcanlin wants to merge 2 commits intovllm-project:mainfrom
gcanlin:ir-fx-cache

gcanlin commented Apr 5, 2026 •

edited

Loading

Uh oh!

gcanlin commented Apr 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

ProExpertProg left a comment

Uh oh!

ProExpertProg Apr 7, 2026

Uh oh!

ProExpertProg Apr 7, 2026

Uh oh!

ProExpertProg Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -87,17 +134,33 @@ def lower_matched_op(self, match: Match, args, *kwargs):

		# replace_by_example wants node args, not the fake tensors

Uh oh!

Conversation

gcanlin commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gcanlin commented Apr 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gcanlin commented Apr 5, 2026 •

edited

Loading