Skip to content

[vLLM IR] Cache the fx_replacement to avoid re-tracing the same impl#39034

Open
gcanlin wants to merge 2 commits intovllm-project:mainfrom
gcanlin:ir-fx-cache
Open

[vLLM IR] Cache the fx_replacement to avoid re-tracing the same impl#39034
gcanlin wants to merge 2 commits intovllm-project:mainfrom
gcanlin:ir-fx-cache

Conversation

@gcanlin
Copy link
Copy Markdown
Contributor

@gcanlin gcanlin commented Apr 5, 2026

Purpose

  • Cache the fwd_only trace result in VllmIRLoweringPass to avoid re-tracing the same impl_fn + arg shapes across subgraphs
  • Use match.replace_with_graph with cached GraphModule instead of match.replace_by_example which re-traces every time

Test Plan

VLLM_LOGGING_LEVEL=DEBUG vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct

Test Result

Didn't hit cache: 30ms

hit cache: 1.5ms

(EngineCore pid=634293) DEBUG 04-05 16:53:12 [compilation/passes/vllm_inductor_pass.py:84] NoOpEliminationPass completed in 0.7 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:12 [compilation/passes/vllm_inductor_pass.py:84] PostCleanupPass completed in 1.3 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:12 [compilation/.../ir/lowering_pass.py:110] Traced replacement for rms_norm (cache size: 1)
(EngineCore pid=634293) DEBUG 04-05 16:53:12 [compilation/.../ir/lowering_pass.py:110] Traced replacement for rms_norm (cache size: 2)
(EngineCore pid=634293) DEBUG 04-05 16:53:12 [compilation/.../ir/lowering_pass.py:110] Traced replacement for rms_norm (cache size: 3)
(EngineCore pid=634293) DEBUG 04-05 16:53:12 [compilation/.../ir/lowering_pass.py:171] VllmIRLoweringPass lowered 3 vLLM IR nodes
(EngineCore pid=634293) DEBUG 04-05 16:53:12 [compilation/.../ir/lowering_pass.py:185] Selected implementations: rms_norm=native*3
(EngineCore pid=634293) DEBUG 04-05 16:53:12 [compilation/passes/vllm_inductor_pass.py:84] VllmIRLoweringPass completed in 30.9 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:12 [compilation/passes/vllm_inductor_pass.py:84] PostCleanupPass completed in 1.7 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:12 [compilation/.../utility/fix_functionalization.py:203] De-functionalized 0 nodes, removed 0 nodes
(EngineCore pid=634293) DEBUG 04-05 16:53:12 [compilation/passes/vllm_inductor_pass.py:84] FixFunctionalizationPass completed in 0.2 ms
(APIServer pid=633825) DEBUG 04-05 16:53:13 [v1/engine/utils.py:1122] Waiting for 1 local, 0 remote core engine proc(s) to start.
(EngineCore pid=634293) INFO 04-05 16:53:18 [compilation/backends.py:372] Cache the graph of compile range (1, 2048) for later use
(EngineCore pid=634293) DEBUG 04-05 16:53:18 [compilation/backends.py:377] Store the 0-th graph for compile range(1, 2048) from inductor_standalone via handle ('artifact_compile_range_1_2048_subgraph_0', '/root/.cache/vllm/torch_compile_cache/607d5a75c8/rank_0_0/backbone/artifact_compile_range_1_2048_subgraph_0')
(EngineCore pid=634293) DEBUG 04-05 16:53:18 [compilation/.../utility/noop_elimination.py:105] Removed 0 no-op reshapes and slices
(EngineCore pid=634293) DEBUG 04-05 16:53:18 [compilation/passes/vllm_inductor_pass.py:84] NoOpEliminationPass completed in 0.7 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:18 [compilation/passes/vllm_inductor_pass.py:84] PostCleanupPass completed in 1.8 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:18 [compilation/.../ir/lowering_pass.py:171] VllmIRLoweringPass lowered 2 vLLM IR nodes
(EngineCore pid=634293) DEBUG 04-05 16:53:18 [compilation/.../ir/lowering_pass.py:185] Selected implementations: rms_norm=native*2
(EngineCore pid=634293) DEBUG 04-05 16:53:18 [compilation/passes/vllm_inductor_pass.py:84] VllmIRLoweringPass completed in 1.5 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:18 [compilation/passes/vllm_inductor_pass.py:84] PostCleanupPass completed in 2.1 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:18 [compilation/.../utility/fix_functionalization.py:203] De-functionalized 0 nodes, removed 0 nodes
(EngineCore pid=634293) DEBUG 04-05 16:53:18 [compilation/passes/vllm_inductor_pass.py:84] FixFunctionalizationPass completed in 0.4 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:20 [compilation/backends.py:377] Store the 1-th graph for compile range(1, 2048) from inductor_standalone via handle ('artifact_compile_range_1_2048_subgraph_1', '/root/.cache/vllm/torch_compile_cache/607d5a75c8/rank_0_0/backbone/artifact_compile_range_1_2048_subgraph_1')
(EngineCore pid=634293) DEBUG 04-05 16:53:21 [compilation/backends.py:377] Store the 2-th graph for compile range(1, 2048) from inductor_standalone via handle ('artifact_compile_range_1_2048_subgraph_2', '/root/.cache/vllm/torch_compile_cache/607d5a75c8/rank_0_0/backbone/artifact_compile_range_1_2048_subgraph_2')
(EngineCore pid=634293) DEBUG 04-05 16:53:21 [compilation/.../utility/noop_elimination.py:105] Removed 0 no-op reshapes and slices
(EngineCore pid=634293) DEBUG 04-05 16:53:21 [compilation/passes/vllm_inductor_pass.py:84] NoOpEliminationPass completed in 0.7 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:21 [compilation/passes/vllm_inductor_pass.py:84] PostCleanupPass completed in 1.8 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:21 [compilation/.../ir/lowering_pass.py:171] VllmIRLoweringPass lowered 2 vLLM IR nodes
(EngineCore pid=634293) DEBUG 04-05 16:53:21 [compilation/.../ir/lowering_pass.py:185] Selected implementations: rms_norm=native*2
(EngineCore pid=634293) DEBUG 04-05 16:53:21 [compilation/passes/vllm_inductor_pass.py:84] VllmIRLoweringPass completed in 1.4 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:21 [compilation/passes/vllm_inductor_pass.py:84] PostCleanupPass completed in 2.1 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:21 [compilation/.../utility/fix_functionalization.py:203] De-functionalized 0 nodes, removed 0 nodes
(EngineCore pid=634293) DEBUG 04-05 16:53:21 [compilation/passes/vllm_inductor_pass.py:84] FixFunctionalizationPass completed in 0.2 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:22 [compilation/backends.py:377] Store the 4-th graph for compile range(1, 2048) from inductor_standalone via handle ('artifact_compile_range_1_2048_subgraph_4', '/root/.cache/vllm/torch_compile_cache/607d5a75c8/rank_0_0/backbone/artifact_compile_range_1_2048_subgraph_4')
(APIServer pid=633825) DEBUG 04-05 16:53:23 [v1/engine/utils.py:1122] Waiting for 1 local, 0 remote core engine proc(s) to start.
(EngineCore pid=634293) DEBUG 04-05 16:53:25 [compilation/.../utility/noop_elimination.py:105] Removed 0 no-op reshapes and slices
(EngineCore pid=634293) DEBUG 04-05 16:53:25 [compilation/passes/vllm_inductor_pass.py:84] NoOpEliminationPass completed in 0.7 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:25 [compilation/passes/vllm_inductor_pass.py:84] PostCleanupPass completed in 0.2 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:25 [compilation/.../ir/lowering_pass.py:171] VllmIRLoweringPass lowered 0 vLLM IR nodes
(EngineCore pid=634293) DEBUG 04-05 16:53:25 [compilation/.../ir/lowering_pass.py:185] Selected implementations:
(EngineCore pid=634293) DEBUG 04-05 16:53:25 [compilation/passes/vllm_inductor_pass.py:84] VllmIRLoweringPass completed in 0.4 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:25 [compilation/passes/vllm_inductor_pass.py:84] PostCleanupPass completed in 0.2 ms
(EngineCore pid=634293) DEBUG 04-05 16:53:25 [compilation/.../utility/fix_functionalization.py:203] De-functionalized 0 nodes, removed 0 nodes

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@gcanlin
Copy link
Copy Markdown
Contributor Author

gcanlin commented Apr 5, 2026

@ProExpertProg Could you help review it? I’m still new to vLLM IR, so some parts of the code may not be fully thought through. Appreciate your understanding.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a caching mechanism for traced replacement graphs in the VllmIRLoweringPass to avoid redundant tracing of the same implementations. It adds a _replacement_cache and helper methods to generate hashable metadata from operation arguments. A critical issue was identified in the _make_arg_meta implementation, which lacks support for unhashable types like lists or symbolic types (e.g., torch.SymInt), which will cause a TypeError when used as cache keys.

Comment thread vllm/compilation/passes/ir/lowering_pass.py
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@github-project-automation github-project-automation bot moved this to Todo in vLLM IR Apr 6, 2026
@ProExpertProg ProExpertProg moved this from Todo to In review in vLLM IR Apr 6, 2026
@ProExpertProg ProExpertProg added the vllm-ir vLLM IR: intermediate representation and kernel registration label Apr 6, 2026
Copy link
Copy Markdown
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing this! Looks good overall, would be nice to get eyes from the PyTorch side.

I think we should wait for #36816 and #36823 to merge first (aiming for this week)

@@ -87,17 +134,33 @@ def lower_matched_op(self, match: Match, *args, **kwargs):

# replace_by_example wants node args, not the fake tensors
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix this comment: move below and mention replace_with_graph requires node args

)(self.lower_matched_op)

@staticmethod
def _make_arg_meta(val: Any) -> Any:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BoyuanFeng can you check the caching logic?

Return a cached traced replacement graph, or trace and cache a new one.
"""
cache_key = (
impl_fn,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to use the impl_fn? Or should we use the IR op's name and provider instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

vllm-ir vLLM IR: intermediate representation and kernel registration

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

2 participants