[compile][graph_partition]Add tensor size handling by fxdawnn · Pull Request #36038 · vllm-project/vllm

fxdawnn · 2026-03-04T19:43:40Z

Purpose

Fix #31043
Redo #32747 since there was some issues with the git sign-off

Problem

When using torch.compile with dynamic shapes on models that call x.size() / x.shape before a splitting op (e.g. sigmoid) and use the shape after it, the torch.Size object crosses the split boundary as a submodule output. aot_autograd / standalone_compile cannot handle torch.Size as a submodule output — it expects flat tensors and scalars. This causes:

AssertionError: output spec mismatch
TreeSpec(tuple, None, [*, *, TreeSpec(Size, None, [*, *]), *])
vs
TreeSpec(tuple, None, [*, *, *, *])

Observed in production with MoE models (e.g. DeepSeek) where torch.Size([s72, 2048]) crossed a split boundary.

Root Cause

torch.compile captures x.size() / x.shape as a call_method node with target="size", which returns a torch.Size object (a tuple of ints/SymInts). When this node is in the producer subgraph but its consumer (e.g. view(x, shape)) is in a later subgraph after a split point, split_module threads the torch.Size across the boundary. aot_autograd sees TreeSpec(Size, ...) in the output spec instead of flat scalars and raises an assertion error.

Fix

Add a pre-pass (_decompose_size_nodes) at the start of split_graph that decomposes every x.size() call into individual sym_size.int(x, dim) calls — one per dimension:

Before: view(clone, size)       # size = torch.Size([s77, 8])
After:  view(clone, [sym_size_int, 8])  # s77 as SymInt node, 8 as literal

Dynamic dims (SymInt) → new sym_size.int(x, dim) node placed in the producer subgraph. split_module automatically handles cross-boundary data flow: when it sees a node in subgraph 0 used by a node in subgraph 2, it makes the result an output of subgraph 0, creates a placeholder (input) in subgraph 2, and wires them in the top-level orchestrator. We don't need to manually thread SymInt inputs — split_module does this for any scalar or tensor that crosses a boundary.
Static dims (plain int) → inlined as literal constant, never crosses the boundary

The new sym_size.int nodes are placed right after their tensor operand, so split_module naturally puts them in the producer subgraph. example_value metadata is propagated to each new node so downstream code can introspect placeholder types.

Debug logging (VLLM_LOGGING_LEVEL=DEBUG) prints the graph before and after decomposition.

Tests

5 new tests in tests/compile/test_graph_partition.py:

test_sym_size_whole_shape_boundary: basic repro — x.size() used across a split boundary, validates standalone_compile passes
test_symint_crosses_split_boundary: SymInt placeholders from mark_dynamic thread through multiple split boundaries correctly
test_shape_boundary_standalone_compile: repro of the production MoE error (TreeSpec mismatch), validates consumer has SymInt placeholders (not static int placeholders) and standalone_compile works
test_size_used_in_multiple_consumer_subgraphs: same x.size() consumed by two subgraphs across two split points, validates functional correctness
test_sym_size_metadata_propagated: example_value metadata set on all new nodes, standalone_compile works on every submodule

Compile Time Assurance

Our changes shouldn't increase the overhead for runtime. To ensure this, we benchmarked on before and after gpt-oss-120b and llama3-70b.

before: gpt-oss-120b 16.05 s llama 27.43 s
After: gpt-oss-120b 15.56 s llama 26.70 s

The changes in overhead are marginal and can be considered negligible. The TLParse per analysis for the decomposition also showed under 10ms consistently across 4 models.

Graph changes

test_sym_size_whole_shape_boundary (focus on size node is decomposed into symInt in consumer graph)

DEBUG 03-17 11:43:13 [compilation/backends.py:476] Graph before size decomposition:
DEBUG 03-17 11:43:13 [compilation/backends.py:476] graph():
DEBUG 03-17 11:43:13 [compilation/backends.py:476]     %s77 : torch.SymInt [num_users=0] = placeholder[target=s77]
DEBUG 03-17 11:43:13 [compilation/backends.py:476]     %l_x_ : torch.Tensor [num_users=2] = placeholder[target=L_x_]
DEBUG 03-17 11:43:13 [compilation/backends.py:476]     %size : [num_users=1] = call_method[target=size](args = (%l_x_,), kwargs = {})
DEBUG 03-17 11:43:13 [compilation/backends.py:476]     %x : [num_users=1] = call_function[target=torch.ops.aten.sigmoid.default](args = (%l_x_,), kwargs = {})
DEBUG 03-17 11:43:13 [compilation/backends.py:476]     %clone : [num_users=1] = call_method[target=clone](args = (%x,), kwargs = {})
DEBUG 03-17 11:43:13 [compilation/backends.py:476]     %x_1 : [num_users=1] = call_method[target=view](args = (%clone, %size), kwargs = {})
DEBUG 03-17 11:43:13 [compilation/backends.py:476]     return (x_1,)
DEBUG 03-17 11:43:13 [compilation/backends.py:534] Graph after size decomposition:
DEBUG 03-17 11:43:13 [compilation/backends.py:534] graph():
DEBUG 03-17 11:43:13 [compilation/backends.py:534]     %s77 : torch.SymInt [num_users=0] = placeholder[target=s77]
DEBUG 03-17 11:43:13 [compilation/backends.py:534]     %l_x_ : torch.Tensor [num_users=2] = placeholder[target=L_x_]
DEBUG 03-17 11:43:13 [compilation/backends.py:534]     %sym_size_int : [num_users=1] = call_function[target=torch.ops.aten.sym_size.int](args = (%l_x_, 0), kwargs = {})
DEBUG 03-17 11:43:13 [compilation/backends.py:534]     %x : [num_users=1] = call_function[target=torch.ops.aten.sigmoid.default](args = (%l_x_,), kwargs = {})
DEBUG 03-17 11:43:13 [compilation/backends.py:534]     %clone : [num_users=1] = call_method[target=clone](args = (%x,), kwargs = {})
DEBUG 03-17 11:43:13 [compilation/backends.py:534]     %x_1 : [num_users=1] = call_method[target=view](args = (%clone, %sym_size_int, 8), kwargs = {})
DEBUG 03-17 11:43:13 [compilation/backends.py:534]     return (x_1,)

test_size_used_in_multiple_consumer_subgraphs (Size to symint and inline int in consumer subgraphs)

DEBUG 03-17 11:46:48 [compilation/backends.py:476] Graph before size decomposition:
DEBUG 03-17 11:46:48 [compilation/backends.py:476] graph():
DEBUG 03-17 11:46:48 [compilation/backends.py:476]     %s77 : torch.SymInt [num_users=0] = placeholder[target=s77]
DEBUG 03-17 11:46:48 [compilation/backends.py:476]     %l_x_ : torch.Tensor [num_users=2] = placeholder[target=L_x_]
DEBUG 03-17 11:46:48 [compilation/backends.py:476]     %l_y_ : torch.Tensor [num_users=2] = placeholder[target=L_y_]
DEBUG 03-17 11:46:48 [compilation/backends.py:476]     %size : [num_users=2] = call_method[target=size](args = (%l_x_,), kwargs = {})
DEBUG 03-17 11:46:48 [compilation/backends.py:476]     %z1 : [num_users=1] = call_function[target=torch.ops.aten.sigmoid.default](args = (%l_x_,), kwargs = {})
DEBUG 03-17 11:46:48 [compilation/backends.py:476]     %y1 : [num_users=1] = call_method[target=view](args = (%l_y_, %size), kwargs = {})
DEBUG 03-17 11:46:48 [compilation/backends.py:476]     %z2 : [num_users=1] = call_function[target=torch.ops.aten.sigmoid.default](args = (%z1,), kwargs = {})
DEBUG 03-17 11:46:48 [compilation/backends.py:476]     %y2 : [num_users=1] = call_method[target=view](args = (%l_y_, %size), kwargs = {})
DEBUG 03-17 11:46:48 [compilation/backends.py:476]     %add : [num_users=1] = call_function[target=operator.add](args = (%z2, %y1), kwargs = {})
DEBUG 03-17 11:46:48 [compilation/backends.py:476]     %add_1 : [num_users=1] = call_function[target=operator.add](args = (%add, %y2), kwargs = {})
DEBUG 03-17 11:46:48 [compilation/backends.py:476]     return (add_1,)
DEBUG 03-17 11:46:48 [compilation/backends.py:534] Graph after size decomposition:
DEBUG 03-17 11:46:48 [compilation/backends.py:534] graph():
DEBUG 03-17 11:46:48 [compilation/backends.py:534]     %s77 : torch.SymInt [num_users=0] = placeholder[target=s77]
DEBUG 03-17 11:46:48 [compilation/backends.py:534]     %l_x_ : torch.Tensor [num_users=2] = placeholder[target=L_x_]
DEBUG 03-17 11:46:48 [compilation/backends.py:534]     %sym_size_int : [num_users=2] = call_function[target=torch.ops.aten.sym_size.int](args = (%l_x_, 0), kwargs = {})
DEBUG 03-17 11:46:48 [compilation/backends.py:534]     %l_y_ : torch.Tensor [num_users=2] = placeholder[target=L_y_]
DEBUG 03-17 11:46:48 [compilation/backends.py:534]     %z1 : [num_users=1] = call_function[target=torch.ops.aten.sigmoid.default](args = (%l_x_,), kwargs = {})
DEBUG 03-17 11:46:48 [compilation/backends.py:534]     %y1 : [num_users=1] = call_method[target=view](args = (%l_y_, %sym_size_int, 8), kwargs = {})
DEBUG 03-17 11:46:48 [compilation/backends.py:534]     %z2 : [num_users=1] = call_function[target=torch.ops.aten.sigmoid.default](args = (%z1,), kwargs = {})
DEBUG 03-17 11:46:48 [compilation/backends.py:534]     %y2 : [num_users=1] = call_method[target=view](args = (%l_y_, %sym_size_int, 8), kwargs = {})
DEBUG 03-17 11:46:48 [compilation/backends.py:534]     %add : [num_users=1] = call_function[target=operator.add](args = (%z2, %y1), kwargs = {})
DEBUG 03-17 11:46:48 [compilation/backends.py:534]     %add_1 : [num_users=1] = call_function[target=operator.add](args = (%add, %y2), kwargs = {})
DEBUG 03-17 11:46:48 [compilation/backends.py:534]     return (add_1,)

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces an effective optimization by adding a pre-pass to split_graph that repositions sym_size.int nodes. This change prevents unnecessary tensor propagation across subgraph boundaries, which should improve memory efficiency. The implementation is clean and the accompanying tests are relevant. I've identified a minor issue in one of the new tests where an assertion was missing and have suggested a fix. Overall, this is a solid contribution.

tests/compile/test_graph_partition.py

gemini-code-assist

Code Review

This pull request introduces an effective optimization by adding a pre-pass to split_graph that moves sym_size.int operations into the producer subgraph. This prevents tensors from being unnecessarily passed across subgraph boundaries just for shape information, which should improve memory efficiency during compilation. The implementation is clean and the new tests correctly validate the core logic. I've included one suggestion to strengthen a check in the tests to make it an explicit assertion, improving its robustness.

tests/compile/test_graph_partition.py

gemini-code-assist

Code Review

This pull request introduces an important optimization by moving sym_size.int nodes into the producer subgraph during graph partitioning. This prevents tensors from being unnecessarily passed to consumer subgraphs just for shape information, improving memory efficiency. The implementation in vllm/compilation/backends.py is clean and follows FX best practices. The accompanying tests are mostly thorough, though I've pointed out a small issue in one of the new test cases where a verification loop is ineffective and should be removed.

tests/compile/test_graph_partition.py

mergify · 2026-03-09T00:33:03Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fxdawnn.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tests/compile/test_graph_partition.py

zou3519 · 2026-03-11T14:59:34Z

@fxdawnn I don't think this PR is solving the right problem. The problem is when we have a sym_size node in the graph, not a sym_size.int node. e.g.:

#!/usr/bin/env python
import torch
import torch.fx as fx
from torch._inductor import standalone_compile
from vllm.compilation.backends import split_graph


captured_graph = None

def capturing_backend(gm: fx.GraphModule, example_inputs: list) -> fx.GraphModule:
    global captured_graph
    captured_graph = gm
    return gm

def model_fn(x: torch.Tensor) -> torch.Tensor:
    shape = x.shape
    x = torch.ops.aten.sigmoid.default(x)
    x = x.clone().view(shape)
    return x

x = torch.randn(4, 8)
torch._dynamo.mark_dynamic(x, 0)
compiled_fn = torch.compile(model_fn, backend=capturing_backend)
compiled_fn(x)

split_gm, split_items = split_graph(captured_graph, ["aten::sigmoid"])
assert len(split_items) == 3

# the shape error
submod_0 = split_gm.submod_0
print(submod_0)
example_input = torch.randn(4, 8)
compiled = standalone_compile(
    submod_0, [example_input, 4], dynamic_shapes="from_example_inputs"
)

tests/compile/test_graph_partition.py

fxdawnn · 2026-03-18T16:46:14Z

This method decompose the size() into list of valid inputs symint/int. This is cheaper memory cost than adding tensor as input to all subgraph that uses taht. The trade-off of saving the memory cost is runtime. Observing the runtime overhead on the torch.size() decomposition among all the major models. After some benchmarking on Llama/openAI/ZAI/MISTRAL, the runtime overhead is minimal (all below 10ms and some under 1ms in H100X8).

zou3519 · 2026-03-18T17:47:53Z

vllm/compilation/backends.py

+      - Dynamic dims (SymInt) → new sym_size.int node
+      - Static dims (plain int) → inlined as literal constant
+    """
+    # torch.compile captures x.size()/x.shape as call_method target="size".


nit: "Dynamo captures ..."

updated! Thanks.

vllm/compilation/backends.py

zou3519 · 2026-03-18T17:52:37Z

vllm/compilation/backends.py

+        if skip:
+            continue


we don't need the skip case if we raise AssertionError right?

removed! Thanks!

zou3519 · 2026-03-18T17:54:43Z

vllm/compilation/backends.py

+                elif isinstance(arg, (list, tuple)):
+                    expanded = []
+                    for a in arg:
+                        if a is node:
+                            expanded.extend(dims)
+                        else:
+                            expanded.append(a)
+                    new_args.append(type(arg)(expanded))


I don't think this case can happen?

great catch! tuple are not valid for crossing...

zou3519

this lgtm but had some minor comments, please read

mergify · 2026-03-19T15:39:08Z

Documentation preview: https://vllm--36038.org.readthedocs.build/en/36038/

mergify · 2026-03-19T15:39:36Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fxdawnn.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-03-19T16:04:35Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fxdawnn.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…ry crossing Signed-off-by: Xiao Fu <xiaofu@meta.com>

fxdawnn requested review from BoyuanFeng, ProExpertProg, youkaichao and zou3519 as code owners March 4, 2026 19:43

gemini-code-assist bot reviewed Mar 4, 2026

View reviewed changes