Skip to content

[torch.compile] Move torch.Size producers to consumer subgraph in split_graph#35635

Open
bsherifi wants to merge 12 commits intovllm-project:mainfrom
bsherifi:main
Open

[torch.compile] Move torch.Size producers to consumer subgraph in split_graph#35635
bsherifi wants to merge 12 commits intovllm-project:mainfrom
bsherifi:main

Conversation

@bsherifi
Copy link
Copy Markdown

@bsherifi bsherifi commented Mar 1, 2026

Purpose

split_graph can create submodules where a torch.Size is returned by one submodule and consumed by another. AoTAutograd cannot handle torch.Size as submodule input/output (TreeSpec mismatch: Size vs tuple), causing runtime errors during piecewise compilation.

This affects 13+ MoE models (granitemoe, qwen2_moe, mixtral, dbrx, etc.) that follow this pattern:

orig_shape = hidden_states.shape          # Subgraph 0 — produces torch.Size
hidden_states = moe(hidden_states)         # Splitting op — creates boundary
result = hidden_states.view(orig_shape)    # Subgraph 2 — torch.Size crosses boundary

Adds a second pass after the initial node-to-subgraph assignment that moves torch.Size-producing nodes to the earliest consumer's subgraph when all consumers are in strictly later subgraphs. This follows the same pattern as the existing getitem fix from PR #28533, as suggested by @zou3519.

Relation to prior PRs

This is a simpler alternative to #32747 and #32960 (both currently stale with merge conflicts). Key differences:

Fixes #31043

Test Plan

pytest tests/compile/test_graph_partition.py -v
ruff check vllm/compilation/backends.py tests/compile/test_graph_partition.py
ruff format --check vllm/compilation/backends.py tests/compile/test_graph_partition.py

Test Result

  • ruff check — all checks passed
  • ruff format --check — both files formatted
  • Added test_torch_size_moved_to_consumer_subgraph that verifies the torch.Size-producing node is moved from the pre-split to the post-split submodule

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 1, 2026

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

…it_graph

Signed-off-by: bsherifi <betimsh99@gmail.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix to prevent torch.Size values from crossing subgraph boundaries during graph splitting, which can cause issues with AoTAutograd. The approach is to move torch.Size-producing nodes to their consumer's subgraph. The implementation and accompanying test correctly handle the case with a single consumer. However, I've identified a potential logic issue where the fix is incomplete for cases with multiple consumers spread across different subgraphs, as a torch.Size may still cross a boundary between later subgraphs. I've provided a detailed comment with a suggestion for a more robust fix.

Comment on lines +452 to +454
# Only move if ALL consumers are in strictly later subgraphs.
if all(sg > producer_subgraph for sg in consumer_subgraphs):
node_to_subgraph_id[node] = min(consumer_subgraphs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This logic is incomplete for cases where a torch.Size-producing node has multiple consumers located in different subsequent subgraphs.

The current implementation moves the producer node to the subgraph of the earliest consumer (min(consumer_subgraphs)). If other consumers exist in later subgraphs, they will still receive the torch.Size value as an input from the submodule where the producer was moved. This means a torch.Size value would still cross a subgraph boundary, which this PR aims to prevent.

For example:

  • Producer P is in subgraph 0.
  • Consumer C1 is in subgraph 2.
  • Consumer C2 is in subgraph 4.

The current logic will move P to subgraph 2. While this works for C1, C2 in subgraph 4 will now receive P's output from subgraph 2, which is still a torch.Size crossing a boundary.

A more robust approach is to only move the producer if all consumers are in the same subgraph. This is a safer, though more restrictive, fix. Another alternative would be to duplicate the producer node in each consumer's subgraph, which is more complex but would handle all cases correctly.

Suggested change
# Only move if ALL consumers are in strictly later subgraphs.
if all(sg > producer_subgraph for sg in consumer_subgraphs):
node_to_subgraph_id[node] = min(consumer_subgraphs)
# Only move if all consumers are in the same, strictly later subgraph.
if consumer_subgraphs and len(consumer_subgraphs) == 1:
consumer_sg = next(iter(consumer_subgraphs))
if consumer_sg > producer_subgraph:
node_to_subgraph_id[node] = consumer_sg

@ProExpertProg
Copy link
Copy Markdown
Collaborator

cc @zou3519 @BoyuanFeng

@fxdawnn
Copy link
Copy Markdown
Contributor

fxdawnn commented Mar 2, 2026

Hi! This is very similar to my earliest PR version but it does not address the problem with multiple consumers usage. The latest version of my PR root caused the problem and placed the symint in the producer graph since they're already supported with graph boundary crossing. [https://github.com//pull/32747]

The second PR that i made is irrelevant to the fix that is why it was a separate PR. We didn't do DCE for the redundant input so I added that PR. [https://github.com//pull/35251]

@bsherifi
Copy link
Copy Markdown
Author

bsherifi commented Mar 3, 2026

Hi! This is very similar to my earliest PR version but it does not address the problem with multiple consumers usage. The latest version of my PR root caused the problem and placed the symint in the producer graph since they're already supported with graph boundary crossing. [https://github.com//pull/32747]

The second PR that i made is irrelevant to the fix that is why it was a separate PR. We didn't do DCE for the redundant input so I added that PR. [https://github.com//pull/35251]

Thanks for the context @fxdawnn; I went through the full review history on #32747 and #35251 including the discussions with @zou3519 and @gmagogsfm.

You're right, this has the same multi-consumer limitation your earliest version had. I started from zou3519's suggestion on the issue to extend the getitem pattern from #28533, but after reading through your review threads, seems sym_size.int pre-pass is the right call.

Both #32747 and #35251 have merge conflicts right now. If you'd prefer to rebase yours, I'm happy to close this. Otherwise I can update this PR to use your sym_size.int approach (with credit) so the fix gets landed. Either way works for me, just want to get #31043 resolved.

@fxdawnn
Copy link
Copy Markdown
Contributor

fxdawnn commented Mar 3, 2026

I believed the merge conflict was already resolved but the github auto comment wouldn't disappear for my PRs.

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 9, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bsherifi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BugFix]: move torch.Size across graphs in split_graph

3 participants