[torch.compile] Move torch.Size producers to consumer subgraph in split_graph#35635
[torch.compile] Move torch.Size producers to consumer subgraph in split_graph#35635bsherifi wants to merge 12 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
…it_graph Signed-off-by: bsherifi <betimsh99@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a fix to prevent torch.Size values from crossing subgraph boundaries during graph splitting, which can cause issues with AoTAutograd. The approach is to move torch.Size-producing nodes to their consumer's subgraph. The implementation and accompanying test correctly handle the case with a single consumer. However, I've identified a potential logic issue where the fix is incomplete for cases with multiple consumers spread across different subgraphs, as a torch.Size may still cross a boundary between later subgraphs. I've provided a detailed comment with a suggestion for a more robust fix.
| # Only move if ALL consumers are in strictly later subgraphs. | ||
| if all(sg > producer_subgraph for sg in consumer_subgraphs): | ||
| node_to_subgraph_id[node] = min(consumer_subgraphs) |
There was a problem hiding this comment.
This logic is incomplete for cases where a torch.Size-producing node has multiple consumers located in different subsequent subgraphs.
The current implementation moves the producer node to the subgraph of the earliest consumer (min(consumer_subgraphs)). If other consumers exist in later subgraphs, they will still receive the torch.Size value as an input from the submodule where the producer was moved. This means a torch.Size value would still cross a subgraph boundary, which this PR aims to prevent.
For example:
- Producer
Pis in subgraph 0. - Consumer
C1is in subgraph 2. - Consumer
C2is in subgraph 4.
The current logic will move P to subgraph 2. While this works for C1, C2 in subgraph 4 will now receive P's output from subgraph 2, which is still a torch.Size crossing a boundary.
A more robust approach is to only move the producer if all consumers are in the same subgraph. This is a safer, though more restrictive, fix. Another alternative would be to duplicate the producer node in each consumer's subgraph, which is more complex but would handle all cases correctly.
| # Only move if ALL consumers are in strictly later subgraphs. | |
| if all(sg > producer_subgraph for sg in consumer_subgraphs): | |
| node_to_subgraph_id[node] = min(consumer_subgraphs) | |
| # Only move if all consumers are in the same, strictly later subgraph. | |
| if consumer_subgraphs and len(consumer_subgraphs) == 1: | |
| consumer_sg = next(iter(consumer_subgraphs)) | |
| if consumer_sg > producer_subgraph: | |
| node_to_subgraph_id[node] = consumer_sg |
|
Hi! This is very similar to my earliest PR version but it does not address the problem with multiple consumers usage. The latest version of my PR root caused the problem and placed the symint in the producer graph since they're already supported with graph boundary crossing. [https://github.com//pull/32747] The second PR that i made is irrelevant to the fix that is why it was a separate PR. We didn't do DCE for the redundant input so I added that PR. [https://github.com//pull/35251] |
Thanks for the context @fxdawnn; I went through the full review history on #32747 and #35251 including the discussions with @zou3519 and @gmagogsfm. You're right, this has the same multi-consumer limitation your earliest version had. I started from zou3519's suggestion on the issue to extend the getitem pattern from #28533, but after reading through your review threads, seems sym_size.int pre-pass is the right call. Both #32747 and #35251 have merge conflicts right now. If you'd prefer to rebase yours, I'm happy to close this. Otherwise I can update this PR to use your sym_size.int approach (with credit) so the fix gets landed. Either way works for me, just want to get #31043 resolved. |
|
I believed the merge conflict was already resolved but the github auto comment wouldn't disappear for my PRs. |
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
split_graphcan create submodules where atorch.Sizeis returned by one submodule and consumed by another. AoTAutograd cannot handletorch.Sizeas submodule input/output (TreeSpec mismatch:Sizevstuple), causing runtime errors during piecewise compilation.This affects 13+ MoE models (granitemoe, qwen2_moe, mixtral, dbrx, etc.) that follow this pattern:
Adds a second pass after the initial node-to-subgraph assignment that moves
torch.Size-producing nodes to the earliest consumer's subgraph when all consumers are in strictly later subgraphs. This follows the same pattern as the existinggetitemfix from PR #28533, as suggested by @zou3519.Relation to prior PRs
This is a simpler alternative to #32747 and #32960 (both currently stale with merge conflicts). Key differences:
node_to_subgraph_id, does not erase/create nodesgetitemhoisting pattern from PR [Bugfix] Eliminate tuple inputs to submodules in graph partitioning #28533, keeping the approach consistentFixes #31043
Test Plan
Test Result
ruff check— all checks passedruff format --check— both files formattedtest_torch_size_moved_to_consumer_subgraphthat verifies the torch.Size-producing node is moved from the pre-split to the post-split submoduleEssential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.