[Bug] Fix torch inductor issue (shape passing through sub-graphs)#30914
[Bug] Fix torch inductor issue (shape passing through sub-graphs)#30914yewentao256 wants to merge 5 commits into
Conversation
Signed-off-by: yewentao256 <zhyanwentao@126.com>
There was a problem hiding this comment.
Code Review
This pull request addresses a torch.compile issue in Qwen2MoeSparseMoeBlock by refactoring how input tensor shapes are handled. Instead of saving the original shape and using view() at the end, it now uses a boolean flag is_input_1d to conditionally squeeze() the output tensor. This is a good approach to make the code more friendly to torch.compile. I've found one potential issue with the new assertion which could lead to a runtime error.
|
do you get reasonable output? @BoyuanFeng tried this but there was still CG address issues |
|
@LucasWilkinson Thanks for the feedback, I see the issue, which will trigger in debug mode. It is a cuda address replay issue, I will fix it tomorrow. |
There was a problem hiding this comment.
I think we need to audit:
step3_text.py
qwen3_next.py
phimoe.py
mixtral.py
olmoe.py
lfm2_moe.py
ernie45_moe.py
dbrx.py
granitemoe.py
hunyuan_v1.py
grok1.py
jamba.py
flex_olmo.py
ernie45_vl_moe.py
for the same issues
There was a problem hiding this comment.
alot of these models are old-ish or probably too small to be used with wideEP so I think it would probably be good enough if we just audit
step3_text.py
qwen3_next.py
granitemoe.py
aeeb81d to
546bad2
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
ProExpertProg
left a comment
There was a problem hiding this comment.
Can you describe in more detail why this is needed? We're adding a lot of complexity (and potential bugs) so I want to understand the use case.
Also, I thought there was existing logic to copy inputs into cudagraph addresses, can we reuse that?
My understanding of the issue is that the output of the fused moe op is outside the cudagraph and that's why we have to copy the tensor into the cudagraph address. What we could do instead is make sure that the output address of the Moe op appears in the previous cudagraph/compiled graph; that's what we do for attention. Could we try that as well?
cc @youkaichao @BoyuanFeng as well for review.
|
For the changing address in moe, this pr copies it to static tensor address in Another (probably better) way is to modify the moe op such that the output tensors have the same address as the input tensors. Because input tensor comes from previous piecewise cudagraph and has static address, this guarantees that the output tensor has static address. For example, in attention op, instead of |
f1b6e46 to
e877962
Compare
Signed-off-by: yewentao256 <zhyanwentao@126.com>
e877962 to
d9cba20
Compare
Signed-off-by: yewentao256 <zhyanwentao@126.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
Hi @LucasWilkinson @ProExpertProg @BoyuanFeng Thanks for the review — after digging deeper, we realized there are two separate issues: Given the complexity/risk of (2), I’ve reverted the current CUDA-graph input-copy change from this PR and will keep this PR scoped to (1) only. For (2), Two possible directions:
I do prefer MOE_out but seems really a lot of changes, so perhaps discussing in another issue/PR |
|
Thanks for clarifying! For issue 1), could we just update the Dynamo/FX partitioning logic like described in #31043: basically make sure that size is never passed through subgraph boundaries? That way we don't have to update model definitions for that? |
|
The original cause is that DeepEPHT doesn't support MOE with cudagraph, after consideration, I don't see the benefits of supporting this feature specifically for DeepEPHT, closing this PR |
Purpose
Context: https://vllm-dev.slack.com/archives/C08U97ZRC0J/p1765934670913979
VLLM_ALL2ALL_BACKEND="deepep_high_throughput" VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL="debug" python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2There are two issues we found:
Note: For part 2, it involves a lot of change in the design, so let's put this in another PR. This PR focus on the first issue.
Basically, we can 1. copy through changed address; 2. refactor moe to support a output tensor.
Test
Originally:
Now everything fixed