[torch.compile] Improve Cold Start for MoEs by zou3519 · Pull Request #32805 · vllm-project/vllm

zou3519 · 2026-01-21T19:40:32Z

Purpose

For torch.compile cold start times, we need to avoid hard-coding any strings into the graph. Right now, the vllm.moe_forward and vllm.moe_forward_shared custom operators hard-code strings into the graph.

The workaround is to store a list of the strings that each of those custom ops needs, in reverse order, in the ForwardContext. The ForwardContext object is alive for the duration of the forward pass. When the custom op needs the string, pop the string from this list.

This assumes that the custom operators will always be executed in order and that torch.compile will not try to reorder these operations with respect to each other.

There are longer-term solutions, like unwrapping the moe custom operator,
and/or treating the string as a "symbolic input" to the graph that aren't ready yet.

Test Plan & Test Result

This PR speeds up:

the torch.compile piece of gpt-oss-120b from 46s to 16s.
the torch.compile piece of GLM-4.7-FP8 from 197.62s to 46s.

I also tested gpt-oss-120b locally with some inputs to sanity check that it was still correct.

Wait for CI

gemini-code-assist

Code Review

This pull request introduces a workaround to speed up the cold start time for MOE models when using torch.compile. The changes avoid hard-coding layer name strings into the compiled graph by storing them in the ForwardContext and retrieving them at runtime. This relies on the assumption that MoE layers are executed in a fixed order.

The implementation is sound, but I have one suggestion to improve the robustness and type safety of the new get_layer_from_name function. This will help in debugging if the execution order assumption is violated in the future and improves code maintainability.

vllm/model_executor/layers/fused_moe/layer.py

zou3519 · 2026-01-21T19:42:41Z

vllm/forward_context.py

+    # There are longer-term solutions, like unwrapping the moe custom operator,
+    # and/or treating the string as a "symbolic input" to the graph that
+    # aren't ready yet.
+    remaining_moe_layers: list[str]


@ProExpertProg I could try to clean up how no_compile_layers works (it has both attention layers and MOE layers) by instead:

making it a list

popping from it instead of the new remaining_moe_layers list

deleting the "string" arguments to the moe_forward / moe_forward_shared operators (and maybe unified_attention if it has them too, I don't know)

but I'm not sure how long this change will really stick around. I expect the MOE refactor to expose the MOE internals and obviate this change.

Yeah this isn't meant to be permanent. Do you mind adding a TODO with a link to #31985?

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

vllm/model_executor/layers/fused_moe/layer.py

ProExpertProg

Looks good, thanks for doing this!

ProExpertProg · 2026-01-21T20:21:33Z

vllm/forward_context.py

+    # There are longer-term solutions, like unwrapping the moe custom operator,
+    # and/or treating the string as a "symbolic input" to the graph that
+    # aren't ready yet.
+    remaining_moe_layers: list[str]


Yeah this isn't meant to be permanent. Do you mind adding a TODO with a link to #31985?

Fixes vllm-project#29992 For torch.compile cold start times, we need to avoid hard-coding any strings into the graph. Right now, the vllm.moe_forward and vllm.moe_forward_shared custom operators hard-code strings into the graph. The workaround is to store a list of the strings that each of those custom ops needs, in reverse order, in the ForwardContext. The ForwardContext object is alive for the duration of the forward pass. When the custom op needs the string, pop the string from this list. This assumes that the custom operators will always be executed in order and that torch.compile will not try to reorder these operations with respect to each other. There are longer-term solutions, like unwrapping the moe custom operator, and/or treating the string as a "symbolic input" to the graph that aren't ready yet. This PR speeds up the torch.compile piece of gpt-oss-120b from 46s to 16s. Signed-off-by: Richard Zou <zou3519@gmail.com>

WoosukKwon · 2026-01-22T01:08:23Z

I tried out the PR on my GB200 machine and got OOM. I don't get OOM in the main, so it seems this PR does cause the memory issue.

zou3519 · 2026-01-22T15:44:28Z

Discussed with Woosuk online, it turns out that the model does also OOM on main so this PR is unrelated.

Signed-off-by: Richard Zou <zou3519@gmail.com> Signed-off-by: mohammad najafi <mohammad.najafi@amd.com>

Signed-off-by: Richard Zou <zou3519@gmail.com> Signed-off-by: 陈建华 <1647430658@qq.com>

youkaichao · 2026-01-27T08:17:07Z

vllm/forward_context.py

+    no_compile_layers = vllm_config.compilation_config.static_forward_context
+    from vllm.model_executor.layers.fused_moe.layer import FusedMoE
+
+    remaining_moe_layers = [
+        name for name, layer in no_compile_layers.items() if isinstance(layer, FusedMoE)
+    ]
+    remaining_moe_layers.reverse()


do we have profiling to show how long it takes? doing it in every forward can be time-consuming.

maybe we can keep all_moe_layers and only increase a current_index inside forward context.

I can change it to all_moe_layers and current_index. Should be able to cache all_moe_layers on a static_forward_context somewhere so that also isn't being recomputed on each forward pass.

Updated at #33184, please take a look

This is a follow up to the comments on vllm-project#32805 . It contains the following two perf optimizations: - We don't need to recompute all of the MOE layer names on every forward pass. Instead we can get all of the layer names when the model is being initialized - Stop popping strings from a list. Instead, maintain a counter. Signed-off-by: Richard Zou <zou3519@gmail.com>

Signed-off-by: Richard Zou <zou3519@gmail.com>

Avoid hard-coding attention layer name strings into the compiled graph in unified_kv_cache_update. Each layer having a different name prevents Inductor from reusing piecewise graphs across layers, increasing cold start compilation time. Apply the same approach used for MOE layers (vllm-project#32805, vllm-project#33184): store the list of all KV cache update layer names at model init time and resolve them at runtime via a counter in ForwardContext. Fixes vllm-project#33267 Signed-off-by: Varun Chawla <varun_6april@hotmail.com>

Signed-off-by: Richard Zou <zou3519@gmail.com>

CSWYF3634076 · 2026-02-27T14:21:07Z

@zou3519 This PR caused garbled outputs when running the ERNIE-4.5-VL-28B-A3B-PT model. Since it contains two experts but only one expert is used during decode, this leads to misalignment and results in corrupted outputs. Could this be fixed?

One possible approach I can think of is to add an input flag to FusedMoE to control this behavior, but that doesn’t seem very elegant.

zou3519 · 2026-02-27T15:01:15Z

@CSWYF3634076 yes I believe this PR is the problem. Could you try to test -cc.fast_moe_cold_start=False as a workaround please?

ProExpertProg · 2026-02-27T16:04:53Z

@CSWYF3634076 is that an in-tree model? If yes, can the model config init override this field to false? See vllm/model_executor/models/config.py

CSWYF3634076 · 2026-02-28T05:07:41Z

@zou3519 @ProExpertProg Thank you for your answer. It worked. I solved the problem by setting false in Ernie4.5-VL directly. And there's no need for users to specify configurations. It can be fixed without being noticed. Could you please help review it #35587

Avoid hard-coding attention layer name strings into the compiled graph in unified_kv_cache_update. Each layer having a different name prevents Inductor from reusing piecewise graphs across layers, increasing cold start compilation time. Apply the same approach used for MOE layers (vllm-project#32805, vllm-project#33184): store the list of all KV cache update layer names at model init time and resolve them at runtime via a counter in ForwardContext. Fixes vllm-project#33267 Signed-off-by: Varun Chawla <varun_6april@hotmail.com>

zou3519 requested review from mgoin and pavanimajety as code owners January 21, 2026 19:40

gemini-code-assist bot reviewed Jan 21, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/layer.py Outdated Show resolved Hide resolved

zou3519 commented Jan 21, 2026

View reviewed changes

zou3519 requested review from ProExpertProg and WoosukKwon January 21, 2026 19:42

cursor bot reviewed Jan 21, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/layer.py Outdated Show resolved Hide resolved

zou3519 force-pushed the fix_moe_cold_start branch 2 times, most recently from 517304e to a2cd62a Compare January 21, 2026 19:49

zou3519 requested a review from youkaichao January 21, 2026 19:52

robertgshaw2-redhat changed the title ~~[torch.compile] Speed up cold start time for MOE models~~ [torch.compile] Improve Cold Start for MoEs Jan 21, 2026

ProExpertProg approved these changes Jan 21, 2026

View reviewed changes

ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 21, 2026

zou3519 force-pushed the fix_moe_cold_start branch from a2cd62a to f7d4588 Compare January 21, 2026 20:26

ProExpertProg approved these changes Jan 21, 2026

View reviewed changes

ProExpertProg enabled auto-merge (squash) January 21, 2026 20:28

zou3519 force-pushed the fix_moe_cold_start branch from f7d4588 to 1920eb3 Compare January 21, 2026 22:23

zou3519 requested review from tlrmchlsmth and yewentao256 as code owners January 21, 2026 22:23

zou3519 force-pushed the fix_moe_cold_start branch from 1920eb3 to c94988d Compare January 22, 2026 01:07

ProExpertProg disabled auto-merge January 22, 2026 01:09

zou3519 merged commit 654a71f into vllm-project:main Jan 22, 2026
52 checks passed

monajafi-amd pushed a commit to monajafi-amd/vllm that referenced this pull request Jan 23, 2026

[torch.compile] Improve Cold Start for MoEs (vllm-project#32805)

7eeb49a

Signed-off-by: Richard Zou <zou3519@gmail.com> Signed-off-by: mohammad najafi <mohammad.najafi@amd.com>

cwazai pushed a commit to cwazai/vllm that referenced this pull request Jan 25, 2026

[torch.compile] Improve Cold Start for MoEs (vllm-project#32805)

c482308

Signed-off-by: Richard Zou <zou3519@gmail.com> Signed-off-by: 陈建华 <1647430658@qq.com>

youkaichao reviewed Jan 27, 2026

View reviewed changes

zou3519 mentioned this pull request Jan 27, 2026

[torch.compile] Speed up MOE handling in forward_context #33184

Merged

lapy pushed a commit to lapy/vllm that referenced this pull request Jan 27, 2026

[torch.compile] Improve Cold Start for MoEs (vllm-project#32805)

2d425ec

Signed-off-by: Richard Zou <zou3519@gmail.com>

ProExpertProg mentioned this pull request Jan 28, 2026

[Feature]: Remove attention layer name from unified_kv_cache_update #33267

Open

1 task

aabbccddwasd mentioned this pull request Feb 4, 2026

[Revert] Fix performance regression for GLM-4.7-GPTQ decode and MTP acceptance rate #33771

Merged

5 tasks

veeceey mentioned this pull request Feb 7, 2026

[torch.compile] Remove attention layer name from unified_kv_cache_update #34062

Closed

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[torch.compile] Improve Cold Start for MoEs (vllm-project#32805)

d863c54

Signed-off-by: Richard Zou <zou3519@gmail.com>

CSWYF3634076 mentioned this pull request Feb 28, 2026

[BugFix][Model]Fix the garbled code in Ernie4.5-VL caused by fast_moe_cold_start #35587

Merged

5 tasks

Uh oh!

Conversation

zou3519 commented Jan 21, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan & Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

zou3519 Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

WoosukKwon commented Jan 22, 2026

Uh oh!

zou3519 commented Jan 22, 2026

Uh oh!

Uh oh!

youkaichao Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

zou3519 Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zou3519 Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

CSWYF3634076 commented Feb 27, 2026

Uh oh!

zou3519 commented Feb 27, 2026

Uh oh!

ProExpertProg commented Feb 27, 2026

Uh oh!

CSWYF3634076 commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zou3519 commented Jan 21, 2026 •

edited by github-actions bot

Loading

zou3519 Jan 27, 2026 •

edited

Loading