[torch.compile] Speed up MOE handling in forward_context by zou3519 · Pull Request #33184 · vllm-project/vllm

zou3519 · 2026-01-27T16:33:48Z

Purpose

This is a follow up to the comments on
#32805 .

It contains the following two perf optimizations:

We don't need to recompute all of the MOE layer names on every forward pass. Instead we can get all of the layer names when the model is being initialized
Stop popping strings from a list. Instead, maintain a counter.

Test Plan & Test Result

I tested this locally. Compilation time remains good while models still produce reasonable results. Also, wait for CI.

This is a follow up to the comments on vllm-project#32805 . It contains the following two perf optimizations: - We don't need to recompute all of the MOE layer names on every forward pass. Instead we can get all of the layer names when the model is being initialized - Stop popping strings from a list. Instead, maintain a counter. Signed-off-by: Richard Zou <zou3519@gmail.com>

gemini-code-assist

Code Review

The pull request successfully implements two performance optimizations for MOE handling: pre-computing MOE layer names during model initialization and using a counter-based approach instead of popping from a list in the ForwardContext. The changes are well-aligned with the stated purpose and appear to be correctly implemented across the affected files. The refactoring improves efficiency by reducing redundant computations and list manipulations during the forward pass. No critical or high-severity issues were found.

…t#33184) Signed-off-by: Richard Zou <zou3519@gmail.com> Signed-off-by: Vedant Madane <6527493+VedantMadane@users.noreply.github.com>

…t#33184) Signed-off-by: Richard Zou <zou3519@gmail.com>

Signed-off-by: Richard Zou <zou3519@gmail.com> (cherry picked from commit d9aa39a)

…t#33184) Signed-off-by: Richard Zou <zou3519@gmail.com> Signed-off-by: Pai <416932041@qq.com>

### What this PR does / why we need it? 1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required positional argument: 'is_sequence_parallel'` due to vllm-project/vllm#32567 2. Fix ` TypeError: '>' not supported between instances of 'MagicMock' and 'int'` due to vllm-project/vllm#33035 3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with abstract methods forward_mha, forward_mqa` and AttributeError: 'bool' object has no attribute 'process_weights_after_loading' due to vllm-project/vllm#33284 4. Fix `'AscendSharedFusedMoE' object has no attribute '_routed_input_transform'`due to vllm-project/vllm#32790 5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument 'num_active_loras'` due to vllm-project/vllm#32005 6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'` due to vllm-project/vllm#27492 7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward, vllm.moe_forward_shared due to vllm-project/vllm#33184 8. Add patch to fix the problem "got multiple values for keyword argument 'add_special_tokens'" due to vllm-project/vllm#32863 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>

Avoid hard-coding attention layer name strings into the compiled graph in unified_kv_cache_update. Each layer having a different name prevents Inductor from reusing piecewise graphs across layers, increasing cold start compilation time. Apply the same approach used for MOE layers (vllm-project#32805, vllm-project#33184): store the list of all KV cache update layer names at model init time and resolve them at runtime via a counter in ForwardContext. Fixes vllm-project#33267 Signed-off-by: Varun Chawla <varun_6april@hotmail.com>

### What this PR does / why we need it? 1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required positional argument: 'is_sequence_parallel'` due to vllm-project/vllm#32567 2. Fix ` TypeError: '>' not supported between instances of 'MagicMock' and 'int'` due to vllm-project/vllm#33035 3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with abstract methods forward_mha, forward_mqa` and AttributeError: 'bool' object has no attribute 'process_weights_after_loading' due to vllm-project/vllm#33284 4. Fix `'AscendSharedFusedMoE' object has no attribute '_routed_input_transform'`due to vllm-project/vllm#32790 5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument 'num_active_loras'` due to vllm-project/vllm#32005 6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'` due to vllm-project/vllm#27492 7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward, vllm.moe_forward_shared due to vllm-project/vllm#33184 8. Add patch to fix the problem "got multiple values for keyword argument 'add_special_tokens'" due to vllm-project/vllm#32863 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>

…t#33184) Signed-off-by: Richard Zou <zou3519@gmail.com>

### What this PR does / why we need it? 1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required positional argument: 'is_sequence_parallel'` due to vllm-project/vllm#32567 2. Fix ` TypeError: '>' not supported between instances of 'MagicMock' and 'int'` due to vllm-project/vllm#33035 3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with abstract methods forward_mha, forward_mqa` and AttributeError: 'bool' object has no attribute 'process_weights_after_loading' due to vllm-project/vllm#33284 4. Fix `'AscendSharedFusedMoE' object has no attribute '_routed_input_transform'`due to vllm-project/vllm#32790 5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument 'num_active_loras'` due to vllm-project/vllm#32005 6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'` due to vllm-project/vllm#27492 7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward, vllm.moe_forward_shared due to vllm-project/vllm#33184 8. Add patch to fix the problem "got multiple values for keyword argument 'add_special_tokens'" due to vllm-project/vllm#32863 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? 1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required positional argument: 'is_sequence_parallel'` due to vllm-project/vllm#32567 2. Fix ` TypeError: '>' not supported between instances of 'MagicMock' and 'int'` due to vllm-project/vllm#33035 3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with abstract methods forward_mha, forward_mqa` and AttributeError: 'bool' object has no attribute 'process_weights_after_loading' due to vllm-project/vllm#33284 4. Fix `'AscendSharedFusedMoE' object has no attribute '_routed_input_transform'`due to vllm-project/vllm#32790 5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument 'num_active_loras'` due to vllm-project/vllm#32005 6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'` due to vllm-project/vllm#27492 7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward, vllm.moe_forward_shared due to vllm-project/vllm#33184 8. Add patch to fix the problem "got multiple values for keyword argument 'add_special_tokens'" due to vllm-project/vllm#32863 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>

### What this PR does / why we need it? 1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required positional argument: 'is_sequence_parallel'` due to vllm-project/vllm#32567 2. Fix ` TypeError: '>' not supported between instances of 'MagicMock' and 'int'` due to vllm-project/vllm#33035 3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with abstract methods forward_mha, forward_mqa` and AttributeError: 'bool' object has no attribute 'process_weights_after_loading' due to vllm-project/vllm#33284 4. Fix `'AscendSharedFusedMoE' object has no attribute '_routed_input_transform'`due to vllm-project/vllm#32790 5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument 'num_active_loras'` due to vllm-project/vllm#32005 6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'` due to vllm-project/vllm#27492 7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward, vllm.moe_forward_shared due to vllm-project/vllm#33184 8. Add patch to fix the problem "got multiple values for keyword argument 'add_special_tokens'" due to vllm-project/vllm#32863 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

Avoid hard-coding attention layer name strings into the compiled graph in unified_kv_cache_update. Each layer having a different name prevents Inductor from reusing piecewise graphs across layers, increasing cold start compilation time. Apply the same approach used for MOE layers (vllm-project#32805, vllm-project#33184): store the list of all KV cache update layer names at model init time and resolve them at runtime via a counter in ForwardContext. Fixes vllm-project#33267 Signed-off-by: Varun Chawla <varun_6april@hotmail.com>

### What this PR does / why we need it? 1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required positional argument: 'is_sequence_parallel'` due to vllm-project/vllm#32567 2. Fix ` TypeError: '>' not supported between instances of 'MagicMock' and 'int'` due to vllm-project/vllm#33035 3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with abstract methods forward_mha, forward_mqa` and AttributeError: 'bool' object has no attribute 'process_weights_after_loading' due to vllm-project/vllm#33284 4. Fix `'AscendSharedFusedMoE' object has no attribute '_routed_input_transform'`due to vllm-project/vllm#32790 5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument 'num_active_loras'` due to vllm-project/vllm#32005 6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'` due to vllm-project/vllm#27492 7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward, vllm.moe_forward_shared due to vllm-project/vllm#33184 8. Add patch to fix the problem "got multiple values for keyword argument 'add_special_tokens'" due to vllm-project/vllm#32863 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>

zou3519 requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners January 27, 2026 16:33

gemini-code-assist bot reviewed Jan 27, 2026

View reviewed changes

zou3519 mentioned this pull request Jan 27, 2026

[torch.compile] Improve Cold Start for MoEs #32805

Merged

ProExpertProg approved these changes Jan 27, 2026

View reviewed changes

ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 27, 2026

zou3519 enabled auto-merge (squash) January 27, 2026 18:43

zou3519 merged commit d9aa39a into vllm-project:main Jan 27, 2026
58 checks passed

ProExpertProg mentioned this pull request Jan 28, 2026

[Feature]: Remove attention layer name from unified_kv_cache_update #33267

Open

1 task

apd10 pushed a commit to apd10/vllm that referenced this pull request Jan 31, 2026

[torch.compile] Speed up MOE handling in forward_context (vllm-projec…

aaf4071

…t#33184) Signed-off-by: Richard Zou <zou3519@gmail.com>

khluu pushed a commit that referenced this pull request Feb 3, 2026

[torch.compile] Speed up MOE handling in forward_context (#33184)

611b187

Signed-off-by: Richard Zou <zou3519@gmail.com> (cherry picked from commit d9aa39a)

PiratePai pushed a commit to PiratePai/epd_shm that referenced this pull request Feb 3, 2026

[torch.compile] Speed up MOE handling in forward_context (vllm-projec…

17cc666

…t#33184) Signed-off-by: Richard Zou <zou3519@gmail.com> Signed-off-by: Pai <416932041@qq.com>

zhangxinyuehfad mentioned this pull request Feb 4, 2026

[Main2Main] Upgrade to newest vLLM 0204 vllm-project/vllm-ascend#6510

Closed

Meihan-chen mentioned this pull request Feb 4, 2026

[Main2Main] Upgrade to newest vLLM 0205 vllm-project/vllm-ascend#6511

Open

vllm-ascend-ci mentioned this pull request Feb 5, 2026

[main2main] upgrade vllm main 0202 vllm-project/vllm-ascend#6560

Merged

veeceey mentioned this pull request Feb 7, 2026

[torch.compile] Remove attention layer name from unified_kv_cache_update #34062

Closed

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[torch.compile] Speed up MOE handling in forward_context (vllm-projec…

0594aff

…t#33184) Signed-off-by: Richard Zou <zou3519@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[torch.compile] Speed up MOE handling in forward_context#33184

[torch.compile] Speed up MOE handling in forward_context#33184
zou3519 merged 1 commit intovllm-project:mainfrom
zou3519:moe_cold_start_cleanup

zou3519 commented Jan 27, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

zou3519 commented Jan 27, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan & Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zou3519 commented Jan 27, 2026 •

edited by github-actions bot

Loading