[compile] Allow strings in custom ops without regressing compilation times by zou3519 · Pull Request #38123 · vllm-project/vllm

zou3519 · 2026-03-25T16:46:48Z

This is a follow-up to #35475 to extend the fix to all custom operators, not just the MOE custom ops.

Previously, string inputs to custom ops would regress compilation times. The problem goes:

a transformer model (e.g. llama3-70b) has 80 identical layers
we capture a full graph and the split the graph on the attention operations.
this produces 81 subgraphs: the middle 79 graphs are all identical (aside from graph inputs - parameters and buffers)
vLLM-compile produces 1 compiled artifact for all of the middle 79 subgraphs.
If a custom operator with a layer_name string appears in the graph, then this causes the middle 79 subgraphs to now be unique, so vLLM-compile ends up producing 79 compiled artifacts for them.

In PyTorch 2.11, we have added a special class (the OpaqueObject type). The idea is that instead of passing strings to custom operators, we can pass a special LayerName object to the custom operator. This signifies to the compiler that it should "lift" the LayerName object to being a graph input and not bake the value directly into the graph.

More things:

LayerName used to be named ModuleName. I did the rename in this PR too.
VLLM_USE_LAYERNAME=0 can be used to turn this off if something goes wrong. I don't expect anything to go wrong. I'll remove this in < 1 month.

gemini-code-assist

Code Review

This pull request introduces a ModuleName opaque type to improve torch.compile behavior by hoisting layer names as graph inputs, preventing per-layer recompilation for custom operations. This change involves updating various torch.ops.vllm calls across attention, KV cache, and Mamba mixer modules to use _encode_layer_name and _resolve_layer_name for consistent handling of layer names. A critical issue was identified where the kv_cache_dummy_dep variable could be undefined, leading to an UnboundLocalError.

vllm/model_executor/layers/attention/attention.py

ProExpertProg · 2026-03-25T17:36:59Z

vllm/model_executor/models/qwen3_next.py

    qkvz_output_size: int,
    ba_output_size: int,
-    layer_name: str,
+    layer_name: _layer_name_type,


Nit: can we call this LayerName or LayerNameType?

mergify · 2026-03-26T01:28:11Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zou3519.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…times This is a follow-up to vllm-project#35475 to extend the fix to all custom operators, not just the MOE custom ops. Previously, string inputs to custom ops would regress compilation times. The problem goes: - a transformer model (e.g. llama3-70b) has 80 identical layers - we capture a full graph and the split the graph on the attention operations. - this produces 81 subgraphs: the middle 79 graphs are all identical (aside from graph inputs - parameters and buffers) - vLLM-compile produces 1 compiled artifact for all of the middle 79 subgraphs. - If a custom operator with a layer_name string appears in the graph, then this causes the middle 79 subgraphs to now be unique, so vLLM-compile ends up producing 79 compiled artifacts for them. In PyTorch 2.11, we have added a special class (the OpaqueObject type). The idea is that instead of passing strings to custom operators, we can pass a special LayerName object to the custom operator. This signifies to the compiler that it should "lift" the LayerName object to being a graph input and not bake the value directly into the graph. More notes: - the LayerName object used to be called ModuleName. I renamed it here, LayerName seemed more appropriate. - VLLM_USE_LAYERNAME=0 turns this feature off. This option is here just in case something breaks. I'll probably remove it in the next month. Signed-off-by: Richard Zou <zou3519@gmail.com>

zou3519 requested review from BoyuanFeng, LucasWilkinson, MatthewBonanni, ProExpertProg, sighingnow, tdoublep and youkaichao as code owners March 25, 2026 16:46

mergify bot added the qwen Related to Qwen models label Mar 25, 2026

gemini-code-assist bot reviewed Mar 25, 2026

View reviewed changes

vllm/model_executor/layers/attention/attention.py Show resolved Hide resolved

zou3519 force-pushed the opaque_everything branch from 35626a5 to 99e2154 Compare March 25, 2026 17:30

ProExpertProg approved these changes Mar 25, 2026

View reviewed changes

zou3519 mentioned this pull request Mar 25, 2026

[Performance] DeepSeek V3.2 multi-stream indexer overlap #35968

Open

zou3519 force-pushed the opaque_everything branch from 99e2154 to 14a1342 Compare March 25, 2026 21:33

zou3519 requested review from mgoin and pavanimajety as code owners March 25, 2026 21:33

zou3519 added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 25, 2026

zou3519 mentioned this pull request Mar 25, 2026

[Perf] Enable dual stream execution of input projection for Qwen3 #36795

Merged

5 tasks

xyang16 mentioned this pull request Mar 25, 2026

Disable dual stream execution of input projection for Qwen3 #38152

Merged

5 tasks

mergify bot added the needs-rebase label Mar 26, 2026

zou3519 force-pushed the opaque_everything branch 2 times, most recently from 807e2e7 to 594eb86 Compare March 26, 2026 14:00

zou3519 force-pushed the opaque_everything branch from 594eb86 to 04fc04b Compare March 26, 2026 14:21

mergify bot removed the needs-rebase label Mar 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[compile] Allow strings in custom ops without regressing compilation times#38123

[compile] Allow strings in custom ops without regressing compilation times#38123
zou3519 wants to merge 1 commit intovllm-project:mainfrom
zou3519:opaque_everything

zou3519 commented Mar 25, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

ProExpertProg Mar 25, 2026

Uh oh!

zou3519 Mar 25, 2026

Uh oh!

mergify bot commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

zou3519 commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

ProExpertProg Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

zou3519 Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zou3519 commented Mar 25, 2026 •

edited

Loading