[Misc][LLaMa4] Compile LLaMa Vision Encoder by Lucaskabela · Pull Request #30709 · vllm-project/vllm

Lucaskabela · 2025-12-15T18:34:03Z

Purpose

We want to speedup up inference for mllama4 by applying torch.compile to the intensive workload, similar to what is done in #23207. We start by enabling the VisionEncoder + PixelShuffle

Test Plan

Unit Test

with-proxy pytest tests/compile/fullgraph/test_multimodal_compile.py::test_mllama4_vit_compilation

Result:

 1 passed, 27 warnings in 176.88s (0:02:56)

Offline Test

with-proxy VLLM_USE_V1=1 python examples/offline_inference/vision_language.py -m llama4

With compilation_config={"compile_mm_encoder": True} monkey patched to EngineArgs

Results in

--------------------------------------------------
The image depicts a tower, likely Tokyo Tower, framed by cherry blossoms. The tower is white and has a distinctive shape, with a large sphere at the top and a long, thin spire extending from it. It appears to be made of metal and has a lattice-like structure.

In the foreground, there are
--------------------------------------------------
The image depicts a tower, likely Tokyo Tower, framed by cherry blossoms. The tower is white and has a distinctive shape, with a large sphere at the top and a series of latticework sections below it. It appears to be made of metal and has a tall, slender design.

In the foreground, there are
--------------------------------------------------
The image depicts a serene scene of Tokyo Tower, partially obscured by the vibrant pink blossoms of cherry blossom trees. The tower's white and gold structure is visible through the branches and flowers, set against a clear blue sky.

**Key Features:**

* **Tokyo Tower:** A prominent landmark in Tokyo, Japan,
--------------------------------------------------
The image depicts a serene scene of Tokyo Tower, partially obscured by blooming cherry blossoms. The tower's distinctive shape and structure are visible through the branches of the trees, which are adorned with vibrant pink flowers.

**Key Features:**

* **Tokyo Tower:** A prominent landmark in Tokyo, Japan, known for
--------------------------------------------------

Server Benchmark

vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct --tensor-parallel-size=8 --gpu_memory_utilization=.8 --max_model_len=8192 --compilation-config='{"compile_mm_encoder":"true"}'

vllm bench serve   --backend openai-chat   --model meta-llama/Llama-4-Scout-17B-16E-Instruct    --endpoint /v1/chat/completions   --dataset-name hf   --dataset-path lmarena-ai/VisionArena-Chat   --hf-split train   --num-prompts 1000

Test Result

	Main	This PR
Successful requests	998	998
Benchmark duration (s)	63.54	61.71
Total generated tokens	117050	117397
Request throughput (req/s)	15.71	16.17
Output token throughput (tok/s)	1842.17	1909.91
Mean TTFT (ms)	29224.49	28011.36
Mean TPOT (ms)	240.21	231.26
Mean ITL (ms)	232.45	223.91

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Note

Speeds up mLLaMA4 vision by compiling its encoder path and tightening related infra.

Compile Llama4VisionModel (VisionEncoder + PixelShuffleMLP) via support_torch_compile, gated by compile_mm_encoder; tag with set_model_tag and run under set_forward_context
Update CompilationConfig.compile_mm_encoder docs to include mLLaMa4; add test test_mllama4_vit_compilation (forked/skipped in CI)
Fix arg order/defaults in ViT flash-attn wrapper and its fake impl; plumb args in MMEncoderAttention
Optimize Llama4VisionRotaryEmbedding to avoid in-place cache updates; minor embed path change in LlamaModel to prevent recompiles

^{Written by Cursor Bugbot for commit e1e0f0a. This will update automatically on new commits. Configure here.}

Note

^{Cursor Bugbot is generating a summary for commit 84c926015cf934954bf3e1eaf51049d5e3003492. Configure here.}

Note

Speeds up mLLaMA4 vision by compiling its encoder path and tightening related infra.

Compile Llama4VisionModel (VisionEncoder + PixelShuffleMLP) via support_torch_compile, gated by CompilationConfig.compile_mm_encoder; tag with set_model_tag and run under set_forward_context
Update CompilationConfig.compile_mm_encoder docs to include mLLaMa4; add test_mllama4_vit_compilation (forked/skipped in CI)
Fix arg order/defaults in ViT flash-attn wrapper (vit_attn_wrappers.py) and plumb args in MMEncoderAttention
Optimize Llama4VisionRotaryEmbedding to avoid in-place cache updates; add dynamic-shape hints to LlamaModel to reduce recompiles

^{Written by Cursor Bugbot for commit 84c926015cf934954bf3e1eaf51049d5e3003492. This will update automatically on new commits. Configure here.}

Note

^{Cursor Bugbot is generating a summary for commit 8935253245665bb7b66ed35b63846b7ec513a9b9. Configure here.}

Note

Speeds up mLLaMA4 vision by compiling its encoder path and aligning infra.

Compile Llama4VisionModel (VisionEncoder + PixelShuffleMLP) via support_torch_compile gated by CompilationConfig.compile_mm_encoder; tag with set_model_tag and run under set_forward_context
Update CompilationConfig.compile_mm_encoder doc to include mLLaMa4; add test_mllama4_vit_compilation (forked/skipped in CI)
Fix ViT flash-attn wrapper arg order/defaults (vit_attn_wrappers.py) and plumb args in MMEncoderAttention
Optimize Llama4VisionRotaryEmbedding to avoid in-place cache updates; add mark_unbacked_dims to LlamaModel to reduce recompiles

^{Written by Cursor Bugbot for commit 8935253245665bb7b66ed35b63846b7ec513a9b9. This will update automatically on new commits. Configure here.}

Note

Speeds up mLLaMA4 vision by compiling its encoder path and aligning related infra.

Compile Llama4VisionModel via support_torch_compile (gated by compile_mm_encoder), tag with set_model_tag, and run under set_forward_context
Update CompilationConfig.compile_mm_encoder docs to include mLLaMa4; add test_mllama4_vit_compilation (forked/skipped in CI) and refine Qwen2.5-VL tests
Fix ViT flash-attn wrapper arg order/defaults in vit_attn_wrappers.py and plumb args in MMEncoderAttention
Optimize Llama4VisionRotaryEmbedding to avoid in-place cache updates
Add dynamic-shape hints to LlamaModel (mark_unbacked_dims) to reduce recompiles

^{Written by Cursor Bugbot for commit 95c4616. This will update automatically on new commits. Configure here.}

Note

^{Cursor Bugbot is generating a summary for commit 31bc1de. Configure here.}

Note

Speeds up mLLaMA4 vision by compiling its encoder path and aligning related infra.

Wraps Llama4VisionModel with support_torch_compile (dynamic arg dims; gated by should_torch_compile_mm_vit), constructs under set_current_vllm_config and tags via set_model_tag; runs image embed under set_forward_context
Documents CompilationConfig.compile_mm_encoder to include mLLaMa4
Fixes ViT flash-attn wrapper plumbing: reorder args in MMEncoderAttention call; add defaults in flash_attn_maxseqlen_wrapper_fake
Optimizes Llama4VisionRotaryEmbedding to avoid in-place cache updates; tweaks LlamaModel compile decorator to reduce recompiles
Adds test_mllama4_vit_compilation (forked/skipped due to CI constraints)

^{Written by Cursor Bugbot for commit 31bc1de. This will update automatically on new commits. Configure here.}

chatgpt-codex-connector · 2025-12-15T18:34:11Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Lucaskabela · 2025-12-15T18:34:49Z

cc @ywang96 - resubmit of #27900 (previous PR fell a bit out of date so resubmitting for another review)

gemini-code-assist

Code Review

This pull request enables torch.compile for the LLaMa Vision Encoder layers in mllama4 to improve inference performance. The changes primarily involve adapting the model code to be compatible with torch.compile, such as introducing a wrapper for flash attention and decorating vision submodules. A new test is also added to verify that the model runs correctly with compilation enabled. The approach is sound and follows existing patterns in the codebase for torch.compile integration. I have one high-severity suggestion to correct misleading type hints in the new flash attention wrapper function to improve code correctness and maintainability.

vllm/attention/ops/vit_attn_wrappers.py

Lucaskabela · 2025-12-15T22:36:17Z

vllm/model_executor/models/llama.py

@@ -407,6 +407,9 @@ def __init__(
        )

    def embed_input_ids(self, input_ids: torch.Tensor) -> torch.Tensor:
+        # Need explicit mark here to avoid recompile from 0/1 spec
+        # since VocabEmbedding uses a different torch.compile decorator
+        torch._dynamo.decorators.mark_unbacked(input_ids, 0)


This is needed to avoid a recompile (discovered via tlparse)

cc @ProExpertProg

And we can't handle this via the decorator and the work @laithsakka has been doing on dynamic/unbacked shapes?

I think it can be handled with the decorator :)

Lucaskabela · 2025-12-15T22:37:49Z

Note for reviewers: This is off by default, and requires compile_mm_encoder: True to turn on

Lucaskabela · 2025-12-15T22:47:46Z

One more interesting note: Since rebasing from last Friday (see #27900 table), there was a pretty sizable performance dip for the compiled artifact. I know the compiled ranges work landed in that time, but wondering if there was any other significant backend changes/code changes that might have caused this

DarkLight1337 · 2025-12-16T06:11:28Z

cc @ywang96 @Isotr0py @zou3519

NickLucche

Hey @Lucaskabela thanks a lot for your work on another mm model!

Have you looked into the MMEncoderAttention CustomOp https://github.com/vllm-project/vllm/blob/main/vllm/attention/layers/mm_encoder_attention.py#L44?

I think it'd be nice to start having a more homogeneous code structure when compiling a new encoder, rather than adding an FA wrapper for each.
@Isotr0py for one is refactoring this part here #30684 , and it should be able to satisfy your use case with q_len != k_len, without requiring a separate wrapper.
At the very least, you could re-use the is_rocm+fa_version boilerplate code which is now taken care of in that MMEncoderAttention class.

vllm/attention/ops/vit_attn_wrappers.py

Lucaskabela · 2025-12-17T16:35:02Z

Note: will wait for #30684 to land since that is a fairly large refactor to this code (and will eliminate the need for us to add new custom ops)

mergify · 2025-12-18T18:12:53Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Lucaskabela.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/model_executor/layers/attention/mm_encoder_attention.py

vllm/attention/ops/vit_attn_wrappers.py

mergify · 2026-01-09T21:17:26Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Lucaskabela.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Lucas Kabela <lucaskabela@meta.com>

mergify · 2026-01-09T23:26:58Z

Hi @Lucaskabela, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-01-09T23:51:54Z

Hi @Lucaskabela, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Lucaskabela · 2026-01-10T00:01:20Z

vllm/v1/attention/ops/vit_attn_wrappers.py

-    scale: float | None,
-    cu_seqlens: torch.Tensor | None,
-    max_seqlen: torch.Tensor | None,
+    scale: float | None = None, 


needs to match the other api (needs = None)

Signed-off-by: Lucas Kabela <lucaskabela@meta.com>

vllm/model_executor/models/llama.py

vllm/model_executor/models/mllama4.py

Signed-off-by: Lucas Kabela <lucaskabela@meta.com>

Signed-off-by: Lucas Kabela <lucaskabela@meta.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

Signed-off-by: Lucas Kabela <lucaskabela@meta.com>

Lucaskabela requested review from LucasWilkinson, ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners December 15, 2025 18:34

mergify bot added the llama Related to Llama models label Dec 15, 2025

gemini-code-assist bot reviewed Dec 15, 2025

View reviewed changes

vllm/attention/ops/vit_attn_wrappers.py Outdated Show resolved Hide resolved

Lucaskabela changed the title ~~[Misc][LLaMa4] Compile LLaMa Vision Encoder layers~~ [Misc][LLaMa4] Compile LLaMa Vision Encode Dec 15, 2025

Lucaskabela force-pushed the lucaskabela/mllama4_compilation branch from dbdda27 to af16562 Compare December 15, 2025 22:35

Lucaskabela commented Dec 15, 2025

View reviewed changes

Lucaskabela changed the title ~~[Misc][LLaMa4] Compile LLaMa Vision Encode~~ [Misc][LLaMa4] Compile LLaMa Vision Encoder Dec 15, 2025

NickLucche reviewed Dec 17, 2025

View reviewed changes

vllm/attention/ops/vit_attn_wrappers.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Dec 18, 2025

Lucaskabela marked this pull request as draft December 19, 2025 00:23

Lucaskabela force-pushed the lucaskabela/mllama4_compilation branch from 2be7440 to 38f02d1 Compare December 19, 2025 00:29

mergify bot removed the needs-rebase label Dec 19, 2025

Lucaskabela force-pushed the lucaskabela/mllama4_compilation branch from 38f02d1 to 9fb41ce Compare December 19, 2025 00:36

Lucaskabela commented Dec 22, 2025

View reviewed changes

vllm/model_executor/layers/attention/mm_encoder_attention.py Show resolved Hide resolved

vllm/attention/ops/vit_attn_wrappers.py Show resolved Hide resolved

mergify bot added the needs-rebase label Jan 9, 2026

Lucaskabela added 3 commits January 9, 2026 15:18

Support compiling llama4 vision

4bebaa0

Signed-off-by: Lucas Kabela <lucaskabela@meta.com>

Rebase to make compat with new backend

3c47731

Signed-off-by: Lucas Kabela <lucaskabela@meta.com>

Update test file

e1e0f0a

Signed-off-by: Lucas Kabela <lucaskabela@meta.com>

Lucaskabela force-pushed the lucaskabela/mllama4_compilation branch from 3d612d9 to e1e0f0a Compare January 9, 2026 23:21

mergify bot added v1 and removed needs-rebase labels Jan 9, 2026

Lucaskabela force-pushed the lucaskabela/mllama4_compilation branch from 84c9260 to 8935253 Compare January 9, 2026 23:47

Lucaskabela commented Jan 10, 2026

View reviewed changes

Updates after rebase

95c4616

Signed-off-by: Lucas Kabela <lucaskabela@meta.com>

Lucaskabela force-pushed the lucaskabela/mllama4_compilation branch from 8935253 to 95c4616 Compare January 10, 2026 00:05

ProExpertProg reviewed Jan 10, 2026

View reviewed changes

vllm/model_executor/models/llama.py Show resolved Hide resolved

vllm/model_executor/models/mllama4.py Show resolved Hide resolved

Lucaskabela requested a review from ProExpertProg January 10, 2026 00:35

ProExpertProg approved these changes Jan 10, 2026

View reviewed changes

ProExpertProg enabled auto-merge (squash) January 10, 2026 00:51

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 10, 2026

Update to link issue

31bc1de

Signed-off-by: Lucas Kabela <lucaskabela@meta.com>

auto-merge was automatically disabled January 10, 2026 00:55
Head branch was pushed to by a user without write access

ProExpertProg merged commit ea6d067 into vllm-project:main Jan 10, 2026
64 checks passed

akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026

[Misc][LLaMa4] Compile LLaMa Vision Encoder (vllm-project#30709)

fbfad8f

Signed-off-by: Lucas Kabela <lucaskabela@meta.com>

dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026

[Misc][LLaMa4] Compile LLaMa Vision Encoder (vllm-project#30709)

77a12f1

Signed-off-by: Lucas Kabela <lucaskabela@meta.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

Lucaskabela deleted the lucaskabela/mllama4_compilation branch February 19, 2026 16:40

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[Misc][LLaMa4] Compile LLaMa Vision Encoder (vllm-project#30709)

015a29e

Signed-off-by: Lucas Kabela <lucaskabela@meta.com>

Uh oh!

Conversation

Lucaskabela commented Dec 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Unit Test

Offline Test

Server Benchmark

Test Result

Uh oh!

chatgpt-codex-connector bot commented Dec 15, 2025

Uh oh!

Lucaskabela commented Dec 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Lucaskabela Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NickLucche Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Lucaskabela Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Lucaskabela commented Dec 15, 2025

Uh oh!

Lucaskabela commented Dec 15, 2025

Uh oh!

DarkLight1337 commented Dec 16, 2025

Uh oh!

NickLucche left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Lucaskabela commented Dec 17, 2025

Uh oh!

mergify bot commented Dec 18, 2025

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Jan 9, 2026

Uh oh!

mergify bot commented Jan 9, 2026

Uh oh!

mergify bot commented Jan 9, 2026

Uh oh!

Lucaskabela Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Lucaskabela commented Dec 15, 2025 •

edited by github-actions bot

Loading

Lucaskabela Dec 15, 2025 •

edited

Loading

NickLucche left a comment •

edited

Loading