[Misc][LLaMa4] Compile LLaMa Vision Encoder#30709
[Misc][LLaMa4] Compile LLaMa Vision Encoder#30709ProExpertProg merged 5 commits intovllm-project:mainfrom
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request enables torch.compile for the LLaMa Vision Encoder layers in mllama4 to improve inference performance. The changes primarily involve adapting the model code to be compatible with torch.compile, such as introducing a wrapper for flash attention and decorating vision submodules. A new test is also added to verify that the model runs correctly with compilation enabled. The approach is sound and follows existing patterns in the codebase for torch.compile integration. I have one high-severity suggestion to correct misleading type hints in the new flash attention wrapper function to improve code correctness and maintainability.
dbdda27 to
af16562
Compare
vllm/model_executor/models/llama.py
Outdated
| @@ -407,6 +407,9 @@ def __init__( | |||
| ) | |||
|
|
|||
| def embed_input_ids(self, input_ids: torch.Tensor) -> torch.Tensor: | |||
| # Need explicit mark here to avoid recompile from 0/1 spec | |||
| # since VocabEmbedding uses a different torch.compile decorator | |||
| torch._dynamo.decorators.mark_unbacked(input_ids, 0) | |||
There was a problem hiding this comment.
This is needed to avoid a recompile (discovered via tlparse)
There was a problem hiding this comment.
And we can't handle this via the decorator and the work @laithsakka has been doing on dynamic/unbacked shapes?
There was a problem hiding this comment.
I think it can be handled with the decorator :)
|
Note for reviewers: This is off by default, and requires |
|
One more interesting note: Since rebasing from last Friday (see #27900 table), there was a pretty sizable performance dip for the compiled artifact. I know the compiled ranges work landed in that time, but wondering if there was any other significant backend changes/code changes that might have caused this |
There was a problem hiding this comment.
Hey @Lucaskabela thanks a lot for your work on another mm model!
Have you looked into the MMEncoderAttention CustomOp https://github.com/vllm-project/vllm/blob/main/vllm/attention/layers/mm_encoder_attention.py#L44?
I think it'd be nice to start having a more homogeneous code structure when compiling a new encoder, rather than adding an FA wrapper for each.
@Isotr0py for one is refactoring this part here #30684 , and it should be able to satisfy your use case with q_len != k_len, without requiring a separate wrapper.
At the very least, you could re-use the is_rocm+fa_version boilerplate code which is now taken care of in that MMEncoderAttention class.
|
Note: will wait for #30684 to land since that is a fairly large refactor to this code (and will eliminate the need for us to add new custom ops) |
|
This pull request has merge conflicts that must be resolved before it can be |
2be7440 to
38f02d1
Compare
38f02d1 to
9fb41ce
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
3d612d9 to
e1e0f0a
Compare
|
Hi @Lucaskabela, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
84c9260 to
8935253
Compare
|
Hi @Lucaskabela, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
| scale: float | None, | ||
| cu_seqlens: torch.Tensor | None, | ||
| max_seqlen: torch.Tensor | None, | ||
| scale: float | None = None, |
There was a problem hiding this comment.
needs to match the other api (needs = None)
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
8935253 to
95c4616
Compare
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
Head branch was pushed to by a user without write access
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
Signed-off-by: Lucas Kabela <lucaskabela@meta.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
Purpose
We want to speedup up inference for mllama4 by applying
torch.compileto the intensive workload, similar to what is done in #23207. We start by enabling the VisionEncoder + PixelShuffleTest Plan
Unit Test
Result:
Offline Test
With
compilation_config={"compile_mm_encoder": True}monkey patched to EngineArgsResults in
Server Benchmark
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Note
Speeds up mLLaMA4 vision by compiling its encoder path and tightening related infra.
Llama4VisionModel(VisionEncoder + PixelShuffleMLP) viasupport_torch_compile, gated bycompile_mm_encoder; tag withset_model_tagand run underset_forward_contextCompilationConfig.compile_mm_encoderdocs to includemLLaMa4; add testtest_mllama4_vit_compilation(forked/skipped in CI)MMEncoderAttentionLlama4VisionRotaryEmbeddingto avoid in-place cache updates; minor embed path change inLlamaModelto prevent recompilesWritten by Cursor Bugbot for commit e1e0f0a. This will update automatically on new commits. Configure here.
Note
Cursor Bugbot is generating a summary for commit 84c926015cf934954bf3e1eaf51049d5e3003492. Configure here.
Note
Speeds up mLLaMA4 vision by compiling its encoder path and tightening related infra.
Llama4VisionModel(VisionEncoder + PixelShuffleMLP) viasupport_torch_compile, gated byCompilationConfig.compile_mm_encoder; tag withset_model_tagand run underset_forward_contextCompilationConfig.compile_mm_encoderdocs to includemLLaMa4; addtest_mllama4_vit_compilation(forked/skipped in CI)vit_attn_wrappers.py) and plumb args inMMEncoderAttentionLlama4VisionRotaryEmbeddingto avoid in-place cache updates; add dynamic-shape hints toLlamaModelto reduce recompilesWritten by Cursor Bugbot for commit 84c926015cf934954bf3e1eaf51049d5e3003492. This will update automatically on new commits. Configure here.
Note
Cursor Bugbot is generating a summary for commit 8935253245665bb7b66ed35b63846b7ec513a9b9. Configure here.
Note
Speeds up mLLaMA4 vision by compiling its encoder path and aligning infra.
Llama4VisionModel(VisionEncoder + PixelShuffleMLP) viasupport_torch_compilegated byCompilationConfig.compile_mm_encoder; tag withset_model_tagand run underset_forward_contextCompilationConfig.compile_mm_encoderdoc to includemLLaMa4; addtest_mllama4_vit_compilation(forked/skipped in CI)vit_attn_wrappers.py) and plumb args inMMEncoderAttentionLlama4VisionRotaryEmbeddingto avoid in-place cache updates; addmark_unbacked_dimstoLlamaModelto reduce recompilesWritten by Cursor Bugbot for commit 8935253245665bb7b66ed35b63846b7ec513a9b9. This will update automatically on new commits. Configure here.
Note
Speeds up mLLaMA4 vision by compiling its encoder path and aligning related infra.
Llama4VisionModelviasupport_torch_compile(gated bycompile_mm_encoder), tag withset_model_tag, and run underset_forward_contextCompilationConfig.compile_mm_encoderdocs to includemLLaMa4; addtest_mllama4_vit_compilation(forked/skipped in CI) and refine Qwen2.5-VL testsvit_attn_wrappers.pyand plumb args inMMEncoderAttentionLlama4VisionRotaryEmbeddingto avoid in-place cache updatesLlamaModel(mark_unbacked_dims) to reduce recompilesWritten by Cursor Bugbot for commit 95c4616. This will update automatically on new commits. Configure here.
Note
Cursor Bugbot is generating a summary for commit 31bc1de. Configure here.
Note
Speeds up mLLaMA4 vision by compiling its encoder path and aligning related infra.
Llama4VisionModelwithsupport_torch_compile(dynamic arg dims; gated byshould_torch_compile_mm_vit), constructs underset_current_vllm_configand tags viaset_model_tag; runs image embed underset_forward_contextCompilationConfig.compile_mm_encoderto includemLLaMa4MMEncoderAttentioncall; add defaults inflash_attn_maxseqlen_wrapper_fakeLlama4VisionRotaryEmbeddingto avoid in-place cache updates; tweaksLlamaModelcompile decorator to reduce recompilestest_mllama4_vit_compilation(forked/skipped due to CI constraints)Written by Cursor Bugbot for commit 31bc1de. This will update automatically on new commits. Configure here.