[Feature] Add LoRA tower/connector support for Llama 4 Vision (mllama4)#35147
Conversation
There was a problem hiding this comment.
Code Review
This pull request successfully enables LoRA support for the vision tower and connector of the Llama 4 Vision model. The implementation correctly defines the module mapping for LoRA targeting, using longest-prefix matching to distinguish between tower and connector components. Additionally, it provides the necessary methods to calculate token counts for the vision encoder and connector, accounting for the pixel shuffle and CLS token handling specific to the Llama 4 Vision architecture. The logic aligns with existing patterns in vLLM for multimodal LoRA support.
Implement get_num_mm_encoder_tokens() and get_num_mm_connector_tokens() for Llama4ForConditionalGeneration so LoRA adapters can be applied to the vision encoder (tower) and connector modules. Also update get_mm_mapping() to separate vision_model.vision_adapter into the connector prefix, since the adapter MLP processes post-pixel-shuffle tokens (different count from the encoder layers). Signed-off-by: dorhuri123 <dor.huri1@live.biu.ac.il>
3aba589 to
a53b617
Compare
jeejeelee
left a comment
There was a problem hiding this comment.
I assume you have tested this locally.
|
Yes! Tested end-to-end on 4x H100 80GB with nvidia/Llama-4-Scout-17B-16E-Instruct-FP8 and a LoRA adapter targeting tower, connector, and LM layers. All 3 inference tests passed — baseline and LoRA both produce valid outputs. Full test results are in the PR description. |
…4) (vllm-project#35147) Signed-off-by: dorhuri123 <dor.huri1@live.biu.ac.il> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
…4) (vllm-project#35147) Signed-off-by: dorhuri123 <dor.huri1@live.biu.ac.il> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
…4) (vllm-project#35147) Signed-off-by: dorhuri123 <dor.huri1@live.biu.ac.il> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
…4) (vllm-project#35147) Signed-off-by: dorhuri123 <dor.huri1@live.biu.ac.il> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>
…4) (vllm-project#35147) Signed-off-by: dorhuri123 <dor.huri1@live.biu.ac.il> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
…4) (vllm-project#35147) Signed-off-by: dorhuri123 <dor.huri1@live.biu.ac.il> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>
…4) (vllm-project#35147) Signed-off-by: dorhuri123 <dor.huri1@live.biu.ac.il> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Purpose
Enable LoRA adapters for the vision tower and connector of Llama 4 Vision (
Llama4ForConditionalGeneration/mllama4.py), as part of #31479.Previously, LoRA could only be applied to the language model layers. With this change,
--enable-tower-connector-loraalso applies LoRA to:vision_model.model.layers.*.self_attn)vision_model.vision_adapter.mlp) and multi-modal projector (multi_modal_projector)Changes (1 file, 23 lines)
get_mm_mapping()— Updatedconnectorfrom a single string to a list that includes bothmulti_modal_projector.andvision_model.vision_adapter.. The LoRA manager uses longest-prefix matching, sovision_model.vision_adapter.*modules correctly map to the connector wrapper (not tower).get_num_mm_encoder_tokens()— Converts LM-level image token count back to vision encoder token count. The encoder processes(image_size/patch_size)² + 1tokens per chunk (raw patches + CLS token), while the LM seespatches_per_chunktokens (post pixel-shuffle, fewer).get_num_mm_connector_tokens()— Converts encoder token count to connector token count (post pixel-shuffle). The connector (vision_adapter MLP + multi_modal_projector) processes the reduced token count.Token flow
Values shown for Llama 4 Scout (image_size=504, patch_size=14, pixel_shuffle_ratio=0.5).
Test Plan
Tested on 4x NVIDIA H100 80GB with Llama 4 Scout 17B-16E (
nvidia/Llama-4-Scout-17B-16E-Instruct-FP8).1. Create a test LoRA adapter targeting tower + connector + LM layers
Verified LoRA coverage across all 4 module groups: tower (vision encoder attention), connector (vision adapter MLP), connector (multi-modal projector), and language model.
2. Serve with tower/connector LoRA enabled
Server started successfully with tower/connector LoRA active:
3. Run inference tests
Three tests via the OpenAI-compatible API:
Test Result
All 3 tests passed. The test image is a PNG with four colored dice:
Reference pattern
This follows the same approach as InternVL2 (#32397) which also has pixel shuffle with CLS token handling, and Qwen2VL where the merger is a sub-prefix of the vision tower.