Skip to content

[LoRA] Add LoRA support for Qwen3OmniMoeThinkerForConditionalGeneration#37193

Open
pratapyash wants to merge 6 commits intovllm-project:mainfrom
pratapyash:add-lora-qwen3-omni-moe-thinker
Open

[LoRA] Add LoRA support for Qwen3OmniMoeThinkerForConditionalGeneration#37193
pratapyash wants to merge 6 commits intovllm-project:mainfrom
pratapyash:add-lora-qwen3-omni-moe-thinker

Conversation

@pratapyash
Copy link
Copy Markdown
Contributor

@pratapyash pratapyash commented Mar 16, 2026

Purpose

Add LoRA support for Qwen3OmniMoeThinkerForConditionalGeneration.

The vLLM supported_models.md documentation marks this model with a LoRA checkmark under both "Multimodal Language Models" and "Speech-to-Text Language Models" tables, but the model class never implemented the SupportsLoRA protocol. This means --enable-lora fails at runtime despite the documentation claiming support.

This PR resolves that gap by adding the minimal required LoRA attributes to the model class in a single file.

FIX #31205
Related: #30461, PR #34097

Changes

  1. Import and inherit SupportsLoRA -- enables --enable-lora for this model
  2. Define packed_modules_mapping -- maps qkv_proj to [q_proj, k_proj, v_proj]. gate_up_proj is intentionally excluded because Qwen3-Omni uses MoE (FusedMoE) for FFN layers, not packed linear projections
  3. Define embedding_modules = {} -- required by the SupportsLoRA protocol; empty because no embedding-layer LoRA is needed
  4. Define lora_skip_prefixes = ["audio_tower.", "visual."] -- gracefully skips audio/vision tower modules during LoRA loading. Without this, adapters trained with broad target_modules (e.g., regex matching all Linear layers) crash with ValueError from check_unexpected_modules even though the thinker modules are valid. Follows the same pattern as NemotronH (lora_skip_prefixes = ["mtp."])

Why gate_up_proj is removed from packed_modules_mapping

Qwen3-Omni is MoE -- FFN uses FusedMoE with per-expert gate_proj/up_proj/down_proj, not a packed gate_up_proj. The inherited mapping came from Qwen2_5OmniThinkerForConditionalGeneration which is a dense model. Qwen3MoeForCausalLM (the authoritative MoE reference) deliberately excludes gate_up_proj and only adds it conditionally when mlp_only_layers is non-empty. Qwen3-Omni has mlp_only_layers: []. Keeping it is harmless (never matched) but misleading -- MoE expert LoRA is handled by FusedMoEWithLoRA, not packed_modules_mapping.

Test Plan

Public test adapter (random weights): yashpratap/Qwen3-Omni-30B-A3B-LoRA-test-r32

Server launch:

vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct \
  --tensor-parallel-size 4 \
  --enable-lora \
  --max-lora-rank 32 \
  --max-loras 2 \
  --lora-modules test=yashpratap/Qwen3-Omni-30B-A3B-LoRA-test-r32 \
  --enforce-eager \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192

Inference:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "test", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}'

Test Result

Tested on 4x L40S (48GB each).

Test 1: Enforce-eager mode -- PASS

  • Base model and LoRA adapter both listed in /v1/models
  • Base model response: standard Qwen-Omni intro
  • LoRA adapter response: different output confirming adapter is applied
  • No errors or warnings

Test 2: CUDA graph compilation -- PASS

  • VLLM_COMPILE + PIECEWISE cudagraph mode
  • Both base model and LoRA inference work correctly

Test 3: Mixed adapter (thinker + tower modules) -- PASS

  • Adapter containing both thinker attention and audio/vision tower LoRA weights
  • Tower modules gracefully skipped via lora_skip_prefixes
  • Thinker modules loaded and applied correctly
  • Server starts, inference works

Test 4: With gate_up_proj in mapping -- PASS (but removed anyway)

  • Tested keeping gate_up_proj in packed_modules_mapping
  • No errors or crashes -- the mapping entry is never matched since no module is named gate_up_proj in the MoE architecture
  • Removed to stay consistent with Qwen3MoeForCausalLM which deliberately excludes it for MoE models

  • Purpose is clearly described
  • Test plan provided
  • Test results included
  • No documentation update needed (LoRA checkmark already present in supported_models.md)

AI assistance was used in developing and testing this PR, per AGENTS.md.

…tedLinear and quantization support

- Replaced nn.Linear with ReplicatedLinear for conv_out, proj1, and proj2 layers to support quantization.
- Added quant_config parameter to Qwen3OmniMoeAudioEncoder constructor.
- Updated method calls to handle outputs from ReplicatedLinear layers.
- Included SupportsLoRA in Qwen3OmniMoeThinkerForConditionalGeneration class.
…oEncoder

- Removed ReplicatedLinear usage for conv_out, proj1, and proj2 layers.
- Eliminated quant_config parameter from Qwen3OmniMoeAudioEncoder constructor.
- Updated method calls to reflect changes in layer outputs.
… loading

- Introduced lora_skip_prefixes to exclude audio_tower and visual modules from LoRA loading.
- This change addresses the requirement for enable_tower_connector_lora, which is not yet supported.
@pratapyash pratapyash requested a review from sighingnow as a code owner March 16, 2026 13:44
@mergify mergify bot added the qwen Related to Qwen models label Mar 16, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds LoRA support for Qwen3OmniMoeThinkerForConditionalGeneration. The changes are minimal and well-contained, enabling LoRA by inheriting SupportsLoRA and defining the necessary attributes. The packed_modules_mapping is updated to support LoRA for attention layers while correctly excluding MLP layers for this MoE model. Additionally, lora_skip_prefixes is added to prevent errors when loading LoRA adapters that target the vision and audio towers. The changes are well-explained and appear correct.

@pratapyash
Copy link
Copy Markdown
Contributor Author

Hi @DarkLight1337 @jeejeelee request you to review this PR

@pratapyash
Copy link
Copy Markdown
Contributor Author

pratapyash commented Mar 30, 2026

Hi @DarkLight1337, @jeejeelee, @NickLucche following up again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qwen Related to Qwen models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ValueError: Qwen3OmniMoeThinkerForConditionalGeneration does not support LoRA yet.

1 participant