Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions examples/qwen3.5/122b-a10b-moe-qlora-fsdp.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,11 @@ lora_target_modules:
- k_proj
- v_proj
- o_proj
# Regex matching to target shared experts too
# lora_target_modules: 'model\.(language_model\.)?layers\.[\d]+\.(mlp|self_attn)\.(shared_expert\.)?(up|down|gate|gate_up|q|k|v|o)_proj'
# Add gate_up_proj and down_proj to also target shared experts (nn.Linear):
# - gate_up_proj
# - down_proj

# Target experts
# Target routed experts (3D nn.Parameter tensors, not nn.Linear — use lora_target_parameters):
# lora_target_parameters:
# - mlp.experts.gate_up_proj
# - mlp.experts.down_proj
Expand Down
8 changes: 4 additions & 4 deletions examples/qwen3.5/122b-a10b-moe-qlora.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,11 +31,11 @@ lora_target_modules:
- k_proj
- v_proj
- o_proj
# Add gate_up_proj and down_proj to also target shared experts (nn.Linear):
# - gate_up_proj
# - down_proj

# Regex matching to target shared experts too
# lora_target_modules: 'model\.(language_model\.)?layers\.[\d]+\.(mlp|self_attn)\.(shared_expert\.)?(up|down|gate|gate_up|q|k|v|o)_proj'

# Target experts
# Target routed experts (3D nn.Parameter tensors, not nn.Linear — use lora_target_parameters):
# lora_target_parameters:
# - mlp.experts.gate_up_proj
# - mlp.experts.down_proj
Expand Down
8 changes: 4 additions & 4 deletions examples/qwen3.5/35b-a3b-moe-qlora-fsdp.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,11 +31,11 @@ lora_target_modules:
- k_proj
- v_proj
- o_proj
# Add gate_up_proj and down_proj to also target shared experts (nn.Linear):
# - gate_up_proj
# - down_proj

# Regex matching to target shared experts too
# lora_target_modules: 'model\.(language_model\.)?layers\.[\d]+\.(mlp|self_attn)\.(shared_expert\.)?(up|down|gate|gate_up|q|k|v|o)_proj'

# Target experts
# Target routed experts (3D nn.Parameter tensors, not nn.Linear — use lora_target_parameters):
# lora_target_parameters:
# - mlp.experts.gate_up_proj
# - mlp.experts.down_proj
Expand Down
14 changes: 7 additions & 7 deletions examples/qwen3.5/35b-a3b-moe-qlora.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -42,14 +42,14 @@ lora_target_modules:
- k_proj
- v_proj
- o_proj
# Add gate_up_proj and down_proj to also target shared experts (nn.Linear):
# - gate_up_proj
# - down_proj

# Regex matching to target shared experts too
# lora_target_modules: 'model\.(language_model\.)?layers\.[\d]+\.(mlp|self_attn)\.(shared_expert\.)?(up|down|gate|gate_up|q|k|v|o)_proj'

# Target experts
lora_target_parameters:
- mlp.experts.gate_up_proj
- mlp.experts.down_proj
# Target routed experts (3D nn.Parameter tensors, not nn.Linear — use lora_target_parameters):
# lora_target_parameters:
# - mlp.experts.gate_up_proj
# - mlp.experts.down_proj

lora_qkv_kernel: true
lora_o_kernel: true
Expand Down
13 changes: 11 additions & 2 deletions examples/qwen3.5/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,12 +59,21 @@ lora_target_parameters:

### Shared Experts (MoE)

Routed experts and shared experts both have `gate_up_proj`/`down_proj`, so a plain module name in `lora_target_modules` would match both. Use a regex to target only attention and shared expert projections, while `lora_target_parameters` above handles routed experts separately:
Shared experts use `nn.Linear` (unlike routed experts which are 3D `nn.Parameter` tensors), so they can be targeted via `lora_target_modules`. To also train shared expert projections alongside attention, uncomment `gate_up_proj` and `down_proj` in `lora_target_modules`:

```yaml
lora_target_modules: 'model\.(language_model\.)?layers\.[\d]+\.(mlp|self_attn)\.(shared_expert\.)?(up|down|gate|gate_up|q|k|v|o)_proj'
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
# Add gate_up_proj and down_proj to also target shared experts (nn.Linear):
# - gate_up_proj
# - down_proj
```

Use `lora_target_parameters` (see [Routed Experts](#routed-experts-moe) above) to target routed experts separately.

### TIPS

- For inference hyp, please see the respective model card details.
Expand Down
Loading