-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
add: qwen 3.5 #3442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
add: qwen 3.5 #3442
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
b77eada
add: qwen 3.5
ved1beta b3660b0
test for qwen , patch
ved1beta cd483da
lint
ved1beta 587ead4
qwen3 fix on main
ved1beta 5251310
Apply suggestions from code review
ved1beta d2f97c4
moe config
ved1beta cc1320f
config moe
ved1beta 53d26bf
Merge branch 'feat/qwen3.5-2' of github.com:ved1beta/axolotl into fea…
ved1beta 5a85851
configs and chore
ved1beta 3ecd4b1
Update examples/qwen3.5/122b-a10b-moe-qlora.yaml
ved1beta 0ffa5fb
Update examples/qwen3.5/35b-a3b-moe-qlora.yaml
ved1beta f32e555
chore for qwen + vlm patch
ved1beta 98b160c
chore lint
ved1beta 9844e4a
Merge branch 'feat/qwen3.5-2' of github.com:ved1beta/axolotl into fea…
ved1beta 56cae1b
qwen lint
ved1beta 88a093e
3_5_moe
ved1beta 6589080
Update examples/qwen3.5/README.md
NanoCode012 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,71 @@ | ||
| base_model: Qwen/Qwen3.5-122B-A10B | ||
|
|
||
| plugins: | ||
| - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin | ||
| strict: false | ||
|
|
||
| chat_template: qwen3_5 | ||
| datasets: | ||
| - path: mlabonne/FineTome-100k | ||
| type: chat_template | ||
| split: train[:20%] | ||
| field_messages: conversations | ||
| message_property_mappings: | ||
| role: from | ||
| content: value | ||
| val_set_size: 0.0 | ||
| output_dir: ./outputs/out | ||
| dataset_prepared_path: last_run_prepared | ||
|
|
||
| sequence_len: 2048 | ||
| sample_packing: true | ||
|
|
||
| load_in_4bit: true | ||
| quantize_moe_experts: true | ||
| adapter: qlora | ||
| lora_r: 16 | ||
| lora_alpha: 32 | ||
| lora_dropout: 0 | ||
| lora_target_modules: | ||
| - q_proj | ||
| - k_proj | ||
| - v_proj | ||
| - o_proj | ||
|
|
||
| #lora_target_parameters: | ||
| # - mlp.experts.gate_up_proj | ||
| # - mlp.experts.down_proj | ||
|
|
||
| wandb_project: | ||
| wandb_entity: | ||
| wandb_watch: | ||
| wandb_name: | ||
| wandb_log_model: | ||
|
|
||
| gradient_accumulation_steps: 2 | ||
| micro_batch_size: 1 | ||
| num_epochs: 1 | ||
| optimizer: adamw_torch_4bit | ||
| lr_scheduler: cosine | ||
| learning_rate: 0.0002 | ||
|
|
||
| bf16: auto | ||
| tf32: true | ||
|
|
||
|
|
||
| lora_mlp_kernel: false | ||
| lora_qkv_kernel: false | ||
| lora_o_kernel: false | ||
|
|
||
| gradient_checkpointing: true | ||
| gradient_checkpointing_kwargs: | ||
| use_reentrant: false | ||
| resume_from_checkpoint: | ||
| logging_steps: 1 | ||
| flash_attention: true | ||
|
|
||
| warmup_ratio: 0.1 | ||
| evals_per_epoch: 4 | ||
| saves_per_epoch: 1 | ||
| weight_decay: 0.0 | ||
| special_tokens: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,72 @@ | ||
| base_model: Qwen/Qwen3.5-27B | ||
| # Automatically upload checkpoint and final model to HF | ||
| # hub_model_id: username/custom_model_name | ||
| # Note: Qwen3.5 is an early-fusion VLM (image+text). This config fine-tunes | ||
| # the text-only path. For multimodal (image+text) fine-tuning, add image | ||
| # columns to your dataset following axolotl's multimodal dataset format. | ||
|
|
||
| plugins: | ||
| - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin | ||
| strict: false | ||
|
|
||
| chat_template: qwen3_5 | ||
| datasets: | ||
| - path: mlabonne/FineTome-100k | ||
| type: chat_template | ||
| split: train[:20%] | ||
| field_messages: conversations | ||
| message_property_mappings: | ||
| role: from | ||
| content: value | ||
| val_set_size: 0.0 | ||
| output_dir: ./outputs/out | ||
| dataset_prepared_path: last_run_prepared | ||
|
|
||
| sequence_len: 2048 | ||
| sample_packing: true | ||
|
|
||
| load_in_4bit: true | ||
| adapter: qlora | ||
| lora_r: 16 | ||
| lora_alpha: 32 | ||
| lora_target_modules: | ||
| - q_proj | ||
| - k_proj | ||
| - v_proj | ||
| - o_proj | ||
| - down_proj | ||
| - up_proj | ||
| # Uncomment below to also target the linear attention projections. | ||
| # These use separate in_proj_qkv / in_proj_z / out_proj (Qwen3.5-specific). | ||
| # - linear_attn.in_proj_qkv | ||
| # - linear_attn.in_proj_z | ||
| # - linear_attn.out_proj | ||
|
|
||
| wandb_project: | ||
| wandb_entity: | ||
| wandb_watch: | ||
| wandb_name: | ||
| wandb_log_model: | ||
|
|
||
| gradient_accumulation_steps: 2 | ||
| micro_batch_size: 1 | ||
| num_epochs: 1 | ||
| optimizer: adamw_torch_4bit | ||
| lr_scheduler: cosine | ||
| learning_rate: 0.0002 | ||
|
|
||
| bf16: auto | ||
| tf32: true | ||
|
|
||
| gradient_checkpointing: true | ||
| gradient_checkpointing_kwargs: | ||
| use_reentrant: false | ||
| resume_from_checkpoint: | ||
| logging_steps: 1 | ||
| flash_attention: true | ||
|
|
||
| warmup_ratio: 0.1 | ||
| evals_per_epoch: 4 | ||
| saves_per_epoch: 1 | ||
| weight_decay: 0.0 | ||
| special_tokens: | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,70 @@ | ||
| base_model: Qwen/Qwen3.5-35B-A3B | ||
|
|
||
| plugins: | ||
| - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin | ||
| strict: false | ||
|
|
||
| chat_template: qwen3_5 | ||
| datasets: | ||
| - path: mlabonne/FineTome-100k | ||
| type: chat_template | ||
| split: train[:20%] | ||
| field_messages: conversations | ||
| message_property_mappings: | ||
| role: from | ||
| content: value | ||
| val_set_size: 0.0 | ||
| output_dir: ./outputs/out | ||
| dataset_prepared_path: last_run_prepared | ||
|
|
||
| sequence_len: 2048 | ||
| sample_packing: true | ||
|
|
||
| load_in_4bit: true | ||
| quantize_moe_experts: true | ||
| adapter: qlora | ||
| lora_r: 16 | ||
| lora_alpha: 32 | ||
| lora_dropout: 0 | ||
| lora_target_modules: | ||
| - q_proj | ||
| - k_proj | ||
| - v_proj | ||
| - o_proj | ||
|
|
||
| #lora_target_parameters: | ||
| # - mlp.experts.gate_up_proj | ||
| # - mlp.experts.down_proj | ||
|
|
||
| wandb_project: | ||
| wandb_entity: | ||
| wandb_watch: | ||
| wandb_name: | ||
| wandb_log_model: | ||
|
|
||
| gradient_accumulation_steps: 2 | ||
| micro_batch_size: 1 | ||
| num_epochs: 1 | ||
| optimizer: adamw_torch_4bit | ||
| lr_scheduler: cosine | ||
| learning_rate: 0.0002 | ||
|
|
||
| bf16: auto | ||
| tf32: true | ||
|
|
||
| lora_mlp_kernel: false | ||
| lora_qkv_kernel: false | ||
| lora_o_kernel: false | ||
|
|
||
| gradient_checkpointing: true | ||
| gradient_checkpointing_kwargs: | ||
| use_reentrant: false | ||
| resume_from_checkpoint: | ||
| logging_steps: 1 | ||
| flash_attention: true | ||
|
|
||
| warmup_ratio: 0.1 | ||
| evals_per_epoch: 4 | ||
| saves_per_epoch: 1 | ||
| weight_decay: 0.0 | ||
| special_tokens: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,72 @@ | ||
| base_model: Qwen/Qwen3.5-7B | ||
| processor_type: AutoProcessor | ||
|
|
||
| # Qwen3.5-7B and above are early-fusion VLMs (Qwen3_5ForConditionalGeneration). | ||
| # Vision and text tokens are processed together by the same transformer layers. | ||
| # Note: Qwen3.5-2B is a text-only model — the smallest VLM is Qwen3.5-7B. | ||
|
|
||
| # These 3 lines are required for vision/multimodal training | ||
| skip_prepare_dataset: true | ||
| remove_unused_columns: false | ||
| sample_packing: false | ||
|
|
||
| chat_template: qwen3_5 | ||
| datasets: | ||
| - path: HuggingFaceH4/llava-instruct-mix-vsft | ||
| type: chat_template | ||
| split: train[:1%] | ||
|
|
||
| dataset_prepared_path: last_run_prepared | ||
| val_set_size: 0.0 | ||
| output_dir: ./outputs/out | ||
|
|
||
| adapter: lora | ||
| lora_model_dir: | ||
|
|
||
| sequence_len: 8192 | ||
| pad_to_sequence_len: false | ||
|
|
||
| lora_r: 32 | ||
| lora_alpha: 16 | ||
| lora_dropout: 0.05 | ||
| # Targets the language model attention and MLP layers. | ||
| # Qwen3.5 is early-fusion: all layers (including those seeing vision tokens) share | ||
| # the same transformer stack, so standard attention targets work for both modalities. | ||
| lora_target_modules: | ||
| - q_proj | ||
| - k_proj | ||
| - v_proj | ||
| - o_proj | ||
| - down_proj | ||
| - up_proj | ||
| # Uncomment to also target the linear attention (GatedDeltaNet) projections: | ||
| # - linear_attn.in_proj_qkv | ||
| # - linear_attn.in_proj_z | ||
| # - linear_attn.out_proj | ||
|
|
||
| wandb_project: | ||
| wandb_entity: | ||
| wandb_watch: | ||
| wandb_name: | ||
| wandb_log_model: | ||
|
|
||
| gradient_accumulation_steps: 4 | ||
| micro_batch_size: 1 | ||
| num_epochs: 1 | ||
| optimizer: adamw_bnb_8bit | ||
| lr_scheduler: cosine | ||
| learning_rate: 0.0002 | ||
|
|
||
| bf16: true | ||
| tf32: true | ||
|
|
||
| gradient_checkpointing: true | ||
| gradient_checkpointing_kwargs: | ||
| use_reentrant: false | ||
| logging_steps: 1 | ||
| flash_attention: true | ||
|
|
||
| warmup_ratio: 0.1 | ||
| evals_per_epoch: 1 | ||
| saves_per_epoch: 1 | ||
| weight_decay: 0.0 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,61 @@ | ||
| # Finetune Qwen3.5 with Axolotl | ||
|
|
||
| [Qwen3.5](https://huggingface.co/collections/Qwen/qwen35-68452f3bc6e4b7cfb4e1c803) is a hybrid architecture model series combining Gated DeltaNet linear attention with standard Transformer attention. Models from 7B onwards are early-fusion vision-language models (`Qwen3_5ForConditionalGeneration`), meaning vision and text tokens are processed through the same transformer stack. The 2B variant is text-only. | ||
|
|
||
| Available configs: | ||
|
|
||
| | Config | Model | Type | | ||
| |---|---|---| | ||
| | `27b-qlora.yaml` | Qwen3.5-27B | Dense VLM, text-only path | | ||
| | `35b-a3b-moe-qlora.yaml` | Qwen3.5-35B-A3B | MoE, text-only path | | ||
| | `122b-a10b-moe-qlora.yaml` | Qwen3.5-122B-A10B | MoE, text-only path | | ||
| | `7b-lora-vision.yaml` | Qwen3.5-7B | Vision+text (multimodal) | | ||
|
|
||
| ## Getting started | ||
|
|
||
| 1. Install Axolotl following the [installation guide](https://docs.axolotl.ai/docs/installation.html). | ||
|
|
||
| 2. Install [Cut Cross Entropy](https://docs.axolotl.ai/docs/custom_integrations.html#cut-cross-entropy) to reduce training VRAM usage. | ||
|
|
||
| 3. Install FLA for sample packing support with the Gated DeltaNet linear attention layers: | ||
| ```bash | ||
| pip3 uninstall -y causal-conv1d && pip3 install flash-linear-attention==0.4.1 | ||
| ``` | ||
| > FLA is required when `sample_packing: true`. Without it, training raises a `RuntimeError` on packed sequences. Vision configs use `sample_packing: false` so FLA is optional there. | ||
|
|
||
| 4. Run a finetuning example: | ||
|
|
||
| ```bash | ||
| # Dense 27B text-only (QLoRA, ~47 GiB VRAM with sample packing) | ||
| axolotl train examples/qwen3.5/27b-qlora.yaml | ||
|
|
||
| # MoE 35B-A3B text-only (QLoRA) | ||
| axolotl train examples/qwen3.5/35b-a3b-moe-qlora.yaml | ||
|
|
||
| # MoE 122B-A10B text-only (QLoRA) | ||
| axolotl train examples/qwen3.5/122b-a10b-moe-qlora.yaml | ||
|
|
||
| # 7B vision+text (LoRA, multimodal dataset) | ||
| axolotl train examples/qwen3.5/7b-lora-vision.yaml | ||
| ``` | ||
|
|
||
| ### TIPS | ||
|
|
||
| - For inference, you can experiment with `temperature: 0.7`, `top_p: 0.8`, `top_k: 20`, and `min_p: 0`. | ||
| - You can run a full finetuning by removing `adapter: qlora` and `load_in_4bit: true`. See [Multi-GPU](#optimization-guides) below. | ||
| - Read more on loading your own dataset at [docs](https://docs.axolotl.ai/docs/dataset_loading.html). | ||
| - The dataset format follows the OpenAI Messages format as seen [here](https://docs.axolotl.ai/docs/dataset-formats/conversation.html#chat_template). | ||
| - For **multimodal** finetuning, set `processor_type: AutoProcessor`, `skip_prepare_dataset: true`, and `remove_unused_columns: false` as shown in `7b-lora-vision.yaml`. | ||
| - The Gated DeltaNet linear attention layers (`linear_attn.*`) can optionally be added to `lora_target_modules` — they are commented out by default. | ||
|
|
||
| ## Optimization Guides | ||
|
|
||
| - [Optimizations Guide](https://docs.axolotl.ai/docs/optimizations.html) | ||
|
|
||
| ## Related Resources | ||
|
|
||
| - [Qwen3.5 Blog](https://qwenlm.github.io/blog/qwen3.5/) | ||
| - [Axolotl Docs](https://docs.axolotl.ai) | ||
| - [Axolotl Website](https://axolotl.ai) | ||
| - [Axolotl GitHub](https://github.com/axolotl-ai-cloud/axolotl) | ||
| - [Axolotl Discord](https://discord.gg/7m9sfhzaf3) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be better to have a separate config later with
-visionin its name