Skip to content

add: qwen 3.5#3442

Merged
winglian merged 17 commits into
axolotl-ai-cloud:mainfrom
ved1beta:feat/qwen3.5-2
Mar 6, 2026
Merged

add: qwen 3.5#3442
winglian merged 17 commits into
axolotl-ai-cloud:mainfrom
ved1beta:feat/qwen3.5-2

Conversation

@ved1beta
Copy link
Copy Markdown
Member

@ved1beta ved1beta commented Feb 28, 2026

Description

support for qwen 3.5

Motivation and Context

#3434

How has this been tested?

'27b-qolra.yaml'

AI Usage Disclaimer

claudeee

Screenshots (if appropriate)

image

Types of changes

using single patch for qwen 3,5 next both runs fine ig

Social Handles (Optional)

ved

Summary by CodeRabbit

  • New Features

    • Added support for Qwen3.5-27B and Qwen3.5-27B MOE model variants
    • Introduced QLoRA configuration example for Qwen3.5-27B fine-tuning
    • Enabled sample packing optimization for Qwen3.5 models
  • Chores

    • Refactored patching infrastructure to consolidate Qwen3_Next and Qwen3.5 implementations

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 28, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b1ef1376-2ef8-4750-b962-81dd39c0eab3

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Introduces support for Qwen3.5 and Qwen3.5 MoE model variants through a new example QLoRA configuration, architecture registry entries, dynamic patching infrastructure for sample packing, and refactored monkeypatch implementation shared across Qwen3.5 and Qwen3_Next models.

Changes

Cohort / File(s) Summary
Configuration & Architecture
examples/qwen3.5/27b-qlora.yaml, src/axolotl/common/architectures.py
Added example Qwen3.5-27B QLoRA fine-tuning configuration with training parameters and registered "qwen3_5_moe" model architecture entry.
Patch Management
src/axolotl/loaders/patch_manager.py
Added conditional dynamic patching for qwen3_5 and qwen3_5_moe models when sample packing is enabled.
Qwen3.5 Monkeypatch
src/axolotl/monkeypatch/models/qwen3_5/modeling.py
Introduced comprehensive patching module supporting FLA kernel injection, position_ids handling with 3-D mrope shapes, separate factory builders for Qwen3_Next and Qwen3.5/Qwen3.5MoE variants, and packing patch applier utilities.
Qwen3_Next Refactor
src/axolotl/monkeypatch/models/qwen3_next/modeling.py
Consolidated to re-export unified packing patch from qwen3_5.modeling, removing duplicate implementations and simplifying public API.
Multipack Support
src/axolotl/monkeypatch/multipack.py
Added "qwen3_5" and "qwen3_5_moe" to SUPPORTED_MULTIPACK_MODEL_TYPES list.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • NanoCode012
  • winglian
🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 58.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'add: qwen 3.5' is vague and uses generic terminology that does not clearly convey the scope or nature of the implementation. Expand the title to be more specific about the changes, such as 'Add Qwen 3.5 model support with QLoRA configuration and Flash Attention patching' or similar, to clarify what aspects of Qwen 3.5 are being added.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/axolotl/monkeypatch/models/qwen3_5/modeling.py`:
- Line 455: Re-run the code formatter (ruff format) and commit the changes to
fix formatting lint failures; specifically format the file containing the
LOG.info line that reads "Applied {cls_prefix} packing patch
(fla_causal_conv1d={'available' if fla_causal_conv1d else 'unavailable'})" in
src/axolotl/monkeypatch/models/qwen3_5/modeling.py, then stage and commit the
formatted file so the ruff-format pipeline no longer reports changes.
- Line 145: The unpacking in the Qwen3-Next patched forward currently does
"batch_size, seq_len, _ = hidden_states.shape" but batch_size is unused; update
the unpack to ignore that value (e.g., "_ , seq_len, _ = hidden_states.shape" or
simply derive seq_len with "seq_len = hidden_states.shape[1]") in the patched
forward implementation in modeling.py so Ruff RUF059 is resolved while
preserving existing logic that uses seq_len.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 18f26c1 and 81f3f27.

📒 Files selected for processing (7)
  • examples/qwen3.5/27b-qlora.yaml
  • src/axolotl/common/architectures.py
  • src/axolotl/loaders/patch_manager.py
  • src/axolotl/monkeypatch/models/qwen3_5/__init__.py
  • src/axolotl/monkeypatch/models/qwen3_5/modeling.py
  • src/axolotl/monkeypatch/models/qwen3_next/modeling.py
  • src/axolotl/monkeypatch/multipack.py

):
hidden_states = apply_mask_fn(hidden_states, attention_mask)

batch_size, seq_len, _ = hidden_states.shape
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix unused unpacked variable in Qwen3-Next patched forward.

Line 145 unpacks batch_size but never uses it, and Ruff flags this (RUF059).

Suggested fix
-        batch_size, seq_len, _ = hidden_states.shape
+        _batch_size, seq_len, _ = hidden_states.shape
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
batch_size, seq_len, _ = hidden_states.shape
_batch_size, seq_len, _ = hidden_states.shape
🧰 Tools
🪛 Ruff (0.15.2)

[warning] 145-145: Unpacked variable batch_size is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/monkeypatch/models/qwen3_5/modeling.py` at line 145, The
unpacking in the Qwen3-Next patched forward currently does "batch_size, seq_len,
_ = hidden_states.shape" but batch_size is unused; update the unpack to ignore
that value (e.g., "_ , seq_len, _ = hidden_states.shape" or simply derive
seq_len with "seq_len = hidden_states.shape[1]") in the patched forward
implementation in modeling.py so Ruff RUF059 is resolved while preserving
existing logic that uses seq_len.

gated_cls = getattr(module, f"{cls_prefix}GatedDeltaNet")
gated_cls.forward = forward_factory(module.apply_mask_to_padding_states)

LOG.info(f"Applied {cls_prefix} packing patch (fla_causal_conv1d={'available' if fla_causal_conv1d else 'unavailable'})")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Please run formatter to unblock lint.

The lint pipeline reports ruff-format changes; Line 455 is a likely formatter touchpoint in this file. Re-run ruff format and commit the result.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/monkeypatch/models/qwen3_5/modeling.py` at line 455, Re-run the
code formatter (ruff format) and commit the changes to fix formatting lint
failures; specifically format the file containing the LOG.info line that reads
"Applied {cls_prefix} packing patch (fla_causal_conv1d={'available' if
fla_causal_conv1d else 'unavailable'})" in
src/axolotl/monkeypatch/models/qwen3_5/modeling.py, then stage and commit the
formatted file so the ruff-format pipeline no longer reports changes.

@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 28, 2026

Codecov Report

❌ Patch coverage is 4.92958% with 135 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/axolotl/monkeypatch/models/qwen3_5/modeling.py 0.00% 119 Missing ⚠️
src/axolotl/processing_strategies.py 23.07% 10 Missing ⚠️
src/axolotl/loaders/patch_manager.py 33.33% 6 Missing ⚠️

📢 Thoughts on this report? Let us know!

@winglian
Copy link
Copy Markdown
Collaborator

winglian commented Mar 2, 2026

do we have cutcrossentropy support?

@TheLocalDrummer
Copy link
Copy Markdown

Do you have D*scord, @ved1beta ?

@ved1beta
Copy link
Copy Markdown
Member Author

ved1beta commented Mar 3, 2026

Yes , it's huihui17

@ved1beta
Copy link
Copy Markdown
Member Author

ved1beta commented Mar 3, 2026

we have cutcrossentropy support?

No , adding

@NanoCode012 NanoCode012 added the scheduled_release This PR is slated for the upcoming release label Mar 3, 2026
@NanoCode012
Copy link
Copy Markdown
Collaborator

NanoCode012 commented Mar 4, 2026

do we have cutcrossentropy support?

Yes, qwen35 (and moe) CCE support was already added upstream and commit hash updated in this merged PR #3439

Copy link
Copy Markdown
Collaborator

@NanoCode012 NanoCode012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add a README in that directory similar to how it's done for our other arch

Comment thread examples/qwen3.5/27b-qlora.yaml Outdated

sequence_len: 2048
sample_packing: true
eval_sample_packing: true
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
eval_sample_packing: true

Not needed in this example

Comment thread examples/qwen3.5/27b-qlora.yaml Outdated
Comment on lines +37 to +39
- linear_attn.in_proj_qkv
- linear_attn.in_proj_z
- linear_attn.out_proj
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason these in particular? We may optional want to comment these out by default

Comment thread examples/qwen3.5/27b-qlora.yaml Outdated
Comment on lines +34 to +36
if position_ids.ndim == 3:
# mrope: [axes, B, T] — use axis 0 (text/temporal positions)
position_ids = position_ids[0]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate? Is it because index 1 is vision? Do you have ref for this?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is okay. in qwen3.5, the get_rope_index method returns

# In a mixed vision + text sequence, vision tokens use 3D RoPE (temporal, height, width) while text tokens use standard 1D RoPE.
position_ids (`torch.LongTensor` of shape `(3, batch_size, sequence_length)`)

Comment on lines +101 to +105
fa_position_ids = (
position_ids[0]
if position_ids is not None and position_ids.ndim == 3
else position_ids
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this done upstream

# Compute cu_seqlens only when FLA is available (torch fallback doesn't use it)
cu_seqlens = None
if (
fla_causal_conv1d is not None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this first check is not proper. This would then silently skip position ids if FLA is not installed and not properly raise error below

# Compute cu_seqlens only when FLA is available (torch fallback doesn't use it)
cu_seqlens = None
if (
fla_causal_conv1d is not None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

)
else:
# PyTorch fallback — no cu_seqlens, conv state leaks across packed sequences
LOG.warning_once(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the same as my qwen3_next branch

Comment on lines +4 to +6
# Note: Qwen3.5 is an early-fusion VLM (image+text). This config fine-tunes
# the text-only path. For multimodal (image+text) fine-tuning, add image
# columns to your dataset following axolotl's multimodal dataset format.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better to have a separate config later with -vision in its name

Suggested change
# Note: Qwen3.5 is an early-fusion VLM (image+text). This config fine-tunes
# the text-only path. For multimodal (image+text) fine-tuning, add image
# columns to your dataset following axolotl's multimodal dataset format.

@winglian winglian requested a review from NanoCode012 March 5, 2026 18:44
Copy link
Copy Markdown
Collaborator

@NanoCode012 NanoCode012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, just need some small clean ups left.

  • README qwen3.5 (can be based off the current qwen3_next including FLA installation)
  • Could we have a qwen3_5-vision config in examples too

Comment thread examples/qwen3.5/122b-a10b-moe-qlora.yaml Outdated
Comment thread examples/qwen3.5/35b-a3b-moe-qlora.yaml Outdated
Comment on lines +23 to +26
try:
from fla.modules.conv import causal_conv1d as fla_causal_conv1d # FLA < 0.4.1
except ImportError:
fla_causal_conv1d = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Comment thread src/axolotl/processing_strategies.py Outdated
return Qwen2VLProcessingStrategy(
**processing_kwargs,
)
if chat_template_type == "qwen3_5":
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also add qwen3_5moe vlm here? I assume it'll probably use the same processing strategy?

ved1beta and others added 3 commits March 6, 2026 14:22
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
Co-authored-by: NanoCode012 <kevinvong@rocketmail.com>
Comment thread examples/qwen3.5/README.md Outdated
@winglian winglian merged commit c119382 into axolotl-ai-cloud:main Mar 6, 2026
12 of 15 checks passed
@coderabbitai coderabbitai Bot mentioned this pull request Mar 17, 2026
@winglian winglian removed the scheduled_release This PR is slated for the upcoming release label Mar 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants