Qwen3Next and Qwen3.5 MoE Patches, Transformers v5 fixes for Gemma, Other misc fixes by Datta0 · Pull Request #495 · unslothai/unsloth-zoo

Datta0 · 2026-02-11T08:19:47Z

Tested on Colab and B200 with dtype = fp16 with transformers 5.0.0 and main

gemini-code-assist · 2026-02-11T08:20:25Z

Summary of Changes

Hello @Datta0, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a series of compatibility patches for various models, including Gemma, Gemma3n, GLM4-MoE, and CSM, to ensure their proper functioning with updated versions of the transformers library, specifically targeting version 5. The changes focus on making function signatures more robust and adaptable to API changes by introducing optional parameters and dynamic signature checks, thereby preventing breakage when the underlying library evolves.

Highlights

Gemma3Processor Patching: The Gemma3Processor.__call__ method patching was updated to dynamically check for and handle videos and audio arguments, ensuring compatibility with varying transformers library versions.
Gemma3Attention Parameter Flexibility: The position_embeddings and attention_mask parameters in Gemma3Attention's forward_function and forward methods were made optional, defaulting to None.
Gemma3n Model Patching: The input_ids and inputs_embeds parameters in Gemma3nModel.get_placeholder_mask were made optional to improve flexibility.
Conditional CSM Model Patching: Patching for CsmDepthDecoderForCausalLM.forward and CsmForConditionalGeneration.forward was made conditional, adapting based on whether the target function's signature includes output_attentions.
Relaxed Patch Matching: Several patch_function calls across different models now use a match_level='relaxed' argument, indicating a more lenient matching strategy for function signatures.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

unsloth_zoo/temporary_patches/gemma.py
- Updated patch_Gemma3Processor to conditionally handle videos and audio arguments in Gemma3Processor.__call__.
- Modified position_embeddings and attention_mask parameters in Gemma3Attention's forward_function and forward methods to be optional.
- Applied match_level='relaxed' to patch_function_past_key_values calls.
unsloth_zoo/temporary_patches/gemma3n.py
- Made input_ids and inputs_embeds optional in Gemma3nModel.get_placeholder_mask.
- Removed an extraneous blank line in patch_Gemma3nTextAltUp_predict.
unsloth_zoo/temporary_patches/glm4_moe.py
- Removed the torch.Tensor type hint from the hidden_states parameter in moe_block_forward.
unsloth_zoo/temporary_patches/misc.py
- Implemented conditional patching for CsmDepthDecoderForCausalLM.forward and CsmForConditionalGeneration.forward based on the presence of output_attentions in the target function signature.
- Made input_ids optional in the patched forward methods for CSM models.
- Applied match_level='relaxed' to patch_function calls for CSM models.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request updates several temporary patches to ensure compatibility with transformers v5. The changes primarily involve using introspection (inspect module) to dynamically adapt the patches based on the function signatures in the installed transformers version. This is a robust approach to handle API changes in the underlying library. The changes look good overall. I've added one minor suggestion to restore a removed type hint for better code clarity.

…h.compile decorators - Disable @torch.compile on selective_log_softmax and chunked_selective_log_softmax functions - Disable @torch.compile on chunked_hidden_states_selective_log_softmax (main culprit) - Disable @torch.compile on accumulate_chunk function within UnslothEfficientGRPO - Disable @torch.compile on accumulate_chunk in fused_losses/cross_entropy_loss.py - Comment out grpo_compute_loss_slow torch.compile decorator This resolves tensor size mismatch errors (s47*s61 vs s47*s87) that occur when torch.compile tries to optimize functions with dynamic VLM sequence lengths. Fixes Unsloth Issue #4025 for Qwen2-VL and other Vision-Language Models. Signed-off-by: Daniel Han-Chen <daniel.han.chen@gmail.com>

Datta0 · 2026-02-11T11:22:01Z

Datta0 · 2026-02-19T12:33:17Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces significant updates to ensure compatibility with transformers v5, adds support for new models like Qwen3.5-MoE and Qwen3Next-MoE, and implements a major feature for 4-bit quantized Mixture-of-Experts (MoE) models. The changes include robust patching mechanisms that inspect function signatures, extensive refactoring of MoE patching logic for better code reuse, and deep integration with peft and bitsandbytes for 4-bit quantization. Overall, this is a well-engineered update that enhances functionality and maintainability. My review includes a few suggestions to improve error handling and remove a redundant line of code.

…y pass/blanks

danielhanchen · 2026-02-24T16:50:18Z

Transformers v5 Compatibility Testing Results

Tested PR #495 across 4 configurations to verify transformers v5 compatibility:

Config	Branch	Transformers	Purpose
A	`main`	4.57.6	Baseline (no PR)
B	`combined-test`	4.57.6	PR on current stable (done prior)
C	`main`	5.2.0	Establish v5 regression
D	`combined-test`	5.2.0	Verify PR fixes on v5

Llama 3.1 (8B) Alpaca SFT -- Text-only model

Config	Status	Losses [1,2,3,60,61]	Grad Norms [1,2,3,60,61]	Peak Mem
A (main+v4)	PASSED	[1.888, 1.675, 1.576, 1.014, 1.013]	[0.990, 0.887, 0.814, 0.284, 0.395]	6.55 GB
B (PR+v4)	PASSED	[1.888, 1.675, 1.576, 1.014, 1.013]	[0.990, 0.887, 0.814, 0.284, 0.395]	6.55 GB
C (main+v5)	FAILED	N/A	N/A	N/A
D (PR+v5)	DIVERGENT	[13.107, 13.096, 12.764, 5.513, 5.387]	[12.786, 11.185, 11.591, 5.018, 6.243]	6.64 GB

Analysis: Config C crashes with AssertionError: Fused losses expect grad_output to be all 1.0, but got tensor([0.3333]). This is because transformers v5 divides loss by gradient_accumulation_steps before backward(), changing grad_output from 1.0 to 1/3.

Config D completes training after removing the assertion, but losses are ~7x higher than baseline because the fused CE loss backward ignores grad_output, so gradients are not scaled correctly. This is a pre-existing issue in fused_losses/cross_entropy_loss.py not addressed by this PR.

Gemma3 (4B) Vision SFT -- Instruct model (PR #495 gemma.py target)

Config	Status	Losses [1,2,3,60,61]	Grad Norms [1,2,3,60,61]	Peak Mem
A (main+v4)	PASSED	[5.577, 4.320, 5.169, 5.091, 4.349]	[13.454, 23.417, 12.768, 13.764, 20.979]	4.17 GB
B (PR+v4)	PASSED	[5.577, 4.320, 5.169, 5.091, 4.349]	[13.454, 23.417, 12.768, 13.764, 20.979]	4.17 GB
C (main+v5)	FAILED	N/A (fused CE loss assertion)	N/A	N/A
D (PR+v5)	PASSED	[5.577, 4.320, 5.169, 5.091, 4.349]	[13.454, 23.417, 12.768, 13.764, 20.979]	4.17 GB

BIT-IDENTICAL results across configs A, B, and D. The v5 gemma.py patches produce exact same training behavior. Vision models do not use the fused CE loss path, so the grad_output scaling issue does not affect them.

Gemma3 (4B) Vision SFT -- Base model

Config	Status	Losses [1,2,3,60,61]	Grad Norms [1,2,3,60,61]	Peak Mem	Notes
B (PR+v4)	PASSED	[8.661, 7.687, 8.273, 0.337, 0.285]	[5.808, 5.394, 9.170, 2.369, 1.600]	6.59 GB	COMPILE_DISABLE=1
D (PR+v5)	PASSED	[8.661, 7.687, 8.279, 0.335, 0.289]	[1.935, 1.792, 2.098, 0.962, 0.595]	6.59 GB	COMPILE_DISABLE=1

Losses nearly identical (step 61: 0.289 vs 0.285). Grad norms ~3x lower on v5 due to the gradient_accumulation_steps scaling change, but training converges to the same loss values.

Phi-4 Conversational SFT

Config	Status	Losses [1,2,3,60,61]	Grad Norms [1,2,3,60,61]	Peak Mem	Notes
B (PR+v4)	PASSED	[1.180, 0.935, 1.046, 0.986, 0.643]	[0.356, 0.528, 0.346, 0.774, 0.664]	7.49 GB	COMPILE_DISABLE=1
D (PR+v5)	PARTIAL	[1.180, 0.936, 1.045, 1.010, 0.663]	[0.118, 0.177, 0.111, 0.209, 0.155]	8.17 GB	COMPILE_DISABLE=1

Training completed with similar losses but ~3x lower grad norms (same v5 scaling). Inference failed with assert type(input_ids) is torch.Tensor in unsloth/models/vision.py:158.

Gemma3N Vision SFT

Config	Status	Error
B (PR+v4)	FAILED	`per_layer_projection *= ...` inplace view + backward hooks
D (PR+v5)	FAILED	Same error: `RuntimeError: Output 0 of BackwardHookFunctionBackward is a view`

Upstream transformers bug in modeling_gemma3n.py:1768, not PR-related.

Qwen3Next / Qwen3.5 Patch Verification

Both Qwen3NextExperts and Qwen3_5MoeSparseMoeBlock exist in transformers 5.2.0. Patch modules load successfully. These are NOT no-ops on v5.

Key Findings

PR Qwen3Next and Qwen3.5 MoE Patches, Transformers v5 fixes for Gemma, Other misc fixes #495's gemma.py v5 fixes work correctly -- Gemma3 Vision (instruct) produces bit-identical results across v4 and v5.
Fused CE loss is the main v5 blocker -- cross_entropy_loss.py backward ignores grad_output, causing 3x gradient scaling mismatch on v5 when gradient_accumulation_steps > 1. This affects text-only SFT models (Llama, Phi-4) but NOT vision models.
Phi-4 inference assertion -- assert type(input_ids) is torch.Tensor in vision.py:158 fails on v5 (likely input_ids type changed).
Gemma3N -- Upstream per_layer_projection *= ... inplace view bug persists on both v4 and v5.

Recommendation

The PR's v5-specific changes (gemma.py, gemma3n.py patches) are verified working. The fused CE loss backward scaling issue needs a separate fix in cross_entropy_loss.py to properly handle grad_output != 1.0 for full v5 compatibility.

danielhanchen · 2026-02-24T18:06:56Z

Follow-up: Fused CE Loss v5 Fix Needed

The main remaining blocker for full transformers v5 support is unsloth_zoo/fused_losses/cross_entropy_loss.py. The backward pass returns pre-computed gradients without scaling by grad_output:

@staticmethod
def backward(ctx, grad_output,):
    # grad_output is assumed to be always = 1
    (grad_inputs, grad_lm_head, grad_lm_head_bias, ) = ctx.saved_tensors
    return (None, grad_inputs, grad_lm_head, grad_lm_head_bias, None, None, None, None, None, None, None, None, None,)

Transformers v5 divides loss by gradient_accumulation_steps before calling backward(), so grad_output = 1/gradient_accumulation_steps instead of 1.0. Since the backward ignores this, gradients are effectively gradient_accumulation_steps times too large, causing divergent training for text-only SFT models (Llama, Phi-4).

Vision models are unaffected because they don't use this fused CE loss path.

This is separate from PR #495 -- just flagging for a follow-up fix.

Datta0 · 2026-02-25T07:31:42Z

Some saving fixes are handled by #499

GoldenGrapeGentleman · 2026-02-25T08:06:17Z

Just passing by, taking the picture(〃￣︶￣)人(￣︶￣〃)

Fix Gemma and other model patches for v5

df0170f

gemini-code-assist Bot reviewed Feb 11, 2026

View reviewed changes

Comment thread unsloth_zoo/temporary_patches/glm4_moe.py

Datta0 requested a review from danielhanchen February 11, 2026 11:21

Datta0 added 4 commits February 11, 2026 11:40

update deepseek moe impl to match the current one

7774be7

remove style changes

38856a1

Qwen3Next MoE

a7063f7

4bit moe lora

1f70241

Datta0 mentioned this pull request Feb 16, 2026

Handle missing CSM depth decoder loss during loss aggregation #496

Open

Datta0 added 11 commits February 16, 2026 13:50

Qwen3.5 MoE

be54ecc

Modularise code into functions

77f28e7

Fixup deepseek patch

c3966d9

import fixes

eecc3f6

Properly patch compiler for qwen3*experts

36f89a0

Make qwen3.5 and qwen next patches failsafe

5b507b3

Cleanup imports

d99a6f9

Fix path issues

b89fb03

Disable compile for Qwen3GatedDeltaNet

dc4db93

Disable compile for Qwen3GatedDeltaNet

3f6ff66

Use compiled_cache for forward pass

27f87df

gemini-code-assist Bot reviewed Feb 19, 2026

View reviewed changes

Comment thread unsloth_zoo/temporary_patches/moe_utils.py

Comment thread unsloth_zoo/compiler.py

Comment thread unsloth_zoo/temporary_patches/deepseek_v3_moe.py Outdated

Cleanup: remove redundant assignments, fix extractor sig, remove stra…

4e21e3f

…y pass/blanks

Datta0 mentioned this pull request Feb 20, 2026

Qwen3-Coder-Next-Base OOM on 2xA100 QLoRA unslothai/unsloth#4040

Open

Datta0 changed the title ~~Fix Gemma and other model patches for v5~~ Qwen3Next and Qwen3.5 MoE Patches, Transformers v5 fixes for Gemma, Other misc fixes Feb 20, 2026

Datta0 requested a review from mmathew23 February 24, 2026 05:59

Datta0 added 2 commits February 25, 2026 08:06

disable compile for qwen 3.5 moe GDN

f2b8752

disable compile for FLA functions

661976e

Datta0 mentioned this pull request Feb 25, 2026

[Feature] Qwen3.5 unslothai/unsloth#4108

Open

patch FLA to disable repetitive autotune

75984c9

danielhanchen reviewed Feb 25, 2026

View reviewed changes

Comment thread unsloth_zoo/compiler.py Outdated

Update unsloth_zoo/compiler.py

5056d45

Datta0 mentioned this pull request Feb 25, 2026

[WIP] Qwen 3.5 MoE finetuning A100 notebook unslothai/notebooks#193

Merged

danielhanchen merged commit 0f04252 into unslothai:main Feb 25, 2026

Datta0 mentioned this pull request Feb 26, 2026

Fix gptoss 4bit #524

Merged

This was referenced Apr 20, 2026

[MoE] Fix Qwen-family MoE LoRA extractor shape mismatch #601

Merged

[Qwen 3.5][gemma4] Qwen35 and Gemma 4 fast inference (clean-base mirror of #588) #603

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3Next and Qwen3.5 MoE Patches, Transformers v5 fixes for Gemma, Other misc fixes#495

Qwen3Next and Qwen3.5 MoE Patches, Transformers v5 fixes for Gemma, Other misc fixes#495
danielhanchen merged 22 commits into
unslothai:mainfrom
Datta0:transformers_v5_patches

Datta0 commented Feb 11, 2026

Uh oh!

gemini-code-assist Bot commented Feb 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Datta0 commented Feb 11, 2026 •

edited

Loading

Uh oh!

Datta0 commented Feb 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danielhanchen commented Feb 24, 2026

Uh oh!

danielhanchen commented Feb 24, 2026

Uh oh!

Datta0 commented Feb 25, 2026

Uh oh!

GoldenGrapeGentleman commented Feb 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Datta0 commented Feb 11, 2026

Uh oh!

gemini-code-assist Bot commented Feb 11, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Datta0 commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Datta0 commented Feb 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danielhanchen commented Feb 24, 2026

Transformers v5 Compatibility Testing Results

Llama 3.1 (8B) Alpaca SFT -- Text-only model

Gemma3 (4B) Vision SFT -- Instruct model (PR #495 gemma.py target)

Gemma3 (4B) Vision SFT -- Base model

Phi-4 Conversational SFT

Gemma3N Vision SFT

Qwen3Next / Qwen3.5 Patch Verification

Key Findings

Recommendation

Uh oh!

danielhanchen commented Feb 24, 2026

Follow-up: Fused CE Loss v5 Fix Needed

Uh oh!

Datta0 commented Feb 25, 2026

Uh oh!

GoldenGrapeGentleman commented Feb 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Datta0 commented Feb 11, 2026 •

edited

Loading