Skip to content

Qwen3Next and Qwen3.5 MoE Patches, Transformers v5 fixes for Gemma, Other misc fixes#495

Merged
danielhanchen merged 22 commits into
unslothai:mainfrom
Datta0:transformers_v5_patches
Feb 25, 2026
Merged

Qwen3Next and Qwen3.5 MoE Patches, Transformers v5 fixes for Gemma, Other misc fixes#495
danielhanchen merged 22 commits into
unslothai:mainfrom
Datta0:transformers_v5_patches

Conversation

@Datta0

@Datta0 Datta0 commented Feb 11, 2026

Copy link
Copy Markdown
Collaborator
image

Tested on Colab and B200 with dtype = fp16 with transformers 5.0.0 and main

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @Datta0, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a series of compatibility patches for various models, including Gemma, Gemma3n, GLM4-MoE, and CSM, to ensure their proper functioning with updated versions of the transformers library, specifically targeting version 5. The changes focus on making function signatures more robust and adaptable to API changes by introducing optional parameters and dynamic signature checks, thereby preventing breakage when the underlying library evolves.

Highlights

  • Gemma3Processor Patching: The Gemma3Processor.__call__ method patching was updated to dynamically check for and handle videos and audio arguments, ensuring compatibility with varying transformers library versions.
  • Gemma3Attention Parameter Flexibility: The position_embeddings and attention_mask parameters in Gemma3Attention's forward_function and forward methods were made optional, defaulting to None.
  • Gemma3n Model Patching: The input_ids and inputs_embeds parameters in Gemma3nModel.get_placeholder_mask were made optional to improve flexibility.
  • Conditional CSM Model Patching: Patching for CsmDepthDecoderForCausalLM.forward and CsmForConditionalGeneration.forward was made conditional, adapting based on whether the target function's signature includes output_attentions.
  • Relaxed Patch Matching: Several patch_function calls across different models now use a match_level='relaxed' argument, indicating a more lenient matching strategy for function signatures.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • unsloth_zoo/temporary_patches/gemma.py
    • Updated patch_Gemma3Processor to conditionally handle videos and audio arguments in Gemma3Processor.__call__.
    • Modified position_embeddings and attention_mask parameters in Gemma3Attention's forward_function and forward methods to be optional.
    • Applied match_level='relaxed' to patch_function_past_key_values calls.
  • unsloth_zoo/temporary_patches/gemma3n.py
    • Made input_ids and inputs_embeds optional in Gemma3nModel.get_placeholder_mask.
    • Removed an extraneous blank line in patch_Gemma3nTextAltUp_predict.
  • unsloth_zoo/temporary_patches/glm4_moe.py
    • Removed the torch.Tensor type hint from the hidden_states parameter in moe_block_forward.
  • unsloth_zoo/temporary_patches/misc.py
    • Implemented conditional patching for CsmDepthDecoderForCausalLM.forward and CsmForConditionalGeneration.forward based on the presence of output_attentions in the target function signature.
    • Made input_ids optional in the patched forward methods for CSM models.
    • Applied match_level='relaxed' to patch_function calls for CSM models.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates several temporary patches to ensure compatibility with transformers v5. The changes primarily involve using introspection (inspect module) to dynamically adapt the patches based on the function signatures in the installed transformers version. This is a robust approach to handle API changes in the underlying library. The changes look good overall. I've added one minor suggestion to restore a removed type hint for better code clarity.

Comment thread unsloth_zoo/temporary_patches/glm4_moe.py
…h.compile decorators

- Disable @torch.compile on selective_log_softmax and chunked_selective_log_softmax functions
- Disable @torch.compile on chunked_hidden_states_selective_log_softmax (main culprit)
- Disable @torch.compile on accumulate_chunk function within UnslothEfficientGRPO
- Disable @torch.compile on accumulate_chunk in fused_losses/cross_entropy_loss.py
- Comment out grpo_compute_loss_slow torch.compile decorator

This resolves tensor size mismatch errors (s47*s61 vs s47*s87) that occur
when torch.compile tries to optimize functions with dynamic VLM sequence lengths.

Fixes Unsloth Issue #4025 for Qwen2-VL and other Vision-Language Models.

Signed-off-by: Daniel Han-Chen <daniel.han.chen@gmail.com>
@Datta0 Datta0 requested a review from danielhanchen February 11, 2026 11:21
@Datta0

Datta0 commented Feb 11, 2026

Copy link
Copy Markdown
Collaborator Author
image

@Datta0

Datta0 commented Feb 19, 2026

Copy link
Copy Markdown
Collaborator Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant updates to ensure compatibility with transformers v5, adds support for new models like Qwen3.5-MoE and Qwen3Next-MoE, and implements a major feature for 4-bit quantized Mixture-of-Experts (MoE) models. The changes include robust patching mechanisms that inspect function signatures, extensive refactoring of MoE patching logic for better code reuse, and deep integration with peft and bitsandbytes for 4-bit quantization. Overall, this is a well-engineered update that enhances functionality and maintainability. My review includes a few suggestions to improve error handling and remove a redundant line of code.

Comment thread unsloth_zoo/temporary_patches/moe_utils.py
Comment thread unsloth_zoo/compiler.py
Comment thread unsloth_zoo/temporary_patches/deepseek_v3_moe.py Outdated
@Datta0 Datta0 changed the title Fix Gemma and other model patches for v5 Qwen3Next and Qwen3.5 MoE Patches, Transformers v5 fixes for Gemma, Other misc fixes Feb 20, 2026
@Datta0 Datta0 requested a review from mmathew23 February 24, 2026 05:59
@danielhanchen

Copy link
Copy Markdown
Member

Transformers v5 Compatibility Testing Results

Tested PR #495 across 4 configurations to verify transformers v5 compatibility:

Config Branch Transformers Purpose
A main 4.57.6 Baseline (no PR)
B combined-test 4.57.6 PR on current stable (done prior)
C main 5.2.0 Establish v5 regression
D combined-test 5.2.0 Verify PR fixes on v5

Llama 3.1 (8B) Alpaca SFT -- Text-only model

Config Status Losses [1,2,3,60,61] Grad Norms [1,2,3,60,61] Peak Mem
A (main+v4) PASSED [1.888, 1.675, 1.576, 1.014, 1.013] [0.990, 0.887, 0.814, 0.284, 0.395] 6.55 GB
B (PR+v4) PASSED [1.888, 1.675, 1.576, 1.014, 1.013] [0.990, 0.887, 0.814, 0.284, 0.395] 6.55 GB
C (main+v5) FAILED N/A N/A N/A
D (PR+v5) DIVERGENT [13.107, 13.096, 12.764, 5.513, 5.387] [12.786, 11.185, 11.591, 5.018, 6.243] 6.64 GB

Analysis: Config C crashes with AssertionError: Fused losses expect grad_output to be all 1.0, but got tensor([0.3333]). This is because transformers v5 divides loss by gradient_accumulation_steps before backward(), changing grad_output from 1.0 to 1/3.

Config D completes training after removing the assertion, but losses are ~7x higher than baseline because the fused CE loss backward ignores grad_output, so gradients are not scaled correctly. This is a pre-existing issue in fused_losses/cross_entropy_loss.py not addressed by this PR.

Gemma3 (4B) Vision SFT -- Instruct model (PR #495 gemma.py target)

Config Status Losses [1,2,3,60,61] Grad Norms [1,2,3,60,61] Peak Mem
A (main+v4) PASSED [5.577, 4.320, 5.169, 5.091, 4.349] [13.454, 23.417, 12.768, 13.764, 20.979] 4.17 GB
B (PR+v4) PASSED [5.577, 4.320, 5.169, 5.091, 4.349] [13.454, 23.417, 12.768, 13.764, 20.979] 4.17 GB
C (main+v5) FAILED N/A (fused CE loss assertion) N/A N/A
D (PR+v5) PASSED [5.577, 4.320, 5.169, 5.091, 4.349] [13.454, 23.417, 12.768, 13.764, 20.979] 4.17 GB

BIT-IDENTICAL results across configs A, B, and D. The v5 gemma.py patches produce exact same training behavior. Vision models do not use the fused CE loss path, so the grad_output scaling issue does not affect them.

Gemma3 (4B) Vision SFT -- Base model

Config Status Losses [1,2,3,60,61] Grad Norms [1,2,3,60,61] Peak Mem Notes
B (PR+v4) PASSED [8.661, 7.687, 8.273, 0.337, 0.285] [5.808, 5.394, 9.170, 2.369, 1.600] 6.59 GB COMPILE_DISABLE=1
D (PR+v5) PASSED [8.661, 7.687, 8.279, 0.335, 0.289] [1.935, 1.792, 2.098, 0.962, 0.595] 6.59 GB COMPILE_DISABLE=1

Losses nearly identical (step 61: 0.289 vs 0.285). Grad norms ~3x lower on v5 due to the gradient_accumulation_steps scaling change, but training converges to the same loss values.

Phi-4 Conversational SFT

Config Status Losses [1,2,3,60,61] Grad Norms [1,2,3,60,61] Peak Mem Notes
B (PR+v4) PASSED [1.180, 0.935, 1.046, 0.986, 0.643] [0.356, 0.528, 0.346, 0.774, 0.664] 7.49 GB COMPILE_DISABLE=1
D (PR+v5) PARTIAL [1.180, 0.936, 1.045, 1.010, 0.663] [0.118, 0.177, 0.111, 0.209, 0.155] 8.17 GB COMPILE_DISABLE=1

Training completed with similar losses but ~3x lower grad norms (same v5 scaling). Inference failed with assert type(input_ids) is torch.Tensor in unsloth/models/vision.py:158.

Gemma3N Vision SFT

Config Status Error
B (PR+v4) FAILED per_layer_projection *= ... inplace view + backward hooks
D (PR+v5) FAILED Same error: RuntimeError: Output 0 of BackwardHookFunctionBackward is a view

Upstream transformers bug in modeling_gemma3n.py:1768, not PR-related.

Qwen3Next / Qwen3.5 Patch Verification

Both Qwen3NextExperts and Qwen3_5MoeSparseMoeBlock exist in transformers 5.2.0. Patch modules load successfully. These are NOT no-ops on v5.

Key Findings

  1. PR Qwen3Next and Qwen3.5 MoE Patches, Transformers v5 fixes for Gemma, Other misc fixes #495's gemma.py v5 fixes work correctly -- Gemma3 Vision (instruct) produces bit-identical results across v4 and v5.
  2. Fused CE loss is the main v5 blocker -- cross_entropy_loss.py backward ignores grad_output, causing 3x gradient scaling mismatch on v5 when gradient_accumulation_steps > 1. This affects text-only SFT models (Llama, Phi-4) but NOT vision models.
  3. Phi-4 inference assertion -- assert type(input_ids) is torch.Tensor in vision.py:158 fails on v5 (likely input_ids type changed).
  4. Gemma3N -- Upstream per_layer_projection *= ... inplace view bug persists on both v4 and v5.

Recommendation

The PR's v5-specific changes (gemma.py, gemma3n.py patches) are verified working. The fused CE loss backward scaling issue needs a separate fix in cross_entropy_loss.py to properly handle grad_output != 1.0 for full v5 compatibility.

@danielhanchen

Copy link
Copy Markdown
Member

Follow-up: Fused CE Loss v5 Fix Needed

The main remaining blocker for full transformers v5 support is unsloth_zoo/fused_losses/cross_entropy_loss.py. The backward pass returns pre-computed gradients without scaling by grad_output:

@staticmethod
def backward(ctx, grad_output,):
    # grad_output is assumed to be always = 1
    (grad_inputs, grad_lm_head, grad_lm_head_bias, ) = ctx.saved_tensors
    return (None, grad_inputs, grad_lm_head, grad_lm_head_bias, None, None, None, None, None, None, None, None, None,)

Transformers v5 divides loss by gradient_accumulation_steps before calling backward(), so grad_output = 1/gradient_accumulation_steps instead of 1.0. Since the backward ignores this, gradients are effectively gradient_accumulation_steps times too large, causing divergent training for text-only SFT models (Llama, Phi-4).

Vision models are unaffected because they don't use this fused CE loss path.

This is separate from PR #495 -- just flagging for a follow-up fix.

@Datta0

Datta0 commented Feb 25, 2026

Copy link
Copy Markdown
Collaborator Author

Some saving fixes are handled by #499

@GoldenGrapeGentleman

Copy link
Copy Markdown
Contributor
image

Just passing by, taking the picture(〃 ̄︶ ̄)人( ̄︶ ̄〃)

Comment thread unsloth_zoo/compiler.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants