Qwen3Next and Qwen3.5 MoE Patches, Transformers v5 fixes for Gemma, Other misc fixes#495
Conversation
Summary of ChangesHello @Datta0, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a series of compatibility patches for various models, including Gemma, Gemma3n, GLM4-MoE, and CSM, to ensure their proper functioning with updated versions of the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request updates several temporary patches to ensure compatibility with transformers v5. The changes primarily involve using introspection (inspect module) to dynamically adapt the patches based on the function signatures in the installed transformers version. This is a robust approach to handle API changes in the underlying library. The changes look good overall. I've added one minor suggestion to restore a removed type hint for better code clarity.
…h.compile decorators - Disable @torch.compile on selective_log_softmax and chunked_selective_log_softmax functions - Disable @torch.compile on chunked_hidden_states_selective_log_softmax (main culprit) - Disable @torch.compile on accumulate_chunk function within UnslothEfficientGRPO - Disable @torch.compile on accumulate_chunk in fused_losses/cross_entropy_loss.py - Comment out grpo_compute_loss_slow torch.compile decorator This resolves tensor size mismatch errors (s47*s61 vs s47*s87) that occur when torch.compile tries to optimize functions with dynamic VLM sequence lengths. Fixes Unsloth Issue #4025 for Qwen2-VL and other Vision-Language Models. Signed-off-by: Daniel Han-Chen <daniel.han.chen@gmail.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces significant updates to ensure compatibility with transformers v5, adds support for new models like Qwen3.5-MoE and Qwen3Next-MoE, and implements a major feature for 4-bit quantized Mixture-of-Experts (MoE) models. The changes include robust patching mechanisms that inspect function signatures, extensive refactoring of MoE patching logic for better code reuse, and deep integration with peft and bitsandbytes for 4-bit quantization. Overall, this is a well-engineered update that enhances functionality and maintainability. My review includes a few suggestions to improve error handling and remove a redundant line of code.
Transformers v5 Compatibility Testing ResultsTested PR #495 across 4 configurations to verify transformers v5 compatibility:
Llama 3.1 (8B) Alpaca SFT -- Text-only model
Analysis: Config C crashes with Config D completes training after removing the assertion, but losses are ~7x higher than baseline because the fused CE loss backward ignores Gemma3 (4B) Vision SFT -- Instruct model (PR #495 gemma.py target)
BIT-IDENTICAL results across configs A, B, and D. The v5 gemma.py patches produce exact same training behavior. Vision models do not use the fused CE loss path, so the grad_output scaling issue does not affect them. Gemma3 (4B) Vision SFT -- Base model
Losses nearly identical (step 61: 0.289 vs 0.285). Grad norms ~3x lower on v5 due to the gradient_accumulation_steps scaling change, but training converges to the same loss values. Phi-4 Conversational SFT
Training completed with similar losses but ~3x lower grad norms (same v5 scaling). Inference failed with Gemma3N Vision SFT
Upstream transformers bug in Qwen3Next / Qwen3.5 Patch VerificationBoth Key Findings
RecommendationThe PR's v5-specific changes (gemma.py, gemma3n.py patches) are verified working. The fused CE loss backward scaling issue needs a separate fix in |
Follow-up: Fused CE Loss v5 Fix NeededThe main remaining blocker for full transformers v5 support is @staticmethod
def backward(ctx, grad_output,):
# grad_output is assumed to be always = 1
(grad_inputs, grad_lm_head, grad_lm_head_bias, ) = ctx.saved_tensors
return (None, grad_inputs, grad_lm_head, grad_lm_head_bias, None, None, None, None, None, None, None, None, None,)Transformers v5 divides loss by Vision models are unaffected because they don't use this fused CE loss path. This is separate from PR #495 -- just flagging for a follow-up fix. |
|
Some saving fixes are handled by #499 |


Tested on Colab and B200 with
dtype = fp16with transformers 5.0.0 and main