Remove Gemma-4 from FORCE_FLOAT32#4875
Merged
Merged
Conversation
Gemma-4 does not need FORCE_FLOAT32. Testing shows that both float16 and bfloat16 work correctly without the forced float32 override: - Inference: identical outputs for float16 and bfloat16 (greedy decoding) - Training (100 steps, 4-bit LoRA, SFT on FineTome-100k): - float16 final loss: 3.048 - bfloat16 final loss: 3.065 - Losses converge to within 0.02 by step 60 - Grad norms healthy and comparable for both dtypes The FORCE_FLOAT32 path was actually causing training divergence. With it enabled, the compiled float32 run diverged at step ~28 with grad norms collapsing to near zero and loss plateauing at ~12.4. Without it, both dtypes train normally. This enables float16 on Tesla T4 and other GPUs without bfloat16 support.
Contributor
There was a problem hiding this comment.
Code Review
This pull request removes 'gemma4,' and 'gemma4_text' from the list of model names in unsloth/models/loader.py, which appears to manage model-specific compilation exclusions. There are no review comments to address, and I have no feedback to provide.
shibizhao
pushed a commit
to shibizhao/unsloth-npu
that referenced
this pull request
Apr 7, 2026
Gemma-4 does not need FORCE_FLOAT32. Testing shows that both float16 and bfloat16 work correctly without the forced float32 override: - Inference: identical outputs for float16 and bfloat16 (greedy decoding) - Training (100 steps, 4-bit LoRA, SFT on FineTome-100k): - float16 final loss: 3.048 - bfloat16 final loss: 3.065 - Losses converge to within 0.02 by step 60 - Grad norms healthy and comparable for both dtypes The FORCE_FLOAT32 path was actually causing training divergence. With it enabled, the compiled float32 run diverged at step ~28 with grad norms collapsing to near zero and loss plateauing at ~12.4. Without it, both dtypes train normally. This enables float16 on Tesla T4 and other GPUs without bfloat16 support.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
gemma4,andgemma4_textfrom theFORCE_FLOAT32list inloader.pyBackground
The
FORCE_FLOAT32override was forcing Gemma-4 to load in bfloat16/float32 when the user requested float16. This prevented float16 from working on Tesla T4 and other GPUs without bfloat16 support.Testing shows that Gemma-4's activation magnitudes stay well within float16 range (max ~2080 vs fp16 max 65504). The forced float32 path was actually causing training divergence -- with it enabled, the compiled run diverged at step ~28 with grad norms collapsing to near zero and loss plateauing at ~12.4.
Test results
Training (Gemma-4 E2B, 4-bit LoRA, SFT on FineTome-100k, 100 steps, no patches):
Inference: float16 and bfloat16 produce identical outputs.
Companion PR
gemma4.pyTest plan