Skip to content

Remove Gemma-4 from FORCE_FLOAT32#4875

Merged
danielhanchen merged 1 commit into
mainfrom
gemma4-remove-force-float32
Apr 6, 2026
Merged

Remove Gemma-4 from FORCE_FLOAT32#4875
danielhanchen merged 1 commit into
mainfrom
gemma4-remove-force-float32

Conversation

@danielhanchen
Copy link
Copy Markdown
Member

Summary

  • Remove gemma4, and gemma4_text from the FORCE_FLOAT32 list in loader.py
  • Gemma-4 works correctly in both float16 and bfloat16 without forcing float32

Background

The FORCE_FLOAT32 override was forcing Gemma-4 to load in bfloat16/float32 when the user requested float16. This prevented float16 from working on Tesla T4 and other GPUs without bfloat16 support.

Testing shows that Gemma-4's activation magnitudes stay well within float16 range (max ~2080 vs fp16 max 65504). The forced float32 path was actually causing training divergence -- with it enabled, the compiled run diverged at step ~28 with grad norms collapsing to near zero and loss plateauing at ~12.4.

Test results

Training (Gemma-4 E2B, 4-bit LoRA, SFT on FineTome-100k, 100 steps, no patches):

Metric float16 bfloat16
Final loss (step 100) 3.048 3.065
Min loss 2.389 (step 76) 2.396 (step 76)
Avg loss (last 20 steps) 3.198 3.211
Grad norms Healthy (~3.0) Healthy (~3.0)

Inference: float16 and bfloat16 produce identical outputs.

Companion PR

Test plan

  • Verify float16 inference produces correct output
  • Verify bfloat16 inference produces correct output
  • Verify float16 training converges (100 steps)
  • Verify bfloat16 training converges (100 steps)
  • Verify losses match between float16 and bfloat16
  • Test on Tesla T4 (float16-only GPU)

Gemma-4 does not need FORCE_FLOAT32. Testing shows that both float16 and
bfloat16 work correctly without the forced float32 override:

- Inference: identical outputs for float16 and bfloat16 (greedy decoding)
- Training (100 steps, 4-bit LoRA, SFT on FineTome-100k):
  - float16 final loss: 3.048
  - bfloat16 final loss: 3.065
  - Losses converge to within 0.02 by step 60
  - Grad norms healthy and comparable for both dtypes

The FORCE_FLOAT32 path was actually causing training divergence. With
it enabled, the compiled float32 run diverged at step ~28 with grad norms
collapsing to near zero and loss plateauing at ~12.4. Without it, both
dtypes train normally.

This enables float16 on Tesla T4 and other GPUs without bfloat16 support.
@danielhanchen danielhanchen requested a review from mmathew23 as a code owner April 6, 2026 14:30
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes 'gemma4,' and 'gemma4_text' from the list of model names in unsloth/models/loader.py, which appears to manage model-specific compilation exclusions. There are no review comments to address, and I have no feedback to provide.

@danielhanchen danielhanchen merged commit 07b6fcc into main Apr 6, 2026
5 checks passed
@danielhanchen danielhanchen deleted the gemma4-remove-force-float32 branch April 6, 2026 14:33
shibizhao pushed a commit to shibizhao/unsloth-npu that referenced this pull request Apr 7, 2026
Gemma-4 does not need FORCE_FLOAT32. Testing shows that both float16 and
bfloat16 work correctly without the forced float32 override:

- Inference: identical outputs for float16 and bfloat16 (greedy decoding)
- Training (100 steps, 4-bit LoRA, SFT on FineTome-100k):
  - float16 final loss: 3.048
  - bfloat16 final loss: 3.065
  - Losses converge to within 0.02 by step 60
  - Grad norms healthy and comparable for both dtypes

The FORCE_FLOAT32 path was actually causing training divergence. With
it enabled, the compiled float32 run diverged at step ~28 with grad norms
collapsing to near zero and loss plateauing at ~12.4. Without it, both
dtypes train normally.

This enables float16 on Tesla T4 and other GPUs without bfloat16 support.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant