Normalise weight_decay to 0.001 across all templates by danielhanchen · Pull Request #270 · unslothai/notebooks

danielhanchen · 2026-05-24T04:57:33Z

Summary

Normalises the weight_decay hyperparameter to 0.001 across all original_template/*.ipynb notebooks, then regenerates nb/, kaggle/, and python_scripts/ via update_all_notebooks.py.

Why

In full fine-tuning, AdamW weight decay shrinks the parameter directly,

$$W \leftarrow W - \eta, g_L - \eta\lambda W$$

so the implicit prior is $W \to 0$.

In LoRA the trained parameters are $A, B$ but the effective weight is

$$W_\text{eff} = W_\text{init} + \tfrac{\alpha}{r}, B A$$

Standard AdamW decays $A$ and $B$ separately,

$$A \leftarrow A - \eta, g_A - \eta\lambda A, \qquad B \leftarrow B - \eta, g_B - \eta\lambda B$$

so the implicit prior shifts: $BA \to 0$, hence $W_\text{eff} \to W_\text{init}$, not $0$. The composed adapter is being pulled back toward the frozen base instead of being regularised in magnitude. See LoRA and Weight Decay for the full derivation.

At $\lambda = 0.01$ the pure-decay shrinkage per step is $\eta\lambda$ (typically around $10^{-6}$ at $\eta = 2 \times 10^{-4}$), which compounds across a few thousand steps into a measurable bias toward init. $\lambda = 0.001$ keeps a small Frobenius-norm prior |A|_F^2 + |B|_F^2 for numerical stability without meaningfully dragging the merged adapter back to base.

Most templates already use $0.001$. This PR normalises the rest.

Change

original_template/*.ipynb: 18 lines across 16 templates:

7 lines weight_decay = 0.01 -> 0.001
- Gemma2_(2B)-Alpaca.ipynb
- Falcon_H1_(0.5B)-Alpaca.ipynb
- Llama_FP8_GRPO.ipynb (x2)
- Qwen3_8B_FP8_GRPO.ipynb (x2)
- gpt_oss_(20B)_Reinforcement_Learning_GRPO_Minesweeper_Game_BF16.ipynb
11 lines weight_decay = 0.1 -> 0.001
- Advanced_Llama3_2_(3B)_GRPO_LoRA.ipynb, Advanced_Llama3_1_(3B)_GRPO_LoRA.ipynb
- Llama3.1_(8B)-GRPO.ipynb, Gemma3_(1B)-GRPO.ipynb, Gemma3_(4B)-Vision-GRPO.ipynb
- Qwen3_VL_(8B)-Vision-GRPO.ipynb, Qwen2_5_7B_VL_GRPO.ipynb, Qwen2.5_(3B)-GRPO.ipynb
- Mistral_v0.3_(7B)-GRPO.ipynb, Phi_4_(14B)-GRPO.ipynb, TinyLlama_(1.1B)-Alpaca.ipynb

nb/*.ipynb, kaggle/*.ipynb, python_scripts/*.py: regenerated via python update_all_notebooks.py. 102 files modified, 116 insertions and 116 deletions, all weight_decay value swaps (verified via spot diff).

After this PR all weight_decay values in templates are either 0.001 (91 lines) or 0.0 / 0.00 (3 lines, intentionally disabled).

Test plan

Inspect sample regenerated notebooks under nb/ and kaggle/ to confirm only weight_decay values changed.
Spot-check that one Alpaca SFT and one GRPO notebook still train cleanly with the new default.
Companion source PR: Lower default RL weight_decay from 0.01 to 0.001 for LoRA unsloth#5747

In full FT, AdamW weight decay drives the trained parameter toward 0. In LoRA the trained parameters are A and B but the effective weight is W = W_init + (alpha/r) * B @ A, so decaying A and B individually drives BA -> 0, hence W -> W_init rather than 0. The previous mix of 0.01 and 0.1 across templates produced a measurable pull on the merged adapter back toward the base model over a few thousand steps. Most templates already used 0.001. This change brings the remaining 16 templates in line: - 7 lines weight_decay = 0.01 -> 0.001 (Gemma2 / Falcon_H1 Alpaca SFT, Llama_FP8 / Qwen3_8B_FP8 GRPO, gpt-oss Minesweeper GRPO) - 11 lines weight_decay = 0.1 -> 0.001 (Advanced Llama3.{1,2} GRPO LoRA, Llama3.1 / Gemma3 / Gemma3-Vision / Qwen3-VL / Qwen2.5-VL / Mistral v0.3 / Qwen2.5 / Phi-4 GRPO, TinyLlama Alpaca SFT) nb/, kaggle/ and python_scripts/ regenerated via update_all_notebooks.py. After this PR all weight_decay values in templates are either 0.001 (91 lines) or 0.0 (3 lines, intentionally disabled).

gemini-code-assist

Code Review

This pull request performs a global update of the weight_decay hyperparameter across numerous Jupyter notebooks, templates, and Python scripts. The weight_decay value in GRPOConfig and other training arguments is consistently changed to 0.001 from previous values of 0.1 or 0.01 for models including Llama, Gemma, Qwen, Mistral, and Phi. No review comments were provided, and I have no additional feedback on these changes.

gemini-code-assist Bot reviewed May 24, 2026

View reviewed changes

danielhanchen mentioned this pull request May 24, 2026

Lower default RL weight_decay from 0.01 to 0.001 for LoRA unslothai/unsloth#5747

Merged

3 tasks

danielhanchen merged commit ff0685a into main May 24, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalise weight_decay to 0.001 across all templates#270

Normalise weight_decay to 0.001 across all templates#270
danielhanchen merged 1 commit into
mainfrom
lora-weight-decay-defaults

danielhanchen commented May 24, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danielhanchen commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Change

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

danielhanchen commented May 24, 2026 •

edited

Loading