Normalise weight_decay to 0.001 across all templates#270
Merged
Conversation
In full FT, AdamW weight decay drives the trained parameter toward 0.
In LoRA the trained parameters are A and B but the effective weight is
W = W_init + (alpha/r) * B @ A, so decaying A and B individually drives
BA -> 0, hence W -> W_init rather than 0. The previous mix of 0.01 and
0.1 across templates produced a measurable pull on the merged adapter
back toward the base model over a few thousand steps.
Most templates already used 0.001. This change brings the remaining 16
templates in line:
- 7 lines weight_decay = 0.01 -> 0.001 (Gemma2 / Falcon_H1 Alpaca SFT,
Llama_FP8 / Qwen3_8B_FP8 GRPO, gpt-oss Minesweeper GRPO)
- 11 lines weight_decay = 0.1 -> 0.001 (Advanced Llama3.{1,2} GRPO LoRA,
Llama3.1 / Gemma3 / Gemma3-Vision / Qwen3-VL / Qwen2.5-VL / Mistral v0.3 /
Qwen2.5 / Phi-4 GRPO, TinyLlama Alpaca SFT)
nb/, kaggle/ and python_scripts/ regenerated via update_all_notebooks.py.
After this PR all weight_decay values in templates are either 0.001
(91 lines) or 0.0 (3 lines, intentionally disabled).
Contributor
There was a problem hiding this comment.
Code Review
This pull request performs a global update of the weight_decay hyperparameter across numerous Jupyter notebooks, templates, and Python scripts. The weight_decay value in GRPOConfig and other training arguments is consistently changed to 0.001 from previous values of 0.1 or 0.01 for models including Llama, Gemma, Qwen, Mistral, and Phi. No review comments were provided, and I have no additional feedback on these changes.
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Normalises the
weight_decayhyperparameter to0.001across alloriginal_template/*.ipynbnotebooks, then regeneratesnb/,kaggle/, andpython_scripts/viaupdate_all_notebooks.py.Why
In full fine-tuning, AdamW weight decay shrinks the parameter directly,
so the implicit prior is$W \to 0$ .
In LoRA the trained parameters are$A, B$ but the effective weight is
Standard AdamW decays$A$ and $B$ separately,
so the implicit prior shifts:$BA \to 0$ , hence $W_\text{eff} \to W_\text{init}$ , not $0$ . The composed adapter is being pulled back toward the frozen base instead of being regularised in magnitude. See LoRA and Weight Decay for the full derivation.
At$\lambda = 0.01$ the pure-decay shrinkage per step is $\eta\lambda$ (typically around $10^{-6}$ at $\eta = 2 \times 10^{-4}$ ), which compounds across a few thousand steps into a measurable bias toward init. $\lambda = 0.001$ keeps a small Frobenius-norm prior
|A|_F^2 + |B|_F^2for numerical stability without meaningfully dragging the merged adapter back to base.Most templates already use$0.001$ . This PR normalises the rest.
Change
original_template/*.ipynb: 18 lines across 16 templates:weight_decay = 0.01->0.001Gemma2_(2B)-Alpaca.ipynbFalcon_H1_(0.5B)-Alpaca.ipynbLlama_FP8_GRPO.ipynb(x2)Qwen3_8B_FP8_GRPO.ipynb(x2)gpt_oss_(20B)_Reinforcement_Learning_GRPO_Minesweeper_Game_BF16.ipynbweight_decay = 0.1->0.001Advanced_Llama3_2_(3B)_GRPO_LoRA.ipynb,Advanced_Llama3_1_(3B)_GRPO_LoRA.ipynbLlama3.1_(8B)-GRPO.ipynb,Gemma3_(1B)-GRPO.ipynb,Gemma3_(4B)-Vision-GRPO.ipynbQwen3_VL_(8B)-Vision-GRPO.ipynb,Qwen2_5_7B_VL_GRPO.ipynb,Qwen2.5_(3B)-GRPO.ipynbMistral_v0.3_(7B)-GRPO.ipynb,Phi_4_(14B)-GRPO.ipynb,TinyLlama_(1.1B)-Alpaca.ipynbnb/*.ipynb,kaggle/*.ipynb,python_scripts/*.py: regenerated viapython update_all_notebooks.py. 102 files modified, 116 insertions and 116 deletions, allweight_decayvalue swaps (verified via spot diff).After this PR all
weight_decayvalues in templates are either0.001(91 lines) or0.0/0.00(3 lines, intentionally disabled).Test plan
nb/andkaggle/to confirm onlyweight_decayvalues changed.