Skip to content

Normalise weight_decay to 0.001 across all templates#270

Merged
danielhanchen merged 1 commit into
mainfrom
lora-weight-decay-defaults
May 24, 2026
Merged

Normalise weight_decay to 0.001 across all templates#270
danielhanchen merged 1 commit into
mainfrom
lora-weight-decay-defaults

Conversation

@danielhanchen

@danielhanchen danielhanchen commented May 24, 2026

Copy link
Copy Markdown
Member

Summary

Normalises the weight_decay hyperparameter to 0.001 across all original_template/*.ipynb notebooks, then regenerates nb/, kaggle/, and python_scripts/ via update_all_notebooks.py.

Why

In full fine-tuning, AdamW weight decay shrinks the parameter directly,

$$W \leftarrow W - \eta, g_L - \eta\lambda W$$

so the implicit prior is $W \to 0$.

In LoRA the trained parameters are $A, B$ but the effective weight is

$$W_\text{eff} = W_\text{init} + \tfrac{\alpha}{r}, B A$$

Standard AdamW decays $A$ and $B$ separately,

$$A \leftarrow A - \eta, g_A - \eta\lambda A, \qquad B \leftarrow B - \eta, g_B - \eta\lambda B$$

so the implicit prior shifts: $BA \to 0$, hence $W_\text{eff} \to W_\text{init}$, not $0$. The composed adapter is being pulled back toward the frozen base instead of being regularised in magnitude. See LoRA and Weight Decay for the full derivation.

At $\lambda = 0.01$ the pure-decay shrinkage per step is $\eta\lambda$ (typically around $10^{-6}$ at $\eta = 2 \times 10^{-4}$), which compounds across a few thousand steps into a measurable bias toward init. $\lambda = 0.001$ keeps a small Frobenius-norm prior |A|_F^2 + |B|_F^2 for numerical stability without meaningfully dragging the merged adapter back to base.

Most templates already use $0.001$. This PR normalises the rest.

Change

original_template/*.ipynb: 18 lines across 16 templates:

  • 7 lines weight_decay = 0.01 -> 0.001
    • Gemma2_(2B)-Alpaca.ipynb
    • Falcon_H1_(0.5B)-Alpaca.ipynb
    • Llama_FP8_GRPO.ipynb (x2)
    • Qwen3_8B_FP8_GRPO.ipynb (x2)
    • gpt_oss_(20B)_Reinforcement_Learning_GRPO_Minesweeper_Game_BF16.ipynb
  • 11 lines weight_decay = 0.1 -> 0.001
    • Advanced_Llama3_2_(3B)_GRPO_LoRA.ipynb, Advanced_Llama3_1_(3B)_GRPO_LoRA.ipynb
    • Llama3.1_(8B)-GRPO.ipynb, Gemma3_(1B)-GRPO.ipynb, Gemma3_(4B)-Vision-GRPO.ipynb
    • Qwen3_VL_(8B)-Vision-GRPO.ipynb, Qwen2_5_7B_VL_GRPO.ipynb, Qwen2.5_(3B)-GRPO.ipynb
    • Mistral_v0.3_(7B)-GRPO.ipynb, Phi_4_(14B)-GRPO.ipynb, TinyLlama_(1.1B)-Alpaca.ipynb

nb/*.ipynb, kaggle/*.ipynb, python_scripts/*.py: regenerated via python update_all_notebooks.py. 102 files modified, 116 insertions and 116 deletions, all weight_decay value swaps (verified via spot diff).

After this PR all weight_decay values in templates are either 0.001 (91 lines) or 0.0 / 0.00 (3 lines, intentionally disabled).

Test plan

In full FT, AdamW weight decay drives the trained parameter toward 0.
In LoRA the trained parameters are A and B but the effective weight is
W = W_init + (alpha/r) * B @ A, so decaying A and B individually drives
BA -> 0, hence W -> W_init rather than 0. The previous mix of 0.01 and
0.1 across templates produced a measurable pull on the merged adapter
back toward the base model over a few thousand steps.

Most templates already used 0.001. This change brings the remaining 16
templates in line:

- 7 lines weight_decay = 0.01 -> 0.001 (Gemma2 / Falcon_H1 Alpaca SFT,
  Llama_FP8 / Qwen3_8B_FP8 GRPO, gpt-oss Minesweeper GRPO)
- 11 lines weight_decay = 0.1  -> 0.001 (Advanced Llama3.{1,2} GRPO LoRA,
  Llama3.1 / Gemma3 / Gemma3-Vision / Qwen3-VL / Qwen2.5-VL / Mistral v0.3 /
  Qwen2.5 / Phi-4 GRPO, TinyLlama Alpaca SFT)

nb/, kaggle/ and python_scripts/ regenerated via update_all_notebooks.py.
After this PR all weight_decay values in templates are either 0.001
(91 lines) or 0.0 (3 lines, intentionally disabled).

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request performs a global update of the weight_decay hyperparameter across numerous Jupyter notebooks, templates, and Python scripts. The weight_decay value in GRPOConfig and other training arguments is consistently changed to 0.001 from previous values of 0.1 or 0.01 for models including Llama, Gemma, Qwen, Mistral, and Phi. No review comments were provided, and I have no additional feedback on these changes.

@danielhanchen danielhanchen merged commit ff0685a into main May 24, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant