New WR 159.3s: Dynamically Incorporate YaRN during Training and Validation#122
Merged
ClassicLarry merged 7 commits intoKellerJordan:masterfrom Oct 15, 2025
Merged
New WR 159.3s: Dynamically Incorporate YaRN during Training and Validation#122ClassicLarry merged 7 commits intoKellerJordan:masterfrom
ClassicLarry merged 7 commits intoKellerJordan:masterfrom
Conversation
bernard24
pushed a commit
to bernard24/modded-nanogpt
that referenced
this pull request
Sep 11, 2025
…aining, improve skip connection gating, and enhance bfloat16 usage
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR of 159.3s incorporates YaRN dynamically into the training window schedule and final validation. YaRN paper: https://arxiv.org/pdf/2309.00071.
This submission includes all recent WR improvements, including dropping initial MLP layer by @EmelyanenkoK in PR 120.
Longer attention windows take longer to train, but produce models with lower loss. Two phenomena occur in RoPE when the attention window is increased during or after training:
A single copy of rotary embeddings is stored in the model root to reduce update time, reduce memory size, and potentially improve cache performance.
Based on empirical testing, the 0.1 constant in 0.1*log(curr/prev)+1 formula from YaRN is updated to 0.2.

The constant attn_scale of 0.12 is updated to a starting value of 0.1, such that the distribution over training has a similar mean, ranging between 0.1 and 0.14.
YaRN has a straighforward implementation, shown below. alpha and beta are left at the default constants of 1 and 32, based on the original YaRN paper which was tuned for Llama. The frequency update incurred by YaRN is most notable from ws 3->7 and dimensions 5 to 10.

Arg ws_validate enables the model to be validated at a longer attention window than training. This arg is set to 13, which differs from the final training window size of 11.

Attention args are batched to improve readablility. cooldown_frac is increased from 0.45 to 0.5 to compliment the reduction from 1705 to 1670 steps, following the heuristic of a fixed number of cooldown steps. Dropping below 1695 steps has a secondary benefit of eliminating the 9th file read, saving roughly 200ms.
Without YaRN, there is a substantial spike in validation loss when the attention window is abrubtly increased from 3 to 7.

Extending the final validation window out shows roughly a 0.0015 improvement in loss for 11->13. Interestingly, odd increments perform substantially better. @varunneal has noted that "One thing to note is that floor division (ws_short = ws_long // 2) has different behavior for odd vs short window sizes. I generally found odd window sizes performed surprisingly better." The attention schedule follows (long/short) (3/1) -> (7/3) -> (11/5). It may be that the short attention window performs better when it is under 50% of the long window, or it may be that the model learns to fit the long/short ratio, and performs poorly when this ratio is substantially altered, or there may be a completely different explanation.
Ablations were ran to measure the impact of each change:
Future Considerations:
Validation: