update loss value for flakey e2e test#2786
Conversation
WalkthroughThe test for Llama pretraining was updated by increasing the baseline loss threshold from 3.5 to 3.6. The logic for adjusting the threshold under certain conditions remains unchanged. No modifications were made to public or exported entities. Changes
Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
tests/e2e/test_llama_pretrain.py (1)
25-26: 🛠️ Refactor suggestionUse
pytest.skip()instead of a barereturnA bare
returnsilently terminates the test without recording the reason.pytest.skip()makes the intention explicit in the test report and avoids confusion when analysing CI results.- if not sample_packing and pretrain_multipack_attn: - return + if not sample_packing and pretrain_multipack_attn: + pytest.skip("Combination not supported (multipack attention without sample packing)")
🧹 Nitpick comments (1)
tests/e2e/test_llama_pretrain.py (1)
68-70: Bumping the loss threshold is a band-aid; stabilise the run insteadRaising the threshold from 3.5 → 3.6 will let more noisy runs pass but also allows genuine regressions to slip through.
Two quick wins to reduce flakiness without loosening assertions:
- Fix the random seed (Python, NumPy, PyTorch) so training is deterministic.
- Pass the seed into the Axolotl config to propagate it downstream.
@@ def test_pretrain(self, temp_dir, sample_packing, pretrain_multipack_attn): + # Make the training deterministic to curb loss variance + import random, numpy as np, torch + _seed = 42 + random.seed(_seed) + np.random.seed(_seed) + torch.manual_seed(_seed) + torch.cuda.manual_seed_all(_seed) + @@ "bf16": "auto", "use_tensorboard": True, + "seed": _seed, }If determinism still doesn’t eliminate the occasional > 3.5 loss, consider asserting on the trend (e.g. last step < first step) or average of last n steps instead of a single absolute value.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/e2e/test_llama_pretrain.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.1)
- GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
- GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
- GitHub Check: PyTest (3.11, 2.7.1)
- GitHub Check: PyTest (3.11, 2.6.0)
- GitHub Check: PyTest (3.11, 2.5.1)
- GitHub Check: pre-commit
- GitHub Check: pre-commit
🔇 Additional comments (1)
tests/e2e/test_llama_pretrain.py (1)
68-76: Large gap between 3.6 and 6.5 thresholds may hide regressionsWhen
sample_packingisTrueandpretrain_multipack_attnisFalse, the acceptable loss jumps to 6.5 – almost double the default. Verify that this gap is genuinely necessary; otherwise shrink it or apply a relative delta (e.g.base * 1.8) to keep the guardrail meaningful.
Codecov ReportAll modified and coverable lines are covered by tests ✅ 📢 Thoughts on this report? Let us know! |
Summary by CodeRabbit