cp: `fix: fix several nightly tests that were flaky (1724)` into `r0.5.0` by chtruong814 · Pull Request #1735 · NVIDIA-NeMo/RL

chtruong814 · 2026-01-07T10:05:04Z

beep boop [🤖]: Hi @terrykong 👋,

we've cherry picked #1724 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

Tests
- Updated metric thresholds across DPO and SFT test suites for improved validation accuracy.
- Enabled TensorBoard logging in performance test configurations.
- Adjusted GPU memory constraints and timeout values in various test scripts.
Chores
- Modified model checkpointing configuration settings.
- Updated nightly compute resource validation limits.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

coderabbitai · 2026-01-07T10:09:50Z

📝 Walkthrough

Walkthrough

This pull request updates test suite configurations and metrics thresholds across multiple LLM training test scripts. Changes include adjusting GPU memory thresholds in SFT tests, increasing NUM_MINUTES timeout values in configuration scripts, enabling TensorBoard logging in performance tests, updating DPO training loss thresholds, and raising the nightly GPU-hours budget limit from 1140 to 1180 hours.

Changes

Cohort / File(s)	Summary
Configuration Save Format `examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml`	Removed explicit model save format specification by changing `checkpointing.model_save_format` value from `"dcp"` to `null`.
DPO Test Metric Thresholds `tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp4.sh`, `tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.sh`	Updated `train/loss` metric threshold at step 1 from 3.60/3.6 to 3.65 in final metrics validation.
Performance Test Timeouts & Logging `tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh`, `tests/test_suites/llm/performance/grpo-qwen3-235b-32n8g-async-1off.sh`	Increased `NUM_MINUTES` from 100 to 115; enabled TensorBoard logging via `logger.tensorboard_enabled=True` parameter.
SFT Test Memory & Runtime Adjustments `tests/test_suites/llm/sft-llama3.1-70b-8n8g-tp4pp2-long-megatron.sh`, `tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.sh`, `tests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh`, `tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh`, `tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh`	Removed GPU memory constraint check from one script; increased max GPU memory threshold from 70 to 75 GB in another; increased `NUM_MINUTES` timeout from 15 to 30 in multiple LoRA and nanov3 variants.
Nightly Test Budget Threshold `tests/unit/test_recipes_and_test_suites.py`	Renamed test function from `test_nightly_compute_stays_below_1140_hours` to `test_nightly_compute_stays_below_1180_hours` and updated GPU-hours assertion and error message from 1140 to 1180.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

fix: fix several nightly tests that were flaky #1724: Modifies the exact same files and lines, including the dapo-qwen2.5-7b.yaml config change, identical test script updates, and the same unit test threshold renaming.
cp: feat: Add Nemotron‑3 Nano 30B A3B BF16 SFT nightly tests (FSDP2, +LoRA) (1648) into r0.5.0 #1697: Updates the same sft-nanov3-30BA3B test scripts and related nightly test entries.
feat: Add Nemotron‑3 Nano 30B A3B BF16 SFT nightly tests (FSDP2, +LoRA) #1648: Added the sft-nanov3-30BA3B configuration and test scripts that are being adjusted in this PR.

Suggested labels

CI:L0, r0.5.0

Suggested reviewers

terrykong
yfw

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes a cherry-pick operation of PR #1724 fixing flaky nightly tests into the r0.5.0 branch, which aligns with the changeset's purpose.
Test Results For Major Changes	✅ Passed	Minor adjustments to test configurations and thresholds aimed at fixing flaky nightly tests without new features or breaking changes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh (1)
10-10: LGTM: Timeout increase with clarifying comment.

The timeout increase from 15 to 30 minutes appropriately addresses test flakiness. The comment helpfully explains that the additional time buffers for initial checkpoint download. Based on learnings, NUM_MINUTES is consumed by external launch tooling.
Optional: Clarify the comment wording

The comment mentions both "3 minutes" and "30min" which could be slightly clearer. Consider:
-NUM_MINUTES=30 # Usually 15 minutes is enough for 20 steps, but we add a buffer of 3 minutes in metrics check (30min to buffer for initial ckpt download)
+NUM_MINUTES=30 # 15 minutes typically suffices for 20 steps; extended to 30 minutes to buffer for initial checkpoint download

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 06584bb and a14f97b.

📒 Files selected for processing (11)

examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml
tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp4.sh
tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.sh
tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh
tests/test_suites/llm/performance/grpo-qwen3-235b-32n8g-async-1off.sh
tests/test_suites/llm/sft-llama3.1-70b-8n8g-tp4pp2-long-megatron.sh
tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.sh
tests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
tests/unit/test_recipes_and_test_suites.py

💤 Files with no reviewable changes (1)

tests/test_suites/llm/sft-llama3.1-70b-8n8g-tp4pp2-long-megatron.sh

🧰 Additional context used

📓 Path-based instructions (7)

**/*.sh