fix: fix several nightly tests that were flaky by terrykong · Pull Request #1724 · NVIDIA-NeMo/RL

terrykong · 2026-01-06T08:39:12Z

dpo loss increase was due to a num_workers change that changed dataset seed. needed to update these test metrics
increased time on some tests to account for ckpt download or ones that finished too closely to the time limit
some perf benchmarks were missing TB logs and failed
some benchmarks had memory increases, but are already close to the 80G limit, so i removed those and we can detect regressions simply if the test ooms now

Summary by CodeRabbit

Tests
- Adjusted training loss metric thresholds across DPO and SFT test suites.
- Increased allowed execution times for multiple tests.
- Updated GPU hour consumption benchmarks for nightly test validation.
- Removed memory usage assertion from specific test metrics.
- Increased memory threshold constraints.
Chores
- Enabled TensorBoard logging in performance test configurations.
- Modified model save format configuration setting.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Terry Kong <terryk@nvidia.com>

coderabbitai · 2026-01-06T08:39:17Z

📝 Walkthrough

Walkthrough

Configuration and test parameter adjustments across multiple LLM test suites and recipes. Changes include updating training loss thresholds, increasing time limits (NUM_MINUTES), adjusting memory thresholds, enabling TensorBoard logging, and updating overall compute hour limits in unit tests.

Changes

Cohort / File(s)	Summary
Model Configuration `examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml`	Changed `model_save_format` from `"dcp"` to `null`, removing explicit checkpoint format specification.
DPO Test Suites `tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp4.sh`, `tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.sh`	Updated `train/loss` metric threshold from `3.6` to `3.65` at step 1, relaxing passing criteria for initial training loss validation.
Performance Test Suites `tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh`, `tests/test_suites/llm/performance/grpo-qwen3-235b-32n8g-async-1off.sh`	Increased `NUM_MINUTES` from `100` to `115` and enabled TensorBoard logging via `logger.tensorboard_enabled=True` alongside existing Weights & Biases configuration.
SFT Test Suites (Timeout) `tests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh`, `tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh`, `tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh`	Updated `NUM_MINUTES` from `15` to `30` to allow longer run durations, with updated comments noting larger buffer for checkpoint downloads.
SFT Test Suites (Memory & Assertions) `tests/test_suites/llm/sft-llama3.1-70b-8n8g-tp4pp2-long-megatron.sh`, `tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.sh`	Removed memory usage assertion (`max(data["ray/node.0.gpu.0.mem_gb"]) < 70`) from one test; increased memory threshold from `70` to `75` GB in another with observational note of ~72.6 GB usage.
Unit Tests `tests/unit/test_recipes_and_test_suites.py`	Renamed test function from `test_nightly_compute_stays_below_1140_hours` to `test_nightly_compute_stays_below_1180_hours` and updated assertion upper bound from `1140` to `1180` GPU hours.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

fix: dpo mistral nightly needs more time #1225: Both PRs adjust test timeouts by increasing NUM_MINUTES to accommodate longer test execution durations.
feat: Add Nemotron‑3 Nano 30B A3B BF16 SFT nightly tests (FSDP2, +LoRA) #1648: This PR directly modifies the same sft-nanov3 test scripts that were introduced in the referenced PR, updating their runtime parameters.
cp: feat: Add Nemotron‑3 Nano 30B A3B BF16 SFT nightly tests (FSDP2, +LoRA) (1648) into r0.5.0 #1697: Both PRs touch the sft-nanov3-30BA3B test suite scripts, with this PR adjusting parameters added in the referenced PR.

Suggested reviewers

yfw
terrykong
chtruong814

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Test Results For Major Changes	⚠️ Warning	PR modifies test thresholds affecting numerics/convergence but provides only rationale for changes, not validation demonstrating no regression.	Include test execution results, before/after convergence data, and evidence that the 1.39% threshold relaxation appropriately accounts for changes without indicating model quality regression.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: fix several nightly tests that were flaky' directly and accurately summarizes the main change—fixing flaky nightly tests across multiple test suites.

_{✏️ Tip: You can configure your own custom Pre-merge Checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Yuki Huang <yukih@nvidia.com>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml (1)

32-32: Clarify the rationale for changing checkpoint format from DCP to default (safetensors).

Setting model_save_format to null will cause the system to use the default "safetensors" format instead of the distributed checkpoint (DCP) format. While this aligns with the PR's goal of fixing flaky nightly tests (safetensors is a more standard, potentially more reliable format for downloads), this behavioral change is not documented in the PR description.

Please add a note to the PR description explaining why DCP was replaced with the safetensors default, particularly how this change addresses the checkpoint download delays mentioned in the PR objectives.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1720466 and 3f7d520.

📒 Files selected for processing (11)

examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml
tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp4.sh
tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.sh
tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh
tests/test_suites/llm/performance/grpo-qwen3-235b-32n8g-async-1off.sh
tests/test_suites/llm/sft-llama3.1-70b-8n8g-tp4pp2-long-megatron.sh
tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.sh
tests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
tests/unit/test_recipes_and_test_suites.py

💤 Files with no reviewable changes (1)

tests/test_suites/llm/sft-llama3.1-70b-8n8g-tp4pp2-long-megatron.sh

🧰 Additional context used

📓 Path-based instructions (7)

**/*.sh