Skip to content

cp: fix: fix several nightly tests that were flaky (1724) into r0.5.0#1735

Merged
terrykong merged 1 commit intor0.5.0from
cherry-pick-1724-r0.5.0
Jan 7, 2026
Merged

cp: fix: fix several nightly tests that were flaky (1724) into r0.5.0#1735
terrykong merged 1 commit intor0.5.0from
cherry-pick-1724-r0.5.0

Conversation

@chtruong814
Copy link
Contributor

@chtruong814 chtruong814 commented Jan 7, 2026

beep boop [🤖]: Hi @terrykong 👋,

we've cherry picked #1724 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

  • Tests

    • Updated metric thresholds across DPO and SFT test suites for improved validation accuracy.
    • Enabled TensorBoard logging in performance test configurations.
    • Adjusted GPU memory constraints and timeout values in various test scripts.
  • Chores

    • Modified model checkpointing configuration settings.
    • Updated nightly compute resource validation limits.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 7, 2026

📝 Walkthrough

Walkthrough

This pull request updates test suite configurations and metrics thresholds across multiple LLM training test scripts. Changes include adjusting GPU memory thresholds in SFT tests, increasing NUM_MINUTES timeout values in configuration scripts, enabling TensorBoard logging in performance tests, updating DPO training loss thresholds, and raising the nightly GPU-hours budget limit from 1140 to 1180 hours.

Changes

Cohort / File(s) Summary
Configuration Save Format
examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml
Removed explicit model save format specification by changing checkpointing.model_save_format value from "dcp" to null.
DPO Test Metric Thresholds
tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp4.sh, tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.sh
Updated train/loss metric threshold at step 1 from 3.60/3.6 to 3.65 in final metrics validation.
Performance Test Timeouts & Logging
tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh, tests/test_suites/llm/performance/grpo-qwen3-235b-32n8g-async-1off.sh
Increased NUM_MINUTES from 100 to 115; enabled TensorBoard logging via logger.tensorboard_enabled=True parameter.
SFT Test Memory & Runtime Adjustments
tests/test_suites/llm/sft-llama3.1-70b-8n8g-tp4pp2-long-megatron.sh, tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.sh, tests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh, tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh, tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
Removed GPU memory constraint check from one script; increased max GPU memory threshold from 70 to 75 GB in another; increased NUM_MINUTES timeout from 15 to 30 in multiple LoRA and nanov3 variants.
Nightly Test Budget Threshold
tests/unit/test_recipes_and_test_suites.py
Renamed test function from test_nightly_compute_stays_below_1140_hours to test_nightly_compute_stays_below_1180_hours and updated GPU-hours assertion and error message from 1140 to 1180.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Suggested labels

CI:L0, r0.5.0

Suggested reviewers

  • terrykong
  • yfw
🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes a cherry-pick operation of PR #1724 fixing flaky nightly tests into the r0.5.0 branch, which aligns with the changeset's purpose.
Test Results For Major Changes ✅ Passed Minor adjustments to test configurations and thresholds aimed at fixing flaky nightly tests without new features or breaking changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh (1)

10-10: LGTM: Timeout increase with clarifying comment.

The timeout increase from 15 to 30 minutes appropriately addresses test flakiness. The comment helpfully explains that the additional time buffers for initial checkpoint download. Based on learnings, NUM_MINUTES is consumed by external launch tooling.

Optional: Clarify the comment wording

The comment mentions both "3 minutes" and "30min" which could be slightly clearer. Consider:

-NUM_MINUTES=30 # Usually 15 minutes is enough for 20 steps, but we add a buffer of 3 minutes in metrics check (30min to buffer for initial ckpt download)
+NUM_MINUTES=30 # 15 minutes typically suffices for 20 steps; extended to 30 minutes to buffer for initial checkpoint download
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 06584bb and a14f97b.

📒 Files selected for processing (11)
  • examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml
  • tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp4.sh
  • tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-32n8g-async-1off.sh
  • tests/test_suites/llm/sft-llama3.1-70b-8n8g-tp4pp2-long-megatron.sh
  • tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.sh
  • tests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh
  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
  • tests/unit/test_recipes_and_test_suites.py
💤 Files with no reviewable changes (1)
  • tests/test_suites/llm/sft-llama3.1-70b-8n8g-tp4pp2-long-megatron.sh
🧰 Additional context used
📓 Path-based instructions (7)
**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.sh: Use uv run instead of python to execute scripts
Follow the Google Shell Style Guide for shell scripts

Files:

  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-32n8g-async-1off.sh
  • tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.sh
  • tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp4.sh
  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
  • tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh
  • tests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh
tests/test_suites/**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

tests/test_suites/**/*.sh: When adding support for a new model, create a corresponding driver shell script under tests/test_suites/ in the matching domain
Driver shell scripts should match the YAML base name with .sh extension and invoke training entrypoint with uv run

Files:

  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-32n8g-async-1off.sh
  • tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.sh
  • tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp4.sh
  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
  • tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh
  • tests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh
!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-32n8g-async-1off.sh
  • tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.sh
  • tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp4.sh
  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
  • tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.sh
  • tests/unit/test_recipes_and_test_suites.py
  • examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml
  • tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh
  • tests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh
**/*.{py,sh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)

Files:

  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-32n8g-async-1off.sh
  • tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.sh
  • tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp4.sh
  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
  • tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.sh
  • tests/unit/test_recipes_and_test_suites.py
  • tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh
  • tests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Conform code to Python 3.12+
Indent code with 4 spaces. Do not use tabs
Use snake_case for file names
Use PascalCase for class names
Use snake_case for function and method names
Use snake_case for local variables
Prefix variable names that start with a number with 'k' (e.g., k_99th_percentile)
Use upper snake_case with 'G' prefix for global variables (e.g., G_MY_GLOBAL)
Use upper snake_case for constants
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
Prefer docstrings over comments for interfaces that may be used outside a file
Reserve comments for code within a function or interfaces that are local to a file
If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx
Avoid using reflection when functionality can be easily achieved without reflection
When using try-except blocks, limit the except clause to the smallest set of specific errors possible
When using try-except blocks for duck-typing, keep the body of the try as small as possible and use the else block for logic
YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values
For required configuration attributes, access config directly and expect presence (e.g., policy_cfg['precision']) without hidden defaults
Use typing.NotRequired to mark optional attributes in TypedDict for configuration
When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml
Follow the Google Python Style Guide for Python code

Files:

  • tests/unit/test_recipes_and_test_suites.py
examples/configs/recipes/**/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)

Files:

  • examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml
examples/configs/recipes/llm/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Recipe YAML files should follow the naming pattern: --ng-[-modifiers][-long][.vN].yaml for LLM recipes

Files:

  • examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml
🧠 Learnings (2)
📓 Common learnings
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.

Applied to files:

  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
  • tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.sh
  • tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp4.sh
  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
  • tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.sh
  • tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh
  • tests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh
🧬 Code graph analysis (1)
tests/unit/test_recipes_and_test_suites.py (1)
tests/unit/conftest.py (1)
  • tracker (265-296)
🪛 Shellcheck (0.11.0)
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh

[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)

tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh

[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)

tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh

[warning] 12-12: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)

tests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh

[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).

(SC2034)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Post automodel integration comment / Comment on PR
  • GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (11)
tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-megatron.v2.sh (1)

37-37: LGTM: Reasonable threshold adjustment to fix test flakiness.

The threshold increase from 3.60 to 3.65 for the initial training loss at step 1 is a minimal adjustment (1.4% increase) that appropriately addresses test flakiness while maintaining test effectiveness. This change is consistent with the PR objective to fix flaky nightly tests.

tests/test_suites/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp4.sh (1)

37-37: LGTM: Consistent threshold adjustment across test configurations.

The threshold increase from 3.60 to 3.65 matches the identical change in the megatron.v2 test script, demonstrating a systematic and consistent approach to fixing flakiness across both megatron and fsdp2tp4 configurations.

tests/test_suites/llm/sft-llama3.1-8b-1n8g-fsdp2tp1-dynamicbatch.sh (2)

37-37: Good practice: Documenting observed memory behavior.

The comment provides helpful context for the threshold adjustment, noting both the typical memory usage and its variability.


41-41: LGTM: Threshold adjustment addresses test flakiness.

The increase from 70 GB to 75 GB is appropriate given the observed memory usage of ~72.6 GB (noted in line 37). The 5 GB buffer provides reasonable headroom for measurement variability while still catching actual memory issues. This change aligns with the PR objective of fixing flaky nightly tests.

examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml (1)

28-32: This configuration change is architecturally required for DTensorPolicyWorker v1 compatibility.

Setting model_save_format to null is not a workaround but a mandatory constraint enforced by the codebase. When model_save_format is null, the system automatically defaults to "safetensors" format. This is consistent across other DAPO/GRPO recipes using _v2: false configuration.

tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh (1)

10-10: LGTM: Timeout increase to mitigate flaky tests.

Doubling the timeout from 15 to 30 minutes is a reasonable adjustment for addressing test flakiness. Based on learnings, NUM_MINUTES is consumed by external launch tooling and is correctly defined here.

tests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh (1)

10-10: LGTM: Consistent timeout increase.

The timeout increase from 15 to 30 minutes aligns with similar adjustments across other test suites in this PR to reduce flakiness. Based on learnings, NUM_MINUTES is correctly configured for consumption by external launch tooling.

tests/test_suites/llm/performance/grpo-qwen3-235b-16n8g.sh (2)

12-12: LGTM: Conservative timeout increase for large-scale test.

The 15% timeout increase (100→115 minutes) provides additional headroom for this 16-node performance test while remaining conservative. Based on learnings, NUM_MINUTES is consumed by external launch tooling.


27-27: LGTM: Explicit TensorBoard logging enabled.

Explicitly enabling TensorBoard logging ensures consistent metrics collection, which the script already depends on (line 34 converts TB logs to JSON). This change improves test reliability.

tests/test_suites/llm/performance/grpo-qwen3-235b-32n8g-async-1off.sh (1)

27-27: LGTM: Consistent TensorBoard logging enablement.

Explicitly enabling TensorBoard logging aligns with the change in the 16-node configuration (grpo-qwen3-235b-16n8g.sh) and ensures reliable metrics collection for this 32-node async test.

tests/unit/test_recipes_and_test_suites.py (1)

183-218: Threshold increase is consistent and properly implemented across all references.

All three locations correctly reflect the new 1180 GPU-hour limit: the function name, the assertion condition, and the error message. A search confirms no remaining references to the old 1140 threshold, indicating a clean migration. This ~3.5% budget increase is a reasonable accommodation for execution variance when addressing flaky tests, and the tracker.track() call enables continued monitoring of actual usage.

@yuki-97 yuki-97 added the CI:L1 Run doctests, unit tests, and functional tests label Jan 7, 2026
@terrykong terrykong merged commit 5aeca40 into r0.5.0 Jan 7, 2026
69 of 72 checks passed
@terrykong terrykong deleted the cherry-pick-1724-r0.5.0 branch January 7, 2026 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick CI:L1 Run doctests, unit tests, and functional tests Run CICD

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants