Skip to content

fix: relax nanov3 nightly test metrics strict#1712

Merged
terrykong merged 1 commit intomainfrom
ruit/relax_nanov3_strict
Jan 5, 2026
Merged

fix: relax nanov3 nightly test metrics strict#1712
terrykong merged 1 commit intomainfrom
ruit/relax_nanov3_strict

Conversation

@RayenTian
Copy link
Contributor

@RayenTian RayenTian commented Jan 5, 2026

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • Tests
    • Updated training loss metric thresholds in test suite validations to accommodate variations in model training performance.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: ruit <ruit@nvidia.com>
@RayenTian RayenTian marked this pull request as ready for review January 5, 2026 07:35
@RayenTian RayenTian requested a review from a team as a code owner January 5, 2026 07:35
@RayenTian RayenTian requested a review from terrykong January 5, 2026 07:35
@RayenTian RayenTian added the CI:L1 Run doctests, unit tests, and functional tests label Jan 5, 2026
@terrykong terrykong added CI:docs Run doctest r0.5.0 and removed CI:L1 Run doctests, unit tests, and functional tests CI:docs Run doctest labels Jan 5, 2026
@terrykong terrykong enabled auto-merge (squash) January 5, 2026 07:36
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 5, 2026

📝 Walkthrough

Walkthrough

Two LLM training test suite scripts have their loss metric validation thresholds increased. The sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh script now asserts data["train/loss"]["20"] < 2.05 instead of < 2.03, and the sft-nanov3-30BA3B-2n8g-fsdp2.sh script updates its threshold from < 1.98 to < 2.05. All other logic remains unchanged.

Changes

Cohort / File(s) Summary
LLM SFT Test Suite Thresholds
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh, tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
Updated training loss metric thresholds: lora variant raised from 2.03 to 2.05; standard variant raised from 1.98 to 2.05

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

Suggested labels

CI:L1

Suggested reviewers

  • joyang-nv
  • yuki-97
  • terrykong

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR relaxes training loss thresholds in test files without explanation. Changes to convergence metrics require justification to rule out regression. Update PR description to explain threshold changes, document what caused higher loss values, and demonstrate these are expected without masking regression.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: relaxing the strictness of nanov3 nightly test metrics by raising thresholds from 1.98 to 2.05 and 2.03 to 2.05.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 13c3cd6 and 3534a18.

📒 Files selected for processing (2)
  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
🧰 Additional context used
📓 Path-based instructions (4)
**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.sh: Use uv run instead of python to execute scripts
Follow the Google Shell Style Guide for shell scripts

Files:

  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
tests/test_suites/**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

tests/test_suites/**/*.sh: When adding support for a new model, create a corresponding driver shell script under tests/test_suites/ in the matching domain
Driver shell scripts should match the YAML base name with .sh extension and invoke training entrypoint with uv run

Files:

  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
**/*.{py,sh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)

Files:

  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
🧠 Learnings (1)
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.

Applied to files:

  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
  • tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: pre-flight
  • GitHub Check: sphinx-build / Build docs
  • GitHub Check: Lint check
  • GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (2)
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh (1)

36-36: Verify the rationale for the relaxed loss threshold.

The loss threshold has been increased from 1.98 to 2.05 (a ~3.5% increase), which aligns with the PR objective to relax nanov3 metrics strictness. The change is technically correct and consistent with the other nanov3 test script.

However, it would be helpful to document the reasoning for this specific threshold value. Was this based on recent nightly test results, observed training variance, or infrastructure changes?

Consider adding a comment in the PR description explaining what data or observations motivated these threshold adjustments to help future maintainers understand the decision.

tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh (1)

36-36: LGTM - Threshold aligned with non-LORA variant.

The loss threshold has been updated from 2.03 to 2.05, now matching the threshold in the non-LORA variant (sft-nanov3-30BA3B-2n8g-fsdp2.sh). This consistency makes sense and simplifies threshold management across related test configurations.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@terrykong terrykong merged commit c84a1b6 into main Jan 5, 2026
74 of 81 checks passed
@terrykong terrykong deleted the ruit/relax_nanov3_strict branch January 5, 2026 12:11
chtruong814 pushed a commit that referenced this pull request Jan 5, 2026
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
parthmannan pushed a commit to parthmannan/RL that referenced this pull request Jan 15, 2026
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: Parth Mannan <pmannan@nvidia.com>
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 12, 2026
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
seonjinn pushed a commit that referenced this pull request Mar 8, 2026
Signed-off-by: ruit <ruit@nvidia.com>
seonjinn pushed a commit that referenced this pull request Mar 8, 2026
Signed-off-by: ruit <ruit@nvidia.com>
seonjinn pushed a commit that referenced this pull request Mar 9, 2026
Signed-off-by: ruit <ruit@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants