fix: relax nanov3 nightly test metrics strict by RayenTian · Pull Request #1712 · NVIDIA-NeMo/RL

RayenTian · 2026-01-05T07:31:03Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

Tests
- Updated training loss metric thresholds in test suite validations to accommodate variations in model training performance.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: ruit <ruit@nvidia.com>

coderabbitai · 2026-01-05T07:38:25Z

📝 Walkthrough

Walkthrough

Two LLM training test suite scripts have their loss metric validation thresholds increased. The sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh script now asserts data["train/loss"]["20"] < 2.05 instead of < 2.03, and the sft-nanov3-30BA3B-2n8g-fsdp2.sh script updates its threshold from < 1.98 to < 2.05. All other logic remains unchanged.

Changes

Cohort / File(s)	Summary
LLM SFT Test Suite Thresholds `tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh`, `tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh`	Updated training loss metric thresholds: lora variant raised from 2.03 to 2.05; standard variant raised from 1.98 to 2.05

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

feat: Add Nemotron‑3 Nano 30B A3B BF16 SFT nightly tests (FSDP2, +LoRA) #1648: Introduced these test suite scripts (sft-nanov3-30BA3B-2n8g-fsdp2\*.sh) that are being updated with metric threshold adjustments.

Suggested labels

CI:L1

Suggested reviewers

joyang-nv
yuki-97
terrykong

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR relaxes training loss thresholds in test files without explanation. Changes to convergence metrics require justification to rule out regression.	Update PR description to explain threshold changes, document what caused higher loss values, and demonstrate these are expected without masking regression.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the main change: relaxing the strictness of nanov3 nightly test metrics by raising thresholds from 1.98 to 2.05 and 2.03 to 2.05.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

📝 Generate docstrings

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 13c3cd6 and 3534a18.

📒 Files selected for processing (2)

tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh

🧰 Additional context used

📓 Path-based instructions (4)

**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.sh: Use uv run instead of python to execute scripts
Follow the Google Shell Style Guide for shell scripts

Files:

tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh

tests/test_suites/**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

tests/test_suites/**/*.sh: When adding support for a new model, create a corresponding driver shell script under tests/test_suites/ in the matching domain
Driver shell scripts should match the YAML base name with .sh extension and invoke training entrypoint with uv run

Files:

tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh

!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh

**/*.{py,sh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)

Files:

tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh

🧠 Learnings (1)

📚 Learning: 2025-10-12T14:46:57.171Z

Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.

Applied to files:

tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: pre-flight
GitHub Check: sphinx-build / Build docs
GitHub Check: Lint check
GitHub Check: Post automodel integration comment / Comment on PR

🔇 Additional comments (2)

tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh (1)

36-36: Verify the rationale for the relaxed loss threshold.

The loss threshold has been increased from 1.98 to 2.05 (a ~3.5% increase), which aligns with the PR objective to relax nanov3 metrics strictness. The change is technically correct and consistent with the other nanov3 test script.

However, it would be helpful to document the reasoning for this specific threshold value. Was this based on recent nightly test results, observed training variance, or infrastructure changes?

Consider adding a comment in the PR description explaining what data or observations motivated these threshold adjustments to help future maintainers understand the decision.

tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh (1)

36-36: LGTM - Threshold aligned with non-LORA variant.

The loss threshold has been updated from 2.03 to 2.05, now matching the threshold in the non-LORA variant (sft-nanov3-30BA3B-2n8g-fsdp2.sh). This consistency makes sense and simplifies threshold management across related test configurations.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>

Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

Signed-off-by: ruit <ruit@nvidia.com>

relax nanov3 strict

3534a18

Signed-off-by: ruit <ruit@nvidia.com>

RayenTian marked this pull request as ready for review January 5, 2026 07:35

RayenTian requested a review from a team as a code owner January 5, 2026 07:35

RayenTian requested a review from terrykong January 5, 2026 07:35

RayenTian added the CI:L1 Run doctests, unit tests, and functional tests label Jan 5, 2026

RayenTian temporarily deployed to nemo-ci January 5, 2026 07:35 — with GitHub Actions Inactive

terrykong added CI:docs Run doctest r0.5.0 and removed CI:L1 Run doctests, unit tests, and functional tests CI:docs Run doctest labels Jan 5, 2026

terrykong temporarily deployed to nemo-ci January 5, 2026 07:36 — with GitHub Actions Inactive

terrykong enabled auto-merge (squash) January 5, 2026 07:36

terrykong approved these changes Jan 5, 2026

View reviewed changes

RayenTian temporarily deployed to nemo-ci January 5, 2026 07:45 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci January 5, 2026 09:51 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci January 5, 2026 10:25 — with GitHub Actions Inactive

RayenTian temporarily deployed to nemo-ci January 5, 2026 12:07 — with GitHub Actions Inactive

terrykong merged commit c84a1b6 into main Jan 5, 2026
74 of 81 checks passed

terrykong deleted the ruit/relax_nanov3_strict branch January 5, 2026 12:11

chtruong814 pushed a commit that referenced this pull request Jan 5, 2026

fix: relax nanov3 nightly test metrics strict (#1712)

6fc4d6f

Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

parthmannan pushed a commit to parthmannan/RL that referenced this pull request Jan 15, 2026

fix: relax nanov3 nightly test metrics strict (NVIDIA-NeMo#1712)

8c492ff

Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>

yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 12, 2026

fix: relax nanov3 nightly test metrics strict (NVIDIA-NeMo#1712)

c4aaca4

Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026

fix: relax nanov3 nightly test metrics strict (NVIDIA-NeMo#1712)

a3d532b

Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

seonjinn pushed a commit that referenced this pull request Mar 8, 2026

fix: relax nanov3 nightly test metrics strict (#1712)

4f0045c

Signed-off-by: ruit <ruit@nvidia.com>

seonjinn pushed a commit that referenced this pull request Mar 8, 2026

fix: relax nanov3 nightly test metrics strict (#1712)

75f6f8a

Signed-off-by: ruit <ruit@nvidia.com>

seonjinn pushed a commit that referenced this pull request Mar 9, 2026

fix: relax nanov3 nightly test metrics strict (#1712)

342fc04

Signed-off-by: ruit <ruit@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: relax nanov3 nightly test metrics strict#1712

fix: relax nanov3 nightly test metrics strict#1712
terrykong merged 1 commit intomainfrom
ruit/relax_nanov3_strict

RayenTian commented Jan 5, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 5, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RayenTian commented Jan 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 5, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RayenTian commented Jan 5, 2026 •

edited by coderabbitai bot

Loading