feat: Add Nemotron‑3 Nano 30B A3B BF16 SFT nightly tests (FSDP2, +LoRA) by RayenTian · Pull Request #1648 · NVIDIA-NeMo/RL

RayenTian · 2025-12-17T09:52:38Z

Summary:

Introduces nightly coverage for SFT on the Nemotron‑3 Nano 30B A3B BF16 model, including both a base FSDP2 configuration and a LoRA-enabled variant. Adds runnable test scripts with metric thresholds and registers them in the nightly test suite.

Changes:

New configs:
examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2.yaml
examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.yaml

New nightly test scripts:
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh

Nightly registration:
Appends the two new scripts to tests/test_suites/nightly.txt under “Nemotron 3 Nano 30B A3B BF16 tests”.

Results

When we set checkpoint period as 10, the configuration of sft-nanov3-30BA3B-2n8g-fsdp2 will result in a significant portion of additional memory usage, leading to a notable slowdown after 10 steps. The speed returns to normal after disabling the checkpoint.

nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16

AdamW Optimizer

lora dim: 256
lora alpha: 512

Adam optimizer

Enable CKPT

memory

Disable CKPT

memory

Known Issue

#1688

RayenTian · 2025-12-22T05:58:26Z

@ZhiyuLi-Nvidia @hemildesai @samodi-nv Do you have any ideas why lora is slower than normal SFT?

ZhiyuLi-Nvidia · 2025-12-22T06:22:38Z

@ZhiyuLi-Nvidia @hemildesai @samodi-nv Do you have any ideas why lora is slower than normal SFT?

lora should be computational efficient unless the training pipeline is bounded by other overheads.

Could you try the following to see if lora is faster?

using 1 node instead (to minimize cross node communication)?
Maximize batch size (to best utilize computation relatviely)

examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.yaml

coderabbitai · 2025-12-23T03:30:02Z

📝 Walkthrough

Walkthrough

Adds two new SFT configuration files and corresponding test scripts for the NVIDIA Nemotron Nano 30B model on 2-node, 8-GPU-per-node clusters, with and without LoRA. Updates the nightly test suite manifest to include both variants.

Changes

Cohort / File(s)	Summary
SFT Configuration Files `examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2.yaml`, `examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.yaml`	New configurations for Nemotron Nano 30B SFT experiments. Base config includes Adam optimizer, global batch size 16, max sequence length 2048, and logging for wandb/tensorboard/mlflow. LoRA variant adds dtensor_cfg with enabled LoRA (use_triton: false). Both disable checkpointing and set max_num_steps: 100.
Test Scripts `tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh`, `tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh`	New bash test scripts that configure and execute SFT experiments (NUM_NODES=2, STEPS_PER_RUN calculation), convert TensorBoard logs to JSON, and conditionally run metrics checks enforcing train/loss < 4.20 at step 20 and step timing < 15 seconds.
Test Suite Manifest `tests/test_suites/nightly.txt`	Adds both new test scripts to nightly test suite. Entries appear in two separate sections, resulting in duplicate insertions of the same test references.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

feat: LoRA SFT support for DTensorV2 path #1556 — Adds DTensorV2 LoRA configuration support and LoRA-enabled SFT configs that directly enable the LoRA functionality in these test configurations.
chore: Enable LoRA Nightly Test #1634 — Adds similar LoRA test configurations and modifies tests/test_suites/nightly.txt with overlapping test entries.

Suggested labels

Run CICD

Suggested reviewers

joyang-nv
yfw

Pre-merge checks and finishing touches

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly and specifically describes the main change: adding Nemotron-3 Nano 30B A3B BF16 SFT nightly tests with FSDP2 and LoRA variants, which aligns directly with the changeset content.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Test Results For Major Changes	✅ Passed	PR adds test infrastructure and configuration files for existing model variant, which are minor changes. Performance and memory testing results were documented with screenshots, and metric thresholds were established based on empirical testing.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch ruit/nano_v3_recipe

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 56e8fcb and 8e5aaaf.

📒 Files selected for processing (5)

examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.yaml
examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2.yaml
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
tests/test_suites/nightly.txt

🧰 Additional context used

📓 Path-based instructions (7)

examples/configs/recipes/**/*.yaml