feat: Add Nemotron‑3 Nano 30B A3B BF16 SFT nightly tests (FSDP2, +LoRA)#1648
feat: Add Nemotron‑3 Nano 30B A3B BF16 SFT nightly tests (FSDP2, +LoRA)#1648
Conversation
4101842 to
c8ce31b
Compare
|
@ZhiyuLi-Nvidia @hemildesai @samodi-nv Do you have any ideas why lora is slower than normal SFT? |
lora should be computational efficient unless the training pipeline is bounded by other overheads. Could you try the following to see if lora is faster?
|
6841abd to
9ce74ae
Compare
9ce74ae to
8e5aaaf
Compare
examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.yaml
Outdated
Show resolved
Hide resolved
📝 WalkthroughWalkthroughAdds two new SFT configuration files and corresponding test scripts for the NVIDIA Nemotron Nano 30B model on 2-node, 8-GPU-per-node clusters, with and without LoRA. Updates the nightly test suite manifest to include both variants. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (4 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.yamlexamples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2.yamltests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.shtests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.shtests/test_suites/nightly.txt
🧰 Additional context used
📓 Path-based instructions (7)
examples/configs/recipes/**/*.yaml
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
When adding support for a new model, create a recipe YAML under examples/configs/recipes/ in the appropriate domain subdirectory (llm, vlm, etc.)
Files:
examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.yamlexamples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2.yaml
examples/configs/recipes/llm/*.yaml
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Recipe YAML files should follow the naming pattern: --ng-[-modifiers][-long][.vN].yaml for LLM recipes
Files:
examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.yamlexamples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2.yaml
!(**/tests/**|**/test_*.py|**/test_*.sh)
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year
Files:
examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.yamltests/test_suites/nightly.txtexamples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2.yamltests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.shtests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
tests/test_suites/nightly.txt
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
When adding a nightly test for a new model, append the driver script path (relative to tests/test_suites/) to tests/test_suites/nightly.txt
Files:
tests/test_suites/nightly.txt
**/*.sh
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.sh: Use uv run instead of python to execute scripts
Follow the Google Shell Style Guide for shell scripts
Files:
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.shtests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
tests/test_suites/**/*.sh
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
tests/test_suites/**/*.sh: When adding support for a new model, create a corresponding driver shell script under tests/test_suites/ in the matching domain
Driver shell scripts should match the YAML base name with .sh extension and invoke training entrypoint with uv run
Files:
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.shtests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
**/*.{py,sh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)
Files:
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.shtests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
🧠 Learnings (10)
📓 Common learnings
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to tests/test_suites/nightly.txt : When adding a nightly test for a new model, append the driver script path (relative to tests/test_suites/) to tests/test_suites/nightly.txt
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/llm/*.yaml : Recipe YAML files should follow the naming pattern: <algo>-<model>-<nodes>n<gpus>g-<strategy-and-params>[-modifiers][-long][.vN].yaml for LLM recipes
Applied to files:
examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.yamlexamples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2.yaml
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/vlm/*.yaml : Recipe YAML files should follow the naming pattern: vlm_<algo>-<model>-<nodes>n<gpus>g-<strategy>[-modifiers][.vN].yaml for VLM recipes
Applied to files:
examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.yamlexamples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2.yaml
📚 Learning: 2025-09-24T18:36:06.287Z
Learnt from: terrykong
Repo: NVIDIA-NeMo/RL PR: 1024
File: examples/configs/recipes/llm/dpo-llama3.1-8b-instruct-4n8g-fsdp2tp4.yaml:1-1
Timestamp: 2025-09-24T18:36:06.287Z
Learning: In the NVIDIA NeMo RL repository, when working with Hydra config defaults, the scalar string format (defaults: ../../dpo.yaml) is acceptable and preferred over the list format, even though Hydra typically expects defaults to be a list.
Applied to files:
examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.yamlexamples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2.yaml
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to tests/test_suites/nightly.txt : When adding a nightly test for a new model, append the driver script path (relative to tests/test_suites/) to tests/test_suites/nightly.txt
Applied to files:
tests/test_suites/nightly.txttests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
📚 Learning: 2025-09-19T07:28:29.887Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-long.v1.sh:1-4
Timestamp: 2025-09-19T07:28:29.887Z
Learning: The NVIDIA-NeMo/RL project prefers to maintain consistent formatting across test scripts rather than applying individual bash hardening improvements like `set -euo pipefail` or proper quoting for sourcing files.
Applied to files:
tests/test_suites/nightly.txt
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.
Applied to files:
tests/test_suites/nightly.txttests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.shtests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to tests/test_suites/**/*.sh : When adding support for a new model, create a corresponding driver shell script under tests/test_suites/ in the matching domain
Applied to files:
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.shtests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
📚 Learning: 2025-11-24T17:24:41.976Z
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to tests/test_suites/**/*.sh : Driver shell scripts should match the YAML base name with .sh extension and invoke training entrypoint with uv run
Applied to files:
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.shtests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
📚 Learning: 2025-10-12T14:46:55.513Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:16-30
Timestamp: 2025-10-12T14:46:55.513Z
Learning: In the NVIDIA-NeMo/RL repository, test scripts under tests/ follow a consistent pattern: use `cd $PROJECT_ROOT` without quotes or error handling, and pass arguments with `$@` unquoted. Maintain this consistency when adding new test scripts.
Applied to files:
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.shtests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
🪛 Shellcheck (0.11.0)
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh
[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
(SC2164)
[error] 28-28: Double quote array expansions to avoid re-splitting elements.
(SC2068)
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh
[warning] 6-6: NUM_NODES appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 9-9: NUM_RUNS appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 10-10: NUM_MINUTES appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 16-16: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
(SC2164)
[error] 28-28: Double quote array expansions to avoid re-splitting elements.
(SC2068)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Docs_Tests
- GitHub Check: Post submodule check comment / Comment on PR
- GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (5)
tests/test_suites/nightly.txt (1)
90-93: LGTM!The nightly test entries are correctly added under the SFT section with an appropriate comment header. The paths follow the expected pattern relative to
tests/test_suites/. Based on learnings, this correctly appends the driver script paths tonightly.txt.tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.sh (1)
1-39: LGTM!The script follows the established test infrastructure patterns:
- Standard configuration variables (
NUM_NODES,STEPS_PER_RUN, etc.) are consumed by external launch tooling- Uses
uv runper coding guidelines- Matches the YAML base name with
.shextension- Metric thresholds are defined appropriately
Based on learnings, the
cd $PROJECT_ROOTwithout error handling and unquoted$@are consistent with this repository's conventions.examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.yaml (1)
1-15: Configuration structure looks correct.The defaults scalar format is acceptable per learnings. LoRA configuration (
lora_cfg.enabled: true) is properly nested underdtensor_cfg. Cluster settings match the filename pattern (2n8g).Also applies to: 24-26
examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2.yaml (1)
1-22: LGTM!The base SFT configuration is well-structured:
- Uses scalar defaults format (acceptable per learnings)
- Logger names correctly match the filename
- Cluster settings (2n8g) align with the filename pattern
- No LoRA settings as expected for the base variant
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.sh (1)
1-39: LGTM!The LoRA test script follows the same established patterns as the base script. The higher loss threshold (4.20 vs 3.20) appropriately accounts for potentially different convergence characteristics with LoRA.
Note: The comment on line 7 states
step_time ~ 8secwhile the base config says~15sec. Given the PR discussion about LoRA being slower than expected, you may want to verify this comment reflects actual observed timing after your testing.
examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.yaml
Outdated
Show resolved
Hide resolved
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
Signed-off-by: ruit <ruit@nvidia.com>
…A) (#1648) Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
…A) (NVIDIA-NeMo#1648) Signed-off-by: ruit <ruit@nvidia.com>
…A) (NVIDIA-NeMo#1648) Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
…A) (NVIDIA-NeMo#1648) Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
…A) (NVIDIA-NeMo#1648) Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
…A) (#1648) Signed-off-by: ruit <ruit@nvidia.com>
…A) (#1648) Signed-off-by: ruit <ruit@nvidia.com>
…A) (#1648) Signed-off-by: ruit <ruit@nvidia.com>
Summary:
Introduces nightly coverage for SFT on the Nemotron‑3 Nano 30B A3B BF16 model, including both a base FSDP2 configuration and a LoRA-enabled variant. Adds runnable test scripts with metric thresholds and registers them in the nightly test suite.
Changes:
New configs:
examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2.yamlexamples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.yamlNew nightly test scripts:
tests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2.shtests/test_suites/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.shNightly registration:
Appends the two new scripts to
tests/test_suites/nightly.txtunder “Nemotron 3 Nano 30B A3B BF16 tests”.Results
When we set checkpoint period as 10, the configuration of
sft-nanov3-30BA3B-2n8g-fsdp2will result in a significant portion of additional memory usage, leading to a notable slowdown after 10 steps. The speed returns to normal after disabling the checkpoint.nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16
AdamW Optimizer
Adam optimizer
Enable CKPT
memory
Disable CKPT
memory
Known Issue
#1688