Skip to content

chore: add assert for dtensor v2 cpu offload#1817

Merged
terrykong merged 1 commit intomainfrom
yukih/assert-cpuoffload
Jan 30, 2026
Merged

chore: add assert for dtensor v2 cpu offload#1817
terrykong merged 1 commit intomainfrom
yukih/assert-cpuoffload

Conversation

@yuki-97
Copy link
Contributor

@yuki-97 yuki-97 commented Jan 23, 2026

For world size 1, FSDP uses extra memory which is not necessary, so AutoModel disable it for now. This PR add an assert for it.

Summary by CodeRabbit

  • Bug Fixes

    • Added validation to prevent unsupported CPU offload configuration on single-GPU setups.
  • Improvements

    • Enhanced distributed setup configuration for improved generation performance when resources are not colocated.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 requested review from a team as code owners January 23, 2026 15:12
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 23, 2026

📝 Walkthrough

Walkthrough

Modified environment configuration for non-colocated generation by setting NCCL_CUMEM_ENABLE, and added validation to prevent CPU offload usage on single-GPU environments during distributed setup initialization.

Changes

Cohort / File(s) Summary
Environment and Validation Setup
nemo_rl/models/automodel/setup.py
Set NCCL_CUMEM_ENABLE environment variable when generation is not colocated with rationale documentation; added guard in setup_distributed to raise NotImplementedError if cpu_offload is enabled on single-GPU setups (world_size == 1)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Title check ⚠️ Warning The title mentions 'assert for dtensor v2 cpu offload' but the actual changes add a NotImplementedError guard (not an assert) and modify NCCL_CUMEM_ENABLE environment variable setup, which is not mentioned in the title. Update the title to accurately reflect the main changes, such as 'Prevent single-GPU CPU offload in setup_distributed and configure NCCL_CUMEM_ENABLE' or similar.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes ✅ Passed PR contains only minor changes (7 lines added, 1 removed) classified as chore: setting environment variable and adding guard clause for unsupported configuration.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@nemo_rl/models/automodel/setup.py`:
- Around line 286-290: The NotImplementedError raised when cpu_offload is True
contains an incorrect GitHub URL; update the error message in the cpu_offload
guard (the NotImplementedError raised near cpu_offload and before
_setup_distributed()) to point to the correct repository URL (e.g.,
https://github.com/NVIDIA-NeMo/RL or the canonical
https://github.com/NVIDIA/NeMo) so the message directs users to the right issue
tracker.

RayenTian added a commit that referenced this pull request Jan 26, 2026
Signed-off-by: ruit <ruit@nvidia.com>
RayenTian added a commit that referenced this pull request Jan 28, 2026
Signed-off-by: ruit <ruit@nvidia.com>
@terrykong terrykong added the CI:L1 Run doctests, unit tests, and functional tests label Jan 29, 2026
@terrykong terrykong enabled auto-merge (squash) January 29, 2026 17:53
@terrykong terrykong merged commit 748b9ca into main Jan 30, 2026
42 of 43 checks passed
@terrykong terrykong deleted the yukih/assert-cpuoffload branch January 30, 2026 05:32
RayenTian added a commit that referenced this pull request Feb 2, 2026
Signed-off-by: ruit <ruit@nvidia.com>
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 12, 2026
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
seonjinn pushed a commit that referenced this pull request Mar 8, 2026
Signed-off-by: Yuki Huang <yukih@nvidia.com>
seonjinn pushed a commit that referenced this pull request Mar 8, 2026
Signed-off-by: Yuki Huang <yukih@nvidia.com>
seonjinn pushed a commit that referenced this pull request Mar 9, 2026
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants