chore: add assert for dtensor v2 cpu offload by yuki-97 · Pull Request #1817 · NVIDIA-NeMo/RL

yuki-97 · 2026-01-23T15:12:53Z

For world size 1, FSDP uses extra memory which is not necessary, so AutoModel disable it for now. This PR add an assert for it.

Summary by CodeRabbit

Bug Fixes
- Added validation to prevent unsupported CPU offload configuration on single-GPU setups.
Improvements
- Enhanced distributed setup configuration for improved generation performance when resources are not colocated.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Yuki Huang <yukih@nvidia.com>

coderabbitai · 2026-01-23T15:16:33Z

📝 Walkthrough

Walkthrough

Modified environment configuration for non-colocated generation by setting NCCL_CUMEM_ENABLE, and added validation to prevent CPU offload usage on single-GPU environments during distributed setup initialization.

Changes

Cohort / File(s)	Summary
Environment and Validation Setup `nemo_rl/models/automodel/setup.py`	Set NCCL_CUMEM_ENABLE environment variable when generation is not colocated with rationale documentation; added guard in `setup_distributed` to raise NotImplementedError if cpu_offload is enabled on single-GPU setups (world_size == 1)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title mentions 'assert for dtensor v2 cpu offload' but the actual changes add a NotImplementedError guard (not an assert) and modify NCCL_CUMEM_ENABLE environment variable setup, which is not mentioned in the title.	Update the title to accurately reflect the main changes, such as 'Prevent single-GPU CPU offload in setup_distributed and configure NCCL_CUMEM_ENABLE' or similar.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes	✅ Passed	PR contains only minor changes (7 lines added, 1 removed) classified as chore: setting environment variable and adding guard clause for unsupported configuration.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@nemo_rl/models/automodel/setup.py`:
- Around line 286-290: The NotImplementedError raised when cpu_offload is True
contains an incorrect GitHub URL; update the error message in the cpu_offload
guard (the NotImplementedError raised near cpu_offload and before
_setup_distributed()) to point to the correct repository URL (e.g.,
https://github.com/NVIDIA-NeMo/RL or the canonical
https://github.com/NVIDIA/NeMo) so the message directs users to the right issue
tracker.

nemo_rl/models/automodel/setup.py

Signed-off-by: ruit <ruit@nvidia.com>

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

Signed-off-by: Yuki Huang <yukih@nvidia.com>

add assert for dtensor v2 cpu offload

7dd3f7b

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 requested review from a team as code owners January 23, 2026 15:12

yuki-97 requested review from hemildesai and terrykong January 23, 2026 15:13

coderabbitai bot reviewed Jan 23, 2026

View reviewed changes

nemo_rl/models/automodel/setup.py Show resolved Hide resolved

RayenTian added a commit that referenced this pull request Jan 26, 2026

remove cpu_offload unit test when enable lora due to #1817

7db41f8

Signed-off-by: ruit <ruit@nvidia.com>

RayenTian added a commit that referenced this pull request Jan 28, 2026

remove cpu_offload unit test when enable lora due to #1817

4c65d94

Signed-off-by: ruit <ruit@nvidia.com>

hemildesai approved these changes Jan 29, 2026

View reviewed changes

terrykong approved these changes Jan 29, 2026

View reviewed changes

terrykong added the CI:L1 Run doctests, unit tests, and functional tests label Jan 29, 2026

terrykong enabled auto-merge (squash) January 29, 2026 17:53

terrykong temporarily deployed to nemo-ci January 29, 2026 17:53 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci January 29, 2026 19:42 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci January 30, 2026 00:53 — with GitHub Actions Inactive

terrykong merged commit 748b9ca into main Jan 30, 2026
42 of 43 checks passed

terrykong deleted the yukih/assert-cpuoffload branch January 30, 2026 05:32

RayenTian added a commit that referenced this pull request Feb 2, 2026

remove cpu_offload unit test when enable lora due to #1817

0ee1a7d

Signed-off-by: ruit <ruit@nvidia.com>

yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 12, 2026

chore: add assert for dtensor v2 cpu offload (NVIDIA-NeMo#1817)

82d9e96

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026

chore: add assert for dtensor v2 cpu offload (NVIDIA-NeMo#1817)

1af304e

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

seonjinn pushed a commit that referenced this pull request Mar 8, 2026

chore: add assert for dtensor v2 cpu offload (#1817)

d6f7b52

Signed-off-by: Yuki Huang <yukih@nvidia.com>

seonjinn pushed a commit that referenced this pull request Mar 8, 2026

chore: add assert for dtensor v2 cpu offload (#1817)

f7492d9

Signed-off-by: Yuki Huang <yukih@nvidia.com>

seonjinn pushed a commit that referenced this pull request Mar 9, 2026

chore: add assert for dtensor v2 cpu offload (#1817)

408abc8

Signed-off-by: Yuki Huang <yukih@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: add assert for dtensor v2 cpu offload#1817

chore: add assert for dtensor v2 cpu offload#1817
terrykong merged 1 commit intomainfrom
yukih/assert-cpuoffload

yuki-97 commented Jan 23, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 23, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yuki-97 commented Jan 23, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 23, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuki-97 commented Jan 23, 2026 •

edited by coderabbitai bot

Loading