cp: `fix: Fix crash when using cp in dtensor path (1663)` into `r0.5.0` by chtruong814 · Pull Request #1665 · NVIDIA-NeMo/RL

chtruong814 · 2025-12-19T18:41:30Z

beep boop [🤖]: Hi @yfw 👋,

we've cherry picked #1663 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

Release Notes

New Features
- Introduced optimized attention mechanism support for context parallel processing configurations, with improved backend selection and flexibility for various parallel deployment scenarios.
Improvements
- Enhanced model initialization logic to properly align attention backend selection with context parallelism settings, providing better performance optimization for distributed system configurations.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

github-actions · 2025-12-19T18:41:51Z

⚠️ File Consistency Check

Check based on commit: 19a8968 (PR #1665 from cherry-pick-1663-r0.5.0)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/workers/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

Please review if the changes in nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/workers/dtensor_policy_worker.py
Update nemo_rl/models/policy/workers/dtensor_policy_worker.py if necessary to maintain consistency
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
Not modified: nemo_rl/models/policy/workers/dtensor_policy_worker.py

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

coderabbitai · 2025-12-19T18:45:13Z

📝 Walkthrough

Walkthrough

Modified DTensorPolicyWorkerV2 to support context-parallel SDPA handling. When context parallelism is enabled (cp_size > 1), the code imports SDPBackend and constructs an sdpa_method list containing FLASH_ATTENTION and EFFICIENT_ATTENTION backends, then passes this to model construction via model_class.from_config.

Changes

Cohort / File(s)	Summary
Context-Parallel SDPA Handling `nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py`	Added conditional logic to import SDPBackend and build sdpa_method list when context_parallel_size > 1; integrated sdpa_method parameter into model_class.from_config call alongside existing attn_implementation selection

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Focus areas: verification that sdpa_method is correctly populated only when cp_size > 1; confirmation that the sdpa_method list (FLASH_ATTENTION, EFFICIENT_ATTENTION) is appropriate for context-parallel scenarios; review of the integration with existing attn_implementation logic to ensure no conflicts.

Possibly related PRs

fix: Fix crash when using cp in dtensor path #1663: Directly related modifications to the same file implementing identical context-parallel SDPA backend selection logic and sdpa_method propagation into model construction.

Suggested labels

CI:L1, r0.5.0

Suggested reviewers

yfw
yuki-97
joyang-nv
terrykong

Pre-merge checks and finishing touches

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title indicates a fix for context-parallel (cp) crashes in the dtensor path, which aligns with the changes made to handle context-parallel SDPA methods in the DTensorPolicyWorkerV2 class.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes	✅ Passed	This PR is a bug fix for a crash in context parallelism handling within a single file, making it a minor change that does not require explicit test result documentation.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch cherry-pick-1663-r0.5.0

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 48dbb37 and 19a8968.

📒 Files selected for processing (1)

nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Conform code to Python 3.12+
Indent code with 4 spaces. Do not use tabs
Use snake_case for file names
Use PascalCase for class names
Use snake_case for function and method names
Use snake_case for local variables
Prefix variable names that start with a number with 'k' (e.g., k_99th_percentile)
Use upper snake_case with 'G' prefix for global variables (e.g., G_MY_GLOBAL)
Use upper snake_case for constants
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
Prefer docstrings over comments for interfaces that may be used outside a file
Reserve comments for code within a function or interfaces that are local to a file
If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx
Avoid using reflection when functionality can be easily achieved without reflection
When using try-except blocks, limit the except clause to the smallest set of specific errors possible
When using try-except blocks for duck-typing, keep the body of the try as small as possible and use the else block for logic
YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values
For required configuration attributes, access config directly and expect presence (e.g., policy_cfg['precision']) without hidden defaults
Use typing.NotRequired to mark optional attributes in TypedDict for configuration
When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml
Follow the Google Python Style Guide for Python code

Files:

nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

For any source file under nemo_rl/*.py that defines a class or function decorated with @ray.remote, add a coverage pragma (# pragma: no cover) because these run in separate Ray processes

Files:

nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

**/*.{py,sh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)

Files:

nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

🧠 Learnings (3)

📓 Common learnings

Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.

📚 Learning: 2025-10-30T20:50:44.126Z

Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.

Applied to files:

nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

📚 Learning: 2025-09-19T03:00:58.662Z

Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml:85-101
Timestamp: 2025-09-19T03:00:58.662Z
Learning: In distillation and GRPO configurations, max_new_tokens is intentionally set to the full context window (max_total_sequence_length) for consistency across the codebase. Overflow cases when prompt + generation tokens exceed max_model_len are handled by safeguards implemented in vllm_worker.py.

Applied to files:

nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Lint check
GitHub Check: Lint check
GitHub Check: Lint check
GitHub Check: Post submodule check comment / Comment on PR

🔇 Additional comments (3)

nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py (3)

280-280: LGTM: Context parallel size retrieval.

The early retrieval of cp_size is appropriate for the subsequent conditional logic and aligns with its existing usage throughout the file.

285-296: The code correctly implements SDPBackend backend selection.

The torch.nn.attention.SDPBackend is an enum-like class that contains the different backends for scaled dot product attention, and both FLASH_ATTENTION and EFFICIENT_ATTENTION backends are available in PyTorch 2.9.0. The conditional import and backend list match the usage patterns shown in official documentation. The code will work as intended.

298-307: No issues with sdpa_method parameter — it is properly supported.

The sdpa_method parameter is a standard parameter in NeMoAutoModelForCausalLM.from_config() that accepts "One or multiple SDPA back-ends to prefer when applying SDPA patching," and "When None, the default backend resolution logic is used." The implementation correctly passes this parameter to control SDPA kernel optimization.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

fix: Fix crash when using cp in dtensor path (#1663)

19a8968

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

chtruong814 requested review from a team as code owners December 19, 2025 18:41

chtruong814 requested a review from yfw December 19, 2025 18:41

chtruong814 added cherry-pick Run CICD labels Dec 19, 2025

terrykong added the CI:L1 Run doctests, unit tests, and functional tests label Dec 19, 2025

terrykong enabled auto-merge (squash) December 19, 2025 18:45

terrykong temporarily deployed to nemo-ci December 19, 2025 18:45 — with GitHub Actions Inactive

yfw approved these changes Dec 19, 2025

View reviewed changes

terrykong temporarily deployed to nemo-ci December 19, 2025 18:49 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci December 19, 2025 20:33 — with GitHub Actions Inactive

terrykong merged commit d488064 into r0.5.0 Dec 19, 2025
68 of 71 checks passed

terrykong deleted the cherry-pick-1663-r0.5.0 branch December 19, 2025 21:36

coderabbitai bot mentioned this pull request Dec 23, 2025

cp: feat: DTensorPolicyV2 GPT-OSS SFT support (1470) into r0.5.0 #1690

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cp: `fix: Fix crash when using cp in dtensor path (1663)` into `r0.5.0`#1665

cp: `fix: Fix crash when using cp in dtensor path (1663)` into `r0.5.0`#1665
terrykong merged 1 commit intor0.5.0from
cherry-pick-1663-r0.5.0

chtruong814 commented Dec 19, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Dec 19, 2025

Uh oh!

coderabbitai bot commented Dec 19, 2025

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chtruong814 commented Dec 19, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

github-actions bot commented Dec 19, 2025

⚠️ File Consistency Check

⚠️ DTensor Policy Worker Synchronization Warning

Uh oh!

coderabbitai bot commented Dec 19, 2025

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chtruong814 commented Dec 19, 2025 •

edited by coderabbitai bot

Loading