Skip to content

cp: fix: Fix crash when using cp in dtensor path (1663) into r0.5.0#1665

Merged
terrykong merged 1 commit intor0.5.0from
cherry-pick-1663-r0.5.0
Dec 19, 2025
Merged

cp: fix: Fix crash when using cp in dtensor path (1663) into r0.5.0#1665
terrykong merged 1 commit intor0.5.0from
cherry-pick-1663-r0.5.0

Conversation

@chtruong814
Copy link
Contributor

@chtruong814 chtruong814 commented Dec 19, 2025

beep boop [🤖]: Hi @yfw 👋,

we've cherry picked #1663 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced optimized attention mechanism support for context parallel processing configurations, with improved backend selection and flexibility for various parallel deployment scenarios.
  • Improvements

    • Enhanced model initialization logic to properly align attention backend selection with context parallelism settings, providing better performance optimization for distributed system configurations.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@github-actions
Copy link

⚠️ File Consistency Check

Check based on commit: 19a8968 (PR #1665 from cherry-pick-1663-r0.5.0)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/workers/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/workers/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/workers/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/workers/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 19, 2025

📝 Walkthrough

Walkthrough

Modified DTensorPolicyWorkerV2 to support context-parallel SDPA handling. When context parallelism is enabled (cp_size > 1), the code imports SDPBackend and constructs an sdpa_method list containing FLASH_ATTENTION and EFFICIENT_ATTENTION backends, then passes this to model construction via model_class.from_config.

Changes

Cohort / File(s) Summary
Context-Parallel SDPA Handling
nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
Added conditional logic to import SDPBackend and build sdpa_method list when context_parallel_size > 1; integrated sdpa_method parameter into model_class.from_config call alongside existing attn_implementation selection

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Focus areas: verification that sdpa_method is correctly populated only when cp_size > 1; confirmation that the sdpa_method list (FLASH_ATTENTION, EFFICIENT_ATTENTION) is appropriate for context-parallel scenarios; review of the integration with existing attn_implementation logic to ensure no conflicts.

Possibly related PRs

Suggested labels

CI:L1, r0.5.0

Suggested reviewers

  • yfw
  • yuki-97
  • joyang-nv
  • terrykong

Pre-merge checks and finishing touches

✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title indicates a fix for context-parallel (cp) crashes in the dtensor path, which aligns with the changes made to handle context-parallel SDPA methods in the DTensorPolicyWorkerV2 class.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes ✅ Passed This PR is a bug fix for a crash in context parallelism handling within a single file, making it a minor change that does not require explicit test result documentation.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch cherry-pick-1663-r0.5.0

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 48dbb37 and 19a8968.

📒 Files selected for processing (1)
  • nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Conform code to Python 3.12+
Indent code with 4 spaces. Do not use tabs
Use snake_case for file names
Use PascalCase for class names
Use snake_case for function and method names
Use snake_case for local variables
Prefix variable names that start with a number with 'k' (e.g., k_99th_percentile)
Use upper snake_case with 'G' prefix for global variables (e.g., G_MY_GLOBAL)
Use upper snake_case for constants
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
Prefer docstrings over comments for interfaces that may be used outside a file
Reserve comments for code within a function or interfaces that are local to a file
If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx
Avoid using reflection when functionality can be easily achieved without reflection
When using try-except blocks, limit the except clause to the smallest set of specific errors possible
When using try-except blocks for duck-typing, keep the body of the try as small as possible and use the else block for logic
YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values
For required configuration attributes, access config directly and expect presence (e.g., policy_cfg['precision']) without hidden defaults
Use typing.NotRequired to mark optional attributes in TypedDict for configuration
When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml
Follow the Google Python Style Guide for Python code

Files:

  • nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

For any source file under nemo_rl/*.py that defines a class or function decorated with @ray.remote, add a coverage pragma (# pragma: no cover) because these run in separate Ray processes

Files:

  • nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

  • nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
**/*.{py,sh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)

Files:

  • nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.
📚 Learning: 2025-10-30T20:50:44.126Z
Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.

Applied to files:

  • nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
📚 Learning: 2025-09-19T03:00:58.662Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: examples/configs/recipes/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-fsdp2tp1.v1.yaml:85-101
Timestamp: 2025-09-19T03:00:58.662Z
Learning: In distillation and GRPO configurations, max_new_tokens is intentionally set to the full context window (max_total_sequence_length) for consistency across the codebase. Overflow cases when prompt + generation tokens exceed max_model_len are handled by safeguards implemented in vllm_worker.py.

Applied to files:

  • nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (3)
nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py (3)

280-280: LGTM: Context parallel size retrieval.

The early retrieval of cp_size is appropriate for the subsequent conditional logic and aligns with its existing usage throughout the file.


285-296: The code correctly implements SDPBackend backend selection.

The torch.nn.attention.SDPBackend is an enum-like class that contains the different backends for scaled dot product attention, and both FLASH_ATTENTION and EFFICIENT_ATTENTION backends are available in PyTorch 2.9.0. The conditional import and backend list match the usage patterns shown in official documentation. The code will work as intended.


298-307: No issues with sdpa_method parameter — it is properly supported.

The sdpa_method parameter is a standard parameter in NeMoAutoModelForCausalLM.from_config() that accepts "One or multiple SDPA back-ends to prefer when applying SDPA patching," and "When None, the default backend resolution logic is used." The implementation correctly passes this parameter to control SDPA kernel optimization.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@terrykong terrykong added the CI:L1 Run doctests, unit tests, and functional tests label Dec 19, 2025
@terrykong terrykong enabled auto-merge (squash) December 19, 2025 18:45
@terrykong terrykong merged commit d488064 into r0.5.0 Dec 19, 2025
68 of 71 checks passed
@terrykong terrykong deleted the cherry-pick-1663-r0.5.0 branch December 19, 2025 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick CI:L1 Run doctests, unit tests, and functional tests Run CICD

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants