-
Notifications
You must be signed in to change notification settings - Fork 173
fix: Fix the sequence padding for FP8 case #1569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: root <[email protected]>
📝 WalkthroughWalkthroughThe changes consolidate and enhance padding logic for Megatron sequence packing to address FP8 alignment requirements. A new centralized helper function derives padding parameters from Megatron configuration (including FP8 settings and parallelism sizes), and these parameters are threaded through the policy worker workflow to ensure proper tensor dimension alignment. Changes
Sequence Diagram(s)sequenceDiagram
participant Policy Worker as megatron_policy_worker
participant Common as common.py
participant PackSeq as _pack_sequences_for_megatron()
Policy Worker->>Common: _get_pack_sequence_parameters_for_megatron(megatron_cfg, max_seq_len)
Note over Common: Derive padding params from:<br/>- Megatron config<br/>- FP8 settings<br/>- CP/TP/SP sizes
Common-->>Policy Worker: (pad_individual, pad_packed_multiple, pad_packed_to)
Policy Worker->>Policy Worker: Compute pad_factor from params
alt sequence_packing enabled
Policy Worker->>PackSeq: Call with pad_packed_seq_to_multiple_of
PackSeq->>Common: _round_up_to_multiple(pad_to, pad_multiple)
Common-->>PackSeq: Rounded pad target
PackSeq->>PackSeq: Apply padding to achieve<br/>FP8-compliant dimensions
PackSeq-->>Policy Worker: Padded sequences
else sequence_packing disabled
Policy Worker->>Policy Worker: Use standard padding
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (4 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
nemo_rl/models/megatron/common.py(9 hunks)nemo_rl/models/policy/lm_policy.py(0 hunks)nemo_rl/models/policy/megatron_policy_worker.py(8 hunks)
💤 Files with no reviewable changes (1)
- nemo_rl/models/policy/lm_policy.py
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Conform code to Python 3.12+
Indent code with 4 spaces. Do not use tabs
Use snake_case for file names
Use PascalCase for class names
Use snake_case for function and method names
Use snake_case for local variables
Prefix variable names that start with a number with 'k' (e.g., k_99th_percentile)
Use upper snake_case with 'G' prefix for global variables (e.g., G_MY_GLOBAL)
Use upper snake_case for constants
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
Prefer docstrings over comments for interfaces that may be used outside a file
Reserve comments for code within a function or interfaces that are local to a file
If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx
Avoid using reflection when functionality can be easily achieved without reflection
When using try-except blocks, limit the except clause to the smallest set of specific errors possible
When using try-except blocks for duck-typing, keep the body of the try as small as possible and use the else block for logic
YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values
For required configuration attributes, access config directly and expect presence (e.g., policy_cfg['precision']) without hidden defaults
Use typing.NotRequired to mark optional attributes in TypedDict for configuration
When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml
Follow the Google Python Style Guide for Python code
Files:
nemo_rl/models/megatron/common.pynemo_rl/models/policy/megatron_policy_worker.py
nemo_rl/**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
For any source file under nemo_rl/*.py that defines a class or function decorated with @ray.remote, add a coverage pragma (# pragma: no cover) because these run in separate Ray processes
Files:
nemo_rl/models/megatron/common.pynemo_rl/models/policy/megatron_policy_worker.py
!(**/tests/**|**/test_*.py|**/test_*.sh)
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year
Files:
nemo_rl/models/megatron/common.pynemo_rl/models/policy/megatron_policy_worker.py
**/*.{py,sh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)
Files:
nemo_rl/models/megatron/common.pynemo_rl/models/policy/megatron_policy_worker.py
🧠 Learnings (1)
📚 Learning: 2025-09-18T10:13:58.767Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1006
File: nemo_rl/algorithms/loss_functions.py:900-909
Timestamp: 2025-09-18T10:13:58.767Z
Learning: In NeMo RL codebase, when using device_mesh.get_group("tp"), the returned tp_group object has its own .rank() and .size() methods, which are the correct APIs to use (not torch.distributed.get_rank(tp_group)).
Applied to files:
nemo_rl/models/policy/megatron_policy_worker.py
🧬 Code graph analysis (1)
nemo_rl/models/policy/megatron_policy_worker.py (2)
nemo_rl/models/megatron/common.py (1)
_get_pack_sequence_parameters_for_megatron(229-281)nemo_rl/distributed/batched_data_dict.py (1)
get_microbatch_iterator_for_packable_sequences_len(798-800)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: sphinx-build / Build docs
- GitHub Check: Lint check
- GitHub Check: Lint check
- GitHub Check: Post automodel integration comment / Comment on PR
- GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (10)
nemo_rl/models/policy/megatron_policy_worker.py (3)
115-119: LGTM on import addition.The new import for
_get_pack_sequence_parameters_for_megatronis correctly added alongside related imports fromnemo_rl.models.megatron.common.
1020-1047: LGTM on centralized padding parameter retrieval intrain().The padding parameters are correctly initialized with defaults for non-packing paths (lines 1022-1024) and only overridden when sequence packing is enabled (lines 1040-1047). This ensures the forward step always has valid values.
1068-1070: Padding parameter correctly threaded through forward step.The
pad_packed_seq_to_multiple_ofparameter is now passed toforward_step, ensuring FP8 alignment requirements are honored during training.nemo_rl/models/megatron/common.py (7)
36-41: LGTM on_round_up_to_multiplehelper.The implementation correctly rounds up to the nearest multiple. The conditional check
if value % multiple != 0avoids unnecessary computation when the value is already aligned.
74-88: LGTM onneeds_paddingflag and initial rounding.The centralized
needs_paddingflag correctly captures when any padding is required. Roundingpad_packed_seq_toto a multiple ofpad_packed_seq_to_multiple_ofupfront (lines 85-88) ensures FP8 alignment is satisfied when PP forces a specific packed length.
104-120: Correct handling of padded cumulative sequence lengths.The logic properly updates
cu_seqlens_paddedbased on the padding requirements:
- Individual sequences are padded to
pad_factor(line 106)- The final packed length is either set to
pad_packed_seq_toor rounded up topad_packed_seq_to_multiple_of(lines 115-120)
144-159: Complex padding logic for last sequence element is correct.When
b == batch_size - 1and padding is needed, the code correctly computes the padded length to satisfy both individual sequence padding and the overall packed sequence alignment requirements.
188-205: Padding logic whenpad_factor == 1is correct.When no individual sequence padding is needed but packed sequence padding is required (e.g., FP8 without CP/SP), this path correctly pads the entire packed sequence to satisfy alignment.
229-281: Well-designed centralized padding parameter calculation.The
_get_pack_sequence_parameters_for_megatronfunction correctly derives padding requirements:
- Individual sequence padding accounts for CP (×cp_size×2) and SP (×tp_size)
- Packed sequence padding for FP8 uses 128 for blockwise, 16 otherwise, with CP/SP multipliers
- PP requires padding to max sequence length in batch
This directly addresses the FP8 dimension divisibility issue in #1551.
358-426: LGTM onforward_step_arbitrary_lossupdates.The new
pad_packed_seq_to_multiple_ofparameter is correctly added to the function signature and passed through to_pack_sequences_for_megatron.
terrykong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the fix!
Signed-off-by: root <[email protected]>
Signed-off-by: root <[email protected]>
|
Taken coderabbit comment |
Signed-off-by: root <[email protected]>
What does this PR do ?
Consolidate the sequence padding in Mcore sequence packing case, fix bugs for FP8.
The sequence packing now relies on three hyperparameters:
This functionality is now maintained in a separate utility function.
closes #1551
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.