Skip to content

cp: Revert packed seq extra checks (2180) into r0.3.0#2196

Merged
ko3n1g merged 1 commit intor0.3.0from
cherry-pick-2180-r0.3.0
Feb 3, 2026
Merged

cp: Revert packed seq extra checks (2180) into r0.3.0#2196
ko3n1g merged 1 commit intor0.3.0from
cherry-pick-2180-r0.3.0

Conversation

@ko3n1g
Copy link
Copy Markdown
Contributor

@ko3n1g ko3n1g commented Feb 3, 2026

beep boop [🤖]: Hi @cuichenx 👋,

we've cherry picked #2180 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

  • Refactor

    • Optimized sequence padding handling logic for improved efficiency.
  • Tests

    • Removed redundant test cases to streamline test coverage.

Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Feb 3, 2026

/ok to test 75f3195

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Feb 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 3, 2026

📝 Walkthrough

Walkthrough

Refactored packed sequence utilities to replace assertion-based conditional logic with direct tensor slicing operations. Simplified control flow by removing assertions that enforced specific argmin positions and padding values. Corresponding unit tests for no-padding and unpadded scenarios were removed to align with the updated implementation.

Changes

Cohort / File(s) Summary
Packed Sequence Utils Refactoring
src/megatron/bridge/training/utils/packed_seq_utils.py
Replaced assertion-based logic with direct slicing using cu_seqlens_argmin or torch.argmin. Simplified conditional branches for both padded and unpadded sequence length handling.
Unit Test Cleanup
tests/unit_tests/training/test_gpt_step.py
Removed three test functions (test_packed_seq_params_no_padding, test_packed_seq_params_no_padding_in_cu_seqlens, test_packed_seq_params_cu_seqlens_unpadded_no_padding) and associated test assertions that validated no-padding scenarios.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

r0.3.0

Suggested reviewers

  • yaoyu-33
🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR removes validation checks and 48 lines of tests (net -4 lines) without test results, regression analysis, or explanation; CUDA compatibility bug identified with missing .item() conversions. Add test results demonstrating removed checks don't cause regressions, fix CUDA bug with .item() conversions on torch.argmin() results, and include regression testing across CPU and CUDA.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the PR as a cherry-pick of a revert (PR #2180) into the r0.3.0 branch, which matches the changeset that removes assertions and test cases from packed sequence utilities.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch cherry-pick-2180-r0.3.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/megatron/bridge/training/utils/packed_seq_utils.py`:
- Around line 46-55: The slice bounds use torch.argmin() (which returns a 0-d
tensor) causing CUDA/device-index issues; update the fallback branches to
convert the argmin results to Python ints (use .item()) before slicing
cu_seqlens_padded and cu_seqlens_unpadded so that cu_seqlens_padded =
cu_seqlens_padded[: torch.argmin(cu_seqlens_padded).item()] and
cu_seqlens_unpadded = cu_seqlens_unpadded[:
torch.argmin(cu_seqlens_unpadded).item()], matching the existing .item() usage
in the other branches for consistent, device-safe indexing of cu_seqlens_padded
and cu_seqlens_unpadded.

Comment on lines 46 to 55
if cu_seqlens_argmin is not None:
argmin_idx = cu_seqlens_argmin.item()
assert argmin_idx == 0 or cu_seqlens_padded[argmin_idx] == -1 # cu_seqlens padding is -1
cu_seqlens_padded = cu_seqlens_padded[:argmin_idx]
elif torch.min(cu_seqlens_padded) == -1:
cu_seqlens_padded = cu_seqlens_padded[: cu_seqlens_argmin.item()]
else:
cu_seqlens_padded = cu_seqlens_padded[: torch.argmin(cu_seqlens_padded)]

if cu_seqlens_unpadded is not None:
if cu_seqlens_unpadded_argmin is not None:
argmin_idx = cu_seqlens_unpadded_argmin.item()
assert argmin_idx == 0 or cu_seqlens_unpadded[argmin_idx] == -1 # cu_seqlens padding is -1
cu_seqlens_unpadded = cu_seqlens_unpadded[:argmin_idx]
elif torch.min(cu_seqlens_unpadded) == -1:
cu_seqlens_unpadded = cu_seqlens_unpadded[: cu_seqlens_unpadded_argmin.item()]
else:
cu_seqlens_unpadded = cu_seqlens_unpadded[: torch.argmin(cu_seqlens_unpadded)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

PyTorch tensor slicing with 0-dimensional tensor indices CUDA CPU compatibility

💡 Result:

In PyTorch, a 0‑dim tensor used as an index is still a “tensor index” (advanced indexing), so PyTorch applies the device/dtype rules for index tensors:

  • Index tensors must be on CPU or on the same device as the tensor being indexed; otherwise you’ll get errors like: “indices should be either on cpu or on the same device as the indexed tensor”. [1]
  • On CUDA, some invalid tensor-indexing cases (e.g., out-of-bounds when the index is a tensor) may surface as a device-side assert rather than a clean IndexError, which can be harder to catch/debug. [2]

Practical patterns for CPU/CUDA compatibility

1) If you want “scalar slicing”, convert the 0‑dim tensor to a Python int

i = int(i0.item())          # i0 can be CPU or CUDA 0-dim
y = x[i]                    # uses basic indexing; avoids tensor-index device rules

This is the most portable fix. (Note: .item() on a CUDA tensor synchronizes.)

2) If you want to keep it as a tensor index, keep devices consistent

idx = idx.to(x.device)      # and ensure integer dtype (usually torch.long)
y = x[idx]

3) Debugging CUDA “device-side assert” from indexing
If you hit a device-side assert during indexing, reproduce with CUDA_LAUNCH_BLOCKING=1 to get a more accurate stack trace; this class of issue is commonly triggered by invalid CUDA indexing with tensor indices. [2]

Sources:
[1] PyTorch Forums: “Indices should be either on cpu or on the same device as the indexed tensor” (discussion of the device rule for index tensors). [1]
[2] PyTorch GitHub issue discussing CUDA tensor indexing leading to device-side asserts instead of IndexError. [2]


🏁 Script executed:

# First, let's examine the actual file to understand the context
cat -n src/megatron/bridge/training/utils/packed_seq_utils.py | head -70

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3436


🏁 Script executed:

# Let's verify the full context and see if there's any processing of these cumulative seqlens
cat -n src/megatron/bridge/training/utils/packed_seq_utils.py | sed -n '36,70p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1818


Convert torch.argmin() results to Python int for CUDA compatibility and consistent slicing behavior.

Lines 49 and 55 use torch.argmin() directly as a slice bound, which returns a 0-dimensional tensor. On CUDA, this can trigger device-side assertion errors because PyTorch requires index tensors to match the device of the indexed tensor. Lines 47 and 53 already use .item() correctly; apply the same pattern to the fallback paths.

Proposed fix
     if cu_seqlens_argmin is not None:
         cu_seqlens_padded = cu_seqlens_padded[: cu_seqlens_argmin.item()]
     else:
-        cu_seqlens_padded = cu_seqlens_padded[: torch.argmin(cu_seqlens_padded)]
+        cu_seqlens_padded = cu_seqlens_padded[: torch.argmin(cu_seqlens_padded).item()]
 
     if cu_seqlens_unpadded is not None:
         if cu_seqlens_unpadded_argmin is not None:
             cu_seqlens_unpadded = cu_seqlens_unpadded[: cu_seqlens_unpadded_argmin.item()]
         else:
-            cu_seqlens_unpadded = cu_seqlens_unpadded[: torch.argmin(cu_seqlens_unpadded)]
+            cu_seqlens_unpadded = cu_seqlens_unpadded[: torch.argmin(cu_seqlens_unpadded).item()]
🤖 Prompt for AI Agents
In `@src/megatron/bridge/training/utils/packed_seq_utils.py` around lines 46 - 55,
The slice bounds use torch.argmin() (which returns a 0-d tensor) causing
CUDA/device-index issues; update the fallback branches to convert the argmin
results to Python ints (use .item()) before slicing cu_seqlens_padded and
cu_seqlens_unpadded so that cu_seqlens_padded = cu_seqlens_padded[:
torch.argmin(cu_seqlens_padded).item()] and cu_seqlens_unpadded =
cu_seqlens_unpadded[: torch.argmin(cu_seqlens_unpadded).item()], matching the
existing .item() usage in the other branches for consistent, device-safe
indexing of cu_seqlens_padded and cu_seqlens_unpadded.

@ko3n1g ko3n1g merged commit bf5ee44 into r0.3.0 Feb 3, 2026
45 of 47 checks passed
@ko3n1g ko3n1g deleted the cherry-pick-2180-r0.3.0 branch February 3, 2026 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants