cp: `Revert packed seq extra checks (2180)` into `r0.3.0` by ko3n1g · Pull Request #2196 · NVIDIA-NeMo/Megatron-Bridge

ko3n1g · 2026-02-03T18:06:40Z

beep boop [🤖]: Hi @cuichenx 👋,

we've cherry picked #2180 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

Refactor
- Optimized sequence padding handling logic for improved efficiency.
Tests
- Removed redundant test cases to streamline test coverage.

Signed-off-by: Chen Cui <chcui@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

ko3n1g · 2026-02-03T18:06:43Z

/ok to test 75f3195

copy-pr-bot · 2026-02-03T18:06:43Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-03T18:11:51Z

📝 Walkthrough

Walkthrough

Refactored packed sequence utilities to replace assertion-based conditional logic with direct tensor slicing operations. Simplified control flow by removing assertions that enforced specific argmin positions and padding values. Corresponding unit tests for no-padding and unpadded scenarios were removed to align with the updated implementation.

Changes

Cohort / File(s)	Summary
Packed Sequence Utils Refactoring `src/megatron/bridge/training/utils/packed_seq_utils.py`	Replaced assertion-based logic with direct slicing using `cu_seqlens_argmin` or `torch.argmin`. Simplified conditional branches for both padded and unpadded sequence length handling.
Unit Test Cleanup `tests/unit_tests/training/test_gpt_step.py`	Removed three test functions (`test_packed_seq_params_no_padding`, `test_packed_seq_params_no_padding_in_cu_seqlens`, `test_packed_seq_params_cu_seqlens_unpadded_no_padding`) and associated test assertions that validated no-padding scenarios.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

THD training in VLMs #1997: Directly modifies the same packed sequence handling code and unit tests for get_packed_seq_params.

Suggested labels

r0.3.0

Suggested reviewers

yaoyu-33

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR removes validation checks and 48 lines of tests (net -4 lines) without test results, regression analysis, or explanation; CUDA compatibility bug identified with missing .item() conversions.	Add test results demonstrating removed checks don't cause regressions, fix CUDA bug with .item() conversions on torch.argmin() results, and include regression testing across CPU and CUDA.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the PR as a cherry-pick of a revert (PR `#2180`) into the r0.3.0 branch, which matches the changeset that removes assertions and test cases from packed sequence utilities.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch cherry-pick-2180-r0.3.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/megatron/bridge/training/utils/packed_seq_utils.py`:
- Around line 46-55: The slice bounds use torch.argmin() (which returns a 0-d
tensor) causing CUDA/device-index issues; update the fallback branches to
convert the argmin results to Python ints (use .item()) before slicing
cu_seqlens_padded and cu_seqlens_unpadded so that cu_seqlens_padded =
cu_seqlens_padded[: torch.argmin(cu_seqlens_padded).item()] and
cu_seqlens_unpadded = cu_seqlens_unpadded[:
torch.argmin(cu_seqlens_unpadded).item()], matching the existing .item() usage
in the other branches for consistent, device-safe indexing of cu_seqlens_padded
and cu_seqlens_unpadded.

coderabbitai · 2026-02-03T18:11:54Z

src/megatron/bridge/training/utils/packed_seq_utils.py

    if cu_seqlens_argmin is not None:
-        argmin_idx = cu_seqlens_argmin.item()
-        assert argmin_idx == 0 or cu_seqlens_padded[argmin_idx] == -1  # cu_seqlens padding is -1
-        cu_seqlens_padded = cu_seqlens_padded[:argmin_idx]
-    elif torch.min(cu_seqlens_padded) == -1:
+        cu_seqlens_padded = cu_seqlens_padded[: cu_seqlens_argmin.item()]
+    else:
        cu_seqlens_padded = cu_seqlens_padded[: torch.argmin(cu_seqlens_padded)]

    if cu_seqlens_unpadded is not None:
        if cu_seqlens_unpadded_argmin is not None:
-            argmin_idx = cu_seqlens_unpadded_argmin.item()
-            assert argmin_idx == 0 or cu_seqlens_unpadded[argmin_idx] == -1  # cu_seqlens padding is -1
-            cu_seqlens_unpadded = cu_seqlens_unpadded[:argmin_idx]
-        elif torch.min(cu_seqlens_unpadded) == -1:
+            cu_seqlens_unpadded = cu_seqlens_unpadded[: cu_seqlens_unpadded_argmin.item()]
+        else:
            cu_seqlens_unpadded = cu_seqlens_unpadded[: torch.argmin(cu_seqlens_unpadded)]


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

PyTorch tensor slicing with 0-dimensional tensor indices CUDA CPU compatibility

💡 Result:

In PyTorch, a 0‑dim tensor used as an index is still a “tensor index” (advanced indexing), so PyTorch applies the device/dtype rules for index tensors:

Index tensors must be on CPU or on the same device as the tensor being indexed; otherwise you’ll get errors like: “indices should be either on cpu or on the same device as the indexed tensor”. [1]

On CUDA, some invalid tensor-indexing cases (e.g., out-of-bounds when the index is a tensor) may surface as a device-side assert rather than a clean IndexError, which can be harder to catch/debug. [2]

Practical patterns for CPU/CUDA compatibility

1) If you want “scalar slicing”, convert the 0‑dim tensor to a Python int

i = int(i0.item()) # i0 can be CPU or CUDA 0-dim y = x[i] # uses basic indexing; avoids tensor-index device rules

This is the most portable fix. (Note: .item() on a CUDA tensor synchronizes.)

2) If you want to keep it as a tensor index, keep devices consistent

idx = idx.to(x.device) # and ensure integer dtype (usually torch.long) y = x[idx]

3) Debugging CUDA “device-side assert” from indexing
If you hit a device-side assert during indexing, reproduce with CUDA_LAUNCH_BLOCKING=1 to get a more accurate stack trace; this class of issue is commonly triggered by invalid CUDA indexing with tensor indices. [2]

Sources:
[1] PyTorch Forums: “Indices should be either on cpu or on the same device as the indexed tensor” (discussion of the device rule for index tensors). [1]
[2] PyTorch GitHub issue discussing CUDA tensor indexing leading to device-side asserts instead of IndexError. [2]

🏁 Script executed:

# First, let's examine the actual file to understand the context cat -n src/megatron/bridge/training/utils/packed_seq_utils.py | head -70

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3436

🏁 Script executed:

# Let's verify the full context and see if there's any processing of these cumulative seqlens cat -n src/megatron/bridge/training/utils/packed_seq_utils.py | sed -n '36,70p'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1818

Convert torch.argmin() results to Python int for CUDA compatibility and consistent slicing behavior.

Lines 49 and 55 use torch.argmin() directly as a slice bound, which returns a 0-dimensional tensor. On CUDA, this can trigger device-side assertion errors because PyTorch requires index tensors to match the device of the indexed tensor. Lines 47 and 53 already use .item() correctly; apply the same pattern to the fallback paths.

Proposed fix

if cu_seqlens_argmin is not None: cu_seqlens_padded = cu_seqlens_padded[: cu_seqlens_argmin.item()] else: - cu_seqlens_padded = cu_seqlens_padded[: torch.argmin(cu_seqlens_padded)] + cu_seqlens_padded = cu_seqlens_padded[: torch.argmin(cu_seqlens_padded).item()] if cu_seqlens_unpadded is not None: if cu_seqlens_unpadded_argmin is not None: cu_seqlens_unpadded = cu_seqlens_unpadded[: cu_seqlens_unpadded_argmin.item()] else: - cu_seqlens_unpadded = cu_seqlens_unpadded[: torch.argmin(cu_seqlens_unpadded)] + cu_seqlens_unpadded = cu_seqlens_unpadded[: torch.argmin(cu_seqlens_unpadded).item()]

🤖 Prompt for AI Agents

In `@src/megatron/bridge/training/utils/packed_seq_utils.py` around lines 46 - 55, The slice bounds use torch.argmin() (which returns a 0-d tensor) causing CUDA/device-index issues; update the fallback branches to convert the argmin results to Python ints (use .item()) before slicing cu_seqlens_padded and cu_seqlens_unpadded so that cu_seqlens_padded = cu_seqlens_padded[: torch.argmin(cu_seqlens_padded).item()] and cu_seqlens_unpadded = cu_seqlens_unpadded[: torch.argmin(cu_seqlens_unpadded).item()], matching the existing .item() usage in the other branches for consistent, device-safe indexing of cu_seqlens_padded and cu_seqlens_unpadded.

Revert packed seq extra checks (#2180)

75f3195

Signed-off-by: Chen Cui <chcui@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

ko3n1g requested a review from cuichenx February 3, 2026 18:06

ko3n1g added cherry-pick Run CICD labels Feb 3, 2026

copy-pr-bot bot temporarily deployed to nemo-ci February 3, 2026 18:07 Inactive

copy-pr-bot bot temporarily deployed to test February 3, 2026 18:07 Inactive

cuichenx approved these changes Feb 3, 2026

View reviewed changes

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci February 3, 2026 19:50 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 3, 2026 19:57 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 3, 2026 20:06 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci February 3, 2026 20:06 Failure

copy-pr-bot bot temporarily deployed to nemo-ci February 3, 2026 20:06 Inactive

thomasdhc approved these changes Feb 3, 2026

View reviewed changes

ko3n1g merged commit bf5ee44 into r0.3.0 Feb 3, 2026
45 of 47 checks passed

ko3n1g deleted the cherry-pick-2180-r0.3.0 branch February 3, 2026 20:48

coderabbitai bot mentioned this pull request Feb 11, 2026

cp: [training] fix: Add cu_seqlens_argmin to vlm packed sequence (2246) into r0.3.0 #2320

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cp: `Revert packed seq extra checks (2180)` into `r0.3.0`#2196

cp: `Revert packed seq extra checks (2180)` into `r0.3.0`#2196
ko3n1g merged 1 commit intor0.3.0from
cherry-pick-2180-r0.3.0

ko3n1g commented Feb 3, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

ko3n1g commented Feb 3, 2026

Uh oh!

copy-pr-bot bot commented Feb 3, 2026

Uh oh!

coderabbitai bot commented Feb 3, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ko3n1g commented Feb 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

ko3n1g commented Feb 3, 2026

Uh oh!

copy-pr-bot bot commented Feb 3, 2026

Uh oh!

coderabbitai bot commented Feb 3, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Practical patterns for CPU/CUDA compatibility

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ko3n1g commented Feb 3, 2026 •

edited by coderabbitai bot

Loading