[None][feat] reuse triton slicing kernel for GDN prefill transpose by nv-guomingz · Pull Request #12737 · NVIDIA/TensorRT-LLM

nv-guomingz · 2026-04-03T14:53:54Z

Summary by CodeRabbit

Refactor
- Consolidated internal tensor operations handling by reducing duplicate logic and improving code maintainability for model inference processing.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-04-03T15:03:24Z

📝 Walkthrough

Walkthrough

The changes refactor tensor transposition and slicing logic for prefill tokens by introducing a new utility function extract_transpose_prefill_slice that consolidates these operations, reducing code duplication and simplifying the prefill processing path in the Qwen3 model's forward pass.

Changes

Cohort / File(s)	Summary
New Utility Helper Function `tensorrt_llm/_torch/modules/mamba/fuse_elementwise_ops.py`	Added `extract_transpose_prefill_slice()` helper that allocates and returns a transposed/sliced tensor by invoking the existing `_extract_transpose_prefill_kernel`. Refactored `extract_transpose_xbc_prefill()` to delegate to this new helper, eliminating duplicated logic and consolidating kernel invocation.
Prefill Tensor Operations Refactor `tensorrt_llm/_torch/models/modeling_qwen3_next.py`	Replaced explicit prefill tensor transposition and post-convolution operations with calls to `extract_transpose_prefill_slice()`. Updated both prefill+decode and prefill-only code paths to use the new function, removing the combined `torch.cat` concatenation step for the prefill+decode case.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	The PR description lacks substantive content; all required sections (Description, Test Coverage) remain empty placeholder comments, and the checklist is unchecked despite being marked complete.	Fill in the Description section explaining what the refactoring accomplishes and why it is beneficial; add Test Coverage section listing relevant tests that validate the changes; verify and mark appropriate checklist items as completed.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'reuse triton slicing kernel for GDN prefill transpose' accurately describes the main change—refactoring to reuse a Triton kernel for handling prefill transpose operations, as evidenced by the addition of `extract_transpose_prefill_slice` and its integration across the codebase.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/models/modeling_qwen3_next.py (1)

701-709: ⚠️ Potential issue | 🟠 Major

Explicitly copy back the decode result to ensure correctness across all backends.

Lines 701–709 assign the decode result to mixed_qkv_d without explicitly copying it back to the split view, whereas the prefill result is explicitly copied at line 709 via mixed_qkv_p.copy_(...). This asymmetry creates a latent bug: if causal_conv1d_update materializes a new tensor (as the Triton backend does), the parent mixed_qkv tensor retains stale decode rows, which corrupts the subsequent Q/K/V split.

Suggested fix with data_ptr guard

-            mixed_qkv_d = causal_conv1d_update(
+            mixed_qkv_d_out = causal_conv1d_update(
                 mixed_qkv_d,
                 conv_states_to_use,
                 self.conv1d.weight,
                 self.conv1d.bias,
                 activation=self.activation,
                 conv_state_indices=state_indices_d,
             )
+            if mixed_qkv_d_out.data_ptr() != mixed_qkv_d.data_ptr():
+                mixed_qkv_d.copy_(mixed_qkv_d_out)
             mixed_qkv_p.copy_(mixed_qkv_p_t.transpose(0, 1))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_qwen3_next.py` around lines 701 - 709,
The decode path may materialize a new tensor in causal_conv1d_update causing
mixed_qkv's decode rows to remain stale; mirror the prefill handling by
explicitly copying the updated decode data back into the split view. After
calling causal_conv1d_update (for mixed_qkv_d), perform an in-place copy into
the original decode view (the same way mixed_qkv_p.copy_(...) is used) and
optionally guard with a data_ptr comparison between the returned tensor and the
destination to avoid redundant copies; update references around mixed_qkv_d,
mixed_qkv_p.copy_, causal_conv1d_update and mixed_qkv accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/modules/mamba/fuse_elementwise_ops.py`:
- Around line 54-80: The store-side offset computation in the Triton kernel
_extract_transpose_prefill_kernel is done in 32-bit arithmetic and can overflow
when width * num_prefill_tokens exceeds 2,147,483,647; mirror the fix used for
src_offsets by casting the operands to tl.int64 (or performing the
multiplication in tl.int64) before computing dst_offsets and before any
subsequent index arithmetic or calls to tl.store so the write addresses cannot
overflow. Locate the dst_offsets/dst_ptr computation inside
_extract_transpose_prefill_kernel and change the multiplication/additions to use
tl.int64 (e.g., cast row/col/width or the product) and ensure the tl.store uses
the 64-bit offset variable.

---

Outside diff comments:
In `@tensorrt_llm/_torch/models/modeling_qwen3_next.py`:
- Around line 701-709: The decode path may materialize a new tensor in
causal_conv1d_update causing mixed_qkv's decode rows to remain stale; mirror the
prefill handling by explicitly copying the updated decode data back into the
split view. After calling causal_conv1d_update (for mixed_qkv_d), perform an
in-place copy into the original decode view (the same way mixed_qkv_p.copy_(...)
is used) and optionally guard with a data_ptr comparison between the returned
tensor and the destination to avoid redundant copies; update references around
mixed_qkv_d, mixed_qkv_p.copy_, causal_conv1d_update and mixed_qkv accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 124c4bc8-c452-4797-af80-6edb29df5311

📥 Commits

Reviewing files that changed from the base of the PR and between 1045f38 and 8bd70de.

📒 Files selected for processing (2)

tensorrt_llm/_torch/models/modeling_qwen3_next.py
tensorrt_llm/_torch/modules/mamba/fuse_elementwise_ops.py

tensorrt_llm/_torch/modules/mamba/fuse_elementwise_ops.py

nv-guomingz · 2026-04-03T15:22:40Z

/bot run --add-multi-gpu-test

tensorrt-cicd · 2026-04-03T15:28:05Z

PR_Github #41693 [ run ] triggered by Bot. Commit: d522412 Link to invocation

tensorrt-cicd · 2026-04-03T17:25:25Z

PR_Github #41693 [ run ] completed with state SUCCESS. Commit: d522412
/LLM/main/L0_MergeRequest_PR pipeline #32596 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

nv-guomingz · 2026-04-04T01:12:53Z

/bot run

tensorrt-cicd · 2026-04-04T01:18:57Z

PR_Github #41775 [ run ] triggered by Bot. Commit: d522412 Link to invocation

tensorrt-cicd · 2026-04-04T04:25:45Z

PR_Github #41775 [ run ] completed with state SUCCESS. Commit: d522412
/LLM/main/L0_MergeRequest_PR pipeline #32670 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

nv-guomingz · 2026-04-05T11:08:17Z

/bot run

tensorrt-cicd · 2026-04-05T11:14:05Z

PR_Github #41867 [ run ] triggered by Bot. Commit: 114834c Link to invocation

tensorrt-cicd · 2026-04-05T12:39:42Z

PR_Github #41867 [ run ] completed with state SUCCESS. Commit: 114834c
/LLM/main/L0_MergeRequest_PR pipeline #32733 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

rosenrodt · 2026-04-05T19:49:48Z

/bot run

rosenrodt

LGTM as long as tests pass

tensorrt-cicd · 2026-04-05T19:55:24Z

PR_Github #41886 [ run ] triggered by Bot. Commit: 114834c Link to invocation

tensorrt-cicd · 2026-04-05T23:59:06Z

PR_Github #41886 [ run ] completed with state SUCCESS. Commit: 114834c
/LLM/main/L0_MergeRequest_PR pipeline #32751 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

yechank-nvidia

LGTM.

…VIDIA#12737) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

nv-guomingz requested review from a team as code owners April 3, 2026 14:53

nv-guomingz requested review from danielafrimi and yechank-nvidia April 3, 2026 14:53

github-actions bot assigned nv-guomingz Apr 3, 2026

coderabbitai bot reviewed Apr 3, 2026

View reviewed changes

tensorrt_llm/_torch/modules/mamba/fuse_elementwise_ops.py Show resolved Hide resolved

nv-guomingz force-pushed the user/guomingz/qwen3.5-fla-triton-slice branch 2 times, most recently from 1f78c5a to d522412 Compare April 3, 2026 15:22

nv-guomingz changed the title ~~reuse triton slicing kernel for GDN prefill transpose~~ [None][feat] reuse triton slicing kernel for GDN prefill transpose Apr 3, 2026

nv-guomingz mentioned this pull request Apr 3, 2026

[None][feat] Optimize GDN of Qwen3-Next/3.5; adds BF16 TRTLLM MoE #12557

Open

1 task

[None][feat] reuse triton slicing kernel for GDN prefill transpose

114834c

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

nv-guomingz force-pushed the user/guomingz/qwen3.5-fla-triton-slice branch from d522412 to 114834c Compare April 5, 2026 11:07

rosenrodt self-requested a review April 5, 2026 19:50

rosenrodt approved these changes Apr 5, 2026

View reviewed changes

yechank-nvidia approved these changes Apr 6, 2026

View reviewed changes

Wanli-Jiang approved these changes Apr 7, 2026

View reviewed changes

nv-guomingz merged commit 6488d7f into NVIDIA:main Apr 7, 2026
5 checks passed

xinhe-nv pushed a commit to xinhe-nv/TensorRT-LLM that referenced this pull request Apr 7, 2026

[None][feat] reuse triton slicing kernel for GDN prefill transpose (N…

aba73ca

…VIDIA#12737) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

karen-sy pushed a commit to karen-sy/TensorRT-LLM that referenced this pull request Apr 7, 2026

[None][feat] reuse triton slicing kernel for GDN prefill transpose (N…

40b84be

…VIDIA#12737) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

Conversation

nv-guomingz commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nv-guomingz commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

nv-guomingz commented Apr 4, 2026

Uh oh!

tensorrt-cicd commented Apr 4, 2026

Uh oh!

tensorrt-cicd commented Apr 4, 2026

Uh oh!

nv-guomingz commented Apr 5, 2026

Uh oh!

tensorrt-cicd commented Apr 5, 2026

Uh oh!

tensorrt-cicd commented Apr 5, 2026

Uh oh!

rosenrodt commented Apr 5, 2026

Uh oh!

rosenrodt left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Apr 5, 2026

Uh oh!

tensorrt-cicd commented Apr 5, 2026

Uh oh!

yechank-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nv-guomingz commented Apr 3, 2026 •

edited

Loading

coderabbitai bot commented Apr 3, 2026 •

edited

Loading