[JAX] Make all jax attention calls use non-packed common calls #2358

pggPL · 2025-11-06T22:41:18Z

Description

JAX calls nvte_fused_attn_fwd_kvpacked(), nvte_fused_attn_fwd_qkvpacked() or nvte_fused_attn_fwd(). First two will be deprecated by #2287, so this PR changes the jax extension code to use only last one.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Pawel Gadzinski <[email protected]>

greptile-apps

Greptile Overview

Greptile Summary

Refactors JAX attention extension to use only nvte_fused_attn_fwd/bwd instead of the deprecated packed variants (nvte_fused_attn_fwd_qkvpacked and nvte_fused_attn_fwd_kvpacked). The PR moves pointer arithmetic from the common API layer into the JAX extension code.

Key changes:

Unified all three layout types (QKV packed, KV packed, separate) to call single nvte_fused_attn_fwd/bwd API
Added pointer arithmetic in JAX extension to extract K and V pointers from packed tensors
Removed unused tensor shape definitions and layout-specific branching in workspace size calculations
Updated gradient zeroing logic in backward pass to correctly handle packed tensor memory layouts

Critical issue found:

Lines 287 and 517: Stride calculation for KV-packed layout uses qk_head_dim but should use v_head_dim since KV packed tensors have shape [batch*seqlen, 2, num_gqa_groups, v_head_dim]

Confidence Score: 1/5

This PR contains critical pointer arithmetic bugs that will cause memory corruption or incorrect results
Score of 1 (critical issues) due to incorrect stride calculation in KV-packed layout at lines 287 and 517. Using qk_head_dim instead of v_head_dim will cause V pointer to point to wrong memory location when these dimensions differ, leading to incorrect attention computation or potential memory access violations
transformer_engine/jax/csrc/extensions/attention.cpp - lines 287 and 517 must be fixed before merge

Important Files Changed

File Analysis

Filename	Score	Overview
transformer_engine/jax/csrc/extensions/attention.cpp	1/5	Refactors JAX attention to use only `nvte_fused_attn_fwd/bwd`, but has critical pointer arithmetic bug in KV-packed layout stride calculation using `qk_head_dim` instead of `v_head_dim`

Sequence Diagram

sequenceDiagram
    participant JAX as JAX Extension
    participant Old as Old API (deprecated)
    participant New as nvte_fused_attn_fwd
    participant Kernel as Attention Kernel

    Note over JAX,Kernel: Before PR #2358
    JAX->>Old: nvte_fused_attn_fwd_qkvpacked(QKV)
    Old->>Kernel: Extract Q,K,V pointers internally
    Kernel-->>Old: Result
    Old-->>JAX: Output

    JAX->>Old: nvte_fused_attn_fwd_kvpacked(Q, KV)
    Old->>Kernel: Extract K,V pointers internally
    Kernel-->>Old: Result
    Old-->>JAX: Output

    Note over JAX,Kernel: After PR #2358
    JAX->>JAX: Extract Q,K,V pointers from packed data
    JAX->>New: nvte_fused_attn_fwd(Q, K, V)
    New->>Kernel: Forward separate pointers
    Kernel-->>New: Result
    New-->>JAX: Output

    Note over JAX: Pointer arithmetic handles:<br/>- NVTE_3HD: QKV packed<br/>- NVTE_HD_2HD: KV packed<br/>- NVTE_HD_HD_HD: All separate

_{1 file reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-11-06T22:45:27Z

transformer_engine/jax/csrc/extensions/attention.cpp

+    // Python passes: q=query, k=packed_kv, v=dummy
+    // Extract V pointer from the packed k data
+    // Note: For KV packed, qk_head_dim must equal v_head_dim
+    size_t stride = (typeToSize(dtype) * num_gqa_groups * qk_head_dim);


logic: stride calculation uses qk_head_dim but KV packed tensor has shape [batch*seqlen, 2, num_gqa_groups, v_head_dim] - should use v_head_dim

Suggested change

size_t stride = (typeToSize(dtype) * num_gqa_groups * qk_head_dim);

size_t stride = (typeToSize(dtype) * num_gqa_groups * v_head_dim);

I think this comment can be ignored

greptile-apps · 2025-11-06T22:45:28Z

transformer_engine/jax/csrc/extensions/attention.cpp

-      cudaMemsetAsync(dv, 0, transformer_engine::jax::product(v_shape) * typeToSize(dtype), stream);
+    // Q separate, KV packed in k: [batch*seqlen, 2, num_gqa_groups, dim]
+    // Note: For KV packed, qk_head_dim must equal v_head_dim
+    size_t stride = (typeToSize(dtype) * num_gqa_groups * qk_head_dim);


logic: stride calculation uses qk_head_dim but KV packed tensor has shape [batch*seqlen, 2, num_gqa_groups, v_head_dim] - should use v_head_dim

Suggested change

size_t stride = (typeToSize(dtype) * num_gqa_groups * qk_head_dim);

size_t stride = (typeToSize(dtype) * num_gqa_groups * v_head_dim);

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

greptile-apps

Greptile Overview

Greptile Summary

Refactors JAX attention code to use unified nvte_fused_attn_fwd() and nvte_fused_attn_bwd() APIs instead of deprecated packed variants (nvte_fused_attn_fwd_kvpacked() and nvte_fused_attn_fwd_qkvpacked()). The pointer arithmetic for extracting K/V pointers from packed tensors is now handled in the JAX layer.

Key Changes:

Removed conditional calls to packed-specific attention APIs
Added pointer calculation logic to extract K/V pointers from packed QKV/KV tensors based on layout
Unified all attention calls to use the single nvte_fused_attn_fwd/bwd API
Updated workspace size calculation functions similarly

Issue Found:

For KV-packed layout (NVTE_HD_2HD), stride calculation uses qk_head_dim but should use v_head_dim to match the actual tensor shape [batch*seqlen, 2, num_gqa_groups, v_head_dim]. While enforced equal by runtime check, using v_head_dim is semantically correct.

Confidence Score: 4/5

Safe to merge after fixing stride calculation to use v_head_dim instead of qk_head_dim for KV-packed layout
The refactoring is well-structured and aligns with the goal of deprecating packed-specific APIs. However, the stride calculation issue (using qk_head_dim instead of v_head_dim) in the KV-packed layout needs to be fixed for semantic correctness, even though runtime checks enforce equality. The logic is sound otherwise, with proper handling of different layouts and appropriate memory clearing for ragged sequences.
transformer_engine/jax/csrc/extensions/attention.cpp - Fix stride calculation on lines 290 and 523 to use v_head_dim instead of qk_head_dim

Important Files Changed

File Analysis

Filename	Score	Overview
transformer_engine/jax/csrc/extensions/attention.cpp	4/5	Refactors JAX attention to use unified `nvte_fused_attn_fwd/bwd` API; has stride calculation issue using `qk_head_dim` instead of `v_head_dim` for KV-packed layout

Sequence Diagram

sequenceDiagram
    participant JAX as JAX Python Layer
    participant FwdImpl as FusedAttnForwardImpl
    participant LayoutCheck as Layout Detection
    participant PtrCalc as Pointer Calculation
    participant API as nvte_fused_attn_fwd

    JAX->>FwdImpl: Call with q, k, v pointers
    FwdImpl->>LayoutCheck: Check layout_group (NVTE_3HD/HD_2HD/HD_HD_HD)
    
    alt NVTE_3HD (QKV packed)
        LayoutCheck->>PtrCalc: Extract K, V from packed Q
        PtrCalc->>PtrCalc: k_ptr = q + stride<br/>v_ptr = q + 2*stride<br/>stride = typeSize * attn_heads * qk_head_dim
    else NVTE_HD_2HD (KV packed)
        LayoutCheck->>PtrCalc: Extract V from packed K
        PtrCalc->>PtrCalc: v_ptr = k + stride<br/>stride = typeSize * num_gqa_groups * qk_head_dim
    else NVTE_HD_HD_HD (separate)
        LayoutCheck->>PtrCalc: Use pointers as-is
    end
    
    PtrCalc->>API: Call with separate q_ptr, k_ptr, v_ptr
    API-->>FwdImpl: Return results
    FwdImpl-->>JAX: Return output

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

pggPL · 2025-11-06T23:06:10Z

/te-ci jax

phu0ngng

LGTM. Thanks.

for more information, see https://pre-commit.ci

greptile-apps · 2025-11-13T17:36:10Z

Greptile Overview

Greptile Summary

Refactors JAX attention extension to use only nvte_fused_attn_fwd and nvte_fused_attn_bwd instead of deprecated nvte_fused_attn_fwd_kvpacked and nvte_fused_attn_fwd_qkvpacked functions (related to #2287).

Key changes:

Removes conditional branching based on layout groups in forward/backward passes
Adds pointer arithmetic to extract K and V pointers from packed QKV/KV tensors
Simplifies workspace size calculation by removing packed-specific tensor definitions
Improves cudaMemset logic for ragged tensors to correctly handle packed layouts

The refactoring consolidates three code paths into one, reducing duplication while maintaining functional equivalence.

Confidence Score: 3/5

PR has logical correctness issues in stride calculation that should be fixed before merging
The refactoring successfully consolidates code paths and the overall logic is sound. However, there are stride calculation issues in the KV packed case (lines 290, 523) that use qk_head_dim instead of v_head_dim. While these are enforced to be equal via NVTE_CHECK, using v_head_dim is more semantically correct since we're computing the offset to the V tensor. This has already been flagged in previous comments.
Pay close attention to transformer_engine/jax/csrc/extensions/attention.cpp lines 290 and 523 - stride calculations for KV packed tensors

Important Files Changed

File Analysis

Filename	Score	Overview
transformer_engine/jax/csrc/extensions/attention.cpp	4/5	Refactors JAX attention calls to use only `nvte_fused_attn_fwd/bwd` instead of deprecated packed variants. Includes pointer arithmetic for unpacking QKV/KV tensors and improved cudaMemset logic for ragged tensors.

Sequence Diagram

sequenceDiagram
    participant JAX as JAX Python
    participant CPP as attention.cpp
    participant NVTE as nvte_fused_attn_fwd
    
    JAX->>CPP: FusedAttnForwardImpl(q, k, v, ...)
    Note over CPP: Determine layout_group from qkv_layout
    
    alt QKV Packed (NVTE_3HD)
        Note over CPP: Extract K, V from packed Q tensor<br/>stride = typeToSize * attn_heads * qk_head_dim<br/>k_ptr = q + stride<br/>v_ptr = q + 2*stride
    else KV Packed (NVTE_HD_2HD)
        Note over CPP: Extract V from packed K tensor<br/>stride = typeToSize * num_gqa_groups * qk_head_dim<br/>k_ptr = k<br/>v_ptr = k + stride
    else Separate (NVTE_HD_HD_HD)
        Note over CPP: Use pointers as-is<br/>q_ptr = q, k_ptr = k, v_ptr = v
    end
    
    CPP->>CPP: Create TensorWrappers(q_ptr, k_ptr, v_ptr)
    CPP->>NVTE: nvte_fused_attn_fwd(q_tensor, k_tensor, v_tensor, ...)
    NVTE-->>CPP: Compute attention
    CPP-->>JAX: Return output

greptile-apps

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

pggPL · 2025-11-13T17:40:40Z

/te-ci jax

KshitijLakhani

EDIT: Please ignore the below comment (I incorrectly thought that you had launched only L0 tests)

Please run L1 tests on this as well so as to exercise the L1 dist attn tests due to the nature of the change (though L0 should mostly be enough - better safe than sorry :) )

KshitijLakhani · 2025-11-13T18:39:11Z

transformer_engine/jax/csrc/extensions/attention.cpp

    auto ragged_offset_tensor =
        TensorWrapper(nullptr, std::vector<size_t>{num_segments + 1}, DType::kInt32);
-    if (layout_group == NVTE_QKV_Layout_Group::NVTE_3HD) {
-      NVTE_CHECK(q_max_seqlen == kv_max_seqlen, "q_max_seqlen must equal to kv_max_seqlen");


As part of this consolidation, will be lose this check ?
Is that okay or needs to be looked into ?

KshitijLakhani · 2025-11-13T18:46:36Z

transformer_engine/jax/csrc/extensions/attention.cpp

+               "For QKV packed layout, qk_head_dim must equal v_head_dim");
+    size_t stride = (typeToSize(dtype) * attn_heads * qk_head_dim);
+    q_ptr = q;
+    k_ptr = static_cast<void *>(static_cast<int8_t *>(q) + stride);


quick question @pggPL : why the choice of int8_t? (for static casting the q,k void pointers)

KshitijLakhani · 2025-11-13T18:49:40Z

transformer_engine/jax/csrc/extensions/attention.cpp

+    // QKV packed in q: [batch*seqlen, 3, heads, dim]
+    // Python passes: q=packed_qkv, k=dummy, v=dummy
+    // Extract K and V pointers from the packed q data


Thanks for the comments

KshitijLakhani · 2025-11-13T18:51:40Z

transformer_engine/jax/csrc/extensions/attention.cpp

+    // Python passes: q=query, k=packed_kv, v=dummy
+    // Extract V pointer from the packed k data
+    // Note: For KV packed, qk_head_dim must equal v_head_dim
+    size_t stride = (typeToSize(dtype) * num_gqa_groups * qk_head_dim);


I think this comment can be ignored

KshitijLakhani · 2025-11-13T19:03:59Z

Thanks for PR 2287. Quick nit from 2287: In calculate_qkv_stride could you add "stride in bytes" in the comments instead of just "stride" ?

mgoldfarb-nvidia · 2025-11-13T20:06:22Z

transformer_engine/jax/csrc/extensions/attention.cpp

+    v_shape = k_shape;
+  }
+
+  auto q_tensor = TensorWrapper(q_ptr, q_shape, dtype);


nit: Do we need these tensor wrappers? Or can we pass the pointers directly? They don't seem to do anything.

mgoldfarb-nvidia

Overall LGTM assuming out CI passes

fix

28a0a7e

Signed-off-by: Pawel Gadzinski <[email protected]>

pggPL changed the title ~~[JAX] Make all jax attention calls to use non-packed common calls~~ [JAX] Make all jax attention calls use non-packed common calls Nov 6, 2025

greptile-apps bot reviewed Nov 6, 2025

View reviewed changes

pggPL and others added 2 commits November 6, 2025 23:01

add notes

79d4d4a

Signed-off-by: Pawel Gadzinski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

78a340d

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Nov 6, 2025

View reviewed changes

pggPL requested a review from KshitijLakhani November 6, 2025 23:06

pggPL mentioned this pull request Nov 7, 2025

[JAX] Add support for sink attention in JAX #2225

Open

13 tasks

pggPL requested a review from phu0ngng November 12, 2025 16:18

phu0ngng previously approved these changes Nov 13, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/main' into jax_attn_use_nonpacked

b89d568

pggPL dismissed phu0ngng’s stale review via b89d568 November 13, 2025 17:31

[pre-commit.ci] auto fixes from pre-commit.com hooks

b4b5826

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Nov 13, 2025

View reviewed changes

KshitijLakhani reviewed Nov 13, 2025

View reviewed changes

KshitijLakhani requested a review from mgoldfarb-nvidia November 13, 2025 19:04

mgoldfarb-nvidia reviewed Nov 13, 2025

View reviewed changes

mgoldfarb-nvidia approved these changes Nov 13, 2025

View reviewed changes

	size_t stride = (typeToSize(dtype) * num_gqa_groups * qk_head_dim);
	size_t stride = (typeToSize(dtype) * num_gqa_groups * v_head_dim);

[JAX] Make all jax attention calls use non-packed common calls #2358

Are you sure you want to change the base?

[JAX] Make all jax attention calls use non-packed common calls #2358

Uh oh!

Conversation

pggPL commented Nov 6, 2025

Description

Type of change

Checklist:

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 1/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

KshitijLakhani Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

pggPL commented Nov 6, 2025

Uh oh!

phu0ngng left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Nov 13, 2025

Greptile Overview

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

pggPL commented Nov 13, 2025

Uh oh!

KshitijLakhani left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KshitijLakhani Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

KshitijLakhani Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

KshitijLakhani Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

KshitijLakhani Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

KshitijLakhani commented Nov 13, 2025

Uh oh!

mgoldfarb-nvidia Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

mgoldfarb-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

KshitijLakhani left a comment •

edited

Loading