Skip to content

Conversation

@cyanguwa
Copy link
Collaborator

@cyanguwa cyanguwa commented Sep 2, 2025

Description

This PR adds sink attention support (fwd + bwd) to TE-PyTorch.

  • FusedAttention backend for FP16/BF16 and BSHD/SBHD: cuDNN 9.13.1+ and cudnn-frontend 1.14.1
  • UnfusedDotProductAttention backend for FP32/FP16/BF16 and BSHD/SBHD
  • context parallel support for cp_comm_type=a2a with FusedAttention

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Added sink attention support via cuDNN
  • Added cp_comm_type and softmax_type to AttentionParams
  • Improved tensor allocation in pytorch/csrc/extensions/attention.cpp and tensor indexing in Aux_CTX_Tensors in fused_attn_f16_arbitrary_seqlen.cu
  • Improved CP unit tests

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

cyanguwa and others added 30 commits September 2, 2025 15:00
Signed-off-by: Charlene Yang <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
@cyanguwa cyanguwa added the 2.8.0 label Sep 8, 2025
@cyanguwa
Copy link
Collaborator Author

/te-ci L1

Signed-off-by: Charlene Yang <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
@jeremyyx
Copy link

@cyanguwa Thank you for the excellent work! I have merged the feature into my own Megatron fork, but I found that every attention backend reports “not supported” when qkv_format is set to 'thd'. Are there any work-arounds? At the moment, in packing mode the qkv format seems to be required to be 'thd'.

@cyanguwa
Copy link
Collaborator Author

/te-ci L1

Copy link
Contributor

@cuichenx cuichenx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Tested this branch with nemo training and convergence looks good.

@cyanguwa
Copy link
Collaborator Author

cyanguwa commented Sep 21, 2025

Tested with cuDNN 9.13.1 in pipeline 35266475. All looks good except for cp_4_0 tests. Have reported them to cuDNN in bug 5522629. Based on @cuichenx's review and convergence testing, I'm merging the PR.

@cyanguwa cyanguwa requested a review from ptrendx September 21, 2025 06:03
@cyanguwa
Copy link
Collaborator Author

Nice work! How could I find cudnn-frontend 1.14.1? I can only install 1.14.0. And I clone the latest cudnn-frontend, which still do not have "set_dsink_token"

Sorry about the late reply. You probably have found my update to frontend in this PR to 1.14.1. The commit for FE now is 1a7b4b7.

@cyanguwa
Copy link
Collaborator Author

@cyanguwa Thank you for the excellent work! I have merged the feature into my own Megatron fork, but I found that every attention backend reports “not supported” when qkv_format is set to 'thd'. Are there any work-arounds? At the moment, in packing mode the qkv format seems to be required to be 'thd'.

We will add the support for 'thd' in our next PR. Due to release schedules, we didn't have time to push more changes in. We will support 'bshd' and 'sbhd' for now. Thanks!

@cyanguwa cyanguwa merged commit 5e4e0b2 into NVIDIA:main Sep 22, 2025
12 checks passed
KshitijLakhani pushed a commit that referenced this pull request Sep 25, 2025
* first draft; debug plan failure

Signed-off-by: Charlene Yang <[email protected]>

* debug uid error

Signed-off-by: Charlene Yang <[email protected]>

* tweak params

Signed-off-by: Charlene Yang <[email protected]>

* add grad in output

Signed-off-by: Charlene Yang <[email protected]>

* clean up prints

Signed-off-by: Charlene Yang <[email protected]>

* fix prints in test

Signed-off-by: Charlene Yang <[email protected]>

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>

* address review comments

Signed-off-by: Charlene Yang <[email protected]>

* fix unfused grad; add softmax_type; add sink to bwd

Signed-off-by: Charlene Yang <[email protected]>

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>

* fix padding mask; add swa tests; remove requires_grad for off-by-one

Signed-off-by: Charlene Yang <[email protected]>

* update FE

Signed-off-by: Charlene Yang <[email protected]>

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>

* fix indent

Signed-off-by: Charlene Yang <[email protected]>

* fix non-determinism and shapes

Signed-off-by: Charlene Yang <[email protected]>

* clean up prints

Signed-off-by: Charlene Yang <[email protected]>

* add GQA

Signed-off-by: Charlene Yang <[email protected]>

* add CP A2A; dq/dk mismatches

Signed-off-by: Charlene Yang <[email protected]>

* fix CP A2A; need cleaner solution

Signed-off-by: Charlene Yang <[email protected]>

* fix CP A2A; pending cudnn kernel change

Signed-off-by: Charlene Yang <[email protected]>

* minor fixes

Signed-off-by: Charlene Yang <[email protected]>

* fix world size in unit test; avoid thd format

Signed-off-by: Charlene Yang <[email protected]>

* fix kernel_backend, dtype in unit test; fix head_dim for FP8 Hopper

Signed-off-by: Charlene Yang <[email protected]>

* fix thd logic

Signed-off-by: Charlene Yang <[email protected]>

* fix fp8 context

Signed-off-by: Charlene Yang <[email protected]>

* tweak CP logging

Signed-off-by: Charlene Yang <[email protected]>

* allow no_mask/padding for SWA(left,0)

Signed-off-by: Charlene Yang <[email protected]>

* Revert "allow no_mask/padding for SWA(left,0)"

This reverts commit 08b4ccc67a08b6882080b06aa715f541bb832aca.

Signed-off-by: Charlene Yang <[email protected]>

* add softmax_type to Jax

Signed-off-by: Charlene Yang <[email protected]>

* add cuDNN version control

Signed-off-by: Charlene Yang <[email protected]>

* prettify tests

Signed-off-by: Charlene Yang <[email protected]>

* skip 9.13 for MLA, non 192/128

Signed-off-by: Charlene Yang <[email protected]>

* rename compare_with_error

Signed-off-by: Charlene Yang <[email protected]>

* small cleanups and improvements

Signed-off-by: Charlene Yang <[email protected]>

* fix minor CI failures

Signed-off-by: Charlene Yang <[email protected]>

* force sink/dsink to be float32

Signed-off-by: Charlene Yang <[email protected]>

* switch FE to GH FE

Signed-off-by: Charlene Yang <[email protected]>

* return to GH TE main FE commit

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update FE to 1.14.1

Signed-off-by: Charlene Yang <[email protected]>

* clean up before CI

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix lint

Signed-off-by: Charlene Yang <[email protected]>

* bump up cudnn version

Signed-off-by: Charlene Yang <[email protected]>

* add backend selection guard for unit tests

Signed-off-by: Charlene Yang <[email protected]>

* add docstring for softmax type enums in C

Signed-off-by: Charlene Yang <[email protected]>

---------

Signed-off-by: Charlene Yang <[email protected]>
Co-authored-by: Chen Cui <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add attention sink to flash attention

4 participants