Skip to content

Conversation

@Gasoonjia
Copy link
Contributor

@Gasoonjia Gasoonjia commented Nov 17, 2025

Stack from ghstack (oldest at bottom):

Introduce Triton SDPA Kernel to CUDA Backend

This diff introduces a Triton-optimized implementation of scaled dot-product attention (SDPA) kernel to the CUDA backend. The new kernel is designed to replace the default Edge SDPA operator during graph transformation to accelerate the model inference and get rid of sdpa decomposition.

Changes

  • Added a new file sdpa.py to fbcode/executorch/backends/cuda/triton/kernels and fbcode/executorch/backends/cuda/triton/kernels directories, which contains the Triton-optimized SDPA kernel implementation.
  • Added a new file __init__.py to fbcode/executorch/backends/cuda/triton/replacement_pass, which replaces the given existing edge ops with target triton kernels.
  • Added tests for sdpa exporting with triton kernel. Without the triton kernel, sdpa model can not be exported.

Purpose

The purpose of this diff is to provide a high-performance SDPA kernel for the CUDA backend, which can be used to accelerate attention-based models on NVIDIA GPUs.

Differential Revision: D87259044

**Introduce Triton SDPA Kernel to CUDA Backend**

This diff introduces a Triton-optimized implementation of scaled dot-product attention (SDPA) kernel to the CUDA backend. The new kernel is designed to replace the default Edge SDPA operator during graph transformation to accelerate the model inference and get rid of sdpa decomposition.

**Changes**

* Added a new file `sdpa.py` to `fbcode/executorch/backends/cuda/triton/kernels` and `fbcode/executorch/backends/cuda/triton/kernels` directories, which contains the Triton-optimized SDPA kernel implementation.
* Added a new file `__init__.py` to `fbcode/executorch/backends/cuda/triton/replacement_pass`, which replaces the given existing edge ops with target triton kernels.
* Added tests for sdpa exporting with triton kernel. Without the triton kernel, sdpa model can not be exported.

**Purpose**

The purpose of this diff is to provide a high-performance SDPA kernel for the CUDA backend, which can be used to accelerate attention-based models on NVIDIA GPUs.

Differential Revision: [D87259044](https://our.internmc.facebook.com/intern/diff/D87259044/)

[ghstack-poisoned]
Gasoonjia added a commit that referenced this pull request Nov 17, 2025
**Introduce Triton SDPA Kernel to CUDA Backend**

This diff introduces a Triton-optimized implementation of scaled dot-product attention (SDPA) kernel to the CUDA backend. The new kernel is designed to replace the default Edge SDPA operator during graph transformation to accelerate the model inference and get rid of sdpa decomposition.

**Changes**

* Added a new file `sdpa.py` to `fbcode/executorch/backends/cuda/triton/kernels` and `fbcode/executorch/backends/cuda/triton/kernels` directories, which contains the Triton-optimized SDPA kernel implementation.
* Added a new file `__init__.py` to `fbcode/executorch/backends/cuda/triton/replacement_pass`, which replaces the given existing edge ops with target triton kernels.
* Added tests for sdpa exporting with triton kernel. Without the triton kernel, sdpa model can not be exported.

**Purpose**

The purpose of this diff is to provide a high-performance SDPA kernel for the CUDA backend, which can be used to accelerate attention-based models on NVIDIA GPUs.

Differential Revision: [D87259044](https://our.internmc.facebook.com/intern/diff/D87259044/)

ghstack-source-id: 323839094
Pull Request resolved: #15859
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 17, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15859

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 4 New Failures, 3 Unrelated Failures

As of commit d88e987 with merge base 179a155 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 17, 2025
@github-actions
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

**Introduce Triton SDPA Kernel to CUDA Backend**

This diff introduces a Triton-optimized implementation of scaled dot-product attention (SDPA) kernel to the CUDA backend. The new kernel is designed to replace the default Edge SDPA operator during graph transformation to accelerate the model inference and get rid of sdpa decomposition.

**Changes**

* Added a new file `sdpa.py` to `fbcode/executorch/backends/cuda/triton/kernels` and `fbcode/executorch/backends/cuda/triton/kernels` directories, which contains the Triton-optimized SDPA kernel implementation.
* Added a new file `__init__.py` to `fbcode/executorch/backends/cuda/triton/replacement_pass`, which replaces the given existing edge ops with target triton kernels.
* Added tests for sdpa exporting with triton kernel. Without the triton kernel, sdpa model can not be exported.

**Purpose**

The purpose of this diff is to provide a high-performance SDPA kernel for the CUDA backend, which can be used to accelerate attention-based models on NVIDIA GPUs.

Differential Revision: [D87259044](https://our.internmc.facebook.com/intern/diff/D87259044/)

[ghstack-poisoned]
Gasoonjia added a commit that referenced this pull request Nov 17, 2025
Pull Request resolved: #15859

**Introduce Triton SDPA Kernel to CUDA Backend**

This diff introduces a Triton-optimized implementation of scaled dot-product attention (SDPA) kernel to the CUDA backend. The new kernel is designed to replace the default Edge SDPA operator during graph transformation to accelerate the model inference and get rid of sdpa decomposition.

**Changes**

* Added a new file `sdpa.py` to `fbcode/executorch/backends/cuda/triton/kernels` and `fbcode/executorch/backends/cuda/triton/kernels` directories, which contains the Triton-optimized SDPA kernel implementation.
* Added a new file `__init__.py` to `fbcode/executorch/backends/cuda/triton/replacement_pass`, which replaces the given existing edge ops with target triton kernels.
* Added tests for sdpa exporting with triton kernel. Without the triton kernel, sdpa model can not be exported.

**Purpose**

The purpose of this diff is to provide a high-performance SDPA kernel for the CUDA backend, which can be used to accelerate attention-based models on NVIDIA GPUs.
ghstack-source-id: 323839094
@exported-using-ghexport

Differential Revision: [D87259044](https://our.internmc.facebook.com/intern/diff/D87259044/)
**Introduce Triton SDPA Kernel to CUDA Backend**

This diff introduces a Triton-optimized implementation of scaled dot-product attention (SDPA) kernel to the CUDA backend. The new kernel is designed to replace the default Edge SDPA operator during graph transformation to accelerate the model inference and get rid of sdpa decomposition.

**Changes**

* Added a new file `sdpa.py` to `fbcode/executorch/backends/cuda/triton/kernels` and `fbcode/executorch/backends/cuda/triton/kernels` directories, which contains the Triton-optimized SDPA kernel implementation.
* Added a new file `__init__.py` to `fbcode/executorch/backends/cuda/triton/replacement_pass`, which replaces the given existing edge ops with target triton kernels.
* Added tests for sdpa exporting with triton kernel. Without the triton kernel, sdpa model can not be exported.

**Purpose**

The purpose of this diff is to provide a high-performance SDPA kernel for the CUDA backend, which can be used to accelerate attention-based models on NVIDIA GPUs.

Differential Revision: [D87259044](https://our.internmc.facebook.com/intern/diff/D87259044/)

[ghstack-poisoned]
Gasoonjia added a commit that referenced this pull request Nov 17, 2025
Pull Request resolved: #15859

**Introduce Triton SDPA Kernel to CUDA Backend**

This diff introduces a kernel-generator (https://github.com/meta-pytorch/KernelAgent) driven, Triton-optimized implementation of scaled dot-product attention (SDPA) kernel to the CUDA backend. The new kernel is designed to replace the default Edge SDPA operator during graph transformation to accelerate the model inference and get rid of sdpa decomposition.

**Changes**

* Added a new file `sdpa.py` to `fbcode/executorch/backends/cuda/triton/kernels` and `fbcode/executorch/backends/cuda/triton/kernels` directories, which contains the Triton-optimized SDPA kernel implementation.
* Added a new file `__init__.py` to `fbcode/executorch/backends/cuda/triton/replacement_pass`, which replaces the given existing edge ops with target triton kernels.
* Added tests for sdpa exporting with triton kernel. Without the triton kernel, sdpa model can not be exported.

**Purpose**

The purpose of this diff is to provide a high-performance SDPA kernel for the CUDA backend, which can be used to accelerate attention-based models on NVIDIA GPUs.
ghstack-source-id: 323849720
@exported-using-ghexport

Differential Revision: [D87259044](https://our.internmc.facebook.com/intern/diff/D87259044/)
**Introduce Triton SDPA Kernel to CUDA Backend**

This diff introduces a Triton-optimized implementation of scaled dot-product attention (SDPA) kernel to the CUDA backend. The new kernel is designed to replace the default Edge SDPA operator during graph transformation to accelerate the model inference and get rid of sdpa decomposition.

**Changes**

* Added a new file `sdpa.py` to `fbcode/executorch/backends/cuda/triton/kernels` and `fbcode/executorch/backends/cuda/triton/kernels` directories, which contains the Triton-optimized SDPA kernel implementation.
* Added a new file `__init__.py` to `fbcode/executorch/backends/cuda/triton/replacement_pass`, which replaces the given existing edge ops with target triton kernels.
* Added tests for sdpa exporting with triton kernel. Without the triton kernel, sdpa model can not be exported.

**Purpose**

The purpose of this diff is to provide a high-performance SDPA kernel for the CUDA backend, which can be used to accelerate attention-based models on NVIDIA GPUs.

Differential Revision: [D87259044](https://our.internmc.facebook.com/intern/diff/D87259044/)

[ghstack-poisoned]
Gasoonjia added a commit that referenced this pull request Nov 17, 2025
Pull Request resolved: #15859

**Introduce Triton SDPA Kernel to CUDA Backend**

This diff introduces a kernel-generator (https://github.com/meta-pytorch/KernelAgent) driven, Triton-optimized implementation of scaled dot-product attention (SDPA) kernel to the CUDA backend. The new kernel is designed to replace the default Edge SDPA operator during graph transformation to accelerate the model inference and get rid of sdpa decomposition.

**Changes**

* Added a new file `sdpa.py` to `fbcode/executorch/backends/cuda/triton/kernels` and `fbcode/executorch/backends/cuda/triton/kernels` directories, which contains the Triton-optimized SDPA kernel implementation.
* Added a new `fbcode/executorch/backends/cuda/triton/replacement_pass`, which replaces the given existing edge ops with target triton kernels.
* Added tests for sdpa exporting with triton kernel. Without the triton kernel, sdpa model can not be exported.

**Purpose**

The purpose of this diff is to provide a high-performance SDPA kernel for the CUDA backend, which can be used to accelerate attention-based models on NVIDIA GPUs.
ghstack-source-id: 323864198
@exported-using-ghexport

Differential Revision: [D87259044](https://our.internmc.facebook.com/intern/diff/D87259044/)
@mergennachin mergennachin requested a review from Copilot November 17, 2025 22:58
Copilot finished reviewing on behalf of mergennachin November 17, 2025 23:01
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a Triton-optimized implementation of scaled dot-product attention (SDPA) kernel to the ExecuTorch CUDA backend, along with a graph transformation pass that replaces the default Edge SDPA operator with the optimized Triton kernel during compilation.

Key Changes:

  • Added Triton SDPA kernel with autotuning and support for different query/key-value sequence lengths
  • Implemented a replacement pass to swap Edge dialect SDPA operations with Triton kernels
  • Added SdpaModule toy model and test coverage for SDPA export

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
backends/cuda/triton/kernels/sdpa.py Implements the Triton-optimized SDPA kernel with fused attention computation, autotuning configs, and fake implementation for torch.export
backends/cuda/triton/kernels/init.py Exports the sdpa kernel from the kernels module
backends/cuda/triton/replacement_pass.py Implements ReplaceEdgeOpWithTritonOpPass to replace edge dialect SDPA operations with Triton kernels during graph transformation
backends/cuda/triton/init.py Module initialization that imports kernels and exports the replacement pass
backends/cuda/cuda_backend.py Integrates the Triton replacement pass into the CUDA backend preprocessing pipeline and removes the SDPBackend.MATH constraint
backends/cuda/tests/test_cuda_export.py Adds test_sdpa_single_kernel to verify SDPA export functionality
backends/cuda/tests/TARGETS Adds dependency on toy_model for SDPA testing
backends/cuda/TARGETS Adds triton_kernels and triton_replacement_pass build targets with appropriate dependencies
examples/models/toy_model/model.py Adds SdpaModule toy model with example inputs using bfloat16 dtype
examples/models/toy_model/init.py Exports SdpaModule from toy_model package
examples/models/init.py Registers "sdpa" model in the Model enum and model registry
.github/workflows/cuda.yml Adds "sdpa" to the CI test matrix

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

class SdpaModule(torch.nn.Module, EagerModelBase):
def __init__(self):
super().__init__()

Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class does not call EagerModelBase.init during initialization. (SdpaModule.init may be missing a call to a base class init)

Suggested change
EagerModelBase.__init__(self)

Copilot uses AI. Check for mistakes.
Gasoonjia added a commit that referenced this pull request Nov 18, 2025
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
bottom):
* #15859
* __->__ #15857

# aoti_torch_new_tensor_handle Shim for Cuda Backend

## summary

This diff introduces the `aoti_torch_new_tensor_handle` shim for the
CUDA backend. The shim is a wrapper around the
`aoti_torch_new_tensor_handle` function, which creates a new tensor
handle from an existing tensor, to unblock the custom triton kernel
usage

The diff affects the following files:

* `fbcode/executorch/backends/aoti/common_shims.h`: The function
prototype is declared.
* `fbcode/executorch/backends/apple/metal/runtime/shims/memory.cpp`: A
fallback implementation is provided for the Apple Metal backend to keep
its behavior.
*
`fbcode/executorch/backends/cuda/runtime/shims/tests/test_aoti_torch_new_tensor_handle.cpp`:
A test file is added for the CUDA backend.

Differential Revision:
[D87254968](https://our.internmc.facebook.com/intern/diff/D87254968/)
**Introduce Triton SDPA Kernel to CUDA Backend**

This diff introduces a Triton-optimized implementation of scaled dot-product attention (SDPA) kernel to the CUDA backend. The new kernel is designed to replace the default Edge SDPA operator during graph transformation to accelerate the model inference and get rid of sdpa decomposition.

**Changes**

* Added a new file `sdpa.py` to `fbcode/executorch/backends/cuda/triton/kernels` and `fbcode/executorch/backends/cuda/triton/kernels` directories, which contains the Triton-optimized SDPA kernel implementation.
* Added a new file `__init__.py` to `fbcode/executorch/backends/cuda/triton/replacement_pass`, which replaces the given existing edge ops with target triton kernels.
* Added tests for sdpa exporting with triton kernel. Without the triton kernel, sdpa model can not be exported.

**Purpose**

The purpose of this diff is to provide a high-performance SDPA kernel for the CUDA backend, which can be used to accelerate attention-based models on NVIDIA GPUs.

Differential Revision: [D87259044](https://our.internmc.facebook.com/intern/diff/D87259044/)

[ghstack-poisoned]
Gasoonjia added a commit that referenced this pull request Nov 18, 2025
Pull Request resolved: #15859

**Introduce Triton SDPA Kernel to CUDA Backend**

This diff introduces a kernel-generator (https://github.com/meta-pytorch/KernelAgent) driven, Triton-optimized implementation of scaled dot-product attention (SDPA) kernel to the CUDA backend. The new kernel is designed to replace the default Edge SDPA operator during graph transformation to accelerate the model inference and get rid of sdpa decomposition.

**Changes**

* Added a new file `sdpa.py` to `fbcode/executorch/backends/cuda/triton/kernels` and `fbcode/executorch/backends/cuda/triton/kernels` directories, which contains the Triton-optimized SDPA kernel implementation.
* Added a new `fbcode/executorch/backends/cuda/triton/replacement_pass`, which replaces the given existing edge ops with target triton kernels.
* Added tests for sdpa exporting with triton kernel. Without the triton kernel, sdpa model can not be exported.

**Purpose**

The purpose of this diff is to provide a high-performance SDPA kernel for the CUDA backend, which can be used to accelerate attention-based models on NVIDIA GPUs.
ghstack-source-id: 323941017
@exported-using-ghexport

Differential Revision: [D87259044](https://our.internmc.facebook.com/intern/diff/D87259044/)
Copy link
Contributor

@larryliu0820 larryliu0820 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at Copilot comments as well

Copy link
Contributor

@larryliu0820 larryliu0820 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review automatically exported from Phabricator review in Meta.

**Introduce Triton SDPA Kernel to CUDA Backend**

This diff introduces a Triton-optimized implementation of scaled dot-product attention (SDPA) kernel to the CUDA backend. The new kernel is designed to replace the default Edge SDPA operator during graph transformation to accelerate the model inference and get rid of sdpa decomposition.

**Changes**

* Added a new file `sdpa.py` to `fbcode/executorch/backends/cuda/triton/kernels` and `fbcode/executorch/backends/cuda/triton/kernels` directories, which contains the Triton-optimized SDPA kernel implementation.
* Added a new file `__init__.py` to `fbcode/executorch/backends/cuda/triton/replacement_pass`, which replaces the given existing edge ops with target triton kernels.
* Added tests for sdpa exporting with triton kernel. Without the triton kernel, sdpa model can not be exported.

**Purpose**

The purpose of this diff is to provide a high-performance SDPA kernel for the CUDA backend, which can be used to accelerate attention-based models on NVIDIA GPUs.

Differential Revision: [D87259044](https://our.internmc.facebook.com/intern/diff/D87259044/)

[ghstack-poisoned]
Gasoonjia added a commit that referenced this pull request Nov 18, 2025
Pull Request resolved: #15859

**Introduce Triton SDPA Kernel to CUDA Backend**

This diff introduces a kernel-generator (https://github.com/meta-pytorch/KernelAgent) driven, Triton-optimized implementation of scaled dot-product attention (SDPA) kernel to the CUDA backend. The new kernel is designed to replace the default Edge SDPA operator during graph transformation to accelerate the model inference and get rid of sdpa decomposition.

**Changes**

* Added a new file `sdpa.py` to `fbcode/executorch/backends/cuda/triton/kernels` and `fbcode/executorch/backends/cuda/triton/kernels` directories, which contains the Triton-optimized SDPA kernel implementation.
* Added a new `fbcode/executorch/backends/cuda/triton/replacement_pass`, which replaces the given existing edge ops with target triton kernels.
* Added tests for sdpa exporting with triton kernel. Without the triton kernel, sdpa model can not be exported.

**Purpose**

The purpose of this diff is to provide a high-performance SDPA kernel for the CUDA backend, which can be used to accelerate attention-based models on NVIDIA GPUs.
ghstack-source-id: 323996159
@exported-using-ghexport

Differential Revision: [D87259044](https://our.internmc.facebook.com/intern/diff/D87259044/)
**Introduce Triton SDPA Kernel to CUDA Backend**

This diff introduces a Triton-optimized implementation of scaled dot-product attention (SDPA) kernel to the CUDA backend. The new kernel is designed to replace the default Edge SDPA operator during graph transformation to accelerate the model inference and get rid of sdpa decomposition.

**Changes**

* Added a new file `sdpa.py` to `fbcode/executorch/backends/cuda/triton/kernels` and `fbcode/executorch/backends/cuda/triton/kernels` directories, which contains the Triton-optimized SDPA kernel implementation.
* Added a new file `__init__.py` to `fbcode/executorch/backends/cuda/triton/replacement_pass`, which replaces the given existing edge ops with target triton kernels.
* Added tests for sdpa exporting with triton kernel. Without the triton kernel, sdpa model can not be exported.

**Purpose**

The purpose of this diff is to provide a high-performance SDPA kernel for the CUDA backend, which can be used to accelerate attention-based models on NVIDIA GPUs.

Differential Revision: [D87259044](https://our.internmc.facebook.com/intern/diff/D87259044/)

[ghstack-poisoned]
Gasoonjia added a commit that referenced this pull request Nov 18, 2025
Pull Request resolved: #15859

**Introduce Triton SDPA Kernel to CUDA Backend**

This diff introduces a kernel-generator (https://github.com/meta-pytorch/KernelAgent) driven, Triton-optimized implementation of scaled dot-product attention (SDPA) kernel to the CUDA backend. The new kernel is designed to replace the default Edge SDPA operator during graph transformation to accelerate the model inference and get rid of sdpa decomposition.

**Changes**

* Added a new file `sdpa.py` to `fbcode/executorch/backends/cuda/triton/kernels` and `fbcode/executorch/backends/cuda/triton/kernels` directories, which contains the Triton-optimized SDPA kernel implementation.
* Added a new `fbcode/executorch/backends/cuda/triton/replacement_pass`, which replaces the given existing edge ops with target triton kernels.
* Added tests for sdpa exporting with triton kernel. Without the triton kernel, sdpa model can not be exported.

**Purpose**

The purpose of this diff is to provide a high-performance SDPA kernel for the CUDA backend, which can be used to accelerate attention-based models on NVIDIA GPUs.
ghstack-source-id: 324134561
@exported-using-ghexport

Differential Revision: [D87259044](https://our.internmc.facebook.com/intern/diff/D87259044/)
@Gasoonjia Gasoonjia closed this Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants