Update on "introduce triton sdpa kernel to cuda backend"

Gasoonjia · Gasoonjia · commit 7a573e2c8420 · 2025-11-17T13:01:06.000-08:00
**Introduce Triton SDPA Kernel to CUDA Backend** This diff introduces a Triton-optimized implementation of scaled dot-product attention (SDPA) kernel to the CUDA backend. The new kernel is designed to replace the default Edge SDPA operator during graph transformation to accelerate the model inference and get rid of sdpa decomposition. **Changes** * Added a new file `sdpa.py` to `fbcode/executorch/backends/cuda/triton/kernels` and `fbcode/executorch/backends/cuda/triton/kernels` directories, which contains the Triton-optimized SDPA kernel implementation. * Added a new file `__init__.py` to `fbcode/executorch/backends/cuda/triton/replacement_pass`, which replaces the given existing edge ops with target triton kernels. * Added tests for sdpa exporting with triton kernel. Without the triton kernel, sdpa model can not be exported. **Purpose** The purpose of this diff is to provide a high-performance SDPA kernel for the CUDA backend, which can be used to accelerate attention-based models on NVIDIA GPUs. Differential Revision: [D87259044](https://our.internmc.facebook.com/intern/diff/D87259044/) [ghstack-poisoned]
diff --git a/backends/cuda/triton/replacement_pass.py b/backends/cuda/triton/replacement_pass.py
@@ -92,8 +92,6 @@ def _should_replace_node(self, node: Node) -> bool:
         if node.op != "call_function":
             return False
 
-        print("Checking:", node.target)
-
         return node.target in EDGE_TO_TRITON_KERNELS
 
     def _replace_node_with_triton(self, graph_module: GraphModule, node: Node) -> None: