You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Alternatively, you can build your own custom kernels by leveraging Kraken's low-level primitives. This allows you to create highly optimized kernels tailored to your specific needs. We provide PTX implementations of low-level primitives in `kraken._ptx_utils`.
69
+
70
+
Here's an example of how to use `kraken._ptx_utils.symm_mem_sync` to synchronize blocks with matching `block_id` across participating devices in a custom kernel. This is often necessary before and after accessing symmetric memory tensors.
71
+
72
+
```python
73
+
import torch
74
+
import torch.distributed as dist
75
+
import torch.distributed._symmetric_memory as symm_mem
76
+
77
+
import triton
78
+
import triton.language as tl
79
+
80
+
import kraken
81
+
import os
82
+
83
+
@triton.jit
84
+
defcustom_distributed_kernel(
85
+
a_shared_ptrs,
86
+
a_signal_pad_ptrs,
87
+
rank: tl.constexpr,
88
+
world_size: tl.constexpr,
89
+
):
90
+
# Synchronizes blocks with matching block_id across participating devices.
91
+
# Ensures that all writes to a_shared from previous kernels across all devices
0 commit comments