[Tools][Translator] Add AMD backend support for Triton-to-Gluon translator#9717
Conversation
c44bce8 to
570f0c3
Compare
ddd5f05 to
9f5b442
Compare
ce4347d to
6a6fe08
Compare
|
Some tests are expected to fail as I'm waiting for the rework of #9709. I'll mark this as a draft again until that happens. |
…lator Add support for translating Triton kernels to Gluon for AMD targets (gfx1250, gfx942, gfx950). This includes: - Architecture detection helpers (_is_gfx1250, _is_cdna, etc.) - AMD WMMA dot path for gfx1250 and MFMA dot path for CDNA3/CDNA4 - TDM tensor descriptor support (load/store/gather/scatter) for gfx1250 - Correct warp size handling (32 for gfx1250, 64 for CDNA) - Cross-compilation via _current_target / _make_target - tl_atomic_add and convert_to_expand_dims_layout as builtins - Parametrized tests across all targets (nvidia, gfx1250, gfx942, gfx950) - Fix segfault in getTensorDescMetadata for unencoded tensor descriptors Made-with: Cursor
…d of @Builtin Replace attribute-stashing @ttgl._core.builtin with a @tl.core._aggregate (AMDTensorDescriptorArgs) holding desc + base_ptr. Load/store use desc directly via @gluon.jit (generic 1D-5D). Gather/scatter reconstruct the descriptor with [num_indices, N] block_shape using desc.block_shape (plain ints from type metadata) and base_ptr, avoiding constexpr-to-tensor conversion in JIT list literals. Only _create_tdm_descriptor remains as a thin builtin for block_shape list construction. Made-with: Cursor
6a6fe08 to
83f1730
Compare
83f1730 to
cd9dbd9
Compare
Use constexpr annotations on block_shape parameters to prevent constexpr-to-tensor decay in JIT list literals. This eliminates the last builtin helper -- all functions are now @gluon.jit. Made-with: Cursor
|
cc @Mogball in case you have any comments |
… tests Avoid pytest-xdist load imbalance by detecting the available target at runtime via current_target() rather than parametrizing over every target and skipping the unavailable ones. Made-with: Cursor
d98bbad to
28095a7
Compare
|
Removed cdna3 and below from the |
|
Kind ping on this @Mogball / @jeffniu-openai |
jeffniu-openai
left a comment
There was a problem hiding this comment.
Thanks for the ping. Really happy to see the work here move forward.
In general, I think it would be best if hardware dispatch logic were coded directly in the translator/AST rewriter via a target abstraction, instead of compile-time in the translator helpers themselves. This would ensure that the final translated kernel only has (translator-generated-)code relevant to the specific hardware, rather than comptime hardware switches. This would produce cleaner output.
| translate_to_gluon: bool = False | ||
| inline_helpers: ordered_set[str] = field(default_factory=ordered_set[str]) | ||
| cvt_context: list[bool] = field(default_factory=lambda: [False]) | ||
| target: str = "nvidia" |
There was a problem hiding this comment.
nit: can you make this a StrEnum
There was a problem hiding this comment.
Done, except I had to use str and enum separately because StrEnum doesn't seem supported on python 3.10 which the CI uses.
| raise e | ||
|
|
||
| def _is_amd_target(self) -> bool: | ||
| return self.target.startswith("gfx") |
There was a problem hiding this comment.
this can be deleted to the target StrEnum when there is one
| if self._is_amd_target(): | ||
| self.imports.add("from triton.experimental.gluon.language.amd.gfx1250.tdm import tensor_descriptor") | ||
| else: | ||
| self.imports.add("from triton.experimental.gluon.language.nvidia.hopper.tma import tensor_descriptor") |
There was a problem hiding this comment.
would it be possible to build a tiny hardware abstraction, and dispatch something like target.get_tensor_descriptor_import?
There was a problem hiding this comment.
Added a tensor_descriptor_import to the target abstraction.
| return self.generic_visit(node) | ||
| value, _, _ = ref | ||
| if value is tl.make_tensor_descriptor and self.target.startswith("gfx"): | ||
| node.func = parse_expr("helpers.tl_make_tensor_descriptor_amd") |
There was a problem hiding this comment.
why is this the only function that does hardware dispatch in the translator side, vs. at the translator helper level?
There was a problem hiding this comment.
You're right, turns out this isn't needed. Updated, PTAL ^^
- Add TranslatorTarget StrEnum with is_amd property and tensor_descriptor_import dispatch, accepting any gfx* string via _missing_() for forward-compat with new AMD architectures. - Replace raw target strings and _is_amd_target() in SliceRewriter and Translator with the enum. - Unify tl_make_tensor_descriptor and tl_make_tensor_descriptor_amd into a single helper that dispatches via current_target() at runtime, matching how tl_dot already works. Made-with: Cursor
Use (str, Enum) instead of StrEnum which requires Python 3.11+. Made-with: Cursor
There was a problem hiding this comment.
I still think we should have separate translator helper modules based on the target. This would reduce polluting the translated kernel with unrelated hardware helpers. This can be abstracted through the hardware target, and you can re-use helpers. Something like
/---common_helpers.py ---\
| \
v v
amd_helpers nvidia_helpers
and the hardware target specifies the python module. the slicer should automatically pull through the transitive deps if there are any
I did have a separate amd and nvidia helpers initially, but wasn't sure if that would be accepted given the renaming of NVIDIA specific code from AMD :D. Now that I have confirmation, I'll get on the refactoring right away. |
Split the monolithic translator_helpers.py into: - common_helpers.py: vendor-neutral utilities (layouts, portable ops) - nvidia_helpers.py: NVIDIA-specific helpers (TMA, mbarrier, Blackwell) - amd_helpers.py: AMD-specific helpers (TDM, WMMA, MFMA) Each target module re-exports common helpers via star import so the generated kernel sees a single unified `helpers` namespace. The TranslatorTarget.helpers_module property selects which module to import, so translated kernels no longer pull in unrelated hardware modules. translator_helpers.py is kept as a backward-compat re-export shim. Made-with: Cursor
Mogball
left a comment
There was a problem hiding this comment.
mostly lgtm just a few nits
|
|
||
| NVIDIA = "nvidia" | ||
| # AMD targets currently exercised by the translator test suite: | ||
| GFX1250 = "gfx1250" |
There was a problem hiding this comment.
nit: can you put this in its own file
| def convert_to_expand_dims_layout(value, expand_dims: list[int]) -> Any: | ||
| layout: ttgl.constexpr = build_expand_dims_layout(value.shape, expand_dims, ttgl.num_warps()) | ||
| return ttgl.convert_layout(value, layout) | ||
| from triton.tools.triton_to_gluon_translator.common_helpers import * # noqa: F401,F403 |
There was a problem hiding this comment.
is this file even still needed?
There was a problem hiding this comment.
Some tests seem to need it (test_slice_kernel.py). If I remove it, will have to modify those imports from other test files.
There was a problem hiding this comment.
Would you like to get this removed and change the imports from test_slice_kernel.py ?
|
Facing some CI failures on NVIDIA backend after the refactor. Will attempt to fix those: https://github.com/triton-lang/triton/actions/runs/24442531609/job/71410815124?pr=9717#step:11:104 |
Split the monolithic translator_helpers.py into: - target.py: TranslatorTarget enum (hardware abstraction) - common_helpers.py: vendor-neutral utilities (layouts, portable ops) - nvidia_helpers.py: NVIDIA-specific helpers (TMA, mbarrier, Blackwell) - amd_helpers.py: AMD-specific helpers (TDM, WMMA, MFMA) Each target module re-exports common helpers via star import so the generated kernel sees a single unified `helpers` namespace. The TranslatorTarget.helpers_module property selects which module to import, so translated kernels no longer pull in unrelated hardware modules. translator_helpers.py is kept as a backward-compat re-export shim. Made-with: Cursor
The gluon JIT compiler evaluates all branches at compile time. With only NVIDIA types in scope, the else branch fails because tensor_descriptor lacks .store()/.load() methods. Since NVIDIA descriptors are always TMA type, call the TMA functions directly. Made-with: Cursor
Update test imports to use the target-specific helper modules directly and delete the re-export shim. Use lazy import for convert_host_descriptor in tests so the correct target module is loaded at runtime. Made-with: Cursor
…lator (triton-lang#9717) Extends the existing NVIDIA Triton-to-Gluon translator to support AMD GPU architectures, with gfx1250 as the primary target and CDNA3/CDNA4 support for non-descriptor ops. ### Changes - **Added AMD backend to translator helpers**: Translates Triton ops to Gluon equivalents using AMD-specific hardware features -- WMMA for `tl.dot`, TDM for tensor descriptors (`tl.make_tensor_descriptor`, `desc.load`, `desc.store`), and `PaddedSharedLayout` for shared memory. - **TDM gather/scatter** (`translator_helpers.py`): Implements `tl_obj_gather_amd` / `tl_obj_scatter_amd` as `@builtin` helpers that recreate the TDM descriptor with the correct block shape for multi-row gather/scatter operations. - **CDNA3/CDNA4 support** (`translator_helpers.py`): Adds MFMA-based `tl.dot` translation, 64-thread-per-warp layouts, and target-aware dispatch for gfx942/gfx950. - **Target propagation** (`translator.py`, `translator_helpers.py`): Injects `helpers._current_target = helpers._make_target("{target}")` into generated code for non-NVIDIA targets. Architecture detection helpers (`get_num_threads_per_warp`, `_is_gfx1250`, `_is_cdna`, etc.) accept an optional `target` parameter with fallback to `current_target()`. - **Host descriptor support** (`inline_helpers.py`): Adds `convert_host_descriptor_amd` using `gluon.amd.gfx1250.TensorDescriptor` with `PaddedSharedLayout`. Deduplicates shared `torch_dtype_to_triton` helper. - **Segfault fix** (`ir.cc`): Uses `isa_and_nonnull` / `dyn_cast_or_null` in `getTensorDescMetadata` to handle unencoded tensor descriptor block types from the `make_ttgir` path. This was needed in order to make the reference triton tensor descriptor tests pass, where the descriptors were created on the host side. - **Tests** (`test_triton_to_gluon.py`): Parametrizes existing tests across `nvidia`/`gfx1250`/`gfx942`/`gfx950` targets. Adds roundtrip tests for `tl.cat`, `tl.make_tensor_descriptor`, and gather/scatter. ### Known limitation `tl.make_tensor_descriptor` / `desc.load` / `desc.store` translation is not yet supported for CDNA3/CDNA4 targets. The Gluon builder lacks generic `DescriptorLoad`/`DescriptorStore` ops, so descriptor-based kernels currently only translate for gfx1250 (TDM) and NVIDIA (TMA).
Extends the existing NVIDIA Triton-to-Gluon translator to support AMD GPU architectures, with gfx1250 as the primary target and CDNA3/CDNA4 support for non-descriptor ops.
Changes
tl.dot, TDM for tensor descriptors (tl.make_tensor_descriptor,desc.load,desc.store), andPaddedSharedLayoutfor shared memory.translator_helpers.py): Implementstl_obj_gather_amd/tl_obj_scatter_amdas@builtinhelpers that recreate the TDM descriptor with the correct block shape for multi-row gather/scatter operations.translator_helpers.py): Adds MFMA-basedtl.dottranslation, 64-thread-per-warp layouts, and target-aware dispatch for gfx942/gfx950.translator.py,translator_helpers.py): Injectshelpers._current_target = helpers._make_target("{target}")into generated code for non-NVIDIA targets. Architecture detection helpers (get_num_threads_per_warp,_is_gfx1250,_is_cdna, etc.) accept an optionaltargetparameter with fallback tocurrent_target().inline_helpers.py): Addsconvert_host_descriptor_amdusinggluon.amd.gfx1250.TensorDescriptorwithPaddedSharedLayout. Deduplicates sharedtorch_dtype_to_tritonhelper.ir.cc): Usesisa_and_nonnull/dyn_cast_or_nullingetTensorDescMetadatato handle unencoded tensor descriptor block types from themake_ttgirpath. This was needed in order to make the reference triton tensor descriptor tests pass, where the descriptors were created on the host side.test_triton_to_gluon.py): Parametrizes existing tests acrossnvidia/gfx1250/gfx942/gfx950targets. Adds roundtrip tests fortl.cat,tl.make_tensor_descriptor, and gather/scatter.Known limitation
tl.make_tensor_descriptor/desc.load/desc.storetranslation is not yet supported for CDNA3/CDNA4 targets. The Gluon builder lacks genericDescriptorLoad/DescriptorStoreops, so descriptor-based kernels currently only translate for gfx1250 (TDM) and NVIDIA (TMA).