[Tools][Translator] Add AMD backend support for Triton-to-Gluon translator by jammm · Pull Request #9717 · triton-lang/triton

jammm · 2026-03-13T19:15:08Z

Extends the existing NVIDIA Triton-to-Gluon translator to support AMD GPU architectures, with gfx1250 as the primary target and CDNA3/CDNA4 support for non-descriptor ops.

Changes

Added AMD backend to translator helpers: Translates Triton ops to Gluon equivalents using AMD-specific hardware features -- WMMA for tl.dot, TDM for tensor descriptors (tl.make_tensor_descriptor, desc.load, desc.store), and PaddedSharedLayout for shared memory.
TDM gather/scatter (translator_helpers.py): Implements tl_obj_gather_amd / tl_obj_scatter_amd as @builtin helpers that recreate the TDM descriptor with the correct block shape for multi-row gather/scatter operations.
CDNA3/CDNA4 support (translator_helpers.py): Adds MFMA-based tl.dot translation, 64-thread-per-warp layouts, and target-aware dispatch for gfx942/gfx950.
Target propagation (translator.py, translator_helpers.py): Injects helpers._current_target = helpers._make_target("{target}") into generated code for non-NVIDIA targets. Architecture detection helpers (get_num_threads_per_warp, _is_gfx1250, _is_cdna, etc.) accept an optional target parameter with fallback to current_target().
Host descriptor support (inline_helpers.py): Adds convert_host_descriptor_amd using gluon.amd.gfx1250.TensorDescriptor with PaddedSharedLayout. Deduplicates shared torch_dtype_to_triton helper.
Segfault fix (ir.cc): Uses isa_and_nonnull / dyn_cast_or_null in getTensorDescMetadata to handle unencoded tensor descriptor block types from the make_ttgir path. This was needed in order to make the reference triton tensor descriptor tests pass, where the descriptors were created on the host side.
Tests (test_triton_to_gluon.py): Parametrizes existing tests across nvidia/gfx1250/gfx942/gfx950 targets. Adds roundtrip tests for tl.cat, tl.make_tensor_descriptor, and gather/scatter.

Known limitation

tl.make_tensor_descriptor / desc.load / desc.store translation is not yet supported for CDNA3/CDNA4 targets. The Gluon builder lacks generic DescriptorLoad/DescriptorStore ops, so descriptor-based kernels currently only translate for gfx1250 (TDM) and NVIDIA (TMA).

jammm · 2026-03-25T16:06:54Z

Some tests are expected to fail as I'm waiting for the rework of #9709. I'll mark this as a draft again until that happens.

antiagainst · 2026-04-06T22:47:31Z

#9792 landed so this should be unblocked now. @jammm can you rebase and address all comments.

…lator Add support for translating Triton kernels to Gluon for AMD targets (gfx1250, gfx942, gfx950). This includes: - Architecture detection helpers (_is_gfx1250, _is_cdna, etc.) - AMD WMMA dot path for gfx1250 and MFMA dot path for CDNA3/CDNA4 - TDM tensor descriptor support (load/store/gather/scatter) for gfx1250 - Correct warp size handling (32 for gfx1250, 64 for CDNA) - Cross-compilation via _current_target / _make_target - tl_atomic_add and convert_to_expand_dims_layout as builtins - Parametrized tests across all targets (nvidia, gfx1250, gfx942, gfx950) - Fix segfault in getTensorDescMetadata for unencoded tensor descriptors Made-with: Cursor

@Builtin

…d of @Builtin Replace attribute-stashing @ttgl._core.builtin with a @tl.core._aggregate (AMDTensorDescriptorArgs) holding desc + base_ptr. Load/store use desc directly via @gluon.jit (generic 1D-5D). Gather/scatter reconstruct the descriptor with [num_indices, N] block_shape using desc.block_shape (plain ints from type metadata) and base_ptr, avoiding constexpr-to-tensor conversion in JIT list literals. Only _create_tdm_descriptor remains as a thin builtin for block_shape list construction. Made-with: Cursor

jammm · 2026-04-07T08:06:04Z

#9792 landed so this should be unblocked now. @jammm can you rebase and address all comments.

Done. Sorry for the force-push, I had mistakenly rebased on top of an older main branch.

Use constexpr annotations on block_shape parameters to prevent constexpr-to-tensor decay in JIT list literals. This eliminates the last builtin helper -- all functions are now @gluon.jit. Made-with: Cursor

peterbell10 · 2026-04-10T16:56:56Z

cc @Mogball in case you have any comments

… tests Avoid pytest-xdist load imbalance by detecting the available target at runtime via current_target() rather than parametrizing over every target and skipping the unavailable ones. Made-with: Cursor

peterbell10

LGTM

…/amd_translator

jammm · 2026-04-11T09:51:26Z

Removed cdna3 and below from the test_simple_matmul test because it was running out of shared memory space. The test_triton_to_gluon_dot_minimal is already testing tl.dot for cdna3 and below, so it should be fine.

antiagainst · 2026-04-13T22:53:47Z

Kind ping on this @Mogball / @jeffniu-openai

jeffniu-openai

Thanks for the ping. Really happy to see the work here move forward.

In general, I think it would be best if hardware dispatch logic were coded directly in the translator/AST rewriter via a target abstraction, instead of compile-time in the translator helpers themselves. This would ensure that the final translated kernel only has (translator-generated-)code relevant to the specific hardware, rather than comptime hardware switches. This would produce cleaner output.

jeffniu-openai · 2026-04-13T23:00:36Z

    translate_to_gluon: bool = False
    inline_helpers: ordered_set[str] = field(default_factory=ordered_set[str])
    cvt_context: list[bool] = field(default_factory=lambda: [False])
+    target: str = "nvidia"


nit: can you make this a StrEnum

Done, except I had to use str and enum separately because StrEnum doesn't seem supported on python 3.10 which the CI uses.

jeffniu-openai · 2026-04-13T23:00:46Z

            raise e

+    def _is_amd_target(self) -> bool:
+        return self.target.startswith("gfx")


this can be deleted to the target StrEnum when there is one

jeffniu-openai · 2026-04-13T23:01:21Z

+            if self._is_amd_target():
+                self.imports.add("from triton.experimental.gluon.language.amd.gfx1250.tdm import tensor_descriptor")
+            else:
+                self.imports.add("from triton.experimental.gluon.language.nvidia.hopper.tma import tensor_descriptor")


would it be possible to build a tiny hardware abstraction, and dispatch something like target.get_tensor_descriptor_import?

Added a tensor_descriptor_import to the target abstraction.

jeffniu-openai · 2026-04-13T23:03:24Z

            return self.generic_visit(node)
        value, _, _ = ref
+        if value is tl.make_tensor_descriptor and self.target.startswith("gfx"):
+            node.func = parse_expr("helpers.tl_make_tensor_descriptor_amd")


why is this the only function that does hardware dispatch in the translator side, vs. at the translator helper level?

You're right, turns out this isn't needed. Updated, PTAL ^^

…/amd_translator

- Add TranslatorTarget StrEnum with is_amd property and tensor_descriptor_import dispatch, accepting any gfx* string via _missing_() for forward-compat with new AMD architectures. - Replace raw target strings and _is_amd_target() in SliceRewriter and Translator with the enum. - Unify tl_make_tensor_descriptor and tl_make_tensor_descriptor_amd into a single helper that dispatches via current_target() at runtime, matching how tl_dot already works. Made-with: Cursor

Use (str, Enum) instead of StrEnum which requires Python 3.11+. Made-with: Cursor

jeffniu-openai

I still think we should have separate translator helper modules based on the target. This would reduce polluting the translated kernel with unrelated hardware helpers. This can be abstracted through the hardware target, and you can re-use helpers. Something like

  /---common_helpers.py ---\
  |                        \
  v                         v
amd_helpers      nvidia_helpers

and the hardware target specifies the python module. the slicer should automatically pull through the transitive deps if there are any

…/amd_translator

jammm · 2026-04-15T06:52:35Z

I still think we should have separate translator helper modules based on the target. This would reduce polluting the translated kernel with unrelated hardware helpers. This can be abstracted through the hardware target, and you can re-use helpers. Something like
  /---common_helpers.py ---\
  |                        \
  v                         v
amd_helpers      nvidia_helpers
and the hardware target specifies the python module. the slicer should automatically pull through the transitive deps if there are any

I did have a separate amd and nvidia helpers initially, but wasn't sure if that would be accepted given the renaming of NVIDIA specific code from AMD :D. Now that I have confirmation, I'll get on the refactoring right away.

Split the monolithic translator_helpers.py into: - common_helpers.py: vendor-neutral utilities (layouts, portable ops) - nvidia_helpers.py: NVIDIA-specific helpers (TMA, mbarrier, Blackwell) - amd_helpers.py: AMD-specific helpers (TDM, WMMA, MFMA) Each target module re-exports common helpers via star import so the generated kernel sees a single unified `helpers` namespace. The TranslatorTarget.helpers_module property selects which module to import, so translated kernels no longer pull in unrelated hardware modules. translator_helpers.py is kept as a backward-compat re-export shim. Made-with: Cursor

Mogball

mostly lgtm just a few nits

Mogball · 2026-04-15T17:55:50Z

+
+    NVIDIA = "nvidia"
+    # AMD targets currently exercised by the translator test suite:
+    GFX1250 = "gfx1250"


nit: can you put this in its own file

Mogball · 2026-04-15T17:55:59Z

-def convert_to_expand_dims_layout(value, expand_dims: list[int]) -> Any:
-    layout: ttgl.constexpr = build_expand_dims_layout(value.shape, expand_dims, ttgl.num_warps())
-    return ttgl.convert_layout(value, layout)
+from triton.tools.triton_to_gluon_translator.common_helpers import *  # noqa: F401,F403


is this file even still needed?

Some tests seem to need it (test_slice_kernel.py). If I remove it, will have to modify those imports from other test files.

Would you like to get this removed and change the imports from test_slice_kernel.py ?

jammm · 2026-04-15T19:50:56Z

Facing some CI failures on NVIDIA backend after the refactor. Will attempt to fix those: https://github.com/triton-lang/triton/actions/runs/24442531609/job/71410815124?pr=9717#step:11:104

Split the monolithic translator_helpers.py into: - target.py: TranslatorTarget enum (hardware abstraction) - common_helpers.py: vendor-neutral utilities (layouts, portable ops) - nvidia_helpers.py: NVIDIA-specific helpers (TMA, mbarrier, Blackwell) - amd_helpers.py: AMD-specific helpers (TDM, WMMA, MFMA) Each target module re-exports common helpers via star import so the generated kernel sees a single unified `helpers` namespace. The TranslatorTarget.helpers_module property selects which module to import, so translated kernels no longer pull in unrelated hardware modules. translator_helpers.py is kept as a backward-compat re-export shim. Made-with: Cursor

The gluon JIT compiler evaluates all branches at compile time. With only NVIDIA types in scope, the else branch fails because tensor_descriptor lacks .store()/.load() methods. Since NVIDIA descriptors are always TMA type, call the TMA functions directly. Made-with: Cursor

…/amd_translator

Update test imports to use the target-specific helper modules directly and delete the re-export shim. Use lazy import for convert_host_descriptor in tests so the correct target module is loaded at runtime. Made-with: Cursor

Mogball

lgtm awesome work!

…lator (triton-lang#9717) Extends the existing NVIDIA Triton-to-Gluon translator to support AMD GPU architectures, with gfx1250 as the primary target and CDNA3/CDNA4 support for non-descriptor ops. ### Changes - **Added AMD backend to translator helpers**: Translates Triton ops to Gluon equivalents using AMD-specific hardware features -- WMMA for `tl.dot`, TDM for tensor descriptors (`tl.make_tensor_descriptor`, `desc.load`, `desc.store`), and `PaddedSharedLayout` for shared memory. - **TDM gather/scatter** (`translator_helpers.py`): Implements `tl_obj_gather_amd` / `tl_obj_scatter_amd` as `@builtin` helpers that recreate the TDM descriptor with the correct block shape for multi-row gather/scatter operations. - **CDNA3/CDNA4 support** (`translator_helpers.py`): Adds MFMA-based `tl.dot` translation, 64-thread-per-warp layouts, and target-aware dispatch for gfx942/gfx950. - **Target propagation** (`translator.py`, `translator_helpers.py`): Injects `helpers._current_target = helpers._make_target("{target}")` into generated code for non-NVIDIA targets. Architecture detection helpers (`get_num_threads_per_warp`, `_is_gfx1250`, `_is_cdna`, etc.) accept an optional `target` parameter with fallback to `current_target()`. - **Host descriptor support** (`inline_helpers.py`): Adds `convert_host_descriptor_amd` using `gluon.amd.gfx1250.TensorDescriptor` with `PaddedSharedLayout`. Deduplicates shared `torch_dtype_to_triton` helper. - **Segfault fix** (`ir.cc`): Uses `isa_and_nonnull` / `dyn_cast_or_null` in `getTensorDescMetadata` to handle unencoded tensor descriptor block types from the `make_ttgir` path. This was needed in order to make the reference triton tensor descriptor tests pass, where the descriptors were created on the host side. - **Tests** (`test_triton_to_gluon.py`): Parametrizes existing tests across `nvidia`/`gfx1250`/`gfx942`/`gfx950` targets. Adds roundtrip tests for `tl.cat`, `tl.make_tensor_descriptor`, and gather/scatter. ### Known limitation `tl.make_tensor_descriptor` / `desc.load` / `desc.store` translation is not yet supported for CDNA3/CDNA4 targets. The Gluon builder lacks generic `DescriptorLoad`/`DescriptorStore` ops, so descriptor-based kernels currently only translate for gfx1250 (TDM) and NVIDIA (TMA).

ThomasRaoux reviewed Mar 16, 2026

View reviewed changes

Comment thread python/src/ir.cc Outdated

jammm changed the title ~~[WIP][Tools][Translator] AMD Translator~~ [Tools][Translator] Add AMD backend support for Triton-to-Gluon translator Mar 17, 2026

jammm marked this pull request as ready for review March 17, 2026 07:50

jammm requested a review from ptillet as a code owner March 17, 2026 07:50

jammm marked this pull request as draft March 17, 2026 07:51

jammm force-pushed the jam/amd_translator branch from c44bce8 to 570f0c3 Compare March 19, 2026 10:45

jammm marked this pull request as ready for review March 19, 2026 10:46

jammm requested review from ThomasRaoux and antiagainst March 19, 2026 10:46

jammm force-pushed the jam/amd_translator branch 2 times, most recently from ddd5f05 to 9f5b442 Compare March 24, 2026 09:24

peterbell10 reviewed Mar 24, 2026

View reviewed changes

Comment thread python/triton/tools/triton_to_gluon_translator/translator_helpers.py Outdated

peterbell10 reviewed Mar 24, 2026

View reviewed changes

Comment thread python/triton/tools/triton_to_gluon_translator/translator_helpers.py Outdated

jammm force-pushed the jam/amd_translator branch from ce4347d to 6a6fe08 Compare March 25, 2026 16:00

jammm marked this pull request as draft March 25, 2026 16:06

jammm added 7 commits April 7, 2026 01:00

Remove unnecessary null checks in getTensorDescMetadata

e4bb60f

pre-commit run

e0d2031

Fix NVIDIA translated dot test failure

cee7ac8

Fixes with review

f11be11

Remove redundant descriptor test and combine with existing one

99b3652

jammm force-pushed the jam/amd_translator branch from 6a6fe08 to 83f1730 Compare April 7, 2026 08:03

jammm marked this pull request as ready for review April 7, 2026 08:03

jammm force-pushed the jam/amd_translator branch from 83f1730 to cd9dbd9 Compare April 7, 2026 08:04

jammm added 2 commits April 7, 2026 02:01

fix for python 3.10

4e02793

[Tools][Translator] Remove _create_tdm_descriptor builtin

63e0cc1

Use constexpr annotations on block_shape parameters to prevent constexpr-to-tensor decay in JIT list literals. This eliminates the last builtin helper -- all functions are now @gluon.jit. Made-with: Cursor

Use current target instead of parametrizing all targets in translator…

cf02165

… tests Avoid pytest-xdist load imbalance by detecting the available target at runtime via current_target() rather than parametrizing over every target and skipping the unavailable ones. Made-with: Cursor

peterbell10 reviewed Apr 10, 2026

View reviewed changes

peterbell10 approved these changes Apr 10, 2026

View reviewed changes

jammm added 2 commits April 11, 2026 02:38

Merge branch 'main' of https://github.com/triton-lang/triton into jam…

82e7eb2

…/amd_translator

Only allow blackwell/cdna4/gfx1250 for test_simple_matmul

28095a7

jammm force-pushed the jam/amd_translator branch from d98bbad to 28095a7 Compare April 11, 2026 09:50

jeffniu-openai reviewed Apr 13, 2026

View reviewed changes

jammm added 3 commits April 13, 2026 22:45

Merge branch 'main' of https://github.com/triton-lang/triton into jam…

0aceafa

…/amd_translator

Fix TranslatorTarget for Python 3.10 compat

3cb9ee6

Use (str, Enum) instead of StrEnum which requires Python 3.11+. Made-with: Cursor

jammm requested a review from jeffniu-openai April 14, 2026 07:51

jeffniu-openai reviewed Apr 14, 2026

View reviewed changes

Merge branch 'main' of https://github.com/triton-lang/triton into jam…

d4509de

…/amd_translator

jammm added 2 commits April 15, 2026 00:13

pre-commit run

2d05693

jammm requested a review from jeffniu-openai April 15, 2026 15:29

Mogball reviewed Apr 15, 2026

View reviewed changes

jammm added 3 commits April 15, 2026 12:51

Merge branch 'main' of https://github.com/triton-lang/triton into jam…

b1e0599

…/amd_translator

jammm requested a review from Mogball April 15, 2026 20:02

Remove translator_helpers.py backward-compat shim

1e47994

Update test imports to use the target-specific helper modules directly and delete the re-export shim. Use lazy import for convert_host_descriptor in tests so the correct target module is loaded at runtime. Made-with: Cursor

Mogball approved these changes Apr 15, 2026

View reviewed changes

Mogball merged commit 58816a1 into triton-lang:main Apr 15, 2026
9 checks passed

Conversation

jammm commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Known limitation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jammm commented Mar 25, 2026

Uh oh!

antiagainst commented Apr 6, 2026

Uh oh!

jammm commented Apr 7, 2026

Uh oh!

peterbell10 commented Apr 10, 2026

Uh oh!

peterbell10 left a comment

Choose a reason for hiding this comment

Uh oh!

jammm commented Apr 11, 2026

Uh oh!

antiagainst commented Apr 13, 2026

Uh oh!

jeffniu-openai left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeffniu-openai left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jammm commented Apr 15, 2026

Uh oh!

Mogball left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jammm Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jammm commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mogball left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jammm commented Mar 13, 2026 •

edited

Loading

jeffniu-openai left a comment •

edited

Loading

jammm Apr 15, 2026 •

edited

Loading

jammm commented Apr 15, 2026 •

edited

Loading