Skip to content

[Tools][Translator] Add AMD backend support for Triton-to-Gluon translator#9717

Merged
Mogball merged 30 commits into
triton-lang:mainfrom
jammm:jam/amd_translator
Apr 15, 2026
Merged

[Tools][Translator] Add AMD backend support for Triton-to-Gluon translator#9717
Mogball merged 30 commits into
triton-lang:mainfrom
jammm:jam/amd_translator

Conversation

@jammm
Copy link
Copy Markdown
Contributor

@jammm jammm commented Mar 13, 2026

Extends the existing NVIDIA Triton-to-Gluon translator to support AMD GPU architectures, with gfx1250 as the primary target and CDNA3/CDNA4 support for non-descriptor ops.

Changes

  • Added AMD backend to translator helpers: Translates Triton ops to Gluon equivalents using AMD-specific hardware features -- WMMA for tl.dot, TDM for tensor descriptors (tl.make_tensor_descriptor, desc.load, desc.store), and PaddedSharedLayout for shared memory.
  • TDM gather/scatter (translator_helpers.py): Implements tl_obj_gather_amd / tl_obj_scatter_amd as @builtin helpers that recreate the TDM descriptor with the correct block shape for multi-row gather/scatter operations.
  • CDNA3/CDNA4 support (translator_helpers.py): Adds MFMA-based tl.dot translation, 64-thread-per-warp layouts, and target-aware dispatch for gfx942/gfx950.
  • Target propagation (translator.py, translator_helpers.py): Injects helpers._current_target = helpers._make_target("{target}") into generated code for non-NVIDIA targets. Architecture detection helpers (get_num_threads_per_warp, _is_gfx1250, _is_cdna, etc.) accept an optional target parameter with fallback to current_target().
  • Host descriptor support (inline_helpers.py): Adds convert_host_descriptor_amd using gluon.amd.gfx1250.TensorDescriptor with PaddedSharedLayout. Deduplicates shared torch_dtype_to_triton helper.
  • Segfault fix (ir.cc): Uses isa_and_nonnull / dyn_cast_or_null in getTensorDescMetadata to handle unencoded tensor descriptor block types from the make_ttgir path. This was needed in order to make the reference triton tensor descriptor tests pass, where the descriptors were created on the host side.
  • Tests (test_triton_to_gluon.py): Parametrizes existing tests across nvidia/gfx1250/gfx942/gfx950 targets. Adds roundtrip tests for tl.cat, tl.make_tensor_descriptor, and gather/scatter.

Known limitation

tl.make_tensor_descriptor / desc.load / desc.store translation is not yet supported for CDNA3/CDNA4 targets. The Gluon builder lacks generic DescriptorLoad/DescriptorStore ops, so descriptor-based kernels currently only translate for gfx1250 (TDM) and NVIDIA (TMA).

Comment thread python/src/ir.cc Outdated
@jammm jammm changed the title [WIP][Tools][Translator] AMD Translator [Tools][Translator] Add AMD backend support for Triton-to-Gluon translator Mar 17, 2026
@jammm jammm marked this pull request as ready for review March 17, 2026 07:50
@jammm jammm requested a review from ptillet as a code owner March 17, 2026 07:50
@jammm jammm marked this pull request as draft March 17, 2026 07:51
@jammm jammm force-pushed the jam/amd_translator branch from c44bce8 to 570f0c3 Compare March 19, 2026 10:45
@jammm jammm marked this pull request as ready for review March 19, 2026 10:46
@jammm jammm force-pushed the jam/amd_translator branch 2 times, most recently from ddd5f05 to 9f5b442 Compare March 24, 2026 09:24
Comment thread python/triton/tools/triton_to_gluon_translator/translator_helpers.py Outdated
Comment thread python/triton/tools/triton_to_gluon_translator/translator_helpers.py Outdated
@jammm jammm force-pushed the jam/amd_translator branch from ce4347d to 6a6fe08 Compare March 25, 2026 16:00
@jammm
Copy link
Copy Markdown
Contributor Author

jammm commented Mar 25, 2026

Some tests are expected to fail as I'm waiting for the rework of #9709. I'll mark this as a draft again until that happens.

@jammm jammm marked this pull request as draft March 25, 2026 16:06
@antiagainst
Copy link
Copy Markdown
Member

#9792 landed so this should be unblocked now. @jammm can you rebase and address all comments.

jammm added 7 commits April 7, 2026 01:00
…lator

Add support for translating Triton kernels to Gluon for AMD targets
(gfx1250, gfx942, gfx950). This includes:

- Architecture detection helpers (_is_gfx1250, _is_cdna, etc.)
- AMD WMMA dot path for gfx1250 and MFMA dot path for CDNA3/CDNA4
- TDM tensor descriptor support (load/store/gather/scatter) for gfx1250
- Correct warp size handling (32 for gfx1250, 64 for CDNA)
- Cross-compilation via _current_target / _make_target
- tl_atomic_add and convert_to_expand_dims_layout as builtins
- Parametrized tests across all targets (nvidia, gfx1250, gfx942, gfx950)
- Fix segfault in getTensorDescMetadata for unencoded tensor descriptors

Made-with: Cursor
…d of @Builtin

Replace attribute-stashing @ttgl._core.builtin with a @tl.core._aggregate
(AMDTensorDescriptorArgs) holding desc + base_ptr. Load/store use desc
directly via @gluon.jit (generic 1D-5D). Gather/scatter reconstruct the
descriptor with [num_indices, N] block_shape using desc.block_shape (plain
ints from type metadata) and base_ptr, avoiding constexpr-to-tensor
conversion in JIT list literals. Only _create_tdm_descriptor remains as a
thin builtin for block_shape list construction.

Made-with: Cursor
@jammm jammm force-pushed the jam/amd_translator branch from 6a6fe08 to 83f1730 Compare April 7, 2026 08:03
@jammm jammm marked this pull request as ready for review April 7, 2026 08:03
@jammm jammm force-pushed the jam/amd_translator branch from 83f1730 to cd9dbd9 Compare April 7, 2026 08:04
@jammm
Copy link
Copy Markdown
Contributor Author

jammm commented Apr 7, 2026

#9792 landed so this should be unblocked now. @jammm can you rebase and address all comments.

Done. Sorry for the force-push, I had mistakenly rebased on top of an older main branch.

jammm added 2 commits April 7, 2026 02:01
Use constexpr annotations on block_shape parameters to prevent
constexpr-to-tensor decay in JIT list literals. This eliminates
the last builtin helper -- all functions are now @gluon.jit.

Made-with: Cursor
@peterbell10
Copy link
Copy Markdown
Contributor

cc @Mogball in case you have any comments

… tests

Avoid pytest-xdist load imbalance by detecting the available target at
runtime via current_target() rather than parametrizing over every target
and skipping the unavailable ones.

Made-with: Cursor
Copy link
Copy Markdown
Contributor

@peterbell10 peterbell10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jammm jammm force-pushed the jam/amd_translator branch from d98bbad to 28095a7 Compare April 11, 2026 09:50
@jammm
Copy link
Copy Markdown
Contributor Author

jammm commented Apr 11, 2026

Removed cdna3 and below from the test_simple_matmul test because it was running out of shared memory space. The test_triton_to_gluon_dot_minimal is already testing tl.dot for cdna3 and below, so it should be fine.

@antiagainst
Copy link
Copy Markdown
Member

Kind ping on this @Mogball / @jeffniu-openai

Copy link
Copy Markdown
Collaborator

@jeffniu-openai jeffniu-openai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the ping. Really happy to see the work here move forward.

In general, I think it would be best if hardware dispatch logic were coded directly in the translator/AST rewriter via a target abstraction, instead of compile-time in the translator helpers themselves. This would ensure that the final translated kernel only has (translator-generated-)code relevant to the specific hardware, rather than comptime hardware switches. This would produce cleaner output.

translate_to_gluon: bool = False
inline_helpers: ordered_set[str] = field(default_factory=ordered_set[str])
cvt_context: list[bool] = field(default_factory=lambda: [False])
target: str = "nvidia"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you make this a StrEnum

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, except I had to use str and enum separately because StrEnum doesn't seem supported on python 3.10 which the CI uses.

raise e

def _is_amd_target(self) -> bool:
return self.target.startswith("gfx")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be deleted to the target StrEnum when there is one

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. PTAL

Comment on lines +569 to +572
if self._is_amd_target():
self.imports.add("from triton.experimental.gluon.language.amd.gfx1250.tdm import tensor_descriptor")
else:
self.imports.add("from triton.experimental.gluon.language.nvidia.hopper.tma import tensor_descriptor")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be possible to build a tiny hardware abstraction, and dispatch something like target.get_tensor_descriptor_import?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a tensor_descriptor_import to the target abstraction.

return self.generic_visit(node)
value, _, _ = ref
if value is tl.make_tensor_descriptor and self.target.startswith("gfx"):
node.func = parse_expr("helpers.tl_make_tensor_descriptor_amd")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this the only function that does hardware dispatch in the translator side, vs. at the translator helper level?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, turns out this isn't needed. Updated, PTAL ^^

jammm added 3 commits April 13, 2026 22:45
- Add TranslatorTarget StrEnum with is_amd property and
  tensor_descriptor_import dispatch, accepting any gfx* string via
  _missing_() for forward-compat with new AMD architectures.
- Replace raw target strings and _is_amd_target() in SliceRewriter and
  Translator with the enum.
- Unify tl_make_tensor_descriptor and tl_make_tensor_descriptor_amd into
  a single helper that dispatches via current_target() at runtime,
  matching how tl_dot already works.

Made-with: Cursor
Use (str, Enum) instead of StrEnum which requires Python 3.11+.

Made-with: Cursor
@jammm jammm requested a review from jeffniu-openai April 14, 2026 07:51
Copy link
Copy Markdown
Collaborator

@jeffniu-openai jeffniu-openai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we should have separate translator helper modules based on the target. This would reduce polluting the translated kernel with unrelated hardware helpers. This can be abstracted through the hardware target, and you can re-use helpers. Something like

  /---common_helpers.py ---\
  |                        \
  v                         v
amd_helpers      nvidia_helpers

and the hardware target specifies the python module. the slicer should automatically pull through the transitive deps if there are any

@jammm
Copy link
Copy Markdown
Contributor Author

jammm commented Apr 15, 2026

I still think we should have separate translator helper modules based on the target. This would reduce polluting the translated kernel with unrelated hardware helpers. This can be abstracted through the hardware target, and you can re-use helpers. Something like

  /---common_helpers.py ---\
  |                        \
  v                         v
amd_helpers      nvidia_helpers

and the hardware target specifies the python module. the slicer should automatically pull through the transitive deps if there are any

I did have a separate amd and nvidia helpers initially, but wasn't sure if that would be accepted given the renaming of NVIDIA specific code from AMD :D. Now that I have confirmation, I'll get on the refactoring right away.

jammm added 2 commits April 15, 2026 00:13
Split the monolithic translator_helpers.py into:
- common_helpers.py: vendor-neutral utilities (layouts, portable ops)
- nvidia_helpers.py: NVIDIA-specific helpers (TMA, mbarrier, Blackwell)
- amd_helpers.py: AMD-specific helpers (TDM, WMMA, MFMA)

Each target module re-exports common helpers via star import so the
generated kernel sees a single unified `helpers` namespace. The
TranslatorTarget.helpers_module property selects which module to
import, so translated kernels no longer pull in unrelated hardware
modules.

translator_helpers.py is kept as a backward-compat re-export shim.

Made-with: Cursor
@jammm jammm requested a review from jeffniu-openai April 15, 2026 15:29
Copy link
Copy Markdown
Collaborator

@Mogball Mogball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly lgtm just a few nits


NVIDIA = "nvidia"
# AMD targets currently exercised by the translator test suite:
GFX1250 = "gfx1250"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you put this in its own file

def convert_to_expand_dims_layout(value, expand_dims: list[int]) -> Any:
layout: ttgl.constexpr = build_expand_dims_layout(value.shape, expand_dims, ttgl.num_warps())
return ttgl.convert_layout(value, layout)
from triton.tools.triton_to_gluon_translator.common_helpers import * # noqa: F401,F403
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this file even still needed?

Copy link
Copy Markdown
Contributor Author

@jammm jammm Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some tests seem to need it (test_slice_kernel.py). If I remove it, will have to modify those imports from other test files.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like to get this removed and change the imports from test_slice_kernel.py ?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@jammm
Copy link
Copy Markdown
Contributor Author

jammm commented Apr 15, 2026

Facing some CI failures on NVIDIA backend after the refactor. Will attempt to fix those: https://github.com/triton-lang/triton/actions/runs/24442531609/job/71410815124?pr=9717#step:11:104

jammm added 3 commits April 15, 2026 12:51
Split the monolithic translator_helpers.py into:
- target.py: TranslatorTarget enum (hardware abstraction)
- common_helpers.py: vendor-neutral utilities (layouts, portable ops)
- nvidia_helpers.py: NVIDIA-specific helpers (TMA, mbarrier, Blackwell)
- amd_helpers.py: AMD-specific helpers (TDM, WMMA, MFMA)

Each target module re-exports common helpers via star import so the
generated kernel sees a single unified `helpers` namespace. The
TranslatorTarget.helpers_module property selects which module to
import, so translated kernels no longer pull in unrelated hardware
modules.

translator_helpers.py is kept as a backward-compat re-export shim.

Made-with: Cursor
The gluon JIT compiler evaluates all branches at compile time. With
only NVIDIA types in scope, the else branch fails because
tensor_descriptor lacks .store()/.load() methods. Since NVIDIA
descriptors are always TMA type, call the TMA functions directly.

Made-with: Cursor
@jammm jammm requested a review from Mogball April 15, 2026 20:02
Update test imports to use the target-specific helper modules directly
and delete the re-export shim. Use lazy import for convert_host_descriptor
in tests so the correct target module is loaded at runtime.

Made-with: Cursor
Copy link
Copy Markdown
Collaborator

@Mogball Mogball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm awesome work!

@Mogball Mogball merged commit 58816a1 into triton-lang:main Apr 15, 2026
9 checks passed
raymondtay pushed a commit to raymondtay/triton that referenced this pull request Apr 18, 2026
…lator (triton-lang#9717)

Extends the existing NVIDIA Triton-to-Gluon translator to support AMD
GPU architectures, with gfx1250 as the primary target and CDNA3/CDNA4
support for non-descriptor ops.

### Changes

- **Added AMD backend to translator helpers**: Translates Triton ops to
Gluon equivalents using AMD-specific hardware features -- WMMA for
`tl.dot`, TDM for tensor descriptors (`tl.make_tensor_descriptor`,
`desc.load`, `desc.store`), and `PaddedSharedLayout` for shared memory.
- **TDM gather/scatter** (`translator_helpers.py`): Implements
`tl_obj_gather_amd` / `tl_obj_scatter_amd` as `@builtin` helpers that
recreate the TDM descriptor with the correct block shape for multi-row
gather/scatter operations.
- **CDNA3/CDNA4 support** (`translator_helpers.py`): Adds MFMA-based
`tl.dot` translation, 64-thread-per-warp layouts, and target-aware
dispatch for gfx942/gfx950.
- **Target propagation** (`translator.py`, `translator_helpers.py`):
Injects `helpers._current_target = helpers._make_target("{target}")`
into generated code for non-NVIDIA targets. Architecture detection
helpers (`get_num_threads_per_warp`, `_is_gfx1250`, `_is_cdna`, etc.)
accept an optional `target` parameter with fallback to
`current_target()`.
- **Host descriptor support** (`inline_helpers.py`): Adds
`convert_host_descriptor_amd` using `gluon.amd.gfx1250.TensorDescriptor`
with `PaddedSharedLayout`. Deduplicates shared `torch_dtype_to_triton`
helper.
- **Segfault fix** (`ir.cc`): Uses `isa_and_nonnull` /
`dyn_cast_or_null` in `getTensorDescMetadata` to handle unencoded tensor
descriptor block types from the `make_ttgir` path. This was needed in
order to make the reference triton tensor descriptor tests pass, where
the descriptors were created on the host side.
- **Tests** (`test_triton_to_gluon.py`): Parametrizes existing tests
across `nvidia`/`gfx1250`/`gfx942`/`gfx950` targets. Adds roundtrip
tests for `tl.cat`, `tl.make_tensor_descriptor`, and gather/scatter.

### Known limitation

`tl.make_tensor_descriptor` / `desc.load` / `desc.store` translation is
not yet supported for CDNA3/CDNA4 targets. The Gluon builder lacks
generic `DescriptorLoad`/`DescriptorStore` ops, so descriptor-based
kernels currently only translate for gfx1250 (TDM) and NVIDIA (TMA).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants