[Tracking Issue] Hopper TMA support for bulk asynchronous loads

### This issue is to track progress for Hopper TMA support for bulk asynchronous loads

NVIDIA Hopper GPU adds support for a Tensor Memory Accelerator (TMA) engine which implements a range of new features to improve memory latency.  The following is an excerpt from the [H100 GPU Architecture Overview](https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper).

> To help feed the powerful new H100 Tensor Cores, data fetch efficiency is improved with a new Tensor Memory Accelerator (TMA) that can transfer large blocks of data and multi-dimensional tensors from global memory to shared memory and vice-versa.

> TMA operations are launched using a copy descriptor which specifies data transfers using tensor dimensions and block coordinates instead of per-element addressing. Large blocks of data (up to the shared memory capacity) can be specified and loaded from global memory into shared memory or stored from shared memory back to global memory. TMA significantly reduces addressing overhead and improves efficiency with support for different tensor layouts (1D-5D tensors), different memory access modes, reductions, and other features.

> The TMA operation is asynchronous and leverages the shared memory-based asynchronous barriers introduced in A100. Additionally, the TMA programming model is single-threaded, where a single thread in a warp is elected to issue an asynchronous TMA operation (cuda::memcpy_async) to copy a tensor, and subsequently multiple threads can wait on a cuda::barrier for completion of the data transfer. To further improve performance, the H100 SM adds hardware to accelerate these asynchronous barrier wait operations.

Hopper TMA implements a wide feature space.  To keep this issue tractable it will focus on a single feature - bulk asynchronous loads (global -> shared) with barrier synchronization for 1D cases using memcpy (destination, source, size) semantics.  Specifically **not** addressed by this issue are:

- Bulk asynchronous stores (global -> shared)
- Tensor layouts (1D-5D tensors), different memory access modes, reductions, and other features

Notably, Hopper bulk asynchronous loads **require** barrier based synchronization whereas on Ampere asynchronous loads could be synchronized either with barriers or group based methods.

TVM implements the group based synchronization method for Ampere asynchronous loads where a `commit_group` instruction marks the end of a group of `cp.async` instructions and the `wait_group` instruction is used to wait for the completion of prior asynchronous copy operations.

The intention of this issue is to pivot TVM support for both Ampere and Hopper to use barrier synchronization for asynchronous loads.  Here are the development items:

- [x] P1. [Hopper TMA] CUDA codegen for async copy with barrier synchronization PR #15616 
- [x] P2. [Hopper TMA] Add CUDA codegen support for bulk asynchronous copy PR #15656 
- [ ] P3. TIR transform for Hopper asynchronous copy; ideally with a unifying approach for Hopper, Ampere [inject_ptx_async_copy.cc](https://github.com/apache/tvm/blob/main/src/tir/transforms/inject_ptx_async_copy.cc) and Hexagon [lower_asyc_dma.cc](https://github.com/apache/tvm/blob/main/src/tir/transforms/lower_async_dma.cc)
- [ ] P4. Modify InjectSoftwarePipeline pass to use barrier instead of group synchronization for A100/H100
- [ ] P5. Modify InjectSoftwarePipeline pass and CUDA codegen to allow for larger bulk load sizes > 16 bytes for H100

Confusingly, bulk asynchronous stores (shared -> global) which are also new for Hopper **require** group (not barrier) based synchronization.  Mentioning this here to avoid over-optimizing for barrier synchronization as group synchronization may still be required on Hopper.  A first pass of bulk asynchronous store support for Hopper in TVM might be "fire and forget" where all stores are issued in a single commit group at the appropriate compute stage (end of operator, pipeline stage) which must complete (wait group) before proceeding to the next compute stage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Tracking Issue] Hopper TMA support for bulk asynchronous loads #15956

This issue is to track progress for Hopper TMA support for bulk asynchronous loads

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Tracking Issue] Hopper TMA support for bulk asynchronous loads #15956

Description

This issue is to track progress for Hopper TMA support for bulk asynchronous loads

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions