Skip to content

[Tracking Issue] Hopper TMA support for bulk asynchronous loads #15956

@adstraw

Description

@adstraw

This issue is to track progress for Hopper TMA support for bulk asynchronous loads

NVIDIA Hopper GPU adds support for a Tensor Memory Accelerator (TMA) engine which implements a range of new features to improve memory latency. The following is an excerpt from the H100 GPU Architecture Overview.

To help feed the powerful new H100 Tensor Cores, data fetch efficiency is improved with a new Tensor Memory Accelerator (TMA) that can transfer large blocks of data and multi-dimensional tensors from global memory to shared memory and vice-versa.

TMA operations are launched using a copy descriptor which specifies data transfers using tensor dimensions and block coordinates instead of per-element addressing. Large blocks of data (up to the shared memory capacity) can be specified and loaded from global memory into shared memory or stored from shared memory back to global memory. TMA significantly reduces addressing overhead and improves efficiency with support for different tensor layouts (1D-5D tensors), different memory access modes, reductions, and other features.

The TMA operation is asynchronous and leverages the shared memory-based asynchronous barriers introduced in A100. Additionally, the TMA programming model is single-threaded, where a single thread in a warp is elected to issue an asynchronous TMA operation (cuda::memcpy_async) to copy a tensor, and subsequently multiple threads can wait on a cuda::barrier for completion of the data transfer. To further improve performance, the H100 SM adds hardware to accelerate these asynchronous barrier wait operations.

Hopper TMA implements a wide feature space. To keep this issue tractable it will focus on a single feature - bulk asynchronous loads (global -> shared) with barrier synchronization for 1D cases using memcpy (destination, source, size) semantics. Specifically not addressed by this issue are:

  • Bulk asynchronous stores (global -> shared)
  • Tensor layouts (1D-5D tensors), different memory access modes, reductions, and other features

Notably, Hopper bulk asynchronous loads require barrier based synchronization whereas on Ampere asynchronous loads could be synchronized either with barriers or group based methods.

TVM implements the group based synchronization method for Ampere asynchronous loads where a commit_group instruction marks the end of a group of cp.async instructions and the wait_group instruction is used to wait for the completion of prior asynchronous copy operations.

The intention of this issue is to pivot TVM support for both Ampere and Hopper to use barrier synchronization for asynchronous loads. Here are the development items:

Confusingly, bulk asynchronous stores (shared -> global) which are also new for Hopper require group (not barrier) based synchronization. Mentioning this here to avoid over-optimizing for barrier synchronization as group synchronization may still be required on Hopper. A first pass of bulk asynchronous store support for Hopper in TVM might be "fire and forget" where all stores are issued in a single commit group at the appropriate compute stage (end of operator, pipeline stage) which must complete (wait group) before proceeding to the next compute stage.

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triagePRs or issues that need to be investigated by maintainers to find the right assignees to address ittype:rfc-trackingRFC progress tracking. Ref: https://github.com/apache/tvm-rfcs

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions