Skip to content

Commit 2e16557

Browse files
committed
update
1 parent f648b83 commit 2e16557

File tree

3 files changed

+326
-1
lines changed

3 files changed

+326
-1
lines changed
Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
---
2+
title: CUTLASS Cute Arch 架构、指令、精度总结表
3+
date: 2025-10-31
4+
categories:
5+
- CUDA
6+
- CUTLASS
7+
tags:
8+
- Tensor Core
9+
- 架构对照
10+
- 精度支持
11+
- 矩阵乘法
12+
abbrlink: cute-arch-summary
13+
description: CUDA各代Tensor Core(SM架构)对应CUTLASS Cute支持的MMA指令、尺寸和精度对照表,帮助开发者理解GPU架构演进与精度特性。
14+
---
15+
16+
17+
# CUTLASS Cute Arch 架构、指令、精度总结表
18+
19+
## 矩阵乘法加速器 (MMA) 架构、指令、精度对照表
20+
21+
| 架构 | 代号 | 指令类型 | MMA尺寸 | 输入精度A×B | 累加精度C | 输出精度D | 布局 | 特殊功能 |
22+
|------|------|----------|---------|-------------|-----------|-----------|------|----------|
23+
| **SM61** | Pascal | `dp4a.s32.s32` | - | U8×U8 | S32 | S32 | - | 点积操作 |
24+
| **SM61** | Pascal | `dp2a.s32.s32` | - | U16×U8 | S32 | S32 | - | 点积操作 |
25+
| **SM70** | Volta | `mma.sync` | 8×8×4 | F16×F16 | F16 | F16 | TN/NT/NN/TT | 首个Tensor Core |
26+
| **SM75** | Turing | `mma.sync` | 16×8×8 | F16×F16 | F32 | F32 | TN | Tensor Core改进 |
27+
| **SM75** | Turing | `mma.sync` | 8×8×16 | S8×S8 | S32 | S32 | TN | INT8支持 |
28+
| **SM80** | Ampere | `mma.sync` | 16×8×8 | F16×F16 | F16/F32 | F16/F32 | TN/NT | 多种尺寸 |
29+
| **SM80** | Ampere | `mma.sync` | 16×8×16 | F16×F16 | F16/F32 | F16/F32 | TN/NT | 多种尺寸 |
30+
| **SM80** | Ampere | `mma.sync` | 16×8×8 | BF16×BF16 | F32 | F32 | TN/NT | BF16支持 |
31+
| **SM80** | Ampere | `mma.sync` | 16×8×16 | BF16×BF16 | F32 | F32 | TN/NT | BF16支持 |
32+
| **SM80** | Ampere | `mma.sync` | 16×8×32 | TF32×TF32 | F32 | F32 | TN/NT | TF32支持 |
33+
| **SM80** | Ampere | `mma.sync` | 16×8×16 | S8×S8 | S32 | S32 | TN/NT | INT8 |
34+
| **SM80** | Ampere | `mma.sync` | 16×8×32 | S8×U8/S8×S8 | S32 | S32 | TN/NT | INT8变体 |
35+
| **SM80** | Ampere | `mma.sync` | 16×8×8 | S4×S4 | S32 | S32 | TN | INT4支持 |
36+
| **SM80** | Ampere | `mma.sync` | 16×8×32 | S4×U4 | S32 | S32 | TN | INT4变体 |
37+
| **SM89** | Ada Lovelace | `mma.sync` | 16×8×32 | E4M3×E4M3 | F32 | F32 | TN | FP8 (E4M3) |
38+
| **SM89** | Ada Lovelace | `mma.sync` | 16×8×32 | E5M2×E5M2 | F32 | F32 | TN | FP8 (E5M2) |
39+
| **SM89** | Ada Lovelace | `mma.sync` | 16×8×32 | E4M3×E5M2 | F32 | F32 | TN | FP8混合 |
40+
| **SM89** | Ada Lovelace | `mma.sync` | 16×8×32 | E4M3×E4M3 | F16 | F16 | TN | FP8→F16 |
41+
| **SM89** | Ada Lovelace | `mma.sync` | 16×8×32 | E5M2×E5M2 | F16 | F16 | TN | FP8→F16 |
42+
| **SM90** | Hopper | `mma.sync` | 16×8×4 | F64×F64 | F64 | F64 | TN | 双精度支持 |
43+
| **SM90** | Hopper | `mma.sync` | 16×8×8 | F64×F64 | F64 | F64 | TN | 双精度 |
44+
| **SM90** | Hopper | `mma.sync` | 16×8×16 | F64×F64 | F64 | F64 | TN | 双精度 |
45+
| **SM90** | Hopper | `wgmma.mma_async` | 64×N×16 | F16×F16 | F16/F32 | F16/F32 | SS/RS | 大型GMMA |
46+
| **SM90** | Hopper | `wgmma.mma_async` | 64×N×16 | BF16×BF16 | F32 | F32 | SS/RS | 大型GMMA |
47+
| **SM90** | Hopper | `wgmma.mma_async` | 64×N×8 | TF32×TF32 | F32 | F32 | SS/RS/TN | 大型GMMA |
48+
| **SM90** | Hopper | `wgmma.mma_async` | 64×N×32 | S8×S8 | S32 | S32 | SS/RS/TN | 大型GMMA |
49+
| **SM90** | Hopper | `wgmma.mma_async.sp` | 64×N×32 | F16×F16 | F16/F32 | F16/F32 | SS/RS | 稀疏GMMA |
50+
| **SM90** | Hopper | `wgmma.mma_async.sp` | 64×N×32 | BF16×BF16 | F32 | F32 | SS/RS | 稀疏GMMA |
51+
| **SM100** | Blackwell | `fma(float2)` | 2×1×1 | F32×F32 | F32 | F32 | - | float2数学 |
52+
| **SM100** | Blackwell | `fma(float2)` | 1×2×1 | F32×F32 | F32 | F32 | - | float2数学 |
53+
| **SM100** | Blackwell | UMMA | 64×N×8 | TF32*(TF32) | F32 | F32 | SS | UMMA操作 |
54+
| **SM100** | Blackwell | UMMA | 64×N×16 | F16×F16 | F32 | F32 | SS | UMMA操作 |
55+
| **SM100** | Blackwell | UMMA | 128×N×8 | TF32×TF32 | F32 | F32 | SS | UMMA操作 |
56+
| **SM120** | 最新 | `mma.sync` | 16×8×32 | E2M1×E2M1 | F32 | F32 | TN | F6 (E2M1) |
57+
| **SM120** | 最新 | `mma.sync` | 16×8×32 | E2M1×E3M2 | F32 | F32 | TN | F6混合 |
58+
| **SM120** | 最新 | `mma.sync` | 16×8×32 | E2M1×E2M3 | F32 | F32 | TN | F6/F4混合 |
59+
| **SM120** | 最新 | `mma.sync` | 16×8×32 | E2M1×E4M3 | F32 | F32 | TN | F6/F8混合 |
60+
| **SM120** | 最新 | `mma.sync` | 16×8×32 | E3M2 REFERENCES | F32 | F32 | TN | F6变体 |
61+
| **SM120** | 最新 | `mma.sync` | 16×8×32 | E4M3×E2M1 | F32 | F32 | TN | F8/F6混合 |
62+
| **SM120** | 最新 | `mma.sync` | 16×8×32 | E5M2 REFERENCES | F32 | F32 | TN | F6变体 |
63+
64+
**说明:**
65+
- **布局**:TN=转置×非转置, NT=非转置×转置, NN=非转置×非转置, TT=转置×转置, SS=共享内存, RS=寄存器
66+
- **精度缩写**:F16=FP16, F32=FP32, F64=FP64, BF16=Bfloat16, TF32=TF32, S8/U8=INT8, S4/U4=INT4
67+
- **E4M3/E5M2**:FP8格式 (4位指数+3位尾数 / 5位指数+2位尾数)
68+
- **E2M1/E3M2/E2M3**:FP6/FP4格式
69+
70+
## 内存拷贝操作 (Copy) 架构、指令、精度对照表
71+
72+
| 架构 | 代号 | 指令类型 | 操作类型 | 数据类型 | 缓存级别 | 特殊功能 |
73+
|------|------|----------|----------|----------|----------|----------|
74+
| **SM50** | Maxwell | `shfl.sync` | Shuffle | U32 | - | Warp内数据交换 |
75+
| **SM75** | Turing | `ldmatrix.sync` | LDSM | U16/U32 | Shared | 共享内存矩阵加载 |
76+
| **SM75** | Turing | `movmatrix.sync` | MOVM | U32 | Register | 寄存器矩阵转置 |
77+
| **SM80** | Ampere | `cp.async` | Async Copy | 多种 | Shared | 异步拷贝 |
78+
| **SM90** | Hopper | `cp.async.bulk.tensor` | TMA | 多种 | Shared/L2 | 张量内存加速器 |
79+
| **SM90** | Hopper | `cp.async.bulk.prefetch.tensor` | TMA Prefetch | 多种 | L2 | TMA预取 |
80+
| **SM100** | Blackwell | `ld.global.L1::no_allocate.v8.f32` | Load 256bit | F32 | L1 | 256bit加载 |
81+
| **SM100** | Blackwell | `st.global.L1::no_allocate.v这是因为8.f32` | Store 256bit | F32 | L1 | 256bit存储 |
82+
| **SM100** | Blackwell | `ldsm.sync` | LDSM | U8/U16/U32 | Shared | 共享内存加载 |
83+
| **SM100** | Blackwell | `stsm.sync` | STSM | U8/U16/U32 | Shared | 共享内存存储 |
84+
| **SM100** | Blackwell | `cp.async.bulk.tensor` берег | TMA | 多种 | Shared/L2 | 优化的TMA |
85+
86+
**说明:**
87+
- **LDSM**:Load Matrix (从共享内存加载矩阵到寄存器)
88+
- **STSM**:Store Matrix (从寄存器存储矩阵到共享内存)
89+
- **TMA**:Tensor Memory Accelerator (张量内存加速器)
90+
- **MOVM**:Move Matrix (矩阵数据移动和转置)
91+
92+
## 完整精度支持汇总
93+
94+
### 支持的数值类型
95+
1. **浮点精度**:F16, BF16, TF32, F32, F64
96+
2. **FP8格式**:E4M3, E5M2
97+
3. **FP6/F4格式** (SM120):E2M1, E3M2, E2M3
98+
4. **整数精度**:S8, U8, S4, U4
99+
5. **复数**:C64 (complex double)
100+
6. **混合精度**:F16→F32, BF16→F32, TF32→F32, FP8→F32/F16
101+
102+
### 架构演进特点
103+
- **SM61-SM75**:基础MMA和拷贝操作
104+
- **SM80**:大幅改进,支持多种精度和尺寸
105+
- **SM89**:引入FP8支持
106+
- **SM90**:GMMA大型操作和稀疏矩阵支持
107+
- **SM100**:float2数学和UMMA操作
108+
- **SM120**:FP6/F4混合精度支持
109+

source/_posts/notes/cute_mma.md

Lines changed: 217 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,12 @@ tags: [cutlass, cute]
55
excerpt: intro for cute mma
66
---
77

8+
[TOC]
9+
10+
## arch
11+
### mma
12+
### copy
13+
814
## MMA
915
```cpp
1016
struct SM80_16x8x8_F32F16F16F32_TN
@@ -32,6 +38,52 @@ struct SM80_16x8x8_F32F16F16F32_TN
3238
"f"(c0), "f"(c1), "f"(c2), "f"(c3));
3339
}
3440
};
41+
42+
43+
// (T32,V1) -> (M8,N8)
44+
using SM80_8x4 = Layout<Shape <Shape < _4,_8>,_1>,
45+
Stride<Stride< _8,_1>,_0>>;
46+
// (T32,V2) -> (M8,N8)
47+
using SM80_8x8_Row = Layout<Shape <Shape < _4,_8>,_2>,
48+
Stride<Stride<_16,_1>,_8>>;
49+
// (T32,V4) -> (M8,N16)
50+
using SM80_8x16_Row = Layout<Shape <Shape < _4,_8>,_4>,
51+
Stride<Stride<_32,_1>,_8>>;
52+
// (T32,V4) -> (M16,N8)
53+
using SM80_16x8_Row = Layout<Shape <Shape < _4,_8>,Shape < _2,_2>>,
54+
Stride<Stride<_32,_1>,Stride<_16,_8>>>;
55+
56+
////////////////////////////////////////////
57+
//////// fp16 = fp16 * fp16 + fp16 /////////
58+
////////////////////////////////////////////
59+
template <>
60+
struct MMA_Traits<SM80_16x8x8_F16F16F16F16_TN>
61+
{
62+
using ValTypeD = half_t;
63+
using ValTypeA = half_t;
64+
using ValTypeB = half_t;
65+
using ValTypeC = half_t;
66+
67+
using Shape_MNK = Shape<_16,_8,_8>;
68+
using ThrID = Layout<_32>;
69+
using ALayout = SM80_16x8_Row;
70+
using BLayout = SM80_8x8_Row;
71+
using CLayout = SM80_16x8_Row;
72+
};
73+
74+
//////////////////////////////////////////
75+
/////// fp32 = fp16 * fp16 + fp32 ////////
76+
//////////////////////////////////////////
77+
template <>
78+
struct MMA_Traits<SM80_16x8x8_F32F16F16F32_TN>
79+
: MMA_Traits<SM80_16x8x8_F16F16F16F16_TN>
80+
{
81+
using ValTypeD = float;
82+
using ValTypeA = half_t;
83+
using ValTypeB = half_t;
84+
using ValTypeC = float;
85+
};
86+
3587
```
3688
### MMA Operation
3789
- Operation 结构体名称
@@ -43,4 +95,168 @@ struct SM80_16x8x8_F32F16F16F32_TN
4395
- F32F16F16F32 分别指四个矩阵操作数的元素类型。MMA 用于计算 D=A*B+C, 对应数据类型从左到右读取(D-F32, A-F16, B-F16, C-F32). 对应 ptx 指令名称为 .f32.f16.f16.f32
4496
- NT 代表 A 矩阵 column major(M-major), B 矩阵 row major(N-major), 对应 ptx 指令为 .col.row.
4597
46-
### MMA_Traits
98+
### MMA_Traits
99+
```cpp
100+
template <class MMAOperation, class... MMAOpArgs>
101+
struct MMA_Traits
102+
{
103+
static_assert(sizeof(MMAOperation) == 0, "MMA_Traits not implemented for this MMA_Operation.");
104+
};
105+
106+
template <class D, class A, class B, class C>
107+
struct MMA_Traits<UniversalFMA<D,A,B,C>>
108+
{
109+
using ValTypeD = D;
110+
using ValTypeA = A;
111+
using ValTypeB = B;
112+
using ValTypeC = C;
113+
114+
// Logical shape of the MMA
115+
using Shape_MNK = Shape<_1,_1,_1>;
116+
117+
// Logical thread id (tid) -> tidx
118+
using ThrID = Layout<_1>;
119+
120+
// (Logical thread id (tid), Logical value id (vid)) -> coord
121+
122+
// (tid,vid) -> (m,k)
123+
using ALayout = Layout<Shape<_1,_1>>;
124+
// (tid,vid) -> (n,k)
125+
using BLayout = Layout<Shape<_1,_1>>;
126+
// (tid,vid) -> (m,n)
127+
using CLayout = Layout<Shape<_1,_1>>;
128+
};
129+
130+
// Extract an MMA_Op from an MMA_Traits
131+
template <class MMA_Traits>
132+
struct MMA_Op {};
133+
134+
template <class MMA_Op_Arg, class... Args>
135+
struct MMA_Op<MMA_Traits<MMA_Op_Arg, Args...>> {
136+
using type = MMA_Op_Arg;
137+
};
138+
```
139+
### TiledMMA
140+
141+
## Atom
142+
### MMA_Atom
143+
```cpp
144+
template <class... Args>
145+
struct MMA_Atom;
146+
147+
template <class MMAOperation>
148+
struct MMA_Atom<MMAOperation> : MMA_Atom<MMA_Traits<MMAOperation>>
149+
{};
150+
151+
template <class MMAOperation, class... Args>
152+
struct MMA_Atom<MMA_Traits<MMAOperation, Args...>>
153+
: MMA_Traits<MMAOperation, Args...>
154+
{
155+
using MMA_Op = MMAOperation;
156+
using Traits = MMA_Traits<MMAOperation, Args...>;
157+
158+
// Element value types from the MMA_Traits
159+
using ValTypeD = typename Traits::ValTypeD;
160+
using ValTypeA = typename Traits::ValTypeA;
161+
using ValTypeB = typename Traits::ValTypeB;
162+
using ValTypeC = typename Traits::ValTypeC;
163+
164+
// Thr-Val layouts from the MMA_Traits
165+
using Shape_MNK = typename Traits::Shape_MNK;
166+
using ThrID = typename Traits::ThrID;
167+
using LayoutC_TV = typename Traits::CLayout;
168+
using LayoutA_TV = typename Traits::ALayout;
169+
using LayoutB_TV = typename Traits::BLayout;
170+
171+
// Fragment value types from the MMA_Traits (optional, defaults to Val type)
172+
using FrgTypeD = typename detail::FrgTypeC_or_Default<Traits>::type;
173+
using FrgTypeA = typename detail::FrgTypeA_or_Default<Traits>::type;
174+
using FrgTypeB = typename detail::FrgTypeB_or_Default<Traits>::type;
175+
using FrgTypeC = typename detail::FrgTypeC_or_Default<Traits>::type;
176+
};
177+
178+
template <class TiledMMA, class ThrCoord>
179+
struct ThrMMA;
180+
181+
// @tparam MMA_Atom The MMA_Atom to use in the TiledMMA
182+
// @tparam AtomLayoutMNK The MNK-tiling of the Atom to be performed.
183+
// @tparam PermuationsMNK Permutations to apply to each MNK-mode before tiling for the Atom.
184+
template <class MMA_Atom,
185+
class AtomLayoutMNK,
186+
class PermutationMNK = Tile<Underscore,Underscore,Underscore>>
187+
struct TiledMMA : MMA_Atom
188+
{
189+
using Atom = MMA_Atom;
190+
using AtomShape_MNK = typename MMA_Atom::Shape_MNK;
191+
using AtomThrID = typename MMA_Atom::ThrID;
192+
using AtomLayoutC_TV = typename MMA_Atom::LayoutC_TV;
193+
using AtomLayoutA_TV = typename MMA_Atom::LayoutA_TV;
194+
using AtomLayoutB_TV = typename MMA_Atom::LayoutB_TV;
195+
196+
static_assert( rank_v<AtomLayoutMNK> == 3, "TiledMMA requires rank-3 AtomLayoutMNK");
197+
static_assert( rank_v<PermutationMNK> == 3, "TiledMMA requires rank-3 PermutationMNK");
198+
static_assert( is_tuple<PermutationMNK>::value, "TiledMMA requires independent permutations of MNK.");
199+
static_assert(is_static<PermutationMNK>::value, "TiledMMA requires static permutations of MNK.");
200+
201+
using ThrLayoutVMNK = decltype(tiled_product(AtomThrID{}, AtomLayoutMNK{}));
202+
ThrLayoutVMNK thr_layout_vmnk_;
203+
204+
...
205+
};
206+
207+
template <class TiledMMA, class ThrVMNK>
208+
struct ThrMMA : TiledMMA
209+
{
210+
...
211+
};
212+
```
213+
214+
- make_tiled_mma
215+
216+
### Copy_Atom
217+
```cpp
218+
template <class... Args>
219+
struct Copy_Atom;
220+
221+
template <class CopyOperation, class CopyInternalType>
222+
struct Copy_Atom<CopyOperation, CopyInternalType> : Copy_Atom<Copy_Traits<CopyOperation>, CopyInternalType>
223+
{};
224+
225+
template <class... Args, class CopyInternalType>
226+
struct Copy_Atom<Copy_Traits<Args...>, CopyInternalType>
227+
: Copy_Traits<Args...>
228+
{
229+
...
230+
};
231+
232+
template <class TiledCopy, class ThrIdx>
233+
struct ThrCopy;
234+
235+
template <class Copy_Atom,
236+
class LayoutCopy_TV, // (tid,vid) -> coord [Need not be 2D...]
237+
class ShapeTiler_MN> // coord space
238+
struct TiledCopy : Copy_Atom
239+
{
240+
...
241+
};
242+
243+
template <class TiledCopy, class ThrIdx>
244+
struct ThrCopy
245+
{
246+
...
247+
};
248+
249+
template <class... Args,
250+
class LayoutCopy_TV,
251+
class Tiler>
252+
CUTE_HOST_DEVICE
253+
auto
254+
make_tiled_copy_impl(Copy_Atom<Args...> const& atom,
255+
LayoutCopy_TV const&,
256+
Tiler const&)
257+
{
258+
return TiledCopy<Copy_Atom<Args...>, LayoutCopy_TV, Tiler>{atom};
259+
}
260+
```
261+
262+
- make_tiled_copy

source/_posts/notes/gpu/image.png

163 KB
Loading

0 commit comments

Comments
 (0)